research article exploiting semantic annotations and...

20
Research Article Exploiting Semantic Annotations and -Learning for Constructing an Efficient Hierarchy/Graph Texts Organization Asmaa M. El-Said, Ali I. Eldesoky, and Hesham A. Arafat Department of Computers and Systems, Faculty of Engineering, Mansoura University, Mansoura, Egypt Correspondence should be addressed to Asmaa M. El-Said; [email protected] Received 15 July 2014; Revised 10 December 2014; Accepted 10 December 2014 Academic Editor: Miguel-Angel Sicilia Copyright © 2015 Asmaa M. El-Said et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Tremendous growth in the number of textual documents has produced daily requirements for effective development to explore, analyze, and discover knowledge from these textual documents. Conventional text mining and managing systems mainly use the presence or absence of key words to discover and analyze useful information from textual documents. However, simple word counts and frequency distributions of term appearances do not capture the meaning behind the words, which results in limiting the ability to mine the texts. is paper proposes an efficient methodology for constructing hierarchy/graph-based texts organization and representation scheme based on semantic annotation and -learning. is methodology is based on semantic notions to represent the text in documents, to infer unknown dependencies and relationships among concepts in a text, to measure the relatedness between text documents, and to apply mining processes using the representation and the relatedness measure. e representation scheme reflects the existing relationships among concepts and facilitates accurate relatedness measurements that result in a better mining performance. An extensive experimental evaluation is conducted on real datasets from various domains, indicating the importance of the proposed approach. 1. Introduction As the sheer number of textual documents available online increases exponentially, the need to manage these textual documents also increases. is growth of online textual doc- uments plays a vital role in exploring information and knowl- edge [1]. e massive volume of available information and knowledge should be discovered, and the tasks of managing, analyzing, searching, filtering, and summarizing the infor- mation in documents should be automated [1, 2]. Four main aspects pertain to most of the textual document mining and managing approaches: (a) representation models [3], (b) relatedness measures [4], (c) mining and managing processes [2], and (d) evaluation methods. Selecting an appropriate data representation model is essential for text characterization including text mining and managing, dictating how data should be organized, and what the key features are. e “relatedness measures” are used to determine the closeness of the objects in the representation space, while the “mining and managing processes” are the algorithms that describe the steps of a specific task to fulfill specific requirements. e evaluation methods are used to judge the quality of the mining process results that are produced [1]. At a certain level of simplicity, such mining and manag- ing operations as gathering, filtering, searching, retrieving, extracting, classifying, clustering, and summarizing docu- ments seem relatively similar. All of these operations make use of a text representation model and a relatedness measure to perform their specific tasks. Dealing with natural language documents requires an adequate text representation model to understand them [5]. Accordingly, the trend toward reliance on a semantic understanding-based approach is necessary. Knowledge-rich representations of text combined with accurate semantic relatedness measures are required. is paper introduces an efficient semantic hierarchy/graph- based representation scheme (SHGRS) based on exploiting the semantic structure to improve the effectiveness of the mining and managing operations. erefore, document clus- tering in this paper is used as a case study to illustrate Hindawi Publishing Corporation e Scientific World Journal Volume 2015, Article ID 136172, 19 pages http://dx.doi.org/10.1155/2015/136172

Upload: others

Post on 19-Jul-2020

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Research Article Exploiting Semantic Annotations and ...downloads.hindawi.com/journals/tswj/2015/136172.pdf · is bene cial for text categorization [ ], text summarization [ ], word

Research ArticleExploiting Semantic Annotations and 119876-Learning forConstructing an Efficient HierarchyGraph Texts Organization

Asmaa M El-Said Ali I Eldesoky and Hesham A Arafat

Department of Computers and Systems Faculty of Engineering Mansoura University Mansoura Egypt

Correspondence should be addressed to Asmaa M El-Said ammansedueggmailcom

Received 15 July 2014 Revised 10 December 2014 Accepted 10 December 2014

Academic Editor Miguel-Angel Sicilia

Copyright copy 2015 Asmaa M El-Said et al This is an open access article distributed under the Creative Commons AttributionLicense which permits unrestricted use distribution and reproduction in any medium provided the original work is properlycited

Tremendous growth in the number of textual documents has produced daily requirements for effective development to exploreanalyze and discover knowledge from these textual documents Conventional text mining and managing systems mainly use thepresence or absence of key words to discover and analyze useful information from textual documents However simple word countsand frequency distributions of term appearances do not capture the meaning behind the words which results in limiting the abilityto mine the texts This paper proposes an efficient methodology for constructing hierarchygraph-based texts organization andrepresentation scheme based on semantic annotation and119876-learningThis methodology is based on semantic notions to representthe text in documents to infer unknown dependencies and relationships among concepts in a text to measure the relatednessbetween text documents and to apply mining processes using the representation and the relatedness measure The representationscheme reflects the existing relationships among concepts and facilitates accurate relatedness measurements that result in a bettermining performance An extensive experimental evaluation is conducted on real datasets from various domains indicating theimportance of the proposed approach

1 Introduction

As the sheer number of textual documents available onlineincreases exponentially the need to manage these textualdocuments also increases This growth of online textual doc-uments plays a vital role in exploring information and knowl-edge [1] The massive volume of available information andknowledge should be discovered and the tasks of managinganalyzing searching filtering and summarizing the infor-mation in documents should be automated [1 2] Four mainaspects pertain to most of the textual document miningand managing approaches (a) representation models [3] (b)relatedness measures [4] (c) mining andmanaging processes[2] and (d) evaluation methods

Selecting an appropriate data representation model isessential for text characterization including text mining andmanaging dictating how data should be organized and whatthe key features are The ldquorelatedness measuresrdquo are used todetermine the closeness of the objects in the representationspace while the ldquomining and managing processesrdquo are the

algorithms that describe the steps of a specific task to fulfillspecific requirements The evaluation methods are used tojudge the quality of the mining process results that areproduced [1]

At a certain level of simplicity such mining and manag-ing operations as gathering filtering searching retrievingextracting classifying clustering and summarizing docu-ments seem relatively similar All of these operations makeuse of a text representation model and a relatedness measureto perform their specific tasks Dealing with natural languagedocuments requires an adequate text representation modelto understand them [5] Accordingly the trend towardreliance on a semantic understanding-based approach isnecessary Knowledge-rich representations of text combinedwith accurate semantic relatedness measures are requiredThis paper introduces an efficient semantic hierarchygraph-based representation scheme (SHGRS) based on exploitingthe semantic structure to improve the effectiveness of themining and managing operationsTherefore document clus-tering in this paper is used as a case study to illustrate

Hindawi Publishing Corporatione Scientific World JournalVolume 2015 Article ID 136172 19 pageshttpdxdoiorg1011552015136172

2 The Scientific World Journal

Textual documents repository

The proposed hierarchygraph-basedrepresentation scheme

Visualization

SHGRSMining

Extraction

Clustering

Summarization

Filtration

Search

Categorization

Text miningInformation

retrieval

Know

ledg

e acq

uisit

ion

Info

rmat

ion

acce

ss DODO

D D D D D D D DD D D D D D D D

Figure 1 The proposed semantic text organizationrepresentationscheme for information access and knowledge acquisition processes

the working and the efficacy of the proposed representationscheme The semantic representation scheme is a generaldescription or a conceptual system for understanding howinformation in the text is represented and usedThe proposedrepresentation scheme as shown in Figure 1 is an essential stepfor the actual information access and knowledge acquisitionAs revealed by the figure below an effective textual staticstructure with dynamic behavior is provided to evolve pro-cesses for information access and knowledge acquisitionThemain contributions of this paper are summarized as follows

(1) proposing an efficient representation scheme calledSHGRS that relies on an understanding-basedapproach

(2) exploiting the knowledge-rich notion to introduce anefficient semantic relatedness measure at a documentlevel

(3) suggesting a novel linkage process for documentsthrough identifying their shared subgraphs at a cor-pus level

(4) conducting extensive experiments on real-worlddatasets to study the effectiveness of the proposedrepresentation scheme along with the proposed relat-edness measure and linkage process that allow moreeffective document mining and managing processes

The rest of the paper is organized as follows Section 2gives a brief overview of the related work and basic conceptswhile Section 3 illustrates the details about the proposedframework for constructing an efficient text representationscheme based on semantic annotation Section 4 reportsthe experimental results and discussions Conclusions andsuggestions for future work are given in Section 5

2 Basic Concepts and Related Works

In an effort to keep up with the tremendous growth of theonline textual documents many research projects target theorganization of such information in a way that will make

it easier for the end users to find the information theywant efficiently and accurately [1] Related works here canroughly be classified into three major categories of studiestext representation model semantic relatedness measuresand documents linkage process

Firstly the growing amount of recent research in this fieldfocuses on how the use of semantic representation modelis beneficial for text categorization [3] text summarization[8] word sense disambiguation [9] methods for automaticevaluation of machine translation [10] and documents clas-sification [11] Although the text representation model is awell-researched problem in computer science the currentrepresentation models could be improved with the use ofexternal knowledge that is impossible to extract from thesource document itself One of the widely used sourcesof external knowledge for the text representation model isWordNet [12] a network of related words organized intosynonym sets where these sets are based on the lexicalunderlying concept Furthermoremachine learning is not yetadvanced enough to allow large-scale extraction of informa-tion from textual documentwithout human input Numeroussemantic annotation tools [13 14] have been developed toaid the process of human text markup to guide machinesThe main task in the text representation model includesmorphological syntactic and semantic analysis that uses theresults of the low-level analysis as well as dictionary andreference information to construct a formalized representa-tion of NL text Semantic level implies not only linguisticbut also logical relations between language objects to berepresented [15] In the semantic level of understandingof text documents [5] models such as the WordNet-basedsemantic model conceptual dependence model semanticgraphmodel ontology-based knowledge model and Univer-sal Networking Language can be used In [16] the WordNet-based semantic model captures the semantic structure ofeach term within a sentence and document rather than thefrequency of the term within a document only This modelanalyzes the terms and their corresponding synonyms andhypernyms on the sentence and document levels ignoringdependencies among terms in the sentence level In [17] theconceptual dependence model identifies characterizes andunderstands the effect of the existing dependencies amongthe entities in themodel considering only nouns as a conceptThe semantic graph model proposes semantic representationbased on a graph-based structure with a description of textstructure [5] In the semantic graph model the traditionalmethod to generate the edges between two nodes in the graphis usually based on the cooccurrence and similarity measuresof two nodes and the most frequent word is not necessarilychosen to be important In [18] ontology knowledge rep-resented by entities connected by naming relationships andthe defined taxonomy of classes may become a much morepowerful tool in information exploration Traditional textanalysis and organization methods can also be enriched withthe ontology-based information concerning the cooccurringentities or whole neighborhoods of entities [19] Howeverautomatic ontology construction is a difficult task becauseof the failure to support order among the objects and theattributes In [20] Universal Networking Language (UNL) is

The Scientific World Journal 3

used to represent every document as a graph with conceptsas nodes and relations between them as links ignoring thesequence and orders of concepts in a sentence at the docu-ment level

Secondly the semantic relatedness measure has becomean area of research as one of the hotspots in the area ofinformation technology Semantic relatedness and semanticsimilarity are sometimes confused in the research literatureand they are not identical [21 22] The semantic relatednessmeasures usually include various types of relationships suchas hypernym hyponym subsumption synonym antonymholonym and meronymyThe semantic similarity is a specialcase of semantic relatedness that only considers synonymousrelationships and subsumption relationships There are sev-eral measures for semantic relatedness recently conductedAccording to the parameters used they can be classified intothree major categories including distance-based methods(Rada and Wu and Palmer [21]) which selects the shortestpath among all the possible paths between concepts to bemore similar information-based methods (Resnik Jiang andConrath and Lin [21]) which considers the use of externalcorpora avoiding the unreliability of path distances andtaxonomy and hybrid methods (T Hong and D smith andZili Zhou [21]) which combines the first two measures Inthis paper the information-based methods are focused onIn [23] the semantic relatedness of any concept is based on asimilarity theorem in which the similarity of two concepts ismeasured by the ratio of the amount of information neededto the commonality of the two concepts to the amount ofinformation needed to describe them The information con-tent (IC) of their lowest common subsumer (LCS) capturesthe commonality of the two concepts and the IC of twoconcepts themselves The lowest common subsumer (LCS) isthe most specific concept which is a shared ancestor of thetwo concepts The pointwise mutual information (PMI) [2425] is a simplemethod for computing corpus-based similarityof words The pointwise mutual information is defined asthe ratio of the probability that the two words cooccur119901(1199081and 119908

2) and the product of the probability cooccurs

of each one 119901(1199081) sdot 119901(119908

2) If 119908

1and 119908

2are statistically

independent then the probability that they cooccur is givenby the product 119901(119908

1) sdot 119901(119908

2) If they are not independent

and they have a tendency to cooccur then 119901(1199081AND 119908

2)

will be greater than 119901(1199081) sdot 119901(119908

2) As it is clear there is

extensive literature on measuring the semantic relatednessbetween long texts or documents [7] but there is less workrelated to themeasurement of similarity between sentences orshort texts [26 27] Such methods are usually effective whendealing with long documents because similar documents willusually contain a degree of cooccurring words However inshort documents the word cooccurrence may be rare oreven null This is mainly due to the inherent flexibility ofnatural language enabling people to express similarmeaningsusing quite different sentences in terms of structure and wordcontent

Finally in order to overcome the limitations of previousmodels the documents linkage process is conducted relyingon graph-based document representation model [4 10 11]The main benefit of graph-based techniques is that they

allow us to keep inherent structural information of orig-inal document Mao and Chu [10] used graph based 119870-means algorithm for clustering web documents Yoo andHu [28] represent a set of documents as bipartite graphsusing domain knowledge in ontology Then documents areclustered based on average-link clustering algorithm How-ever these approaches relying on graph-based model arenot intended to be used in an incrementally growing set ofdocuments

To utilize the structural and semantic information in thedocument in this paper a formal semantic representation oflinguistic input is introduced to build an SHGRS scheme forthe documents This representation scheme is constructedthrough the accumulation of syntactic and semantic analysisoutputs A new semantic relatedness measure is developed todetermine the relatedness among concepts of the documentas well as relatedness between contents of the documents forlongshort texts An efficient linkage process is required toconnect the related documents by identifying their sharedsubgraphs The proposed representation scheme along withthe proposed relatedness measure and linkage process isbelieved to enable a more effective document mining andmanaging processes

3 The Proposed Semantic RepresentationScheme and Linkage Processes

In this section a new framework is introduced for construct-ing the SHGRS and conducting the document linkage pro-cesses using multidimensional analysis and primary decisionsupport In fact the elements in a sentence are not equallyimportant and the most frequent word is not necessarilythe most important one As a result the extraction of textmain features (MFs) as concepts as well as their attributesand relationships is important Combinations of semanticannotation [13 14] and reinforcement learning (RL) [29]techniques are used to extract MFs of the text and to inferunknown dependencies and relationships among these MFsThe RL fulfills sequential decision-making tasks with long-run accumulated reward to achieve the largest amount ofinterdependence among MFs This framework focuses onthree key criteria (1) how the framework refines text to selectthe MFs and their attributes (2) what learning algorithm isused to explore the unknown dependencies and relationshipsamong these MFs and (3) how the framework conducts thedocument linkage process

Details of this framework process in three stages aregiven in Figure 2 The first stage aims to refine textualdocuments to select the MFs and their attributes with theaid of OpenNLP [13] and AlchemyAPI [14] Hierarchy-basedstructure is used to represent each sentence with its MFsand their attributes which achieves dimension reductionwithmore understanding The second stage aims to compute theproposedMFs semantic relatedness (MFsSR) that contributesto the detection of the closest synonyms of the MFs andto inferring the relationships and dependencies of the MFsGraph-based structure is used to represent the relationshipsand dependencies among these MFs which achieves more

4 The Scientific World Journal

Semantic annotation (AlchemyAPI)

Word-tagging analyses

Syntactic analysis

Semantic analysis

Build MFs hierarchy structure MFs extraction

Semantic text refining and annotating stage

FLG matching Compute subgraph-relatedness measure

Documents linkage conducting stage

MFsSR algorithm

WordNet

SALA

Detecting closestsynonyms

Inferring dependency and semantic relationships

Build relationships graph structure

Construct semantic hierarchygraph document object

Dependenciesrelationships exploring and inferring stage

Figure 2 A framework for constructing a semantic hierarchygraph text representation scheme

correlation through many-to-many relationships The thirdstage aims to conduct the document linkage process usinga subgraph matching process that relies on finding all ofthe distinct similarities common to the subgraph The mainproceedings of this framework can be summarized as follows

(1) extracting MFs of sentences and their attributes tobuild the hierarchy-based structure for efficientdimension reduction

(2) estimating a novel semantic relatedness measure withconsideration of direct relevance and indirect rele-vance betweenMFs and their attributes for promisingperformance improvements

(3) detecting closest synonyms of MFs for better disam-biguation

(4) exploring and inferring dependencies and relation-ships of MFs for more understanding

(5) representing the interdependence amongMFs of sen-tences in graph-based structure so as to be moreprecise and informative

(6) conducting a document linkage process for moreeffectiveness

31 Semantic Text Refining and Annotating Stage This stageis responsible for refining the text to discover the text MFswith additional annotation for more semantic understandingand aims to

(1) accurately parse each sentence and identify POSsubject-action-object and named-entity recognition

(2) discover the MFs of each sentence in the textual doc-ument

(3) exploit semantic information in each sentencethrough detection attributes of the MFs

(4) reduce the dimensions as much as possible(5) generate an effective descriptive sentence object

(DSO) with a hierarchical sentence object automati-cally

This stage can be achieved through these processes Firstthe text NL is studied at different linguistic levels that iswords sentence and meaning for semantic analysis andannotation [30 31] Second exploiting the information ldquowhois doing what to whomrdquo clarifies dependencies between verbsand their arguments for extraction of MFs Finally buildingMFs hierarchy-based structure explains sentences MFs andtheir attributes

With regard to extracting MFs and building MFs hier-archy-based structure the following points must be high-lighted First there is a dire need to refine the text contentby representing each sentence with its MFs instead of a seriesof terms to reduce the dimensions as much as possible TheOpenNLP supports the most common NLP tasks such astokenization sentence segmentation part-of-speech taggingnamed entity extraction chunking parsing and coreferenceresolution Furthermore the AlchemyAPI extracts semanticmetadata from content such as information on subject-action-object relation extraction people places companiestopics facts relationships authors and languages Based onthe NL concept by OpenNLP and AlchemyAPI the sentenceMFs are identified as subject ldquoSubrdquo main verb ldquoMVrdquo andobject ldquoObjrdquo (direct or indirect object) In addition suchother terms remaining in the sentence as complement ldquoComrdquo(subject complement or object complement) and modifiers(Mod) are considered as attributes to these MFs

Second the automatic annotation is essential for eachof the MFs with additional information for more seman-tic understanding Based on the semantic annotation byOpenNLP and AlchemyAPI the sentenceMFs are annotatedwith feature value (FVal) part-of-speech (POS) named-entity recognition (Ner) and a list of feature attributes

The Scientific World Journal 5

Root

DSO

FValPosNer

FATTS

Annotated with

Val

POSNer

TypeAnnotated with

DSO DSO DSO

middot middot middot

middot middot middot

middot middot middot

MFnMF1 MF2

FATT1FATT2 FATTm

Figure 3 The hierarchy-based structure of the textual document

(FAtts) The list of FAtts is constructed from the remainingterms in the sentence (complement or modifier) relyingon the grammatical relation Each attribute is annotatedwith attribute value (Val) attribute type (type) which iscomplement or modifier attribute POS (POS) and attributenamed-entity recognition (Ner)

Finally the hierarchy-based structure of the textual doc-ument is represented as an object [32] containing the hierar-chical structure of sentences with the MFs of the sentencesand their attributesThis hierarchical structure maintains thedependency between the terms on the sentence level formoreunderstanding which provides sentences with fast access andretrieval Figure 3 shows the hierarchical structure of the textin a theme tree ldquotree DSOrdquo the first level of the hierarchicalmodel denotes all sentence objects that are called DSO thenext level of nodes denotes the MFs of each sentence objectand the last level of nodes denotes the attributes of the MFs

tree DSODSO[i] indicates annotation of sentenceobject 119894 which is fitted with additional information as inHeuristic 1 This heuristic simplifies the hierarchical modelof the textual document with its DSO objects and its MFsThe feature score is used to measure the contribution of theMF in the sentence This score is based on the number ofthe terms grammatically related to the MF (summation offeature Attributes) as follows

Feature Score (FS) = sum

119894

MF119894FAtt (1)

Heuristic 1 Let tree DSO = Sibling of DSO nodes and leteachDSO C tree DSO = SEN key SEN value SEN StateSibling of MF node[] where SEN key indicates the order ofthe sentence in the document as an index SEN value indi-cates the sentence parsed value as (S(NP(NN employee))(VP(VBZ guides)(NP(DT the)(NN customer)))(sdot)) SENState indicates the sentence tense and state (tense past statenegative) siblings of MF node[] indicate the list of siblingnodes of MFs for each sentence fitted with Val POS Ner andlist of FAtts

In the text refining and annotating (TRN) algorithm(see Algorithm 1) the textual document is converted to anobject model with the hierarchical DSOThe TRN algorithmuses OpenNLP and AlchemyAPI tools for MFs extractionand the hierarchy model construction As a result the

constructed hierarchy structure leads to the reduction ofthe dimensions as much as possible and achieves efficientinformation access and knowledge acquisition

32 DependenciesRelationships Exploring and Inferring StageThis stage is responsible for exploring how the dependenciesand relationships among MFs have an effect and aims to

(1) represent the interdependence among sentence MFsin a graph-based structure known as the feature link-age graph (FLG)

(2) formulate an accurate measure for the semantic relat-edness of MFs

(3) detect the closest synonyms for each of the MFs(4) infer the relationships and dependencies of MFs with

each other

This stage can be achieved in three processes First build-ing an FLG represents the dependencies and relationshipsamong sentences Second an efficientMFsSRmeasure is pro-posed to contribute to the detection of the closest synonymsand to infer the relationships and dependencies of the MFsThis measure considers the direct relevance and the indirectrelevance among MFs The direct relevance is the synonymsand associated capabilities (similarity contiguity contrastand causality) among the MFs These association capabilitiesindicate the relationships that give the largest amount ofinterdependence among theMFswhile the indirect relevancerefers to other relationships among the attributes of theseMFs Finally the unknown dependencies and relationshipsamong MFs are explored by exploiting semantic informationabout their texts at a document level

In addition a semantic actionable learning agent (SALA)plays a fundamental role in detecting the closest synonymsand inferring the relationships and dependencies of the MFsLearning is needed to improve the SALA functionality toachieve multiple goals Many studies show that the RL agenthas a high reproductive capability for human-like behaviors[29] As a result the SALA performs an adaptive approachcombining thesaurus-based and distributional mechanismsIn the thesaurus-based mechanism words are compared interms of how they are in the thesaurus (eg WordNet) whilein the distributionalmechanismwords are compared in termsof the shared number of contexts in which they may appear

321 Graph-Based Organization Constructing Process Therelationships anddependencies among theMFs are organizedinto a graph-like structure of nodes with links known asFLG and defined in Heuristic 2 The graph structure FLGrepresents many-to-many relationships among MFs TheFLG is a directed acyclic graph FLG(119881 119860) that would berepresented in a collection of vertices and a collection ofdirected arcs These arcs connect pairs of vertices with nopath returning to the same vertex (acyclic) In FLG(119881 119860)119881 is the set of vertices or states of the graph and 119860 isthe set of arcs between vertices The FLG construction isbased on two sets of data The first set represents the verticesin a one-dimensional array vertex (119881) of Feature Vertex

6 The Scientific World Journal

Input List of Sentences of The document 119889119894

Output List DSO lowastlist of DSO objects and its MFs with attributes lowastProcedure 119863 larr New documentList DSOlarr Empty list list of sentences object of documentfor each sentence 119904

119894in document 119863 do

AlchemyAPI(119904119894) lowastThis function calls AlchemyAPI to determine all MF and its NER lowast

for each feature mfj isin mf1mf2 mf119899 in DSO do

get MF Attributes(mf119894) lowastThis function determines attributes of each MF lowast

OpenNLP (si) lowastThis function calls OpenNLP to determine POS and NER for MF and attributes lowastcompute MF score(mf

119894) lowastThis function determines the score of each MF lowast

endList DSOadd(DSO)

end

Algorithm 1 TRN algorithm

objects (FVOs)The FVO is annotated with Sentence Featurekey (SFKey) feature value (Fval) feature closest synonyms(Fsyn) feature associations (FAss) and feature weight (FW)Where SFKey combines sentence key and feature key inDSO the sentence key is important because that facilitatesaccessing the parent of theMFsThe Fval object has the valueof the feature the Fsyn object has a list of the detected closestsynonyms of the Fval and the FAss object has a list of theexplored feature associations and relationships for the FvalIn this list each object has Rel Typ between two MFs (1 =similarity 2 = contiguity 3 = contrast 4 = causality and 5 =synonym) and Rel vertex index of the related vertex TheFWobject has accumulated the associations and relationshipsweights clarifying the importance of the Fval in the textualdocument The second set represents the arcs in a two-dimensional array adjacency (119881 119881) of Features link weights(FLW) which indicates the value of the linkage weight in anadjacency matrix between related vertices

Heuristic 2 Let FLG(119881 119860) = ⟨Vertex(119881)Adjacency(119881 119881)⟩let Vertex(119881) = FVO

0 FVO

1 FVO

119899 and let Adja-

cency(119881 119881) = FLW(00)

FLW(01)

FLW(10)

FLW(11)

FLW(119899119899)

where FVO indicates an object ofthe MF in vertex vector and annotates with the additionalinformation as FVO = SFKey Fval Fsyn FAss Fw SFKeyindicates combination of sentence key and feature key in aDSO Fval indicates the value of the feature Fsyn indicatesa list of the closest synonyms of the Fval FAss indicatesa list of the feature associations and relationships for theFval and each object in this list is defined as FAssitem =

Rel TypeRel vertex where Rel Type indicates the valueof the relation type linkage between the two features andRel vertex indicates the index of related vertex FW indicatesthe accumulation of the weights of the associations andrelationships of the Fval in the textual document FLW indi-cates the value of the linkage weight in an adjacency matrix

322 The MFs Semantic Relatedness (MFsSR) MeasuringProcess A primarymotivation for measuring semantic relat-edness comes from the NL processing applications such asinformation retrieval information extraction information

filtering text summary text annotation text mining wordsense disambiguation automatic indexing machine trans-lation and other aspects [2 3] In this paper one of theinformation-based methods is utilized by considering ICThe semantic relatedness that has been investigated concernsthe direct relevance and the indirect relevance among MFsThe MFs may be one word or more and thereby the MF isconsidered as a conceptThe IC is considered as a measure ofquantifying the amount of information a concept expressesLin [23] assumed that for a concept 119888 proposed IC let 119901(119888)be the probability of encountering an instance of concept119888 The IC value is obtained by considering the negative loglikelihood of encountering a concept in a given corpusTraditionally the semantic relatedness of concepts is usuallybased on the cooccurrence information on a large corpusHowever the cooccurrences do not achieve much matchingin a corpus and it is essential to take into account therelationships among concepts Therefore development of anew information content method based on the cocontribu-tions instead of the cooccurrences of the proposed MFsSR isimportant

The new information content (NIC) measure is an exten-sion of the information content measure The NIC measuresare based on the relations defined in the WordNet ontologyThe NIC measure uses hypernymhyponym synonymantonym holonymmeronymy and entailcause to quantifythe informativeness of concepts For example a hyponymrelation could be ldquobicycle is a vehiclerdquo and ameronym relationcould be ldquoa bicycle has wheelsrdquo NIC is defined as a functionof the hypernym hyponym synonym antonym holonymmeronymy entail and cause relationships normalized by themaximum number of MFsattributes objects in the textualdocument using

NIC (119888) = 1 minus ((log (Hype Hypo (119888) + Syn Anto (119888)

+Holo Mero (119888)

+ Enta Cause (119888) + 1))

times (log(max concept))minus1)

(2)

The Scientific World Journal 7

whereHype Hypo(119888) returns the number of FVOs in the FLGrelated to the hypernym or hyponym values Syn Anto(119888)returns the number of FVOs in the FLG related to the syn-onym or antonym values Holo Mero(119888) returns the numberof FVOs in the FLG related to the holonym or meronymyvalues and Enta Cause(119888) returns the number of FVOs in theFLG related to the entail or cause values The max concept isa constant that indicates the total number of MFsattributesobjects in the considered text The max concept normalizesthe NIC value and hence the NIC values fall in the range of[0 1]

With the consideration of direct relevance and indirectrelevance of MFs the proposed MFsSR can be stated asfollows

SemRel = 120582(2 lowast NIC (LCS (1198881 1198882))NIC (1198881) +NIC (1198882)

)

+ (1 minus 120582) (2 lowastNIC (LCS (att 1198881 att 1198882))NIC (att 1198881) + NIC (att 1198882)

)

(3)

where 120582 isin [0 1] decides the relative contribution of directand indirect relevance to the semantic relatedness andbecause the direct relevance is assumed to bemore importantthan the indirect relevance 120582 isin [05 1]

323 The Closest-Synonyms Detecting Process In MFsclosest-synonyms detection the SALA carried out the firstaction to extract the closest synonyms for each MF whereSALA defies automatic discovery of similar meaning words(synonyms) Due to the important role played by a lexicalknowledge base in the closest synonyms detection SALAadopts the dictionary-based approach to the disambiguationof theMFs In the dictionary-based approach the assumptionis that the most plausible sense to assign to multiple sharedwords is that sense that maximizes the relatedness among thechosen senses In this respect SALA detects the synonymsby choosing the meaning whose glosses share the largestnumber of words with the glosses of the neighboring wordsthrough lexical ontology Using a lexical ontology suchas WordNet allows the capture of semantic relationshipsbased on the concepts and exploiting hierarchies of conceptsbesides dictionary glosses One of the problems that canbe faced is that not all synonyms are really related to thecontext of the documentTherefore the MFs disambiguationis achieved by implementing the closest-synonym detection(C-SynD) algorithm

In the C-SynD algorithm the main task of each SALA isto search for synonyms of each FVO throughWordNet Eachsynonym and their relationships are assigned in a list calledSyn Relation List (SRL) Each item in the SRL contains onesynonym and its relationships such asHypernyms HyponymMeronyms Holonymy Antonymy Entail and Cause Thepurpose of using these relations is to eliminate the ambiguityand polysemy because not all of the synonyms are relatedto the context of the document Then all synonyms andtheir relationship lists are assigned in a list called F SRListThe SALA starts to filter the irrelevant synonyms accordingto the score of each SRL This score is computed based on

semantic relatedness between the SRL and every FVO in theFeature Vertex array as follows

Score (SRL119896) =

119898

sum

119894=0

119899

sum

119895=0

SemRel (SRL (119894)item FVO119895) (4)

where 119899 is the number of FVOs in the Feature Vertex arraywith index 119895 119898 is the number of items in SRL with index 119894and 119896 is index of SRL

119896in F SRList

The lists of synonyms and their relationships areused temporarily to serve the C-SynD algorithm (seeAlgorithm 2) The C-SynD algorithm specifies a thresholdvalue of the score for selecting closest synonyms SALAapplies the threshold to select the closest synonyms with thehighest score according to the specific threshold as follows

Closest-Synonyms = maxth

(F SRListSRListscore ()) (5)

Then each SALA retains the closest synonyms in anobject Fsyn of each FVO and SALA links the related FVOswith a bidirectional effect using the SALA links algorithm(see Algorithm 3) In the SALA LINKS algorithm the SALAreceives each FVO for detecting links among the other FVOsaccording to Fsyn object values and FAss object valuesThe SALA assigns the FLW between two FVOs to theirintersection location of the FLG

324 The MFs Relationships and Dependencies Inferring Pro-cess In inferringMFs relationships and dependencies SALAaims to explore the implicit dependencies and relationshipsin the context such as human relying on four associationcapabilities These association capabilities are similarity con-tiguity contrast (antonym) and causality [29] which give thelargest amount of interdependence among the MFs Theseassociation capabilities are defined as follows

Similarity For nouns it is the sibling instance of the sameparent instance with is-a relationship inWordNet indicatinghypernymhyponym relationships

Contiguity For nouns it is the sibling instance of thesame parent instance with part-of relationship in WordNetindicating holonymmeronymy relationship

Contrast For adjectives or adverbs it is the sibling instanceof the same parent instance with is-a relationship and withantonym (opposite) attribute in WordNet indicating a syn-onymantonym relationship

Causality For verbs it indicates the connection of the se-quence of events by a causeentail relationship where a causepicks out two verbs one of the verbs is causative such as (give)and the other is called resultant such as (have) in WordNet

To equip SALA with decision-making and experiencelearning capabilities to achieve multiple-goal RL this studyutilizes the RL method relying on 119876-learning to design aSALA inference engine with sequential decision-makingTheobjective of SALA is to select an optimal association to maxi-mize the total long-run accumulated reward Hence SALA

8 The Scientific World Journal

Input Array Feature Vertex Objects (FVO) WordnetOutput List of Closest-Synonym of FVO objects lowastThe detected closest synonym to set Fsyn value lowastProcedureSRLlarr Empty list list of each synonym and their relationshipsF SRListlarr Empty list list of SRL (all synonyms and their relationships lists)lowastParallelforeach used for parallel processing of all FVO lowastParallelForEach(FVO fvo =gt

Fsynlarr Empty listF SRList = get all Syn Relations List(fvo) lowast function to get all synonyms and their relationships lists from wordnet lowastlowast compute semantic relatedness of each synonym and their relationships with the fvo object then get score of each listlowastfor (int 119894 = 0 119894 lt F SRListCount() 119894++) score[119894] = sum(function Get SemRel(F SRList[119894] fvo)) lowastfunction to compute the score of each

synonym list this score contributes in accurate filtering the returned synonyms and selecting the closest onelowast

Fsyn = maxth(score[]) lowastfunction to select SRL with highest score according to the specific threshold lowast

) ParallelFor

Algorithm 2 C-SynD algorithm

Input Action type and Feature Vertex Object(FVO)Output Detect Links between Feature Vertices and assigned the Features Link Weight(FLW)Procedure ilarr index of the FVO input in FLGnumVertslarr number of Feature Vertex objects (FVO) in Vertex (119881)for (int 119895 = 119894 119895 le numVerts j++)if (Action type == ldquoClosest-Synonymsrdquo) then Fsyn = get C-SynD (FVO[119894]) lowast Function to execute C-SynD algorithm lowastif (FVO[119895]contain(Fsyn)) then FLWlarr SemRel(FVO[119894] FVO[119895]) lowastFunction to compute the value of semantic relatedness betweenthe FVO[119894] and FVO[119895] lowastRel-Typelarr 5 lowast set the relation type with 5 indicates of synonym lowast

elseif (Action type == ldquoAssociationrdquo) then optimal action = get IOAC-QL (FVO) lowast Function to execute IOAC-QL algorithm lowastFLWlarr reward lowastThe reward value of optimal action returnedlowastRel-Typelarr action type lowastThe type of optimal action returned (type = 1 for similarity type = 2 for contiguitytype = 3 for contrast type = 4 for causality) lowast

Rel vertex = 119895 lowast index of the related vertex lowastAdjacent[119894 119895] = FLW lowastset the value of linkage weight in adjacency matrix between related vertex [119894] vertex [119895] lowast

Algorithm 3 SALA links algorithm

can achieve the largest amount of interdependence amongMFs through implementing the inference optimal associationcapabilities with the 119876-learning (IOAC-QL) algorithm

(1) Utilizing Multiple-Goal Reinforcement Learning (RL) RLspecifies what to do but not how to do it through the rewardfunction In sequential decision-making tasks an agent needsto perform a sequence of actions to reach goal states ormultiple-goal states One popular algorithm for dealing withsequential decision-making tasks is 119876-learning [33] In a 119876-learning model a 119876-value is an evaluation of the ldquoqualityrdquo

of an action in a given state 119876(119909 119886) indicates how desirableaction 119886 is in the state 119909 SALA can choose an action based on119876-values The easiest way of choosing an action is to choosethe one that maximizes the 119876-value in the current state

To acquire the119876-values the algorithm119876-learning is usedto estimate the maximum discounted cumulative reinforce-ment that the model will receive from the current state 119909

on max(sum119894=0

120574119894119903119894) where 120574 is a discount factor and 119903

119894is the

reward received at step 119894 (which may be 0) The updating of119876(119909 119886) is based on minimizing 119903 + 120574119891(119910) minus 119876(119909 119886) 119891(119910) =max119886119876(119910 119886) and 119910 is the new state resulting from action

The Scientific World Journal 9

119886 Thus updating is based on the temporal difference inevaluating the current state and the action chosen Throughsuccessive updates of the 119876 function the model can learn totake into account future steps in longer and longer sequencesnotably without explicit planning [33] SALA may eventuallyconverge to a stable function or find an optimal sequence thatmaximizes the reward received Hence the SALA learns toaddress sequential decision-making tasks

Problem Formulation The 119876-learning algorithm is used todetermine the best possible action that is selecting the mosteffective relations analyzing the reward function and updat-ing the related feature weights accordingly The promisingaction can be verified by measuring the relatedness scoreof the action value with the remained MFs using a rewardfunctionThe formulation of the effective association capabil-ities exploration is performed by estimating an action-valuefunction The value of taking action 119886 in state 119904 is definedunder a policy 120587 denoted by 119876(119904 119886) as the expected returnstarting from 119904 and taking the action 119886 Action policy 120587 is adescription of the behavior of the learning SALA The policyis the most important criterion in providing the RL with theability to determine the behavior of the SALA [29]

In single goal reinforcement learning these 119876-values areused only to rank order of the actions in a given state Thekey observation here is that the 119876-values can also be used inmultiple-goal problems to indicate the degree of preferencefor different actions The available way in this paper to selecta promising action to execute is to generate an overall119876-valueas a simple sum of the 119876-values of the individual SALA Theaction with the maximum summed value is then chosen toexecute

Environment State and Actions For the effective associationcapabilities exploration problem the FLG is provided as anenvironment in which intelligent agents can learn and sharetheir knowledge and can explore dependencies and relation-ships among MFs The SALAs receive the environmentalinformation data as a state and execute the elected action inthe form of an action value table which provides the best wayof sharing knowledge among agents The state is defined bythe status of theMF space In each state theMFobject is fittedwith the closest-synonyms relations there is a set of actionsIn each round the SALA decides to explore the effectiveassociations that gain the largest amount of interdependenceamong theMFs Actions are defined by selecting a promisingaction with the highest reward by learning which actionis the optimal for each state In summary once 119904

119896(MF

object) is received from the FLG SALA chooses one ofthe associations as the action 119886

119896according to SALArsquos policy

120587lowast(119904119896) executes the associated action and finally returns the

effective relations Heuristic 3 defines the action variable andthe optimal policy

Heuristic 3 (action variable) Let 119886119896symbolize the action that

is executed by SALA at round 119896

119886119896= 120587lowast(119904119896) (6)

Therein one has the following(i) 119886119896C action space similarity contiguity contrast

and causality(ii) 120587lowast(119904

119896) = argmax

119886119876lowast(119904119896 119886119896)

Once the optimal policy (120587lowast) is obtained the agentchooses the actions using the maximum reward

Reward FunctionThe reward function in this papermeasuresthe dependency and the relatedness among MFs The learneris not told which actions to take as in most forms of machinelearning Rather the learner must discover which actionsyield the most reward by trying them [33]

(2) The SALA Optimal AssociationRelationships InferencesThe optimal association actions are achieved according to theIOAC-QL algorithmThis algorithm starts to select the actionwith the highest expected future reward from each state Theimmediate reward which the SALA gets to execute an actionfrom a state s plus the value of an optimal policy is the 119876-value The higher 119876-value points to the greater chance ofthat action being chosen First initialize all the119876(119904 119886) valuesto zero Then each SALA performs the following steps inHeuristic 4

Heuristic 4(i) At every round 119896 select one action 119886

119896from all the

possible actions (similarity contiguity contrast andcausality) and execute it

(ii) Receive the immediate reward 119903119896and observe the new

state 1199041015840119896

(iii) Get the maximum 119876-value of the state 1199041015840

119896based on

all the possible actions 119886119896= 120587lowast(119904119896) = argmax

119886119876lowast

(119904119896 119886119896)

(iv) Update the 119876(119904119896 119886119896) as follows

119876 (119904119896 119886119896) = (1 minus 120572)119876 (119904

119896 119886119896)

+ 120572(119903 (119904119896 119886119896) + 120574 max1198861015840isin119860(119904119896)

119876 (1199041015840

119896 1198861015840

119896))

(7)

where 119876(119904119896 119886119896) is worthy of selecting the action (119886

119896)

at the state (119904119896) 119876(1199041015840

119896 1198861015840

119896) is worthy of selecting the

next action (1198861015840119896) at the next state (1199041015840

119896) The 119903(119904

119896 119886119896)

is a reward corresponding to the acquired payoff A(1199041015840119896) is a set of possible actions at the next state (1199041015840

119896) 120572

(0 lt 120572 le 1) is learning rate 120574 (0 le 120574 le 1) is discountrate

During each round after taking the action 119886 the actionvalue is extracted from WordNet The weight of the actionvalue is measured to represent the immediate reward wherethe SALA computes the weight AVW

119894119896for the action value 119894

in document 119896 as followsReward (119903) = AVW

119894119896

= AVW119894119896+

119899

sum

119895=1

(FW119895119896

lowast SemRel (119894 119895)) (8)

10 The Scientific World Journal

Input Feature Vertex Object(FVO) WordnetOutput The optimal action for FVO relationshipsProcedure

intialize Q[ ]lowastimplement 119876-learning algorithm to get the action with the highest 119876-value lowast119886119896isin action space[] = similarity contiguity contrast causality

lowastParallelforeach used for parallel processing of all actions lowastParallelfor (0 action spacecount() 119896 =gt 119877[119896] = Get Action Reward(119886[119896]) lowastFunction to implement (8) lowast119876[119894 119896] = (1 minus 120572) lowast 119876[119894 119896] + 120572 lowast (119877[119896] + 120574 lowastmax

119886(119876[119894 119896 + 1]))

) ParallelForreturn Get highest 119876-value action(119876[ ]) lowastselect action with the maximum summed value of 119876-values lowast

lowastexecute (8) to get reward value lowastGet Action Reward(ak)AVW

119896= (freq(119886

119896val)FVOCount()) lowast log(119873119899

119886)

for (int 119895 = 0 119895 lt FVOCount() 119895++) FW119895= (freq(FVO[119895])FVOCount()) lowast log(119873119899

119891)

SemRel(119894 119895) =Get SemRel(119886119896val FVO[119895])

Sum+= FW119895lowast SemRel(119894 119895)

return AVW119896= AVW

119896+ Sum

Algorithm 4 IOAC-QL algorithm

where 119903 is the reward of the selected action 119886119894 AVW

119894119896is the

weight of action value 119894 in document 119896 FW119895119896

is the weightof each feature object FVO

119895in document 119896 and SemRel(119894 119895)

is the semantic relatedness between the action value 119894 andfeature object FVO

119895

The most popular weighting scheme is the normalizedword frequency TFIDF [34] used to measure AVW

119894119896and

FW119895119896The FW

119895119896in (8) is calculated likewise AVW

119894119896is based

on (9) and (10)Action value weight (119860119881119882

119894119896) is a measure to calculate

the weight of the action value that is the scalar product ofaction value frequency and inverse document frequency as in(9)The action value frequency (119865) measures the importanceof this value in a document Inverse document frequency(IDF) measures the general importance of this action valuein a corpus of documents

AVW119894119896

=119865119894119896

sum119896119865119894119896

lowast IDF119894 (9)

where 119865119894119896

represents the number of this action value 119894

cocontributions in document 119889119896 normalized by the number

of cocontributions of all MFs in document 119889119896 normalized to

prevent a bias towards longer documents IDF is performedby dividing the number of all documents by the number ofdocuments containing this action value defined as follows

IDF119894= log( |119863|

1003816100381610038161003816119889119896 V119894 isin 1198891198961003816100381610038161003816

) (10)

where |119863| is the total number of documents in the corpusand |119889

119896 V119894isin 119889119896| is the number of documents containing

action value 119881119894

Thus in the IOAC-QL algorithm (see Algorithm 4)implementation shown in Figure 4 the SALA selects theoptimal action with the higher reward Hence each SALAretains effective actions in the FAss list object of each FVO Inthis list each object hasRel-Type object andRel vertex objectTheRel-Type object value equals the action type object valueand Rel vertex object value equals the related vertex indexThe SALA then links the related FVOs with the bidirectionaleffect using the SALA links algorithm and sets the FLW valueof the linkage weight in the adjacency matrix with the rewardof the selected action

An example of monitoring the SALA decision throughthe practical implementation of the IOAC-QL algorithmactions is illustrated in Figure 4

For state FVO1 one has the following

action similarity ActionResult State FVO1FVO7

FVO9FVO22FVO75FVO92FVO112

FVO125

Prob 8reward 184 PrevState FVO

1 119876(119904 119886) = 91

action contiguity ActionResult State FVO103

FVO17

FVO44 Prob 3 reward 74 PrevState FVO

1 119876(119904 119886) =

61action contrast ActionResult State FVO

89 Prob 1

reward 22 PrevState FVO1 119876(119904 119886) = 329

action causality ActionResult State FVO63

FVO249

Prob 2 reward 25 PrevState FVO

1 119876(119904 119886) = 429

As revealed in the case FVO1 the SALA receives the

environmental information data FVO1as a state and then

the SALA starts the implementation to select the actionwith the highest expected future reward among the otherFVOs The result of each action is illustrated the action with

The Scientific World Journal 11

S0

S0

S1

S1S2

a0 a0a0a4 a4 a4

DSO DSO

Sm

Sm

Object model construction

FVO

Vertex [V] Adjacent [V V]

FLWFval

SFKeyFsyn[]

FAss[]FW

The feature linkage graph (FLG)

Descriptive document object

Descriptive sentence object

Action selection

Q-learning mechanism

Assign Q-values with FLG environment

Reinforcement learning model for semanticrelational graph construction

SALA inference engine

DSO DSO DSO DSO DSO

DDO

Sem

antic

actio

nabl

e lea

rnin

g ag

ent (

SALA

)

F1

F1

F1

F2

F3

F4

F2

F2 F3 F4 middot middot middot Fn

DSO

Fn

Fn

middot middot middot

middot middot middot

middot middot middot

middot middot middot middot middot middotmiddot middot middot

middot middot middot

middot middot middotmiddot middot middot

middot middot middot

middot middot middot

middot middot middot

Q(S0 a0) Q(S0 a4)Q(S1 a0) Q(S1 a4)

Q(Sm a0) Q(Sm a4)

WordNet

Figure 4 The IOAC-QL algorithm implementation

the highest reward is selected and the elected action isexecutedTheSALAdecision is accomplished in FLGwith theeffective extracted relationships according to 119876(119904 119886) value asshown in Table 1

33 Document Linkage Conducting Stage This section pre-sents an efficient linkage process at the corpus level for manydocuments Due to the SHGRS scheme of each documentmeasuring the commonality between the FLG of the doc-uments through finding all distinct common similarity orrelatedness subgraphs is required The objective of findingcommon similar or related subgraphs as a relatedness mea-sure is to be utilized in document mining and managingprocesses such as clustering classification and retrieval Itis essential to establish that the measure is metric A metricmeasure ascertains the order of objects in a set and henceoperations such as comparing searching and indexing of theobjects can be performed

Graph relatedness [35 36] involves determining thedegree of similarity between these two graphs (a numberbetween 0 and 1) so the same node in both graphs wouldbe similar if its related nodes were similar The selectionof subgraphs for matching depends on detecting the moreimportant nodes in FLG using FVOFw value The vertexwith the highest weight is the first node selected which thendetects all related vertices from an adjacency matrix as asubgraphThis process depends on computing the maximumcommon subgraphsThe inclusion of all common relatednesssubgraphs is accomplished through recursive iterations andholding onto the value of the relatedness index of thematched

subgraphs Eliminating the overlap is performed using a dy-namic programming technique that keeps track of the nodesvisited and their attribute relatedness values

The function relatedness subgraphs utilize an array Rel-Nodes[] of the related vertices for each document Thefunction SimRel(119909

119894 119910119894) computes the similarity between the

two nodes 119909119894 119910119894as defined in (11) The subgraph matching

process is measured using the following formula

Sub-graph-Rel (SG1 SG2)

=sum

SG1119894

sumSG2119895

SemRel (119873119894 119873119895)

sumSG1119894

119873119894+ sum

SG2119895

119873119895

(11)

where SG1 is the selected subgraph from the FLG of onedocument SG2 is the selected subgraph from the FLG ofanother document and SemRel(119873

119894 119873119895) is the degree of

matching for both nodes 119873119894and 119873

119895 A threshold value for

node matching is also utilized the threshold is set to a 50match

4 Evaluation of the Experiments and Results

This evaluation is especially vital as the aim of building theSHGRS is to use them efficiently and effectively for the fur-ther mining process To explore the effectiveness of the pro-posed SHGRS scheme examining the correlation of the pro-posed semantic relatedness measure compared with otherprevious relatedness measures is required The impact ofthe proposed MFsSR for detecting the closest synonym is

12 The Scientific World Journal

Table 1 Action values table with the SALA decision

State FVO1 FVO1 FVO1 FVO1

Action Similarity Contiguity Contrast CausalityActionResult State FVO1 FVO7 FVO9 FVO22 FVO75 FVO92 FVO112 FVO125 FVO103 FVO17 FVO44 FVO89 FVO63 FVO249

Total reward 184 74 22 25119876(119904 119886) 94 61 329 429Decision Yes No No No

studied and compared to the PMI [25 37] and independentcomponent analysis (ICA) [38] for detecting the best near-synonym The difference between the proposed discoveringsemantic relationships or associations IOAC-QL implicitrelation extraction conditional random fields (CRFs) [6 39]and term frequency and inverse cluster frequency (TFICF)[40] is computed

A case study of amanaging andmining task (ie semanticclustering of text documents) is consideredThis study showshow the proposed approach is workable and how it canbe applied to mining and managing tasks Experimentallysignificant improvements in all performances are achieved inthe document clustering processThis approach outperformsconventional word-based [16] and concept-based [17 41]clusteringmethodsMoreover performance tends to improveas more semantic analysis is achieved in the representation

41 Evaluation Measures Evaluation measures are subcat-egorized into text quality-based evaluation content-basedevaluation coselection-based evaluation and task-basedevaluationsThefirst category of evaluationmeasures is basedon text quality using aspects such as grammaticality nonre-dundancy referential clarity and coherence For content-based evaluations measures such as similarity semanticrelatedness longest common subsequence and other scoresare used The third category is based on coselection evalua-tion using precision recall and 119865-measure values The lastcategory task-based evaluation is based on the performanceof accomplishing the given task Some of these tasks areinformation retrieval question answering document classi-fication document summarizing and document clusteringmethods In this paper the content-based the coselection-based and the task-based evaluations are used to validate theimplementation of the proposed scheme over the real corpusof documents

411 Measure for Content-Based Evaluation The relatednessor similarity measures are inherited from probability theoryand known as the correlation coefficient [42]The correlationcoefficient is one of the most widely used measures todescribe the relatedness 119903 between two vectors119883 and 119884

Correlation Coefficient 119903 The correlation coefficient is a rela-tively efficient relatedness measure which is a symmetricalmeasure of the linear dependence between two randomvariables

Table 2 Contingency table presents human judgment and systemjudgment

Human judgmentTrue False

System judgmentTrue TP FPFalse FN TN

Therefore the correlation coefficient can be consideredas the coefficient for the linear relationship between corre-sponding values of 119883 and 119884 The correlation coefficient 119903between sequences 119883 = 119909

119894 119894 = 1 119899 and 119884 = 119910

119894

119894 = 1 119899 is defined by

119903 =sum119899

119894=1119883119894119884119894

radic(sum119899

119894=11198832119894) lowast (sum

119899

119894=11198842119894)

(12)

Acceptance Rate (AR) Acceptance rate is a proportion ofcorrectly predicted similar or related sentences compared toall related sentences High acceptance ratemeans recognizingalmost all similar or related sentences

AR =TP

(TP + FN) (13)

Accuracy (Acc) Accuracy is a proportion of all correctlypredicted sentences compared to all sentences

Acc = TP + TN(TP + FN + FP + TN)

(14)

where TP TN FP and FN stand for true positive (the num-ber of pairs correctly labeled as similar) true negative (thenumber of pairs correctly labeled as dissimilar) false positive(the number of pairs incorrectly labeled as similar) andfalse negative (the number of pairs incorrectly labeled asdissimilar) as shown in Table 2

412 Measure for Coselection-Based Evaluation For all thedomains a precision 119875 recall 119877 and 119865-measure are utilizedas the measures of performance in the coselection-basedevaluation These measures may be defined via computingthe correlation between the extracted correct and wrongclosest synonyms or semantic relationshipsdependencies

The Scientific World Journal 13

Table 3 Datasets details for the experimental setup

Dataset Experimenting Dataset Name Description

DS1 Experiment 1content-basedevaluation

Miller and Charles (MC)1 RG consists of 65 pairs of nouns extracted from the WordNet rated bymultiple human annotators

DS2Microsoft ResearchParaphrase Corpus

(MRPC)2

The corpus consists of 5801 sentence pairs collected from newswire articles3900 of which were labeled as relatedness by human annotators The wholeset is divided into a training subset (4076 sentences of which 2753 are true)and a test subset (1725 pairs of which 1147 are true)

DS3 Experiment 2coselection-basedevaluation(closest-synonymdetection)

British National Corpus(BNC)3

BNC is a carefully selected collection of 4124 contemporary written andspoken English texts contains 100-million-word text corpus of samples ofwritten and spoken English with the near-synonym collocations

DS4 SN (semantic neighbors)4SN relates 462 target terms (nouns) to 5910 relatum terms with 14682semantic relations (7341 are meaningful and 7341 are random) The SNcontains synonyms coming from three sources WordNet 30 Rogetrsquosthesaurus and a synonyms database

DS5

Experiment 2coselection-basedevaluation (semanticrelationshipsexploration)

BLESS6BLESS relates 200 target terms (100 animate and 100 inanimate nouns) to 8625relatum terms with 26554 semantic relations (14440 are meaningful (correct)and 12154 are random) Every relation has one of the following typeshypernymy cohypernymy meronymy attribute event or random

DS6 TREC5

TREC includes 1437 sentences annotated with entities and relations at leastone relation There are three types of entities person (1685) location (1968)and organization (978) in addition there is a fourth type others (705) whichindicates that the candidate entity is none of the three types There are fivetypes of relations located in (406) indicates that one location is located insideanother location work for (394) indicates that a person works for anorganization OrgBased in (451) indicates that an organization is based in alocation live in (521) indicates that a person lives at a location and kill (268)indicates that a person killed another person There are 17007 pairs of entitiesthat are not related by any of the five relations and hence have the NR relationbetween them which thus significantly outnumbers other relations

DS7 IJCNLP 2011-New YorkTimes (NYT)6

NYT contains 150 business articles from NYT There are 536 instances (208positive 328 negative) with 140 distinct descriptors in NYT dataset

DS8 IJCNLP 2011-Wikipedia8Wikipedia personalsocial relation dataset was previously used in Culotta etal [6] There are 700 instances (122 positive 578 negative) with 70 distinctdescriptors in Wikipedia dataset

DS9 Experiment 3task-basedevaluation

Reuters 215787 Reuters-21578 contains 21578 documents (12902 are used) categorized to 10categories

DS10 20 Newsgroups8 20 newsgroups dataset contains 20000 documents (18846 are used)categorized to 20 categories

1Available at httpwwwcscmuedusimmfaruquisuitehtml2Available at httpresearchmicrosoftcomen-usdownloads3Available at httpcorpusbyuedubnc4Available at httpcentalfltruclacbeteamsimpanchenkosre-evalsncsv5Available at httpl2rcsuiucedusimcogcompDataERconll04corp6Available at httpwwwmysmuedufacultyjingjiangdataIJCNLP2011zip7Available at httpmlrcsumassedumldatasetsReuters-21578+Text+Categorization+Collection8Available at httpwwwcsminingorgindexphpid-20-newsgroupshtml

Let TP denote the number of correctly detected closestsynonyms or semantic relationships explored let FP be thenumber of incorrectly detected closest synonyms or semanticrelationships explored and let FN be the number of correctlybut not detected closest synonyms or semantic relationshipsexplored in a dataset The 119865-measure combines the precisionand recall in one metric and is often used to show theefficiencyThismeasurement can be interpreted as a weightedaverage of the precision and recall where an 119865-measure

reaches its best value at 1 andworst value at 0 Precision recalland 119865-measure are defined as follows

Recall = TP(TP + FN)

Precision =TP

(TP + FP)

119865-measure = 2 (Recall lowast Precision)(Recall + Precision)

(15)

14 The Scientific World Journal

413 Measure for Task-Based Evaluation A case study ofa managing and mining task (ie semantic clustering oftext documents) is studied for task-based evaluation Thetest is performed with the standard clustering algorithmsthat accept the pairwise (dis)similarity between documentsrather than the internal feature space of the documents Theclustering algorithms may be classified as listed below [43](a) flat clustering (which creates a set of clusters without anyexplicit structure that would relate the clusters to each otheralso called exclusive clustering) (b) hierarchical clustering(which creates a hierarchy of clusters) (c) hard clustering(which assigns each documentobject as a member of exactlyone cluster) and (d) soft clustering (which distributes thedocumentsobjects over all clusters) These algorithms areagglomerative (hierarchical clustering) 119870-means (flat clus-tering hard clustering) and EM algorithm (flat clusteringsoft clustering)

Hierarchical agglomerative clustering (HAC) and the 119870-means algorithm have been applied to text clustering ina straightforward way [16 17] The algorithms employedin the experiment included 119870-means clustering and HACTo evaluate the quality of the clustering the 119865-measure isused for quality measures which combine the precision andrecall measures These metrics are computed for every (classcluster) pair The precision 119875 and recall 119877 of a cluster 119878

119894with

respect to a class 119871119903are defined as

119875 (119871119903 119878119894) =

119873119903119894

119873119894

119877 (119871119903 119878119894) =

119873119903119894

119873119903

(16)

where 119873119903119894is the number of documents of class 119871

119903in cluster

119878119894119873119894is the number of documents of cluster 119878

119894 and119873

119903is the

number of documents of class 119871119903

The 119865-measure which is a harmonic mean of precisionand recall is defined as

119865 (119871119903 119878119894) =

2 lowast 119875 (119871119903 119878119894) lowast 119877 (119871

119903 119878119894)

119875 (119871119903 119878119894) + 119877 (119871

119903 119878119894)

(17)

A per-class 119865 score is calculated as follows

119865 (119871119903) = max119878119894isin119862

119865 (119871119903 119878119894) (18)

With respect to class 119871119903 the cluster with the highest 119865-

measure is considered to be the cluster that maps to class 119871119903

that 119865-measure becomes the score for class 119871119903 and 119862 is total

number of clusters The overall 119865-measure for the clusteringresult 119862 is the weighted average of the 119865-measure for eachclass 119871

119903(macroaverage)

119865119862 =sum119871119903

(1003816100381610038161003816119873119903

1003816100381610038161003816 lowast 119865 (119871119903))

sum119871119903

10038161003816100381610038161198731199031003816100381610038161003816

(19)

where |119873119903| is the number of objects in class 119871

119903 The higher

the overall 119865-measure is the better the clustering is due tothe higher accuracy of the clusters mapping to the originalclasses

42 Evaluation Setup (Dataset) Content-based coselection-based and task-based evaluations are used to validateexperimenting over the real corpus of documents theresults are very promising The experimental setup con-sists of some datasets of textual documents as detailed inTable 3 Experiment 1 uses the Miller and Charles (MC) andMicrosoft Research Paraphrase Corpus Experiment 2 usesthe British National Corpus TREC IJCNLP 2011 (NYT andWikipedia) SN and BLESS Experiment 3 usesThe Reuters-21578 and 20 newsgroups

43 Evaluation Results This section reports on the results ofthree experiments conducted using the evaluation datasetsoutlined in the previous section The SHGRS is imple-mented and evaluated based on concept analysis and anno-tation as sentence based in experiment 1 document basedin experiment 2 and corpus based in experiment 3 Inexperiment 1 the DS1 and DS2 investigated the effectivenessof the proposed MFsSR compared to previous studies basedon the availability of semantic relatedness values of humanjudgments and previous studies In experiment 2 the DS3andDS4 investigated the effectiveness of the closest synonymdetection using MFsSR compared to using PMI [25 37]and ICA [38] based on the availability of synonyms comingfrom three sources WordNet 30 Rogetrsquos thesaurus anda synonyms database Furthermore the DS5 DS6 DS7and DS8 investigated the effectiveness of the IOAC-QL forsemantic relationships exploration compared to the implicitrelation extraction CRFs [6 39] and TFICF [40] based on theavailability of semantic relations judgments In experiment 3the DS9 and DS10 investigated the effectiveness of thedocument linkage process for clustering case study thanprevious studies as word-based [16] and concept-based [17]based on the availability of human judgments

These experiments revealed some interesting trends interms of text organization and a representation schemebased on semantic annotation and reinforcement learningThe results show the effectiveness of the proposed schemefor accurate understanding of the underlying meaning Theproposed SHGRS is exploited in mining and managingoperations such as gathering filtering searching retrievingextracting classifying clustering and summarizing docu-ments In addition the experiments illustrate the improve-ments when direct relevance and indirect relevance areincluded in the process of measuring semantic relatednessbetween MFs of documents The main purpose of semanticrelatedness measures is to infer unknown dependencies andrelationships among the MFs The proposed MFsSR is abackbone of the closest synonym detection and semanticrelationships exploration algorithms The performance ofMFsSR for closest synonym detection and IOAC-QL forsemantic relationships exploration algorithms is proportion-ally improved as well as having a semantic understanding oftext content achieved The relatedness subgraph measure isused for the document linkage process at the corpus levelThrough this process better clustering results can be achievedas more attempt to improve the accuracy of measuring therelatedness between documents using semantic information

The Scientific World Journal 15

Table 4The results of the comparison of the correlation coefficientbetween human judgment with some relatedness measures and theproposed semantic relatedness measure

MeasureRelevance

correlation withMampC

Distance-based measuresRada 0688Wu and Palmer 0765

Information-based measuresResnik 077Jiang and Conrath 0848Lin 0853

Hybrid measuresT Hong and D smith 0879Zili Zhou 0882

Informationfeature-base measuresThe proposed MFsSR 0937

in SHGRSThe relatedness subgraphsmeasure affects the per-formance of the output of the document managing and min-ing process relative to howmuch of the semantic informationit considers The most accurate relatedness can be achievedwhen the maximum amount of the semantic information isprovided Consequently the quality of clustering is improvedand when semantic information is considered the approachsurpasses the word-based and concept-based techniques

431 Experiment 1 Comparative Study (Content-Based Eval-uation) This experiment shows the necessity to evaluate theperformance of the proposed MFsSR based on a benchmarkdataset of human judgments so the results of the proposedsemantic relatedness would be comparable with other pre-vious studies in the same field In this experiment theyattempted to compute the correlation between the ratings ofthe proposed semantic relatedness approach and the meanratings reported by Miller and Charles (DS1 in Table 3)Furthermore the results produced are compared againsteight other semantic similarities approaches namely RadaWu and Palmer Resnik Jiang and Conrath Lin T Hong andD Smith and Zili Zhou [21]

Table 4 shows the correlation coefficient results betweennine componential approaches and the Miller and Charlesratings mean The semantic relatedness for the proposedapproach outperformed all the listed approaches Unlike allthe listed methods in the proposed semantic relatednessdifferent properties are considered Furthermore the goodcorrelation value of the approach also results from consider-ing all available relationships between concepts and the indi-rect relationships between attributes of each concept whenmeasuring the semantic relatedness In this experiment theproposed approach achieved a good correlation value withthe human-subject rating reported by Miller and CharlesBased on the results of this study theMFsSR correlation valuehas proven that considering the directindirect relevance

is specifically important for at least 6 improvement overthe best previous results This improvement contributes tothe achievement of the largest amount of relationships andinterdependence among MFs and their attributes at thedocument level and to the document linkage process at acorpus level

Table 5 summarizes the characteristics of the MRPCdataset (DS2 in Table 3) and presents comparison of theAcc and AR values between the proposed semantic related-ness measure MFsSR and Islam and Inkpen [7] Differentrelatedness thresholds ranging from 0 to 1 with interval01 are used to validate the MFsSR with Islam and Inkpen[7] After evaluation the best relatedness thresholds of Accand AR are 06 07 and 08 These results indicate thatthe proposed MFsSR surpasses Islam and Inkpen [7] interms of Acc and AR The results of each approach listedbelow were based on the best Acc and AR through allthresholds instead of under the same relatedness thresholdThis improvement in Acc andAR values is due to the increasein the numbers of pairs predicted correctly after consideringdirectindirect relevance This relevance takes into accountthe closest synonym and the relationships of sentences MFs

432 Experiment 2 Comparative Study of Closest-SynonymDetection and Semantic Relationships Exploration Thisexperiment shows the necessity to study the impact of theMFsSR for inferring unknown dependencies and relation-ships among the MFs This impact is achieved through theclosest-synonym detection and semantic relationship explo-ration algorithms so the results would be comparable withother previous studies in the same field Throughout thisexperiment the impact of the proposed MFsSR for detectingthe closest synonym is studied and compared to the PMIapproach and ICA approach for detecting the best near-synonym The difference among the proposed discoveringsemantic relationships or associations IOAC-QL the implicitrelation extraction CRFs approach and TFICF approachis examined This experiment attempts to measure theperformance of theMFsSR for the closest synonym detectionand the IOAC-QL for semantic relationship explorationalgorithms Hence more semantic understanding of the textcontent is achieved by inferring unknown dependenciesand relationships among the MFs Considering the directrelevance among the MFs and the indirect relevance amongthe attributes of the MFs in the proposed MFsSR constitutesa certain advantage over previous measures However mostof the time the incorrect items are due to a wrong syntacticparsing from the OpenNLP parser and AlchemyAPIAccording to the preliminary study it is certain that theaccuracy of parsing toolsrsquo effects is on the performance of theMFsSR

Detecting the closest synonym is a process of MF disam-biguation resulting from the implementation of the MFsSRtechnique and the threshold of the synonyms scores throughthe C-SynD algorithm In the C-SynD algorithm the MFsSRtakes into account the direct relevance among MFs andthe indirect relevance among MFs and their attributes in acontext that gives a significant increase in the recall without

16 The Scientific World Journal

Table 5 The results of the comparison of the accuracy and acceptance rate between the proposed semantic relatedness measure and Islamand Inkpen [7]

MRPC dataset Relatednessthreshold

Human judgment(TP + FN)

Islam and Inkpen [7] The proposed MFsSRAcc AR Acc AR

Trainingsubset (4076)

01

2753 true

067 1 068 102 067 1 068 103 067 1 068 104 067 1 068 105 069 098 068 106 072 089 068 107 068 078 070 09808 056 04 072 08609 037 009 060 0491 033 0 034 002

Test subset(1725)

01

1147 true

066 1 067 102 066 1 067 103 066 1 067 104 066 1 067 105 068 098 067 106 072 089 066 107 068 078 066 09808 056 04 064 08509 038 009 049 051 033 0 034 003

disturbing the precision Thus the MFsSR between two con-cepts considers the information content with most of Word-Net relationships Table 6 illustrates the performance of theC-SynD algorithm based on theMFsSR compared to the bestnear-synonym algorithmusing the PMI and ICA approachesThis comparison was conducted for two corpora of textdocuments that are BNC and NS (DS3 and DS4 in Table 3)As indicated in Table 6 the C-SynD algorithm yielded higheraverage precision recall and 119865-measure values than PMIapproach by 25 20 and 21 on SN dataset respectivelyand also by 8 4 and 7 on BNC dataset Furthermorethe C-SynD algorithm also yielded higher average precisionrecall and 119865-measure values than ICA approach by 6 9and 7 on SN dataset respectively and also by 6 4and 5 on BNC dataset The improvements achieved in theperformance values are due to the increase of the numberof pairs predicted correctly and the decrease of the numberof pairs predicted incorrectly through implementing MFsSRThe important observation from this table is the improve-ments achieved in recall which measures effectiveness of theC-SynD algorithm Achieving better precision values is clearwith a high percentage in different datasets and domains

Inferring the MFs relationships and dependencies is aprocess of achieving a large number of interdependencesamongMFs resulting from the implementation of the IOAC-QL algorithm In the IOAC-QL algorithm the 119876-learning isused to select the optimal action (relationships) that gainsthe largest amount of interdependences among the MFs

resulting from the measure of the AVW Table 7 illustratesthe performance of the IOAC-QL algorithm based on AVWscompared to the implicit relationship extraction based onCRFs and TFICF approaches which was conducted overthe TREC Bless the NYT and the Wikipedia corpus (DS5DS6 DS7 and DS8 in Table 3) The data in this table showsincreases in precision recall and 119865-measure values due tothe increase in the number of pairs predicted correctly afterconsidering directindirect relevance through the expansionof closest synonym In addition the IOAC-QL algorithmthe CRFs and the TFICF approaches are beneficial tothe extraction performance but the IOAC-QL contributesmore than CRFs and TFICF Thus IOAC-QL considerssimilarity contiguity contrast and causality relationshipsbetween MFs and its closest-synonyms while the CRFs andTFICF consider is-a and part-of relationships only betweenconceptsThe improvements of the 119865-measure were achievedthrough an IOAC-QL algorithm up to 23 for TREC dataset22 for BLESS dataset 21 for NYT dataset and 6 forWikipedia dataset approximately from the CRFs approachFurthermore the IOAC-QL algorithm also yielded higher119865-measure values than TFICF approach by 10 for TRECdataset 17 for BLESS dataset and 7 for NYT datasetrespectively

433 Experiment 3 Comparative Study (Task-Based Eval-uation) This experiment shows the effectiveness of the

The Scientific World Journal 17

Table 6 The results of the C-SynD algorithm based on the MFsSR compared to the PMI and ICA for detecting the closest synonym

Datasets PMI ICA C-SynDPrecision Recall 119865-measure Precision Recall 119865-measure Precision Recall 119865-measure

SN 606 606 061 795 716 753 85 80 82BNC 745 679 71 76 678 72 82 719 77

Table 7 The results of the IOAC-QL algorithm based on AVWs compared to CRFs and TFICF approaches

Datasets CRFs TFICF IOAC-QLPrecision Recall 119865-measure Precision Recall 119865-measure Precision Recall 119865-measure

TREC 7508 602 6628 893 714 787 898 881 886BLESS 7304 6266 6703 738 695 716 950 835 889NYT 6846 5402 6038 860 650 740 900 740 812Wikipedia 560 420 480 646 5488 5934 700 440 540

Table 8The results of graph-based approach compared to term-based and concept-based approach using119870-means and the HAC clusteringalgorithms

Datasets Algorithm Term-based Concept-based Graph-basedFC FC FC

Reuters-21578 HAC 73 857 902119870-means 51 846 918

20 Newsgroups HAC 53 82 89119870-means 48 793 87

proposed semantic representation scheme in document min-ing and managing operations such as gathering filteringsearching retrieving extracting classifying clustering andsummarizing documents Many sets of experiments usingthe proposed scheme on different datasets in text clusteringhave been conducted The experiments demonstrate theextensive comparisons between the MF-based analysis andthe traditional analysis Experimental results demonstratethe substantial enhancement of the document clusteringquality In this experiment the application of the semanticunderstanding of the textual documents is presented Textualdocuments are parsed to produce the rich syntactic andsemantic representations of SHGRS Based on these semanticrepresentations document linkages are estimated using thesubgraph matching process that is based on finding alldistinct similarity common subgraphs The semantic repre-sentations of textual documents determining the relatednessbetween themare evaluated by allowing clustering algorithmsto use the produced relatedness matrix of the documentset To test the effectiveness of subgraph matching in deter-mining an accurate measure of the relatedness betweendocuments an experiment using the MFs-based analysisand subgraphs relatedness measures was conducted Twostandard document clustering techniques were chosen fortesting the effect of the document linkages on clustering [43](1) HAC and (2) 119870-means The MF-based weighting is themain factor that captures the importance of a concept in

a document Furthermore document linkages using the sub-graph relatedness measure are among the main factors thataffect clustering techniques Thus to study the effect of theMFs-based weighting and the subgraph relatedness measure(graph-based approach) on the clustering techniques thisexperiment is repeated for the different clustering techniquesas shown in Table 8 Basically the aim is to maximize theweighted average of the 119865-measure for each class (FC) ofclusters to achieve high-quality clustering

Table 8 shows that the graph-based approach achieveshigher performance than term-based and concept-basedapproachesThis comparison was conducted for two corporaof text documents that are the Reuters-21578 dataset and the20 newsgroups dataset (DS9 and DS10 in Table 3)The resultsshown in Table 8 illustrate improvement in the clusteringquality when more attributes are included in graph-basedrelatedness measure The improvements were achieved at thegraph-based approach up to 17 by HAC and 40 by 119870-means for Reuters-21578 dataset while up to 36 byHAC and39 by 119870-means for approximately 20 newsgroup datasetsfrom the base case (term-based) approach Furthermorethe improvements also were achieved at the graph-basedapproach up to 5 by HAC and 7 by119870-Means for Reuters-21578 dataset while up to 7 byHAC and 8 by119870-Means forapproximately 20 Newsgroup datasets from Concept-basedapproach

18 The Scientific World Journal

5 Conclusions

This paper proposed the SHGRS to improve the effectivenessof themining andmanaging operations in textual documentsSpecifically a three-stage approach was proposed for con-structing the scheme First the MFs with their attributesare extracted and hierarchy-based structure is built to rep-resent sentences by the MFs and their attributes Second anovel semantic relatedness computing method was proposedfor inferring relationships and dependencies of MFs andthe relationship between MFs is represented by a graph-based structure Third a subgraph matching algorithm wasconducted for document linkage process The documentsclustering is used as a case study and conducted extensiveexperiments on 10 different datasets and the results showedthat the scheme can result in a better mining performancecompared with traditional approach Future work will focuson conducting other case studies to processes such asgathering filtering retrieving classifying and summarizinginformation

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] I T Fatudimu A G Musa C K Ayo and A B SofoluweldquoKnowledge discovery in online repositories a text miningapproachrdquo European Journal of Scientific Research vol 22 no2 pp 241ndash250 2008

[2] A Hotho A Nurnberger and G Paaszlig ldquoA brief survey of textminingrdquo Journal for Computational Linguistics and LanguageTechnology vol 20 2005

[3] M Bernotas K Karklius R Laurutis and A Slotkiene ldquoThepeculiarities of the text document representation using ontol-ogy and tagging-based clustering techniquerdquo Journal of Infor-mation Technology and Control vol 36 pp 217ndash220 2007

[4] Y Li Z A Bandar andDMcLean ldquoAn approach formeasuringsemantic similarity between words using multiple informationsourcesrdquo IEEE Transactions on Knowledge and Data Engineer-ing vol 15 no 4 pp 871ndash882 2003

[5] K Shaban O Basir and M Kamel ldquoDocument mining basedon semantic understanding of textrdquo in Progress in Pattern Rec-ognition Image Analysis and Applications vol 4225 of LectureNotes in Computer Science pp 834ndash843 Springer BerlinGermany 2006

[6] A Culotta A McCallum and J Betz ldquoIntegrating probabilisticextraction models and data mining to discover relations andpatterns in textrdquo in Proceedings of theHuman Language Technol-ogy ConferencemdashNorth American Chapter of the Association forComputational Linguistics Annual Meeting (HLT-NAACL rsquo06)pp 296ndash303 June 2006

[7] A Islam and D Inkpen ldquoSemantic text similarity using corpus-based word similarity and string similarityrdquo ACM Transactionson Knowledge Discovery fromData vol 2 no 2 article 10 2008

[8] R Feldman and J SangerTheTextMining Handbook AdvancedApproaches in Analyzing Unstructured Data Cambridge Uni-versity Press 2006

[9] R Navigli ldquoWord sense disambiguation a surveyrdquo ACM Com-puting Surveys vol 41 no 2 article 10 2009

[10] W Mao andWW Chu ldquoThe phrase-based vector space modelfor automatic retrieval of free-text medical documentsrdquo Dataand Knowledge Engineering vol 61 no 1 pp 76ndash92 2007

[11] C Siefkes and P Siniakov ldquoAn overview and classification ofadaptive approaches to information extractionrdquo in Journal onData Semantics IV vol 3730 of Lecture Notes in ComputerScience pp 172ndash212 Springer New York NY USA 2005

[12] httpwordnetprincetonedu[13] OpenNLP httpopennlpapacheorg[14] httpwwwalchemyapicom[15] W V Siricharoen ldquoOntologies and object models in object ori-

ented software engineeringrdquo International Association of Engi-neers (IAENG) Journal of Computer Science (JICS) vol 33 no 1pp 1151ndash1162 2006

[16] S Shehata ldquoA WordNet-based semantic model for enhancingtext clusteringrdquo in Proceedings of the IEEE International Con-ference on Data Mining Workshops (ICDMW rsquo09) pp 477ndash482Miami Fla USA December 2009

[17] S Shehata F Karray andM Kamel ldquoAn efficient concept-basedminingmodel for enhancing text clusteringrdquo IEEE Transactionson Knowledge and Data Engineering vol 22 no 10 pp 1360ndash1371 2010

[18] E Zemmouri H Behja A Marzak and B Trousse ldquoOntology-based knowledge model for multi-view KDD processrdquo Interna-tional Journal of Mobile Computing and Multimedia Communi-cations vol 4 no 3 pp 21ndash33 2012

[19] C Marinica and F Guillet ldquoKnowledge-based interactive post-mining of association rules using ontologiesrdquo IEEETransactionson Knowledge and Data Engineering vol 22 no 6 pp 784ndash7972010

[20] B Choudhary and P Bhattacharyya ldquoText clustering usingUni-versal Networking Language representationrdquo in Proceedings ofthe 11th International World Wide Web Conference 2003

[21] G Pirro and J Euzenat ldquoA feature and information theoreticframework for semantic similarity and relatednessrdquo in Proceed-ings of the 9th International Semantic Web Conference on theSemantic Web (ISWC rsquo10) vol 6496 pp 615ndash630 2010

[22] Z Lin M R Lyu and I King ldquoMatchSim a novel similaritymeasure based on maximum neighborhood matchingrdquo Knowl-edge and Information Systems vol 32 no 1 pp 141ndash166 2012

[23] D Lin ldquoAn information-theoretic definition of similarityrdquo inProceedings of the 15th International Conference on MachineLearning Madison Wis USA July 1998

[24] D Inkpen ldquoA statistical model for near-synonym choicerdquo ACMTransactions on Speech and Language Processing vol 4 no 1article 2 2007

[25] LHan T Finin PMcNameeA Joshi andYYesha ldquoImprovingword similarity by augmenting PMIwith estimates of word pol-ysemyrdquo IEEE Transactions on Knowledge and Data Engineeringvol 25 no 6 pp 1307ndash1322 2013

[26] J OrsquoShea Z Bandar K Crockett and D McLean ldquoA compar-ative study of two short text semantic similarity measuresrdquo inAgent and Multi-Agent Systems Technologies and Applicationsvol 4953 of Lecture Notes in Artificial Intelligence pp 172ndash181Springer Berlin Germany 2008

[27] J Oliva J I SerranoMDDel Castillo and A Iglesias ldquoSyMSSa syntax-basedmeasure for short-text semantic similarityrdquoDataand Knowledge Engineering vol 70 no 4 pp 390ndash405 2011

The Scientific World Journal 19

[28] I Yoo and X Hu ldquoClustering large collection of biomedicalliterature based on ontology-enriched bipartite graph represen-tation andmutual refinement strategyrdquo inProceedings of the 10thPAKDD pp 9ndash12 2006

[29] S-T Yuan and Y-C Chen ldquoSemantic ideation learning foragent-based e-brainstormingrdquo IEEE Transactions on Knowledgeand Data Engineering vol 20 no 2 pp 261ndash275 2008

[30] D Liu Z Liu and Q Dong ldquoA dependency grammar andWordNet based sentence similarity measurerdquo Journal of Com-putational Information Systems vol 8 no 3 pp 1027ndash1035 2012

[31] A Zouaq M Gagnon and O Benoit ldquoSemantic analysis usingdependency-based grammars and upper-level ontologiesrdquoInternational Journal of Computational Linguistics and Applica-tions vol 1 no 1-2 pp 85ndash101 2010

[32] C Ramakrishnan K J Kochut and A P Sheth ldquoA frameworkfor schema-driven relationship discovery from unstructuredtextrdquo in The Semantic WebmdashISWC 2006 vol 4273 of LectureNotes in Computer Science pp 583ndash596 Springer BerlinGermany 2006

[33] C J C H Watkins and P Dayan ldquoTechnical note Q-learningrdquoMachine Learning vol 8 no 3-4 pp 279ndash292 1992

[34] M Lan C L Tan J Su and Y Lu ldquoSupervised and traditionalterm weighting methods for automatic text categorizationrdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 31 no 4 pp 721ndash735 2009

[35] L Zhang C Li J Liu and HWang ldquoGraph-based text similar-itymeasurement by exploiting wikipedia as background knowl-edgerdquo World Academy of Science Engineering and Technologyvol 59 pp 1548ndash1553 2011

[36] M Kuramochi and G Karypis ldquoAn efficient algorithm for dis-covering frequent subgraphsrdquo IEEE Transactions on Knowledgeand Data Engineering vol 16 no 9 pp 1038ndash1051 2004

[37] L-C Yu H-M Shih Y-L Lai J-F Yeh and C-H Wu ldquoDis-criminative training for near-synonym substitutionrdquo in Pro-ceedings of the 23rd International Conference on ComputationalLinguistics (COLING rsquo10) pp 1254ndash1262 2010

[38] L-C Yu and W-N Chien ldquoIndependent component analysisfor near-synonym choicerdquoDecision Support Systems vol 55 no1 pp 146ndash155 2013

[39] L Yao S Riedel and A Mccallum ldquoCollective cross-documentrelation extraction without labelled datardquo in Proceedings ofthe Conference on Empirical Methods in Natural LanguageProcessing pp 1013ndash1023MITCambridgeMassUSAOctober2010

[40] M Shen D-R Liu and Y-S Huang ldquoExtracting semantic rela-tions to enrich domain ontologiesrdquo Journal of Intelligent Infor-mation Systems vol 39 no 3 pp 749ndash761 2012

[41] H Chim and X Deng ldquoEfficient phrase-based documentsimilarity for clusteringrdquo IEEE Transactions on Knowledge andData Engineering vol 20 no 9 pp 1217ndash1229 2008

[42] GAMiller andWGCharles ldquoContextual correlates of seman-tic similarityrdquo Language and Cognitive Processes vol 6 no 1 pp1ndash28 1991

[43] M Steinbach G Karypis and V Kumar ldquoA comparison ofdocument clustering techniquesrdquo in Proceedings of the Knowl-edge Discovery and Data Mining (KDD) Workshop Text MiningAugust 2000

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 2: Research Article Exploiting Semantic Annotations and ...downloads.hindawi.com/journals/tswj/2015/136172.pdf · is bene cial for text categorization [ ], text summarization [ ], word

2 The Scientific World Journal

Textual documents repository

The proposed hierarchygraph-basedrepresentation scheme

Visualization

SHGRSMining

Extraction

Clustering

Summarization

Filtration

Search

Categorization

Text miningInformation

retrieval

Know

ledg

e acq

uisit

ion

Info

rmat

ion

acce

ss DODO

D D D D D D D DD D D D D D D D

Figure 1 The proposed semantic text organizationrepresentationscheme for information access and knowledge acquisition processes

the working and the efficacy of the proposed representationscheme The semantic representation scheme is a generaldescription or a conceptual system for understanding howinformation in the text is represented and usedThe proposedrepresentation scheme as shown in Figure 1 is an essential stepfor the actual information access and knowledge acquisitionAs revealed by the figure below an effective textual staticstructure with dynamic behavior is provided to evolve pro-cesses for information access and knowledge acquisitionThemain contributions of this paper are summarized as follows

(1) proposing an efficient representation scheme calledSHGRS that relies on an understanding-basedapproach

(2) exploiting the knowledge-rich notion to introduce anefficient semantic relatedness measure at a documentlevel

(3) suggesting a novel linkage process for documentsthrough identifying their shared subgraphs at a cor-pus level

(4) conducting extensive experiments on real-worlddatasets to study the effectiveness of the proposedrepresentation scheme along with the proposed relat-edness measure and linkage process that allow moreeffective document mining and managing processes

The rest of the paper is organized as follows Section 2gives a brief overview of the related work and basic conceptswhile Section 3 illustrates the details about the proposedframework for constructing an efficient text representationscheme based on semantic annotation Section 4 reportsthe experimental results and discussions Conclusions andsuggestions for future work are given in Section 5

2 Basic Concepts and Related Works

In an effort to keep up with the tremendous growth of theonline textual documents many research projects target theorganization of such information in a way that will make

it easier for the end users to find the information theywant efficiently and accurately [1] Related works here canroughly be classified into three major categories of studiestext representation model semantic relatedness measuresand documents linkage process

Firstly the growing amount of recent research in this fieldfocuses on how the use of semantic representation modelis beneficial for text categorization [3] text summarization[8] word sense disambiguation [9] methods for automaticevaluation of machine translation [10] and documents clas-sification [11] Although the text representation model is awell-researched problem in computer science the currentrepresentation models could be improved with the use ofexternal knowledge that is impossible to extract from thesource document itself One of the widely used sourcesof external knowledge for the text representation model isWordNet [12] a network of related words organized intosynonym sets where these sets are based on the lexicalunderlying concept Furthermoremachine learning is not yetadvanced enough to allow large-scale extraction of informa-tion from textual documentwithout human input Numeroussemantic annotation tools [13 14] have been developed toaid the process of human text markup to guide machinesThe main task in the text representation model includesmorphological syntactic and semantic analysis that uses theresults of the low-level analysis as well as dictionary andreference information to construct a formalized representa-tion of NL text Semantic level implies not only linguisticbut also logical relations between language objects to berepresented [15] In the semantic level of understandingof text documents [5] models such as the WordNet-basedsemantic model conceptual dependence model semanticgraphmodel ontology-based knowledge model and Univer-sal Networking Language can be used In [16] the WordNet-based semantic model captures the semantic structure ofeach term within a sentence and document rather than thefrequency of the term within a document only This modelanalyzes the terms and their corresponding synonyms andhypernyms on the sentence and document levels ignoringdependencies among terms in the sentence level In [17] theconceptual dependence model identifies characterizes andunderstands the effect of the existing dependencies amongthe entities in themodel considering only nouns as a conceptThe semantic graph model proposes semantic representationbased on a graph-based structure with a description of textstructure [5] In the semantic graph model the traditionalmethod to generate the edges between two nodes in the graphis usually based on the cooccurrence and similarity measuresof two nodes and the most frequent word is not necessarilychosen to be important In [18] ontology knowledge rep-resented by entities connected by naming relationships andthe defined taxonomy of classes may become a much morepowerful tool in information exploration Traditional textanalysis and organization methods can also be enriched withthe ontology-based information concerning the cooccurringentities or whole neighborhoods of entities [19] Howeverautomatic ontology construction is a difficult task becauseof the failure to support order among the objects and theattributes In [20] Universal Networking Language (UNL) is

The Scientific World Journal 3

used to represent every document as a graph with conceptsas nodes and relations between them as links ignoring thesequence and orders of concepts in a sentence at the docu-ment level

Secondly the semantic relatedness measure has becomean area of research as one of the hotspots in the area ofinformation technology Semantic relatedness and semanticsimilarity are sometimes confused in the research literatureand they are not identical [21 22] The semantic relatednessmeasures usually include various types of relationships suchas hypernym hyponym subsumption synonym antonymholonym and meronymyThe semantic similarity is a specialcase of semantic relatedness that only considers synonymousrelationships and subsumption relationships There are sev-eral measures for semantic relatedness recently conductedAccording to the parameters used they can be classified intothree major categories including distance-based methods(Rada and Wu and Palmer [21]) which selects the shortestpath among all the possible paths between concepts to bemore similar information-based methods (Resnik Jiang andConrath and Lin [21]) which considers the use of externalcorpora avoiding the unreliability of path distances andtaxonomy and hybrid methods (T Hong and D smith andZili Zhou [21]) which combines the first two measures Inthis paper the information-based methods are focused onIn [23] the semantic relatedness of any concept is based on asimilarity theorem in which the similarity of two concepts ismeasured by the ratio of the amount of information neededto the commonality of the two concepts to the amount ofinformation needed to describe them The information con-tent (IC) of their lowest common subsumer (LCS) capturesthe commonality of the two concepts and the IC of twoconcepts themselves The lowest common subsumer (LCS) isthe most specific concept which is a shared ancestor of thetwo concepts The pointwise mutual information (PMI) [2425] is a simplemethod for computing corpus-based similarityof words The pointwise mutual information is defined asthe ratio of the probability that the two words cooccur119901(1199081and 119908

2) and the product of the probability cooccurs

of each one 119901(1199081) sdot 119901(119908

2) If 119908

1and 119908

2are statistically

independent then the probability that they cooccur is givenby the product 119901(119908

1) sdot 119901(119908

2) If they are not independent

and they have a tendency to cooccur then 119901(1199081AND 119908

2)

will be greater than 119901(1199081) sdot 119901(119908

2) As it is clear there is

extensive literature on measuring the semantic relatednessbetween long texts or documents [7] but there is less workrelated to themeasurement of similarity between sentences orshort texts [26 27] Such methods are usually effective whendealing with long documents because similar documents willusually contain a degree of cooccurring words However inshort documents the word cooccurrence may be rare oreven null This is mainly due to the inherent flexibility ofnatural language enabling people to express similarmeaningsusing quite different sentences in terms of structure and wordcontent

Finally in order to overcome the limitations of previousmodels the documents linkage process is conducted relyingon graph-based document representation model [4 10 11]The main benefit of graph-based techniques is that they

allow us to keep inherent structural information of orig-inal document Mao and Chu [10] used graph based 119870-means algorithm for clustering web documents Yoo andHu [28] represent a set of documents as bipartite graphsusing domain knowledge in ontology Then documents areclustered based on average-link clustering algorithm How-ever these approaches relying on graph-based model arenot intended to be used in an incrementally growing set ofdocuments

To utilize the structural and semantic information in thedocument in this paper a formal semantic representation oflinguistic input is introduced to build an SHGRS scheme forthe documents This representation scheme is constructedthrough the accumulation of syntactic and semantic analysisoutputs A new semantic relatedness measure is developed todetermine the relatedness among concepts of the documentas well as relatedness between contents of the documents forlongshort texts An efficient linkage process is required toconnect the related documents by identifying their sharedsubgraphs The proposed representation scheme along withthe proposed relatedness measure and linkage process isbelieved to enable a more effective document mining andmanaging processes

3 The Proposed Semantic RepresentationScheme and Linkage Processes

In this section a new framework is introduced for construct-ing the SHGRS and conducting the document linkage pro-cesses using multidimensional analysis and primary decisionsupport In fact the elements in a sentence are not equallyimportant and the most frequent word is not necessarilythe most important one As a result the extraction of textmain features (MFs) as concepts as well as their attributesand relationships is important Combinations of semanticannotation [13 14] and reinforcement learning (RL) [29]techniques are used to extract MFs of the text and to inferunknown dependencies and relationships among these MFsThe RL fulfills sequential decision-making tasks with long-run accumulated reward to achieve the largest amount ofinterdependence among MFs This framework focuses onthree key criteria (1) how the framework refines text to selectthe MFs and their attributes (2) what learning algorithm isused to explore the unknown dependencies and relationshipsamong these MFs and (3) how the framework conducts thedocument linkage process

Details of this framework process in three stages aregiven in Figure 2 The first stage aims to refine textualdocuments to select the MFs and their attributes with theaid of OpenNLP [13] and AlchemyAPI [14] Hierarchy-basedstructure is used to represent each sentence with its MFsand their attributes which achieves dimension reductionwithmore understanding The second stage aims to compute theproposedMFs semantic relatedness (MFsSR) that contributesto the detection of the closest synonyms of the MFs andto inferring the relationships and dependencies of the MFsGraph-based structure is used to represent the relationshipsand dependencies among these MFs which achieves more

4 The Scientific World Journal

Semantic annotation (AlchemyAPI)

Word-tagging analyses

Syntactic analysis

Semantic analysis

Build MFs hierarchy structure MFs extraction

Semantic text refining and annotating stage

FLG matching Compute subgraph-relatedness measure

Documents linkage conducting stage

MFsSR algorithm

WordNet

SALA

Detecting closestsynonyms

Inferring dependency and semantic relationships

Build relationships graph structure

Construct semantic hierarchygraph document object

Dependenciesrelationships exploring and inferring stage

Figure 2 A framework for constructing a semantic hierarchygraph text representation scheme

correlation through many-to-many relationships The thirdstage aims to conduct the document linkage process usinga subgraph matching process that relies on finding all ofthe distinct similarities common to the subgraph The mainproceedings of this framework can be summarized as follows

(1) extracting MFs of sentences and their attributes tobuild the hierarchy-based structure for efficientdimension reduction

(2) estimating a novel semantic relatedness measure withconsideration of direct relevance and indirect rele-vance betweenMFs and their attributes for promisingperformance improvements

(3) detecting closest synonyms of MFs for better disam-biguation

(4) exploring and inferring dependencies and relation-ships of MFs for more understanding

(5) representing the interdependence amongMFs of sen-tences in graph-based structure so as to be moreprecise and informative

(6) conducting a document linkage process for moreeffectiveness

31 Semantic Text Refining and Annotating Stage This stageis responsible for refining the text to discover the text MFswith additional annotation for more semantic understandingand aims to

(1) accurately parse each sentence and identify POSsubject-action-object and named-entity recognition

(2) discover the MFs of each sentence in the textual doc-ument

(3) exploit semantic information in each sentencethrough detection attributes of the MFs

(4) reduce the dimensions as much as possible(5) generate an effective descriptive sentence object

(DSO) with a hierarchical sentence object automati-cally

This stage can be achieved through these processes Firstthe text NL is studied at different linguistic levels that iswords sentence and meaning for semantic analysis andannotation [30 31] Second exploiting the information ldquowhois doing what to whomrdquo clarifies dependencies between verbsand their arguments for extraction of MFs Finally buildingMFs hierarchy-based structure explains sentences MFs andtheir attributes

With regard to extracting MFs and building MFs hier-archy-based structure the following points must be high-lighted First there is a dire need to refine the text contentby representing each sentence with its MFs instead of a seriesof terms to reduce the dimensions as much as possible TheOpenNLP supports the most common NLP tasks such astokenization sentence segmentation part-of-speech taggingnamed entity extraction chunking parsing and coreferenceresolution Furthermore the AlchemyAPI extracts semanticmetadata from content such as information on subject-action-object relation extraction people places companiestopics facts relationships authors and languages Based onthe NL concept by OpenNLP and AlchemyAPI the sentenceMFs are identified as subject ldquoSubrdquo main verb ldquoMVrdquo andobject ldquoObjrdquo (direct or indirect object) In addition suchother terms remaining in the sentence as complement ldquoComrdquo(subject complement or object complement) and modifiers(Mod) are considered as attributes to these MFs

Second the automatic annotation is essential for eachof the MFs with additional information for more seman-tic understanding Based on the semantic annotation byOpenNLP and AlchemyAPI the sentenceMFs are annotatedwith feature value (FVal) part-of-speech (POS) named-entity recognition (Ner) and a list of feature attributes

The Scientific World Journal 5

Root

DSO

FValPosNer

FATTS

Annotated with

Val

POSNer

TypeAnnotated with

DSO DSO DSO

middot middot middot

middot middot middot

middot middot middot

MFnMF1 MF2

FATT1FATT2 FATTm

Figure 3 The hierarchy-based structure of the textual document

(FAtts) The list of FAtts is constructed from the remainingterms in the sentence (complement or modifier) relyingon the grammatical relation Each attribute is annotatedwith attribute value (Val) attribute type (type) which iscomplement or modifier attribute POS (POS) and attributenamed-entity recognition (Ner)

Finally the hierarchy-based structure of the textual doc-ument is represented as an object [32] containing the hierar-chical structure of sentences with the MFs of the sentencesand their attributesThis hierarchical structure maintains thedependency between the terms on the sentence level formoreunderstanding which provides sentences with fast access andretrieval Figure 3 shows the hierarchical structure of the textin a theme tree ldquotree DSOrdquo the first level of the hierarchicalmodel denotes all sentence objects that are called DSO thenext level of nodes denotes the MFs of each sentence objectand the last level of nodes denotes the attributes of the MFs

tree DSODSO[i] indicates annotation of sentenceobject 119894 which is fitted with additional information as inHeuristic 1 This heuristic simplifies the hierarchical modelof the textual document with its DSO objects and its MFsThe feature score is used to measure the contribution of theMF in the sentence This score is based on the number ofthe terms grammatically related to the MF (summation offeature Attributes) as follows

Feature Score (FS) = sum

119894

MF119894FAtt (1)

Heuristic 1 Let tree DSO = Sibling of DSO nodes and leteachDSO C tree DSO = SEN key SEN value SEN StateSibling of MF node[] where SEN key indicates the order ofthe sentence in the document as an index SEN value indi-cates the sentence parsed value as (S(NP(NN employee))(VP(VBZ guides)(NP(DT the)(NN customer)))(sdot)) SENState indicates the sentence tense and state (tense past statenegative) siblings of MF node[] indicate the list of siblingnodes of MFs for each sentence fitted with Val POS Ner andlist of FAtts

In the text refining and annotating (TRN) algorithm(see Algorithm 1) the textual document is converted to anobject model with the hierarchical DSOThe TRN algorithmuses OpenNLP and AlchemyAPI tools for MFs extractionand the hierarchy model construction As a result the

constructed hierarchy structure leads to the reduction ofthe dimensions as much as possible and achieves efficientinformation access and knowledge acquisition

32 DependenciesRelationships Exploring and Inferring StageThis stage is responsible for exploring how the dependenciesand relationships among MFs have an effect and aims to

(1) represent the interdependence among sentence MFsin a graph-based structure known as the feature link-age graph (FLG)

(2) formulate an accurate measure for the semantic relat-edness of MFs

(3) detect the closest synonyms for each of the MFs(4) infer the relationships and dependencies of MFs with

each other

This stage can be achieved in three processes First build-ing an FLG represents the dependencies and relationshipsamong sentences Second an efficientMFsSRmeasure is pro-posed to contribute to the detection of the closest synonymsand to infer the relationships and dependencies of the MFsThis measure considers the direct relevance and the indirectrelevance among MFs The direct relevance is the synonymsand associated capabilities (similarity contiguity contrastand causality) among the MFs These association capabilitiesindicate the relationships that give the largest amount ofinterdependence among theMFswhile the indirect relevancerefers to other relationships among the attributes of theseMFs Finally the unknown dependencies and relationshipsamong MFs are explored by exploiting semantic informationabout their texts at a document level

In addition a semantic actionable learning agent (SALA)plays a fundamental role in detecting the closest synonymsand inferring the relationships and dependencies of the MFsLearning is needed to improve the SALA functionality toachieve multiple goals Many studies show that the RL agenthas a high reproductive capability for human-like behaviors[29] As a result the SALA performs an adaptive approachcombining thesaurus-based and distributional mechanismsIn the thesaurus-based mechanism words are compared interms of how they are in the thesaurus (eg WordNet) whilein the distributionalmechanismwords are compared in termsof the shared number of contexts in which they may appear

321 Graph-Based Organization Constructing Process Therelationships anddependencies among theMFs are organizedinto a graph-like structure of nodes with links known asFLG and defined in Heuristic 2 The graph structure FLGrepresents many-to-many relationships among MFs TheFLG is a directed acyclic graph FLG(119881 119860) that would berepresented in a collection of vertices and a collection ofdirected arcs These arcs connect pairs of vertices with nopath returning to the same vertex (acyclic) In FLG(119881 119860)119881 is the set of vertices or states of the graph and 119860 isthe set of arcs between vertices The FLG construction isbased on two sets of data The first set represents the verticesin a one-dimensional array vertex (119881) of Feature Vertex

6 The Scientific World Journal

Input List of Sentences of The document 119889119894

Output List DSO lowastlist of DSO objects and its MFs with attributes lowastProcedure 119863 larr New documentList DSOlarr Empty list list of sentences object of documentfor each sentence 119904

119894in document 119863 do

AlchemyAPI(119904119894) lowastThis function calls AlchemyAPI to determine all MF and its NER lowast

for each feature mfj isin mf1mf2 mf119899 in DSO do

get MF Attributes(mf119894) lowastThis function determines attributes of each MF lowast

OpenNLP (si) lowastThis function calls OpenNLP to determine POS and NER for MF and attributes lowastcompute MF score(mf

119894) lowastThis function determines the score of each MF lowast

endList DSOadd(DSO)

end

Algorithm 1 TRN algorithm

objects (FVOs)The FVO is annotated with Sentence Featurekey (SFKey) feature value (Fval) feature closest synonyms(Fsyn) feature associations (FAss) and feature weight (FW)Where SFKey combines sentence key and feature key inDSO the sentence key is important because that facilitatesaccessing the parent of theMFsThe Fval object has the valueof the feature the Fsyn object has a list of the detected closestsynonyms of the Fval and the FAss object has a list of theexplored feature associations and relationships for the FvalIn this list each object has Rel Typ between two MFs (1 =similarity 2 = contiguity 3 = contrast 4 = causality and 5 =synonym) and Rel vertex index of the related vertex TheFWobject has accumulated the associations and relationshipsweights clarifying the importance of the Fval in the textualdocument The second set represents the arcs in a two-dimensional array adjacency (119881 119881) of Features link weights(FLW) which indicates the value of the linkage weight in anadjacency matrix between related vertices

Heuristic 2 Let FLG(119881 119860) = ⟨Vertex(119881)Adjacency(119881 119881)⟩let Vertex(119881) = FVO

0 FVO

1 FVO

119899 and let Adja-

cency(119881 119881) = FLW(00)

FLW(01)

FLW(10)

FLW(11)

FLW(119899119899)

where FVO indicates an object ofthe MF in vertex vector and annotates with the additionalinformation as FVO = SFKey Fval Fsyn FAss Fw SFKeyindicates combination of sentence key and feature key in aDSO Fval indicates the value of the feature Fsyn indicatesa list of the closest synonyms of the Fval FAss indicatesa list of the feature associations and relationships for theFval and each object in this list is defined as FAssitem =

Rel TypeRel vertex where Rel Type indicates the valueof the relation type linkage between the two features andRel vertex indicates the index of related vertex FW indicatesthe accumulation of the weights of the associations andrelationships of the Fval in the textual document FLW indi-cates the value of the linkage weight in an adjacency matrix

322 The MFs Semantic Relatedness (MFsSR) MeasuringProcess A primarymotivation for measuring semantic relat-edness comes from the NL processing applications such asinformation retrieval information extraction information

filtering text summary text annotation text mining wordsense disambiguation automatic indexing machine trans-lation and other aspects [2 3] In this paper one of theinformation-based methods is utilized by considering ICThe semantic relatedness that has been investigated concernsthe direct relevance and the indirect relevance among MFsThe MFs may be one word or more and thereby the MF isconsidered as a conceptThe IC is considered as a measure ofquantifying the amount of information a concept expressesLin [23] assumed that for a concept 119888 proposed IC let 119901(119888)be the probability of encountering an instance of concept119888 The IC value is obtained by considering the negative loglikelihood of encountering a concept in a given corpusTraditionally the semantic relatedness of concepts is usuallybased on the cooccurrence information on a large corpusHowever the cooccurrences do not achieve much matchingin a corpus and it is essential to take into account therelationships among concepts Therefore development of anew information content method based on the cocontribu-tions instead of the cooccurrences of the proposed MFsSR isimportant

The new information content (NIC) measure is an exten-sion of the information content measure The NIC measuresare based on the relations defined in the WordNet ontologyThe NIC measure uses hypernymhyponym synonymantonym holonymmeronymy and entailcause to quantifythe informativeness of concepts For example a hyponymrelation could be ldquobicycle is a vehiclerdquo and ameronym relationcould be ldquoa bicycle has wheelsrdquo NIC is defined as a functionof the hypernym hyponym synonym antonym holonymmeronymy entail and cause relationships normalized by themaximum number of MFsattributes objects in the textualdocument using

NIC (119888) = 1 minus ((log (Hype Hypo (119888) + Syn Anto (119888)

+Holo Mero (119888)

+ Enta Cause (119888) + 1))

times (log(max concept))minus1)

(2)

The Scientific World Journal 7

whereHype Hypo(119888) returns the number of FVOs in the FLGrelated to the hypernym or hyponym values Syn Anto(119888)returns the number of FVOs in the FLG related to the syn-onym or antonym values Holo Mero(119888) returns the numberof FVOs in the FLG related to the holonym or meronymyvalues and Enta Cause(119888) returns the number of FVOs in theFLG related to the entail or cause values The max concept isa constant that indicates the total number of MFsattributesobjects in the considered text The max concept normalizesthe NIC value and hence the NIC values fall in the range of[0 1]

With the consideration of direct relevance and indirectrelevance of MFs the proposed MFsSR can be stated asfollows

SemRel = 120582(2 lowast NIC (LCS (1198881 1198882))NIC (1198881) +NIC (1198882)

)

+ (1 minus 120582) (2 lowastNIC (LCS (att 1198881 att 1198882))NIC (att 1198881) + NIC (att 1198882)

)

(3)

where 120582 isin [0 1] decides the relative contribution of directand indirect relevance to the semantic relatedness andbecause the direct relevance is assumed to bemore importantthan the indirect relevance 120582 isin [05 1]

323 The Closest-Synonyms Detecting Process In MFsclosest-synonyms detection the SALA carried out the firstaction to extract the closest synonyms for each MF whereSALA defies automatic discovery of similar meaning words(synonyms) Due to the important role played by a lexicalknowledge base in the closest synonyms detection SALAadopts the dictionary-based approach to the disambiguationof theMFs In the dictionary-based approach the assumptionis that the most plausible sense to assign to multiple sharedwords is that sense that maximizes the relatedness among thechosen senses In this respect SALA detects the synonymsby choosing the meaning whose glosses share the largestnumber of words with the glosses of the neighboring wordsthrough lexical ontology Using a lexical ontology suchas WordNet allows the capture of semantic relationshipsbased on the concepts and exploiting hierarchies of conceptsbesides dictionary glosses One of the problems that canbe faced is that not all synonyms are really related to thecontext of the documentTherefore the MFs disambiguationis achieved by implementing the closest-synonym detection(C-SynD) algorithm

In the C-SynD algorithm the main task of each SALA isto search for synonyms of each FVO throughWordNet Eachsynonym and their relationships are assigned in a list calledSyn Relation List (SRL) Each item in the SRL contains onesynonym and its relationships such asHypernyms HyponymMeronyms Holonymy Antonymy Entail and Cause Thepurpose of using these relations is to eliminate the ambiguityand polysemy because not all of the synonyms are relatedto the context of the document Then all synonyms andtheir relationship lists are assigned in a list called F SRListThe SALA starts to filter the irrelevant synonyms accordingto the score of each SRL This score is computed based on

semantic relatedness between the SRL and every FVO in theFeature Vertex array as follows

Score (SRL119896) =

119898

sum

119894=0

119899

sum

119895=0

SemRel (SRL (119894)item FVO119895) (4)

where 119899 is the number of FVOs in the Feature Vertex arraywith index 119895 119898 is the number of items in SRL with index 119894and 119896 is index of SRL

119896in F SRList

The lists of synonyms and their relationships areused temporarily to serve the C-SynD algorithm (seeAlgorithm 2) The C-SynD algorithm specifies a thresholdvalue of the score for selecting closest synonyms SALAapplies the threshold to select the closest synonyms with thehighest score according to the specific threshold as follows

Closest-Synonyms = maxth

(F SRListSRListscore ()) (5)

Then each SALA retains the closest synonyms in anobject Fsyn of each FVO and SALA links the related FVOswith a bidirectional effect using the SALA links algorithm(see Algorithm 3) In the SALA LINKS algorithm the SALAreceives each FVO for detecting links among the other FVOsaccording to Fsyn object values and FAss object valuesThe SALA assigns the FLW between two FVOs to theirintersection location of the FLG

324 The MFs Relationships and Dependencies Inferring Pro-cess In inferringMFs relationships and dependencies SALAaims to explore the implicit dependencies and relationshipsin the context such as human relying on four associationcapabilities These association capabilities are similarity con-tiguity contrast (antonym) and causality [29] which give thelargest amount of interdependence among the MFs Theseassociation capabilities are defined as follows

Similarity For nouns it is the sibling instance of the sameparent instance with is-a relationship inWordNet indicatinghypernymhyponym relationships

Contiguity For nouns it is the sibling instance of thesame parent instance with part-of relationship in WordNetindicating holonymmeronymy relationship

Contrast For adjectives or adverbs it is the sibling instanceof the same parent instance with is-a relationship and withantonym (opposite) attribute in WordNet indicating a syn-onymantonym relationship

Causality For verbs it indicates the connection of the se-quence of events by a causeentail relationship where a causepicks out two verbs one of the verbs is causative such as (give)and the other is called resultant such as (have) in WordNet

To equip SALA with decision-making and experiencelearning capabilities to achieve multiple-goal RL this studyutilizes the RL method relying on 119876-learning to design aSALA inference engine with sequential decision-makingTheobjective of SALA is to select an optimal association to maxi-mize the total long-run accumulated reward Hence SALA

8 The Scientific World Journal

Input Array Feature Vertex Objects (FVO) WordnetOutput List of Closest-Synonym of FVO objects lowastThe detected closest synonym to set Fsyn value lowastProcedureSRLlarr Empty list list of each synonym and their relationshipsF SRListlarr Empty list list of SRL (all synonyms and their relationships lists)lowastParallelforeach used for parallel processing of all FVO lowastParallelForEach(FVO fvo =gt

Fsynlarr Empty listF SRList = get all Syn Relations List(fvo) lowast function to get all synonyms and their relationships lists from wordnet lowastlowast compute semantic relatedness of each synonym and their relationships with the fvo object then get score of each listlowastfor (int 119894 = 0 119894 lt F SRListCount() 119894++) score[119894] = sum(function Get SemRel(F SRList[119894] fvo)) lowastfunction to compute the score of each

synonym list this score contributes in accurate filtering the returned synonyms and selecting the closest onelowast

Fsyn = maxth(score[]) lowastfunction to select SRL with highest score according to the specific threshold lowast

) ParallelFor

Algorithm 2 C-SynD algorithm

Input Action type and Feature Vertex Object(FVO)Output Detect Links between Feature Vertices and assigned the Features Link Weight(FLW)Procedure ilarr index of the FVO input in FLGnumVertslarr number of Feature Vertex objects (FVO) in Vertex (119881)for (int 119895 = 119894 119895 le numVerts j++)if (Action type == ldquoClosest-Synonymsrdquo) then Fsyn = get C-SynD (FVO[119894]) lowast Function to execute C-SynD algorithm lowastif (FVO[119895]contain(Fsyn)) then FLWlarr SemRel(FVO[119894] FVO[119895]) lowastFunction to compute the value of semantic relatedness betweenthe FVO[119894] and FVO[119895] lowastRel-Typelarr 5 lowast set the relation type with 5 indicates of synonym lowast

elseif (Action type == ldquoAssociationrdquo) then optimal action = get IOAC-QL (FVO) lowast Function to execute IOAC-QL algorithm lowastFLWlarr reward lowastThe reward value of optimal action returnedlowastRel-Typelarr action type lowastThe type of optimal action returned (type = 1 for similarity type = 2 for contiguitytype = 3 for contrast type = 4 for causality) lowast

Rel vertex = 119895 lowast index of the related vertex lowastAdjacent[119894 119895] = FLW lowastset the value of linkage weight in adjacency matrix between related vertex [119894] vertex [119895] lowast

Algorithm 3 SALA links algorithm

can achieve the largest amount of interdependence amongMFs through implementing the inference optimal associationcapabilities with the 119876-learning (IOAC-QL) algorithm

(1) Utilizing Multiple-Goal Reinforcement Learning (RL) RLspecifies what to do but not how to do it through the rewardfunction In sequential decision-making tasks an agent needsto perform a sequence of actions to reach goal states ormultiple-goal states One popular algorithm for dealing withsequential decision-making tasks is 119876-learning [33] In a 119876-learning model a 119876-value is an evaluation of the ldquoqualityrdquo

of an action in a given state 119876(119909 119886) indicates how desirableaction 119886 is in the state 119909 SALA can choose an action based on119876-values The easiest way of choosing an action is to choosethe one that maximizes the 119876-value in the current state

To acquire the119876-values the algorithm119876-learning is usedto estimate the maximum discounted cumulative reinforce-ment that the model will receive from the current state 119909

on max(sum119894=0

120574119894119903119894) where 120574 is a discount factor and 119903

119894is the

reward received at step 119894 (which may be 0) The updating of119876(119909 119886) is based on minimizing 119903 + 120574119891(119910) minus 119876(119909 119886) 119891(119910) =max119886119876(119910 119886) and 119910 is the new state resulting from action

The Scientific World Journal 9

119886 Thus updating is based on the temporal difference inevaluating the current state and the action chosen Throughsuccessive updates of the 119876 function the model can learn totake into account future steps in longer and longer sequencesnotably without explicit planning [33] SALA may eventuallyconverge to a stable function or find an optimal sequence thatmaximizes the reward received Hence the SALA learns toaddress sequential decision-making tasks

Problem Formulation The 119876-learning algorithm is used todetermine the best possible action that is selecting the mosteffective relations analyzing the reward function and updat-ing the related feature weights accordingly The promisingaction can be verified by measuring the relatedness scoreof the action value with the remained MFs using a rewardfunctionThe formulation of the effective association capabil-ities exploration is performed by estimating an action-valuefunction The value of taking action 119886 in state 119904 is definedunder a policy 120587 denoted by 119876(119904 119886) as the expected returnstarting from 119904 and taking the action 119886 Action policy 120587 is adescription of the behavior of the learning SALA The policyis the most important criterion in providing the RL with theability to determine the behavior of the SALA [29]

In single goal reinforcement learning these 119876-values areused only to rank order of the actions in a given state Thekey observation here is that the 119876-values can also be used inmultiple-goal problems to indicate the degree of preferencefor different actions The available way in this paper to selecta promising action to execute is to generate an overall119876-valueas a simple sum of the 119876-values of the individual SALA Theaction with the maximum summed value is then chosen toexecute

Environment State and Actions For the effective associationcapabilities exploration problem the FLG is provided as anenvironment in which intelligent agents can learn and sharetheir knowledge and can explore dependencies and relation-ships among MFs The SALAs receive the environmentalinformation data as a state and execute the elected action inthe form of an action value table which provides the best wayof sharing knowledge among agents The state is defined bythe status of theMF space In each state theMFobject is fittedwith the closest-synonyms relations there is a set of actionsIn each round the SALA decides to explore the effectiveassociations that gain the largest amount of interdependenceamong theMFs Actions are defined by selecting a promisingaction with the highest reward by learning which actionis the optimal for each state In summary once 119904

119896(MF

object) is received from the FLG SALA chooses one ofthe associations as the action 119886

119896according to SALArsquos policy

120587lowast(119904119896) executes the associated action and finally returns the

effective relations Heuristic 3 defines the action variable andthe optimal policy

Heuristic 3 (action variable) Let 119886119896symbolize the action that

is executed by SALA at round 119896

119886119896= 120587lowast(119904119896) (6)

Therein one has the following(i) 119886119896C action space similarity contiguity contrast

and causality(ii) 120587lowast(119904

119896) = argmax

119886119876lowast(119904119896 119886119896)

Once the optimal policy (120587lowast) is obtained the agentchooses the actions using the maximum reward

Reward FunctionThe reward function in this papermeasuresthe dependency and the relatedness among MFs The learneris not told which actions to take as in most forms of machinelearning Rather the learner must discover which actionsyield the most reward by trying them [33]

(2) The SALA Optimal AssociationRelationships InferencesThe optimal association actions are achieved according to theIOAC-QL algorithmThis algorithm starts to select the actionwith the highest expected future reward from each state Theimmediate reward which the SALA gets to execute an actionfrom a state s plus the value of an optimal policy is the 119876-value The higher 119876-value points to the greater chance ofthat action being chosen First initialize all the119876(119904 119886) valuesto zero Then each SALA performs the following steps inHeuristic 4

Heuristic 4(i) At every round 119896 select one action 119886

119896from all the

possible actions (similarity contiguity contrast andcausality) and execute it

(ii) Receive the immediate reward 119903119896and observe the new

state 1199041015840119896

(iii) Get the maximum 119876-value of the state 1199041015840

119896based on

all the possible actions 119886119896= 120587lowast(119904119896) = argmax

119886119876lowast

(119904119896 119886119896)

(iv) Update the 119876(119904119896 119886119896) as follows

119876 (119904119896 119886119896) = (1 minus 120572)119876 (119904

119896 119886119896)

+ 120572(119903 (119904119896 119886119896) + 120574 max1198861015840isin119860(119904119896)

119876 (1199041015840

119896 1198861015840

119896))

(7)

where 119876(119904119896 119886119896) is worthy of selecting the action (119886

119896)

at the state (119904119896) 119876(1199041015840

119896 1198861015840

119896) is worthy of selecting the

next action (1198861015840119896) at the next state (1199041015840

119896) The 119903(119904

119896 119886119896)

is a reward corresponding to the acquired payoff A(1199041015840119896) is a set of possible actions at the next state (1199041015840

119896) 120572

(0 lt 120572 le 1) is learning rate 120574 (0 le 120574 le 1) is discountrate

During each round after taking the action 119886 the actionvalue is extracted from WordNet The weight of the actionvalue is measured to represent the immediate reward wherethe SALA computes the weight AVW

119894119896for the action value 119894

in document 119896 as followsReward (119903) = AVW

119894119896

= AVW119894119896+

119899

sum

119895=1

(FW119895119896

lowast SemRel (119894 119895)) (8)

10 The Scientific World Journal

Input Feature Vertex Object(FVO) WordnetOutput The optimal action for FVO relationshipsProcedure

intialize Q[ ]lowastimplement 119876-learning algorithm to get the action with the highest 119876-value lowast119886119896isin action space[] = similarity contiguity contrast causality

lowastParallelforeach used for parallel processing of all actions lowastParallelfor (0 action spacecount() 119896 =gt 119877[119896] = Get Action Reward(119886[119896]) lowastFunction to implement (8) lowast119876[119894 119896] = (1 minus 120572) lowast 119876[119894 119896] + 120572 lowast (119877[119896] + 120574 lowastmax

119886(119876[119894 119896 + 1]))

) ParallelForreturn Get highest 119876-value action(119876[ ]) lowastselect action with the maximum summed value of 119876-values lowast

lowastexecute (8) to get reward value lowastGet Action Reward(ak)AVW

119896= (freq(119886

119896val)FVOCount()) lowast log(119873119899

119886)

for (int 119895 = 0 119895 lt FVOCount() 119895++) FW119895= (freq(FVO[119895])FVOCount()) lowast log(119873119899

119891)

SemRel(119894 119895) =Get SemRel(119886119896val FVO[119895])

Sum+= FW119895lowast SemRel(119894 119895)

return AVW119896= AVW

119896+ Sum

Algorithm 4 IOAC-QL algorithm

where 119903 is the reward of the selected action 119886119894 AVW

119894119896is the

weight of action value 119894 in document 119896 FW119895119896

is the weightof each feature object FVO

119895in document 119896 and SemRel(119894 119895)

is the semantic relatedness between the action value 119894 andfeature object FVO

119895

The most popular weighting scheme is the normalizedword frequency TFIDF [34] used to measure AVW

119894119896and

FW119895119896The FW

119895119896in (8) is calculated likewise AVW

119894119896is based

on (9) and (10)Action value weight (119860119881119882

119894119896) is a measure to calculate

the weight of the action value that is the scalar product ofaction value frequency and inverse document frequency as in(9)The action value frequency (119865) measures the importanceof this value in a document Inverse document frequency(IDF) measures the general importance of this action valuein a corpus of documents

AVW119894119896

=119865119894119896

sum119896119865119894119896

lowast IDF119894 (9)

where 119865119894119896

represents the number of this action value 119894

cocontributions in document 119889119896 normalized by the number

of cocontributions of all MFs in document 119889119896 normalized to

prevent a bias towards longer documents IDF is performedby dividing the number of all documents by the number ofdocuments containing this action value defined as follows

IDF119894= log( |119863|

1003816100381610038161003816119889119896 V119894 isin 1198891198961003816100381610038161003816

) (10)

where |119863| is the total number of documents in the corpusand |119889

119896 V119894isin 119889119896| is the number of documents containing

action value 119881119894

Thus in the IOAC-QL algorithm (see Algorithm 4)implementation shown in Figure 4 the SALA selects theoptimal action with the higher reward Hence each SALAretains effective actions in the FAss list object of each FVO Inthis list each object hasRel-Type object andRel vertex objectTheRel-Type object value equals the action type object valueand Rel vertex object value equals the related vertex indexThe SALA then links the related FVOs with the bidirectionaleffect using the SALA links algorithm and sets the FLW valueof the linkage weight in the adjacency matrix with the rewardof the selected action

An example of monitoring the SALA decision throughthe practical implementation of the IOAC-QL algorithmactions is illustrated in Figure 4

For state FVO1 one has the following

action similarity ActionResult State FVO1FVO7

FVO9FVO22FVO75FVO92FVO112

FVO125

Prob 8reward 184 PrevState FVO

1 119876(119904 119886) = 91

action contiguity ActionResult State FVO103

FVO17

FVO44 Prob 3 reward 74 PrevState FVO

1 119876(119904 119886) =

61action contrast ActionResult State FVO

89 Prob 1

reward 22 PrevState FVO1 119876(119904 119886) = 329

action causality ActionResult State FVO63

FVO249

Prob 2 reward 25 PrevState FVO

1 119876(119904 119886) = 429

As revealed in the case FVO1 the SALA receives the

environmental information data FVO1as a state and then

the SALA starts the implementation to select the actionwith the highest expected future reward among the otherFVOs The result of each action is illustrated the action with

The Scientific World Journal 11

S0

S0

S1

S1S2

a0 a0a0a4 a4 a4

DSO DSO

Sm

Sm

Object model construction

FVO

Vertex [V] Adjacent [V V]

FLWFval

SFKeyFsyn[]

FAss[]FW

The feature linkage graph (FLG)

Descriptive document object

Descriptive sentence object

Action selection

Q-learning mechanism

Assign Q-values with FLG environment

Reinforcement learning model for semanticrelational graph construction

SALA inference engine

DSO DSO DSO DSO DSO

DDO

Sem

antic

actio

nabl

e lea

rnin

g ag

ent (

SALA

)

F1

F1

F1

F2

F3

F4

F2

F2 F3 F4 middot middot middot Fn

DSO

Fn

Fn

middot middot middot

middot middot middot

middot middot middot

middot middot middot middot middot middotmiddot middot middot

middot middot middot

middot middot middotmiddot middot middot

middot middot middot

middot middot middot

middot middot middot

Q(S0 a0) Q(S0 a4)Q(S1 a0) Q(S1 a4)

Q(Sm a0) Q(Sm a4)

WordNet

Figure 4 The IOAC-QL algorithm implementation

the highest reward is selected and the elected action isexecutedTheSALAdecision is accomplished in FLGwith theeffective extracted relationships according to 119876(119904 119886) value asshown in Table 1

33 Document Linkage Conducting Stage This section pre-sents an efficient linkage process at the corpus level for manydocuments Due to the SHGRS scheme of each documentmeasuring the commonality between the FLG of the doc-uments through finding all distinct common similarity orrelatedness subgraphs is required The objective of findingcommon similar or related subgraphs as a relatedness mea-sure is to be utilized in document mining and managingprocesses such as clustering classification and retrieval Itis essential to establish that the measure is metric A metricmeasure ascertains the order of objects in a set and henceoperations such as comparing searching and indexing of theobjects can be performed

Graph relatedness [35 36] involves determining thedegree of similarity between these two graphs (a numberbetween 0 and 1) so the same node in both graphs wouldbe similar if its related nodes were similar The selectionof subgraphs for matching depends on detecting the moreimportant nodes in FLG using FVOFw value The vertexwith the highest weight is the first node selected which thendetects all related vertices from an adjacency matrix as asubgraphThis process depends on computing the maximumcommon subgraphsThe inclusion of all common relatednesssubgraphs is accomplished through recursive iterations andholding onto the value of the relatedness index of thematched

subgraphs Eliminating the overlap is performed using a dy-namic programming technique that keeps track of the nodesvisited and their attribute relatedness values

The function relatedness subgraphs utilize an array Rel-Nodes[] of the related vertices for each document Thefunction SimRel(119909

119894 119910119894) computes the similarity between the

two nodes 119909119894 119910119894as defined in (11) The subgraph matching

process is measured using the following formula

Sub-graph-Rel (SG1 SG2)

=sum

SG1119894

sumSG2119895

SemRel (119873119894 119873119895)

sumSG1119894

119873119894+ sum

SG2119895

119873119895

(11)

where SG1 is the selected subgraph from the FLG of onedocument SG2 is the selected subgraph from the FLG ofanother document and SemRel(119873

119894 119873119895) is the degree of

matching for both nodes 119873119894and 119873

119895 A threshold value for

node matching is also utilized the threshold is set to a 50match

4 Evaluation of the Experiments and Results

This evaluation is especially vital as the aim of building theSHGRS is to use them efficiently and effectively for the fur-ther mining process To explore the effectiveness of the pro-posed SHGRS scheme examining the correlation of the pro-posed semantic relatedness measure compared with otherprevious relatedness measures is required The impact ofthe proposed MFsSR for detecting the closest synonym is

12 The Scientific World Journal

Table 1 Action values table with the SALA decision

State FVO1 FVO1 FVO1 FVO1

Action Similarity Contiguity Contrast CausalityActionResult State FVO1 FVO7 FVO9 FVO22 FVO75 FVO92 FVO112 FVO125 FVO103 FVO17 FVO44 FVO89 FVO63 FVO249

Total reward 184 74 22 25119876(119904 119886) 94 61 329 429Decision Yes No No No

studied and compared to the PMI [25 37] and independentcomponent analysis (ICA) [38] for detecting the best near-synonym The difference between the proposed discoveringsemantic relationships or associations IOAC-QL implicitrelation extraction conditional random fields (CRFs) [6 39]and term frequency and inverse cluster frequency (TFICF)[40] is computed

A case study of amanaging andmining task (ie semanticclustering of text documents) is consideredThis study showshow the proposed approach is workable and how it canbe applied to mining and managing tasks Experimentallysignificant improvements in all performances are achieved inthe document clustering processThis approach outperformsconventional word-based [16] and concept-based [17 41]clusteringmethodsMoreover performance tends to improveas more semantic analysis is achieved in the representation

41 Evaluation Measures Evaluation measures are subcat-egorized into text quality-based evaluation content-basedevaluation coselection-based evaluation and task-basedevaluationsThefirst category of evaluationmeasures is basedon text quality using aspects such as grammaticality nonre-dundancy referential clarity and coherence For content-based evaluations measures such as similarity semanticrelatedness longest common subsequence and other scoresare used The third category is based on coselection evalua-tion using precision recall and 119865-measure values The lastcategory task-based evaluation is based on the performanceof accomplishing the given task Some of these tasks areinformation retrieval question answering document classi-fication document summarizing and document clusteringmethods In this paper the content-based the coselection-based and the task-based evaluations are used to validate theimplementation of the proposed scheme over the real corpusof documents

411 Measure for Content-Based Evaluation The relatednessor similarity measures are inherited from probability theoryand known as the correlation coefficient [42]The correlationcoefficient is one of the most widely used measures todescribe the relatedness 119903 between two vectors119883 and 119884

Correlation Coefficient 119903 The correlation coefficient is a rela-tively efficient relatedness measure which is a symmetricalmeasure of the linear dependence between two randomvariables

Table 2 Contingency table presents human judgment and systemjudgment

Human judgmentTrue False

System judgmentTrue TP FPFalse FN TN

Therefore the correlation coefficient can be consideredas the coefficient for the linear relationship between corre-sponding values of 119883 and 119884 The correlation coefficient 119903between sequences 119883 = 119909

119894 119894 = 1 119899 and 119884 = 119910

119894

119894 = 1 119899 is defined by

119903 =sum119899

119894=1119883119894119884119894

radic(sum119899

119894=11198832119894) lowast (sum

119899

119894=11198842119894)

(12)

Acceptance Rate (AR) Acceptance rate is a proportion ofcorrectly predicted similar or related sentences compared toall related sentences High acceptance ratemeans recognizingalmost all similar or related sentences

AR =TP

(TP + FN) (13)

Accuracy (Acc) Accuracy is a proportion of all correctlypredicted sentences compared to all sentences

Acc = TP + TN(TP + FN + FP + TN)

(14)

where TP TN FP and FN stand for true positive (the num-ber of pairs correctly labeled as similar) true negative (thenumber of pairs correctly labeled as dissimilar) false positive(the number of pairs incorrectly labeled as similar) andfalse negative (the number of pairs incorrectly labeled asdissimilar) as shown in Table 2

412 Measure for Coselection-Based Evaluation For all thedomains a precision 119875 recall 119877 and 119865-measure are utilizedas the measures of performance in the coselection-basedevaluation These measures may be defined via computingthe correlation between the extracted correct and wrongclosest synonyms or semantic relationshipsdependencies

The Scientific World Journal 13

Table 3 Datasets details for the experimental setup

Dataset Experimenting Dataset Name Description

DS1 Experiment 1content-basedevaluation

Miller and Charles (MC)1 RG consists of 65 pairs of nouns extracted from the WordNet rated bymultiple human annotators

DS2Microsoft ResearchParaphrase Corpus

(MRPC)2

The corpus consists of 5801 sentence pairs collected from newswire articles3900 of which were labeled as relatedness by human annotators The wholeset is divided into a training subset (4076 sentences of which 2753 are true)and a test subset (1725 pairs of which 1147 are true)

DS3 Experiment 2coselection-basedevaluation(closest-synonymdetection)

British National Corpus(BNC)3

BNC is a carefully selected collection of 4124 contemporary written andspoken English texts contains 100-million-word text corpus of samples ofwritten and spoken English with the near-synonym collocations

DS4 SN (semantic neighbors)4SN relates 462 target terms (nouns) to 5910 relatum terms with 14682semantic relations (7341 are meaningful and 7341 are random) The SNcontains synonyms coming from three sources WordNet 30 Rogetrsquosthesaurus and a synonyms database

DS5

Experiment 2coselection-basedevaluation (semanticrelationshipsexploration)

BLESS6BLESS relates 200 target terms (100 animate and 100 inanimate nouns) to 8625relatum terms with 26554 semantic relations (14440 are meaningful (correct)and 12154 are random) Every relation has one of the following typeshypernymy cohypernymy meronymy attribute event or random

DS6 TREC5

TREC includes 1437 sentences annotated with entities and relations at leastone relation There are three types of entities person (1685) location (1968)and organization (978) in addition there is a fourth type others (705) whichindicates that the candidate entity is none of the three types There are fivetypes of relations located in (406) indicates that one location is located insideanother location work for (394) indicates that a person works for anorganization OrgBased in (451) indicates that an organization is based in alocation live in (521) indicates that a person lives at a location and kill (268)indicates that a person killed another person There are 17007 pairs of entitiesthat are not related by any of the five relations and hence have the NR relationbetween them which thus significantly outnumbers other relations

DS7 IJCNLP 2011-New YorkTimes (NYT)6

NYT contains 150 business articles from NYT There are 536 instances (208positive 328 negative) with 140 distinct descriptors in NYT dataset

DS8 IJCNLP 2011-Wikipedia8Wikipedia personalsocial relation dataset was previously used in Culotta etal [6] There are 700 instances (122 positive 578 negative) with 70 distinctdescriptors in Wikipedia dataset

DS9 Experiment 3task-basedevaluation

Reuters 215787 Reuters-21578 contains 21578 documents (12902 are used) categorized to 10categories

DS10 20 Newsgroups8 20 newsgroups dataset contains 20000 documents (18846 are used)categorized to 20 categories

1Available at httpwwwcscmuedusimmfaruquisuitehtml2Available at httpresearchmicrosoftcomen-usdownloads3Available at httpcorpusbyuedubnc4Available at httpcentalfltruclacbeteamsimpanchenkosre-evalsncsv5Available at httpl2rcsuiucedusimcogcompDataERconll04corp6Available at httpwwwmysmuedufacultyjingjiangdataIJCNLP2011zip7Available at httpmlrcsumassedumldatasetsReuters-21578+Text+Categorization+Collection8Available at httpwwwcsminingorgindexphpid-20-newsgroupshtml

Let TP denote the number of correctly detected closestsynonyms or semantic relationships explored let FP be thenumber of incorrectly detected closest synonyms or semanticrelationships explored and let FN be the number of correctlybut not detected closest synonyms or semantic relationshipsexplored in a dataset The 119865-measure combines the precisionand recall in one metric and is often used to show theefficiencyThismeasurement can be interpreted as a weightedaverage of the precision and recall where an 119865-measure

reaches its best value at 1 andworst value at 0 Precision recalland 119865-measure are defined as follows

Recall = TP(TP + FN)

Precision =TP

(TP + FP)

119865-measure = 2 (Recall lowast Precision)(Recall + Precision)

(15)

14 The Scientific World Journal

413 Measure for Task-Based Evaluation A case study ofa managing and mining task (ie semantic clustering oftext documents) is studied for task-based evaluation Thetest is performed with the standard clustering algorithmsthat accept the pairwise (dis)similarity between documentsrather than the internal feature space of the documents Theclustering algorithms may be classified as listed below [43](a) flat clustering (which creates a set of clusters without anyexplicit structure that would relate the clusters to each otheralso called exclusive clustering) (b) hierarchical clustering(which creates a hierarchy of clusters) (c) hard clustering(which assigns each documentobject as a member of exactlyone cluster) and (d) soft clustering (which distributes thedocumentsobjects over all clusters) These algorithms areagglomerative (hierarchical clustering) 119870-means (flat clus-tering hard clustering) and EM algorithm (flat clusteringsoft clustering)

Hierarchical agglomerative clustering (HAC) and the 119870-means algorithm have been applied to text clustering ina straightforward way [16 17] The algorithms employedin the experiment included 119870-means clustering and HACTo evaluate the quality of the clustering the 119865-measure isused for quality measures which combine the precision andrecall measures These metrics are computed for every (classcluster) pair The precision 119875 and recall 119877 of a cluster 119878

119894with

respect to a class 119871119903are defined as

119875 (119871119903 119878119894) =

119873119903119894

119873119894

119877 (119871119903 119878119894) =

119873119903119894

119873119903

(16)

where 119873119903119894is the number of documents of class 119871

119903in cluster

119878119894119873119894is the number of documents of cluster 119878

119894 and119873

119903is the

number of documents of class 119871119903

The 119865-measure which is a harmonic mean of precisionand recall is defined as

119865 (119871119903 119878119894) =

2 lowast 119875 (119871119903 119878119894) lowast 119877 (119871

119903 119878119894)

119875 (119871119903 119878119894) + 119877 (119871

119903 119878119894)

(17)

A per-class 119865 score is calculated as follows

119865 (119871119903) = max119878119894isin119862

119865 (119871119903 119878119894) (18)

With respect to class 119871119903 the cluster with the highest 119865-

measure is considered to be the cluster that maps to class 119871119903

that 119865-measure becomes the score for class 119871119903 and 119862 is total

number of clusters The overall 119865-measure for the clusteringresult 119862 is the weighted average of the 119865-measure for eachclass 119871

119903(macroaverage)

119865119862 =sum119871119903

(1003816100381610038161003816119873119903

1003816100381610038161003816 lowast 119865 (119871119903))

sum119871119903

10038161003816100381610038161198731199031003816100381610038161003816

(19)

where |119873119903| is the number of objects in class 119871

119903 The higher

the overall 119865-measure is the better the clustering is due tothe higher accuracy of the clusters mapping to the originalclasses

42 Evaluation Setup (Dataset) Content-based coselection-based and task-based evaluations are used to validateexperimenting over the real corpus of documents theresults are very promising The experimental setup con-sists of some datasets of textual documents as detailed inTable 3 Experiment 1 uses the Miller and Charles (MC) andMicrosoft Research Paraphrase Corpus Experiment 2 usesthe British National Corpus TREC IJCNLP 2011 (NYT andWikipedia) SN and BLESS Experiment 3 usesThe Reuters-21578 and 20 newsgroups

43 Evaluation Results This section reports on the results ofthree experiments conducted using the evaluation datasetsoutlined in the previous section The SHGRS is imple-mented and evaluated based on concept analysis and anno-tation as sentence based in experiment 1 document basedin experiment 2 and corpus based in experiment 3 Inexperiment 1 the DS1 and DS2 investigated the effectivenessof the proposed MFsSR compared to previous studies basedon the availability of semantic relatedness values of humanjudgments and previous studies In experiment 2 the DS3andDS4 investigated the effectiveness of the closest synonymdetection using MFsSR compared to using PMI [25 37]and ICA [38] based on the availability of synonyms comingfrom three sources WordNet 30 Rogetrsquos thesaurus anda synonyms database Furthermore the DS5 DS6 DS7and DS8 investigated the effectiveness of the IOAC-QL forsemantic relationships exploration compared to the implicitrelation extraction CRFs [6 39] and TFICF [40] based on theavailability of semantic relations judgments In experiment 3the DS9 and DS10 investigated the effectiveness of thedocument linkage process for clustering case study thanprevious studies as word-based [16] and concept-based [17]based on the availability of human judgments

These experiments revealed some interesting trends interms of text organization and a representation schemebased on semantic annotation and reinforcement learningThe results show the effectiveness of the proposed schemefor accurate understanding of the underlying meaning Theproposed SHGRS is exploited in mining and managingoperations such as gathering filtering searching retrievingextracting classifying clustering and summarizing docu-ments In addition the experiments illustrate the improve-ments when direct relevance and indirect relevance areincluded in the process of measuring semantic relatednessbetween MFs of documents The main purpose of semanticrelatedness measures is to infer unknown dependencies andrelationships among the MFs The proposed MFsSR is abackbone of the closest synonym detection and semanticrelationships exploration algorithms The performance ofMFsSR for closest synonym detection and IOAC-QL forsemantic relationships exploration algorithms is proportion-ally improved as well as having a semantic understanding oftext content achieved The relatedness subgraph measure isused for the document linkage process at the corpus levelThrough this process better clustering results can be achievedas more attempt to improve the accuracy of measuring therelatedness between documents using semantic information

The Scientific World Journal 15

Table 4The results of the comparison of the correlation coefficientbetween human judgment with some relatedness measures and theproposed semantic relatedness measure

MeasureRelevance

correlation withMampC

Distance-based measuresRada 0688Wu and Palmer 0765

Information-based measuresResnik 077Jiang and Conrath 0848Lin 0853

Hybrid measuresT Hong and D smith 0879Zili Zhou 0882

Informationfeature-base measuresThe proposed MFsSR 0937

in SHGRSThe relatedness subgraphsmeasure affects the per-formance of the output of the document managing and min-ing process relative to howmuch of the semantic informationit considers The most accurate relatedness can be achievedwhen the maximum amount of the semantic information isprovided Consequently the quality of clustering is improvedand when semantic information is considered the approachsurpasses the word-based and concept-based techniques

431 Experiment 1 Comparative Study (Content-Based Eval-uation) This experiment shows the necessity to evaluate theperformance of the proposed MFsSR based on a benchmarkdataset of human judgments so the results of the proposedsemantic relatedness would be comparable with other pre-vious studies in the same field In this experiment theyattempted to compute the correlation between the ratings ofthe proposed semantic relatedness approach and the meanratings reported by Miller and Charles (DS1 in Table 3)Furthermore the results produced are compared againsteight other semantic similarities approaches namely RadaWu and Palmer Resnik Jiang and Conrath Lin T Hong andD Smith and Zili Zhou [21]

Table 4 shows the correlation coefficient results betweennine componential approaches and the Miller and Charlesratings mean The semantic relatedness for the proposedapproach outperformed all the listed approaches Unlike allthe listed methods in the proposed semantic relatednessdifferent properties are considered Furthermore the goodcorrelation value of the approach also results from consider-ing all available relationships between concepts and the indi-rect relationships between attributes of each concept whenmeasuring the semantic relatedness In this experiment theproposed approach achieved a good correlation value withthe human-subject rating reported by Miller and CharlesBased on the results of this study theMFsSR correlation valuehas proven that considering the directindirect relevance

is specifically important for at least 6 improvement overthe best previous results This improvement contributes tothe achievement of the largest amount of relationships andinterdependence among MFs and their attributes at thedocument level and to the document linkage process at acorpus level

Table 5 summarizes the characteristics of the MRPCdataset (DS2 in Table 3) and presents comparison of theAcc and AR values between the proposed semantic related-ness measure MFsSR and Islam and Inkpen [7] Differentrelatedness thresholds ranging from 0 to 1 with interval01 are used to validate the MFsSR with Islam and Inkpen[7] After evaluation the best relatedness thresholds of Accand AR are 06 07 and 08 These results indicate thatthe proposed MFsSR surpasses Islam and Inkpen [7] interms of Acc and AR The results of each approach listedbelow were based on the best Acc and AR through allthresholds instead of under the same relatedness thresholdThis improvement in Acc andAR values is due to the increasein the numbers of pairs predicted correctly after consideringdirectindirect relevance This relevance takes into accountthe closest synonym and the relationships of sentences MFs

432 Experiment 2 Comparative Study of Closest-SynonymDetection and Semantic Relationships Exploration Thisexperiment shows the necessity to study the impact of theMFsSR for inferring unknown dependencies and relation-ships among the MFs This impact is achieved through theclosest-synonym detection and semantic relationship explo-ration algorithms so the results would be comparable withother previous studies in the same field Throughout thisexperiment the impact of the proposed MFsSR for detectingthe closest synonym is studied and compared to the PMIapproach and ICA approach for detecting the best near-synonym The difference among the proposed discoveringsemantic relationships or associations IOAC-QL the implicitrelation extraction CRFs approach and TFICF approachis examined This experiment attempts to measure theperformance of theMFsSR for the closest synonym detectionand the IOAC-QL for semantic relationship explorationalgorithms Hence more semantic understanding of the textcontent is achieved by inferring unknown dependenciesand relationships among the MFs Considering the directrelevance among the MFs and the indirect relevance amongthe attributes of the MFs in the proposed MFsSR constitutesa certain advantage over previous measures However mostof the time the incorrect items are due to a wrong syntacticparsing from the OpenNLP parser and AlchemyAPIAccording to the preliminary study it is certain that theaccuracy of parsing toolsrsquo effects is on the performance of theMFsSR

Detecting the closest synonym is a process of MF disam-biguation resulting from the implementation of the MFsSRtechnique and the threshold of the synonyms scores throughthe C-SynD algorithm In the C-SynD algorithm the MFsSRtakes into account the direct relevance among MFs andthe indirect relevance among MFs and their attributes in acontext that gives a significant increase in the recall without

16 The Scientific World Journal

Table 5 The results of the comparison of the accuracy and acceptance rate between the proposed semantic relatedness measure and Islamand Inkpen [7]

MRPC dataset Relatednessthreshold

Human judgment(TP + FN)

Islam and Inkpen [7] The proposed MFsSRAcc AR Acc AR

Trainingsubset (4076)

01

2753 true

067 1 068 102 067 1 068 103 067 1 068 104 067 1 068 105 069 098 068 106 072 089 068 107 068 078 070 09808 056 04 072 08609 037 009 060 0491 033 0 034 002

Test subset(1725)

01

1147 true

066 1 067 102 066 1 067 103 066 1 067 104 066 1 067 105 068 098 067 106 072 089 066 107 068 078 066 09808 056 04 064 08509 038 009 049 051 033 0 034 003

disturbing the precision Thus the MFsSR between two con-cepts considers the information content with most of Word-Net relationships Table 6 illustrates the performance of theC-SynD algorithm based on theMFsSR compared to the bestnear-synonym algorithmusing the PMI and ICA approachesThis comparison was conducted for two corpora of textdocuments that are BNC and NS (DS3 and DS4 in Table 3)As indicated in Table 6 the C-SynD algorithm yielded higheraverage precision recall and 119865-measure values than PMIapproach by 25 20 and 21 on SN dataset respectivelyand also by 8 4 and 7 on BNC dataset Furthermorethe C-SynD algorithm also yielded higher average precisionrecall and 119865-measure values than ICA approach by 6 9and 7 on SN dataset respectively and also by 6 4and 5 on BNC dataset The improvements achieved in theperformance values are due to the increase of the numberof pairs predicted correctly and the decrease of the numberof pairs predicted incorrectly through implementing MFsSRThe important observation from this table is the improve-ments achieved in recall which measures effectiveness of theC-SynD algorithm Achieving better precision values is clearwith a high percentage in different datasets and domains

Inferring the MFs relationships and dependencies is aprocess of achieving a large number of interdependencesamongMFs resulting from the implementation of the IOAC-QL algorithm In the IOAC-QL algorithm the 119876-learning isused to select the optimal action (relationships) that gainsthe largest amount of interdependences among the MFs

resulting from the measure of the AVW Table 7 illustratesthe performance of the IOAC-QL algorithm based on AVWscompared to the implicit relationship extraction based onCRFs and TFICF approaches which was conducted overthe TREC Bless the NYT and the Wikipedia corpus (DS5DS6 DS7 and DS8 in Table 3) The data in this table showsincreases in precision recall and 119865-measure values due tothe increase in the number of pairs predicted correctly afterconsidering directindirect relevance through the expansionof closest synonym In addition the IOAC-QL algorithmthe CRFs and the TFICF approaches are beneficial tothe extraction performance but the IOAC-QL contributesmore than CRFs and TFICF Thus IOAC-QL considerssimilarity contiguity contrast and causality relationshipsbetween MFs and its closest-synonyms while the CRFs andTFICF consider is-a and part-of relationships only betweenconceptsThe improvements of the 119865-measure were achievedthrough an IOAC-QL algorithm up to 23 for TREC dataset22 for BLESS dataset 21 for NYT dataset and 6 forWikipedia dataset approximately from the CRFs approachFurthermore the IOAC-QL algorithm also yielded higher119865-measure values than TFICF approach by 10 for TRECdataset 17 for BLESS dataset and 7 for NYT datasetrespectively

433 Experiment 3 Comparative Study (Task-Based Eval-uation) This experiment shows the effectiveness of the

The Scientific World Journal 17

Table 6 The results of the C-SynD algorithm based on the MFsSR compared to the PMI and ICA for detecting the closest synonym

Datasets PMI ICA C-SynDPrecision Recall 119865-measure Precision Recall 119865-measure Precision Recall 119865-measure

SN 606 606 061 795 716 753 85 80 82BNC 745 679 71 76 678 72 82 719 77

Table 7 The results of the IOAC-QL algorithm based on AVWs compared to CRFs and TFICF approaches

Datasets CRFs TFICF IOAC-QLPrecision Recall 119865-measure Precision Recall 119865-measure Precision Recall 119865-measure

TREC 7508 602 6628 893 714 787 898 881 886BLESS 7304 6266 6703 738 695 716 950 835 889NYT 6846 5402 6038 860 650 740 900 740 812Wikipedia 560 420 480 646 5488 5934 700 440 540

Table 8The results of graph-based approach compared to term-based and concept-based approach using119870-means and the HAC clusteringalgorithms

Datasets Algorithm Term-based Concept-based Graph-basedFC FC FC

Reuters-21578 HAC 73 857 902119870-means 51 846 918

20 Newsgroups HAC 53 82 89119870-means 48 793 87

proposed semantic representation scheme in document min-ing and managing operations such as gathering filteringsearching retrieving extracting classifying clustering andsummarizing documents Many sets of experiments usingthe proposed scheme on different datasets in text clusteringhave been conducted The experiments demonstrate theextensive comparisons between the MF-based analysis andthe traditional analysis Experimental results demonstratethe substantial enhancement of the document clusteringquality In this experiment the application of the semanticunderstanding of the textual documents is presented Textualdocuments are parsed to produce the rich syntactic andsemantic representations of SHGRS Based on these semanticrepresentations document linkages are estimated using thesubgraph matching process that is based on finding alldistinct similarity common subgraphs The semantic repre-sentations of textual documents determining the relatednessbetween themare evaluated by allowing clustering algorithmsto use the produced relatedness matrix of the documentset To test the effectiveness of subgraph matching in deter-mining an accurate measure of the relatedness betweendocuments an experiment using the MFs-based analysisand subgraphs relatedness measures was conducted Twostandard document clustering techniques were chosen fortesting the effect of the document linkages on clustering [43](1) HAC and (2) 119870-means The MF-based weighting is themain factor that captures the importance of a concept in

a document Furthermore document linkages using the sub-graph relatedness measure are among the main factors thataffect clustering techniques Thus to study the effect of theMFs-based weighting and the subgraph relatedness measure(graph-based approach) on the clustering techniques thisexperiment is repeated for the different clustering techniquesas shown in Table 8 Basically the aim is to maximize theweighted average of the 119865-measure for each class (FC) ofclusters to achieve high-quality clustering

Table 8 shows that the graph-based approach achieveshigher performance than term-based and concept-basedapproachesThis comparison was conducted for two corporaof text documents that are the Reuters-21578 dataset and the20 newsgroups dataset (DS9 and DS10 in Table 3)The resultsshown in Table 8 illustrate improvement in the clusteringquality when more attributes are included in graph-basedrelatedness measure The improvements were achieved at thegraph-based approach up to 17 by HAC and 40 by 119870-means for Reuters-21578 dataset while up to 36 byHAC and39 by 119870-means for approximately 20 newsgroup datasetsfrom the base case (term-based) approach Furthermorethe improvements also were achieved at the graph-basedapproach up to 5 by HAC and 7 by119870-Means for Reuters-21578 dataset while up to 7 byHAC and 8 by119870-Means forapproximately 20 Newsgroup datasets from Concept-basedapproach

18 The Scientific World Journal

5 Conclusions

This paper proposed the SHGRS to improve the effectivenessof themining andmanaging operations in textual documentsSpecifically a three-stage approach was proposed for con-structing the scheme First the MFs with their attributesare extracted and hierarchy-based structure is built to rep-resent sentences by the MFs and their attributes Second anovel semantic relatedness computing method was proposedfor inferring relationships and dependencies of MFs andthe relationship between MFs is represented by a graph-based structure Third a subgraph matching algorithm wasconducted for document linkage process The documentsclustering is used as a case study and conducted extensiveexperiments on 10 different datasets and the results showedthat the scheme can result in a better mining performancecompared with traditional approach Future work will focuson conducting other case studies to processes such asgathering filtering retrieving classifying and summarizinginformation

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] I T Fatudimu A G Musa C K Ayo and A B SofoluweldquoKnowledge discovery in online repositories a text miningapproachrdquo European Journal of Scientific Research vol 22 no2 pp 241ndash250 2008

[2] A Hotho A Nurnberger and G Paaszlig ldquoA brief survey of textminingrdquo Journal for Computational Linguistics and LanguageTechnology vol 20 2005

[3] M Bernotas K Karklius R Laurutis and A Slotkiene ldquoThepeculiarities of the text document representation using ontol-ogy and tagging-based clustering techniquerdquo Journal of Infor-mation Technology and Control vol 36 pp 217ndash220 2007

[4] Y Li Z A Bandar andDMcLean ldquoAn approach formeasuringsemantic similarity between words using multiple informationsourcesrdquo IEEE Transactions on Knowledge and Data Engineer-ing vol 15 no 4 pp 871ndash882 2003

[5] K Shaban O Basir and M Kamel ldquoDocument mining basedon semantic understanding of textrdquo in Progress in Pattern Rec-ognition Image Analysis and Applications vol 4225 of LectureNotes in Computer Science pp 834ndash843 Springer BerlinGermany 2006

[6] A Culotta A McCallum and J Betz ldquoIntegrating probabilisticextraction models and data mining to discover relations andpatterns in textrdquo in Proceedings of theHuman Language Technol-ogy ConferencemdashNorth American Chapter of the Association forComputational Linguistics Annual Meeting (HLT-NAACL rsquo06)pp 296ndash303 June 2006

[7] A Islam and D Inkpen ldquoSemantic text similarity using corpus-based word similarity and string similarityrdquo ACM Transactionson Knowledge Discovery fromData vol 2 no 2 article 10 2008

[8] R Feldman and J SangerTheTextMining Handbook AdvancedApproaches in Analyzing Unstructured Data Cambridge Uni-versity Press 2006

[9] R Navigli ldquoWord sense disambiguation a surveyrdquo ACM Com-puting Surveys vol 41 no 2 article 10 2009

[10] W Mao andWW Chu ldquoThe phrase-based vector space modelfor automatic retrieval of free-text medical documentsrdquo Dataand Knowledge Engineering vol 61 no 1 pp 76ndash92 2007

[11] C Siefkes and P Siniakov ldquoAn overview and classification ofadaptive approaches to information extractionrdquo in Journal onData Semantics IV vol 3730 of Lecture Notes in ComputerScience pp 172ndash212 Springer New York NY USA 2005

[12] httpwordnetprincetonedu[13] OpenNLP httpopennlpapacheorg[14] httpwwwalchemyapicom[15] W V Siricharoen ldquoOntologies and object models in object ori-

ented software engineeringrdquo International Association of Engi-neers (IAENG) Journal of Computer Science (JICS) vol 33 no 1pp 1151ndash1162 2006

[16] S Shehata ldquoA WordNet-based semantic model for enhancingtext clusteringrdquo in Proceedings of the IEEE International Con-ference on Data Mining Workshops (ICDMW rsquo09) pp 477ndash482Miami Fla USA December 2009

[17] S Shehata F Karray andM Kamel ldquoAn efficient concept-basedminingmodel for enhancing text clusteringrdquo IEEE Transactionson Knowledge and Data Engineering vol 22 no 10 pp 1360ndash1371 2010

[18] E Zemmouri H Behja A Marzak and B Trousse ldquoOntology-based knowledge model for multi-view KDD processrdquo Interna-tional Journal of Mobile Computing and Multimedia Communi-cations vol 4 no 3 pp 21ndash33 2012

[19] C Marinica and F Guillet ldquoKnowledge-based interactive post-mining of association rules using ontologiesrdquo IEEETransactionson Knowledge and Data Engineering vol 22 no 6 pp 784ndash7972010

[20] B Choudhary and P Bhattacharyya ldquoText clustering usingUni-versal Networking Language representationrdquo in Proceedings ofthe 11th International World Wide Web Conference 2003

[21] G Pirro and J Euzenat ldquoA feature and information theoreticframework for semantic similarity and relatednessrdquo in Proceed-ings of the 9th International Semantic Web Conference on theSemantic Web (ISWC rsquo10) vol 6496 pp 615ndash630 2010

[22] Z Lin M R Lyu and I King ldquoMatchSim a novel similaritymeasure based on maximum neighborhood matchingrdquo Knowl-edge and Information Systems vol 32 no 1 pp 141ndash166 2012

[23] D Lin ldquoAn information-theoretic definition of similarityrdquo inProceedings of the 15th International Conference on MachineLearning Madison Wis USA July 1998

[24] D Inkpen ldquoA statistical model for near-synonym choicerdquo ACMTransactions on Speech and Language Processing vol 4 no 1article 2 2007

[25] LHan T Finin PMcNameeA Joshi andYYesha ldquoImprovingword similarity by augmenting PMIwith estimates of word pol-ysemyrdquo IEEE Transactions on Knowledge and Data Engineeringvol 25 no 6 pp 1307ndash1322 2013

[26] J OrsquoShea Z Bandar K Crockett and D McLean ldquoA compar-ative study of two short text semantic similarity measuresrdquo inAgent and Multi-Agent Systems Technologies and Applicationsvol 4953 of Lecture Notes in Artificial Intelligence pp 172ndash181Springer Berlin Germany 2008

[27] J Oliva J I SerranoMDDel Castillo and A Iglesias ldquoSyMSSa syntax-basedmeasure for short-text semantic similarityrdquoDataand Knowledge Engineering vol 70 no 4 pp 390ndash405 2011

The Scientific World Journal 19

[28] I Yoo and X Hu ldquoClustering large collection of biomedicalliterature based on ontology-enriched bipartite graph represen-tation andmutual refinement strategyrdquo inProceedings of the 10thPAKDD pp 9ndash12 2006

[29] S-T Yuan and Y-C Chen ldquoSemantic ideation learning foragent-based e-brainstormingrdquo IEEE Transactions on Knowledgeand Data Engineering vol 20 no 2 pp 261ndash275 2008

[30] D Liu Z Liu and Q Dong ldquoA dependency grammar andWordNet based sentence similarity measurerdquo Journal of Com-putational Information Systems vol 8 no 3 pp 1027ndash1035 2012

[31] A Zouaq M Gagnon and O Benoit ldquoSemantic analysis usingdependency-based grammars and upper-level ontologiesrdquoInternational Journal of Computational Linguistics and Applica-tions vol 1 no 1-2 pp 85ndash101 2010

[32] C Ramakrishnan K J Kochut and A P Sheth ldquoA frameworkfor schema-driven relationship discovery from unstructuredtextrdquo in The Semantic WebmdashISWC 2006 vol 4273 of LectureNotes in Computer Science pp 583ndash596 Springer BerlinGermany 2006

[33] C J C H Watkins and P Dayan ldquoTechnical note Q-learningrdquoMachine Learning vol 8 no 3-4 pp 279ndash292 1992

[34] M Lan C L Tan J Su and Y Lu ldquoSupervised and traditionalterm weighting methods for automatic text categorizationrdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 31 no 4 pp 721ndash735 2009

[35] L Zhang C Li J Liu and HWang ldquoGraph-based text similar-itymeasurement by exploiting wikipedia as background knowl-edgerdquo World Academy of Science Engineering and Technologyvol 59 pp 1548ndash1553 2011

[36] M Kuramochi and G Karypis ldquoAn efficient algorithm for dis-covering frequent subgraphsrdquo IEEE Transactions on Knowledgeand Data Engineering vol 16 no 9 pp 1038ndash1051 2004

[37] L-C Yu H-M Shih Y-L Lai J-F Yeh and C-H Wu ldquoDis-criminative training for near-synonym substitutionrdquo in Pro-ceedings of the 23rd International Conference on ComputationalLinguistics (COLING rsquo10) pp 1254ndash1262 2010

[38] L-C Yu and W-N Chien ldquoIndependent component analysisfor near-synonym choicerdquoDecision Support Systems vol 55 no1 pp 146ndash155 2013

[39] L Yao S Riedel and A Mccallum ldquoCollective cross-documentrelation extraction without labelled datardquo in Proceedings ofthe Conference on Empirical Methods in Natural LanguageProcessing pp 1013ndash1023MITCambridgeMassUSAOctober2010

[40] M Shen D-R Liu and Y-S Huang ldquoExtracting semantic rela-tions to enrich domain ontologiesrdquo Journal of Intelligent Infor-mation Systems vol 39 no 3 pp 749ndash761 2012

[41] H Chim and X Deng ldquoEfficient phrase-based documentsimilarity for clusteringrdquo IEEE Transactions on Knowledge andData Engineering vol 20 no 9 pp 1217ndash1229 2008

[42] GAMiller andWGCharles ldquoContextual correlates of seman-tic similarityrdquo Language and Cognitive Processes vol 6 no 1 pp1ndash28 1991

[43] M Steinbach G Karypis and V Kumar ldquoA comparison ofdocument clustering techniquesrdquo in Proceedings of the Knowl-edge Discovery and Data Mining (KDD) Workshop Text MiningAugust 2000

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 3: Research Article Exploiting Semantic Annotations and ...downloads.hindawi.com/journals/tswj/2015/136172.pdf · is bene cial for text categorization [ ], text summarization [ ], word

The Scientific World Journal 3

used to represent every document as a graph with conceptsas nodes and relations between them as links ignoring thesequence and orders of concepts in a sentence at the docu-ment level

Secondly the semantic relatedness measure has becomean area of research as one of the hotspots in the area ofinformation technology Semantic relatedness and semanticsimilarity are sometimes confused in the research literatureand they are not identical [21 22] The semantic relatednessmeasures usually include various types of relationships suchas hypernym hyponym subsumption synonym antonymholonym and meronymyThe semantic similarity is a specialcase of semantic relatedness that only considers synonymousrelationships and subsumption relationships There are sev-eral measures for semantic relatedness recently conductedAccording to the parameters used they can be classified intothree major categories including distance-based methods(Rada and Wu and Palmer [21]) which selects the shortestpath among all the possible paths between concepts to bemore similar information-based methods (Resnik Jiang andConrath and Lin [21]) which considers the use of externalcorpora avoiding the unreliability of path distances andtaxonomy and hybrid methods (T Hong and D smith andZili Zhou [21]) which combines the first two measures Inthis paper the information-based methods are focused onIn [23] the semantic relatedness of any concept is based on asimilarity theorem in which the similarity of two concepts ismeasured by the ratio of the amount of information neededto the commonality of the two concepts to the amount ofinformation needed to describe them The information con-tent (IC) of their lowest common subsumer (LCS) capturesthe commonality of the two concepts and the IC of twoconcepts themselves The lowest common subsumer (LCS) isthe most specific concept which is a shared ancestor of thetwo concepts The pointwise mutual information (PMI) [2425] is a simplemethod for computing corpus-based similarityof words The pointwise mutual information is defined asthe ratio of the probability that the two words cooccur119901(1199081and 119908

2) and the product of the probability cooccurs

of each one 119901(1199081) sdot 119901(119908

2) If 119908

1and 119908

2are statistically

independent then the probability that they cooccur is givenby the product 119901(119908

1) sdot 119901(119908

2) If they are not independent

and they have a tendency to cooccur then 119901(1199081AND 119908

2)

will be greater than 119901(1199081) sdot 119901(119908

2) As it is clear there is

extensive literature on measuring the semantic relatednessbetween long texts or documents [7] but there is less workrelated to themeasurement of similarity between sentences orshort texts [26 27] Such methods are usually effective whendealing with long documents because similar documents willusually contain a degree of cooccurring words However inshort documents the word cooccurrence may be rare oreven null This is mainly due to the inherent flexibility ofnatural language enabling people to express similarmeaningsusing quite different sentences in terms of structure and wordcontent

Finally in order to overcome the limitations of previousmodels the documents linkage process is conducted relyingon graph-based document representation model [4 10 11]The main benefit of graph-based techniques is that they

allow us to keep inherent structural information of orig-inal document Mao and Chu [10] used graph based 119870-means algorithm for clustering web documents Yoo andHu [28] represent a set of documents as bipartite graphsusing domain knowledge in ontology Then documents areclustered based on average-link clustering algorithm How-ever these approaches relying on graph-based model arenot intended to be used in an incrementally growing set ofdocuments

To utilize the structural and semantic information in thedocument in this paper a formal semantic representation oflinguistic input is introduced to build an SHGRS scheme forthe documents This representation scheme is constructedthrough the accumulation of syntactic and semantic analysisoutputs A new semantic relatedness measure is developed todetermine the relatedness among concepts of the documentas well as relatedness between contents of the documents forlongshort texts An efficient linkage process is required toconnect the related documents by identifying their sharedsubgraphs The proposed representation scheme along withthe proposed relatedness measure and linkage process isbelieved to enable a more effective document mining andmanaging processes

3 The Proposed Semantic RepresentationScheme and Linkage Processes

In this section a new framework is introduced for construct-ing the SHGRS and conducting the document linkage pro-cesses using multidimensional analysis and primary decisionsupport In fact the elements in a sentence are not equallyimportant and the most frequent word is not necessarilythe most important one As a result the extraction of textmain features (MFs) as concepts as well as their attributesand relationships is important Combinations of semanticannotation [13 14] and reinforcement learning (RL) [29]techniques are used to extract MFs of the text and to inferunknown dependencies and relationships among these MFsThe RL fulfills sequential decision-making tasks with long-run accumulated reward to achieve the largest amount ofinterdependence among MFs This framework focuses onthree key criteria (1) how the framework refines text to selectthe MFs and their attributes (2) what learning algorithm isused to explore the unknown dependencies and relationshipsamong these MFs and (3) how the framework conducts thedocument linkage process

Details of this framework process in three stages aregiven in Figure 2 The first stage aims to refine textualdocuments to select the MFs and their attributes with theaid of OpenNLP [13] and AlchemyAPI [14] Hierarchy-basedstructure is used to represent each sentence with its MFsand their attributes which achieves dimension reductionwithmore understanding The second stage aims to compute theproposedMFs semantic relatedness (MFsSR) that contributesto the detection of the closest synonyms of the MFs andto inferring the relationships and dependencies of the MFsGraph-based structure is used to represent the relationshipsand dependencies among these MFs which achieves more

4 The Scientific World Journal

Semantic annotation (AlchemyAPI)

Word-tagging analyses

Syntactic analysis

Semantic analysis

Build MFs hierarchy structure MFs extraction

Semantic text refining and annotating stage

FLG matching Compute subgraph-relatedness measure

Documents linkage conducting stage

MFsSR algorithm

WordNet

SALA

Detecting closestsynonyms

Inferring dependency and semantic relationships

Build relationships graph structure

Construct semantic hierarchygraph document object

Dependenciesrelationships exploring and inferring stage

Figure 2 A framework for constructing a semantic hierarchygraph text representation scheme

correlation through many-to-many relationships The thirdstage aims to conduct the document linkage process usinga subgraph matching process that relies on finding all ofthe distinct similarities common to the subgraph The mainproceedings of this framework can be summarized as follows

(1) extracting MFs of sentences and their attributes tobuild the hierarchy-based structure for efficientdimension reduction

(2) estimating a novel semantic relatedness measure withconsideration of direct relevance and indirect rele-vance betweenMFs and their attributes for promisingperformance improvements

(3) detecting closest synonyms of MFs for better disam-biguation

(4) exploring and inferring dependencies and relation-ships of MFs for more understanding

(5) representing the interdependence amongMFs of sen-tences in graph-based structure so as to be moreprecise and informative

(6) conducting a document linkage process for moreeffectiveness

31 Semantic Text Refining and Annotating Stage This stageis responsible for refining the text to discover the text MFswith additional annotation for more semantic understandingand aims to

(1) accurately parse each sentence and identify POSsubject-action-object and named-entity recognition

(2) discover the MFs of each sentence in the textual doc-ument

(3) exploit semantic information in each sentencethrough detection attributes of the MFs

(4) reduce the dimensions as much as possible(5) generate an effective descriptive sentence object

(DSO) with a hierarchical sentence object automati-cally

This stage can be achieved through these processes Firstthe text NL is studied at different linguistic levels that iswords sentence and meaning for semantic analysis andannotation [30 31] Second exploiting the information ldquowhois doing what to whomrdquo clarifies dependencies between verbsand their arguments for extraction of MFs Finally buildingMFs hierarchy-based structure explains sentences MFs andtheir attributes

With regard to extracting MFs and building MFs hier-archy-based structure the following points must be high-lighted First there is a dire need to refine the text contentby representing each sentence with its MFs instead of a seriesof terms to reduce the dimensions as much as possible TheOpenNLP supports the most common NLP tasks such astokenization sentence segmentation part-of-speech taggingnamed entity extraction chunking parsing and coreferenceresolution Furthermore the AlchemyAPI extracts semanticmetadata from content such as information on subject-action-object relation extraction people places companiestopics facts relationships authors and languages Based onthe NL concept by OpenNLP and AlchemyAPI the sentenceMFs are identified as subject ldquoSubrdquo main verb ldquoMVrdquo andobject ldquoObjrdquo (direct or indirect object) In addition suchother terms remaining in the sentence as complement ldquoComrdquo(subject complement or object complement) and modifiers(Mod) are considered as attributes to these MFs

Second the automatic annotation is essential for eachof the MFs with additional information for more seman-tic understanding Based on the semantic annotation byOpenNLP and AlchemyAPI the sentenceMFs are annotatedwith feature value (FVal) part-of-speech (POS) named-entity recognition (Ner) and a list of feature attributes

The Scientific World Journal 5

Root

DSO

FValPosNer

FATTS

Annotated with

Val

POSNer

TypeAnnotated with

DSO DSO DSO

middot middot middot

middot middot middot

middot middot middot

MFnMF1 MF2

FATT1FATT2 FATTm

Figure 3 The hierarchy-based structure of the textual document

(FAtts) The list of FAtts is constructed from the remainingterms in the sentence (complement or modifier) relyingon the grammatical relation Each attribute is annotatedwith attribute value (Val) attribute type (type) which iscomplement or modifier attribute POS (POS) and attributenamed-entity recognition (Ner)

Finally the hierarchy-based structure of the textual doc-ument is represented as an object [32] containing the hierar-chical structure of sentences with the MFs of the sentencesand their attributesThis hierarchical structure maintains thedependency between the terms on the sentence level formoreunderstanding which provides sentences with fast access andretrieval Figure 3 shows the hierarchical structure of the textin a theme tree ldquotree DSOrdquo the first level of the hierarchicalmodel denotes all sentence objects that are called DSO thenext level of nodes denotes the MFs of each sentence objectand the last level of nodes denotes the attributes of the MFs

tree DSODSO[i] indicates annotation of sentenceobject 119894 which is fitted with additional information as inHeuristic 1 This heuristic simplifies the hierarchical modelof the textual document with its DSO objects and its MFsThe feature score is used to measure the contribution of theMF in the sentence This score is based on the number ofthe terms grammatically related to the MF (summation offeature Attributes) as follows

Feature Score (FS) = sum

119894

MF119894FAtt (1)

Heuristic 1 Let tree DSO = Sibling of DSO nodes and leteachDSO C tree DSO = SEN key SEN value SEN StateSibling of MF node[] where SEN key indicates the order ofthe sentence in the document as an index SEN value indi-cates the sentence parsed value as (S(NP(NN employee))(VP(VBZ guides)(NP(DT the)(NN customer)))(sdot)) SENState indicates the sentence tense and state (tense past statenegative) siblings of MF node[] indicate the list of siblingnodes of MFs for each sentence fitted with Val POS Ner andlist of FAtts

In the text refining and annotating (TRN) algorithm(see Algorithm 1) the textual document is converted to anobject model with the hierarchical DSOThe TRN algorithmuses OpenNLP and AlchemyAPI tools for MFs extractionand the hierarchy model construction As a result the

constructed hierarchy structure leads to the reduction ofthe dimensions as much as possible and achieves efficientinformation access and knowledge acquisition

32 DependenciesRelationships Exploring and Inferring StageThis stage is responsible for exploring how the dependenciesand relationships among MFs have an effect and aims to

(1) represent the interdependence among sentence MFsin a graph-based structure known as the feature link-age graph (FLG)

(2) formulate an accurate measure for the semantic relat-edness of MFs

(3) detect the closest synonyms for each of the MFs(4) infer the relationships and dependencies of MFs with

each other

This stage can be achieved in three processes First build-ing an FLG represents the dependencies and relationshipsamong sentences Second an efficientMFsSRmeasure is pro-posed to contribute to the detection of the closest synonymsand to infer the relationships and dependencies of the MFsThis measure considers the direct relevance and the indirectrelevance among MFs The direct relevance is the synonymsand associated capabilities (similarity contiguity contrastand causality) among the MFs These association capabilitiesindicate the relationships that give the largest amount ofinterdependence among theMFswhile the indirect relevancerefers to other relationships among the attributes of theseMFs Finally the unknown dependencies and relationshipsamong MFs are explored by exploiting semantic informationabout their texts at a document level

In addition a semantic actionable learning agent (SALA)plays a fundamental role in detecting the closest synonymsand inferring the relationships and dependencies of the MFsLearning is needed to improve the SALA functionality toachieve multiple goals Many studies show that the RL agenthas a high reproductive capability for human-like behaviors[29] As a result the SALA performs an adaptive approachcombining thesaurus-based and distributional mechanismsIn the thesaurus-based mechanism words are compared interms of how they are in the thesaurus (eg WordNet) whilein the distributionalmechanismwords are compared in termsof the shared number of contexts in which they may appear

321 Graph-Based Organization Constructing Process Therelationships anddependencies among theMFs are organizedinto a graph-like structure of nodes with links known asFLG and defined in Heuristic 2 The graph structure FLGrepresents many-to-many relationships among MFs TheFLG is a directed acyclic graph FLG(119881 119860) that would berepresented in a collection of vertices and a collection ofdirected arcs These arcs connect pairs of vertices with nopath returning to the same vertex (acyclic) In FLG(119881 119860)119881 is the set of vertices or states of the graph and 119860 isthe set of arcs between vertices The FLG construction isbased on two sets of data The first set represents the verticesin a one-dimensional array vertex (119881) of Feature Vertex

6 The Scientific World Journal

Input List of Sentences of The document 119889119894

Output List DSO lowastlist of DSO objects and its MFs with attributes lowastProcedure 119863 larr New documentList DSOlarr Empty list list of sentences object of documentfor each sentence 119904

119894in document 119863 do

AlchemyAPI(119904119894) lowastThis function calls AlchemyAPI to determine all MF and its NER lowast

for each feature mfj isin mf1mf2 mf119899 in DSO do

get MF Attributes(mf119894) lowastThis function determines attributes of each MF lowast

OpenNLP (si) lowastThis function calls OpenNLP to determine POS and NER for MF and attributes lowastcompute MF score(mf

119894) lowastThis function determines the score of each MF lowast

endList DSOadd(DSO)

end

Algorithm 1 TRN algorithm

objects (FVOs)The FVO is annotated with Sentence Featurekey (SFKey) feature value (Fval) feature closest synonyms(Fsyn) feature associations (FAss) and feature weight (FW)Where SFKey combines sentence key and feature key inDSO the sentence key is important because that facilitatesaccessing the parent of theMFsThe Fval object has the valueof the feature the Fsyn object has a list of the detected closestsynonyms of the Fval and the FAss object has a list of theexplored feature associations and relationships for the FvalIn this list each object has Rel Typ between two MFs (1 =similarity 2 = contiguity 3 = contrast 4 = causality and 5 =synonym) and Rel vertex index of the related vertex TheFWobject has accumulated the associations and relationshipsweights clarifying the importance of the Fval in the textualdocument The second set represents the arcs in a two-dimensional array adjacency (119881 119881) of Features link weights(FLW) which indicates the value of the linkage weight in anadjacency matrix between related vertices

Heuristic 2 Let FLG(119881 119860) = ⟨Vertex(119881)Adjacency(119881 119881)⟩let Vertex(119881) = FVO

0 FVO

1 FVO

119899 and let Adja-

cency(119881 119881) = FLW(00)

FLW(01)

FLW(10)

FLW(11)

FLW(119899119899)

where FVO indicates an object ofthe MF in vertex vector and annotates with the additionalinformation as FVO = SFKey Fval Fsyn FAss Fw SFKeyindicates combination of sentence key and feature key in aDSO Fval indicates the value of the feature Fsyn indicatesa list of the closest synonyms of the Fval FAss indicatesa list of the feature associations and relationships for theFval and each object in this list is defined as FAssitem =

Rel TypeRel vertex where Rel Type indicates the valueof the relation type linkage between the two features andRel vertex indicates the index of related vertex FW indicatesthe accumulation of the weights of the associations andrelationships of the Fval in the textual document FLW indi-cates the value of the linkage weight in an adjacency matrix

322 The MFs Semantic Relatedness (MFsSR) MeasuringProcess A primarymotivation for measuring semantic relat-edness comes from the NL processing applications such asinformation retrieval information extraction information

filtering text summary text annotation text mining wordsense disambiguation automatic indexing machine trans-lation and other aspects [2 3] In this paper one of theinformation-based methods is utilized by considering ICThe semantic relatedness that has been investigated concernsthe direct relevance and the indirect relevance among MFsThe MFs may be one word or more and thereby the MF isconsidered as a conceptThe IC is considered as a measure ofquantifying the amount of information a concept expressesLin [23] assumed that for a concept 119888 proposed IC let 119901(119888)be the probability of encountering an instance of concept119888 The IC value is obtained by considering the negative loglikelihood of encountering a concept in a given corpusTraditionally the semantic relatedness of concepts is usuallybased on the cooccurrence information on a large corpusHowever the cooccurrences do not achieve much matchingin a corpus and it is essential to take into account therelationships among concepts Therefore development of anew information content method based on the cocontribu-tions instead of the cooccurrences of the proposed MFsSR isimportant

The new information content (NIC) measure is an exten-sion of the information content measure The NIC measuresare based on the relations defined in the WordNet ontologyThe NIC measure uses hypernymhyponym synonymantonym holonymmeronymy and entailcause to quantifythe informativeness of concepts For example a hyponymrelation could be ldquobicycle is a vehiclerdquo and ameronym relationcould be ldquoa bicycle has wheelsrdquo NIC is defined as a functionof the hypernym hyponym synonym antonym holonymmeronymy entail and cause relationships normalized by themaximum number of MFsattributes objects in the textualdocument using

NIC (119888) = 1 minus ((log (Hype Hypo (119888) + Syn Anto (119888)

+Holo Mero (119888)

+ Enta Cause (119888) + 1))

times (log(max concept))minus1)

(2)

The Scientific World Journal 7

whereHype Hypo(119888) returns the number of FVOs in the FLGrelated to the hypernym or hyponym values Syn Anto(119888)returns the number of FVOs in the FLG related to the syn-onym or antonym values Holo Mero(119888) returns the numberof FVOs in the FLG related to the holonym or meronymyvalues and Enta Cause(119888) returns the number of FVOs in theFLG related to the entail or cause values The max concept isa constant that indicates the total number of MFsattributesobjects in the considered text The max concept normalizesthe NIC value and hence the NIC values fall in the range of[0 1]

With the consideration of direct relevance and indirectrelevance of MFs the proposed MFsSR can be stated asfollows

SemRel = 120582(2 lowast NIC (LCS (1198881 1198882))NIC (1198881) +NIC (1198882)

)

+ (1 minus 120582) (2 lowastNIC (LCS (att 1198881 att 1198882))NIC (att 1198881) + NIC (att 1198882)

)

(3)

where 120582 isin [0 1] decides the relative contribution of directand indirect relevance to the semantic relatedness andbecause the direct relevance is assumed to bemore importantthan the indirect relevance 120582 isin [05 1]

323 The Closest-Synonyms Detecting Process In MFsclosest-synonyms detection the SALA carried out the firstaction to extract the closest synonyms for each MF whereSALA defies automatic discovery of similar meaning words(synonyms) Due to the important role played by a lexicalknowledge base in the closest synonyms detection SALAadopts the dictionary-based approach to the disambiguationof theMFs In the dictionary-based approach the assumptionis that the most plausible sense to assign to multiple sharedwords is that sense that maximizes the relatedness among thechosen senses In this respect SALA detects the synonymsby choosing the meaning whose glosses share the largestnumber of words with the glosses of the neighboring wordsthrough lexical ontology Using a lexical ontology suchas WordNet allows the capture of semantic relationshipsbased on the concepts and exploiting hierarchies of conceptsbesides dictionary glosses One of the problems that canbe faced is that not all synonyms are really related to thecontext of the documentTherefore the MFs disambiguationis achieved by implementing the closest-synonym detection(C-SynD) algorithm

In the C-SynD algorithm the main task of each SALA isto search for synonyms of each FVO throughWordNet Eachsynonym and their relationships are assigned in a list calledSyn Relation List (SRL) Each item in the SRL contains onesynonym and its relationships such asHypernyms HyponymMeronyms Holonymy Antonymy Entail and Cause Thepurpose of using these relations is to eliminate the ambiguityand polysemy because not all of the synonyms are relatedto the context of the document Then all synonyms andtheir relationship lists are assigned in a list called F SRListThe SALA starts to filter the irrelevant synonyms accordingto the score of each SRL This score is computed based on

semantic relatedness between the SRL and every FVO in theFeature Vertex array as follows

Score (SRL119896) =

119898

sum

119894=0

119899

sum

119895=0

SemRel (SRL (119894)item FVO119895) (4)

where 119899 is the number of FVOs in the Feature Vertex arraywith index 119895 119898 is the number of items in SRL with index 119894and 119896 is index of SRL

119896in F SRList

The lists of synonyms and their relationships areused temporarily to serve the C-SynD algorithm (seeAlgorithm 2) The C-SynD algorithm specifies a thresholdvalue of the score for selecting closest synonyms SALAapplies the threshold to select the closest synonyms with thehighest score according to the specific threshold as follows

Closest-Synonyms = maxth

(F SRListSRListscore ()) (5)

Then each SALA retains the closest synonyms in anobject Fsyn of each FVO and SALA links the related FVOswith a bidirectional effect using the SALA links algorithm(see Algorithm 3) In the SALA LINKS algorithm the SALAreceives each FVO for detecting links among the other FVOsaccording to Fsyn object values and FAss object valuesThe SALA assigns the FLW between two FVOs to theirintersection location of the FLG

324 The MFs Relationships and Dependencies Inferring Pro-cess In inferringMFs relationships and dependencies SALAaims to explore the implicit dependencies and relationshipsin the context such as human relying on four associationcapabilities These association capabilities are similarity con-tiguity contrast (antonym) and causality [29] which give thelargest amount of interdependence among the MFs Theseassociation capabilities are defined as follows

Similarity For nouns it is the sibling instance of the sameparent instance with is-a relationship inWordNet indicatinghypernymhyponym relationships

Contiguity For nouns it is the sibling instance of thesame parent instance with part-of relationship in WordNetindicating holonymmeronymy relationship

Contrast For adjectives or adverbs it is the sibling instanceof the same parent instance with is-a relationship and withantonym (opposite) attribute in WordNet indicating a syn-onymantonym relationship

Causality For verbs it indicates the connection of the se-quence of events by a causeentail relationship where a causepicks out two verbs one of the verbs is causative such as (give)and the other is called resultant such as (have) in WordNet

To equip SALA with decision-making and experiencelearning capabilities to achieve multiple-goal RL this studyutilizes the RL method relying on 119876-learning to design aSALA inference engine with sequential decision-makingTheobjective of SALA is to select an optimal association to maxi-mize the total long-run accumulated reward Hence SALA

8 The Scientific World Journal

Input Array Feature Vertex Objects (FVO) WordnetOutput List of Closest-Synonym of FVO objects lowastThe detected closest synonym to set Fsyn value lowastProcedureSRLlarr Empty list list of each synonym and their relationshipsF SRListlarr Empty list list of SRL (all synonyms and their relationships lists)lowastParallelforeach used for parallel processing of all FVO lowastParallelForEach(FVO fvo =gt

Fsynlarr Empty listF SRList = get all Syn Relations List(fvo) lowast function to get all synonyms and their relationships lists from wordnet lowastlowast compute semantic relatedness of each synonym and their relationships with the fvo object then get score of each listlowastfor (int 119894 = 0 119894 lt F SRListCount() 119894++) score[119894] = sum(function Get SemRel(F SRList[119894] fvo)) lowastfunction to compute the score of each

synonym list this score contributes in accurate filtering the returned synonyms and selecting the closest onelowast

Fsyn = maxth(score[]) lowastfunction to select SRL with highest score according to the specific threshold lowast

) ParallelFor

Algorithm 2 C-SynD algorithm

Input Action type and Feature Vertex Object(FVO)Output Detect Links between Feature Vertices and assigned the Features Link Weight(FLW)Procedure ilarr index of the FVO input in FLGnumVertslarr number of Feature Vertex objects (FVO) in Vertex (119881)for (int 119895 = 119894 119895 le numVerts j++)if (Action type == ldquoClosest-Synonymsrdquo) then Fsyn = get C-SynD (FVO[119894]) lowast Function to execute C-SynD algorithm lowastif (FVO[119895]contain(Fsyn)) then FLWlarr SemRel(FVO[119894] FVO[119895]) lowastFunction to compute the value of semantic relatedness betweenthe FVO[119894] and FVO[119895] lowastRel-Typelarr 5 lowast set the relation type with 5 indicates of synonym lowast

elseif (Action type == ldquoAssociationrdquo) then optimal action = get IOAC-QL (FVO) lowast Function to execute IOAC-QL algorithm lowastFLWlarr reward lowastThe reward value of optimal action returnedlowastRel-Typelarr action type lowastThe type of optimal action returned (type = 1 for similarity type = 2 for contiguitytype = 3 for contrast type = 4 for causality) lowast

Rel vertex = 119895 lowast index of the related vertex lowastAdjacent[119894 119895] = FLW lowastset the value of linkage weight in adjacency matrix between related vertex [119894] vertex [119895] lowast

Algorithm 3 SALA links algorithm

can achieve the largest amount of interdependence amongMFs through implementing the inference optimal associationcapabilities with the 119876-learning (IOAC-QL) algorithm

(1) Utilizing Multiple-Goal Reinforcement Learning (RL) RLspecifies what to do but not how to do it through the rewardfunction In sequential decision-making tasks an agent needsto perform a sequence of actions to reach goal states ormultiple-goal states One popular algorithm for dealing withsequential decision-making tasks is 119876-learning [33] In a 119876-learning model a 119876-value is an evaluation of the ldquoqualityrdquo

of an action in a given state 119876(119909 119886) indicates how desirableaction 119886 is in the state 119909 SALA can choose an action based on119876-values The easiest way of choosing an action is to choosethe one that maximizes the 119876-value in the current state

To acquire the119876-values the algorithm119876-learning is usedto estimate the maximum discounted cumulative reinforce-ment that the model will receive from the current state 119909

on max(sum119894=0

120574119894119903119894) where 120574 is a discount factor and 119903

119894is the

reward received at step 119894 (which may be 0) The updating of119876(119909 119886) is based on minimizing 119903 + 120574119891(119910) minus 119876(119909 119886) 119891(119910) =max119886119876(119910 119886) and 119910 is the new state resulting from action

The Scientific World Journal 9

119886 Thus updating is based on the temporal difference inevaluating the current state and the action chosen Throughsuccessive updates of the 119876 function the model can learn totake into account future steps in longer and longer sequencesnotably without explicit planning [33] SALA may eventuallyconverge to a stable function or find an optimal sequence thatmaximizes the reward received Hence the SALA learns toaddress sequential decision-making tasks

Problem Formulation The 119876-learning algorithm is used todetermine the best possible action that is selecting the mosteffective relations analyzing the reward function and updat-ing the related feature weights accordingly The promisingaction can be verified by measuring the relatedness scoreof the action value with the remained MFs using a rewardfunctionThe formulation of the effective association capabil-ities exploration is performed by estimating an action-valuefunction The value of taking action 119886 in state 119904 is definedunder a policy 120587 denoted by 119876(119904 119886) as the expected returnstarting from 119904 and taking the action 119886 Action policy 120587 is adescription of the behavior of the learning SALA The policyis the most important criterion in providing the RL with theability to determine the behavior of the SALA [29]

In single goal reinforcement learning these 119876-values areused only to rank order of the actions in a given state Thekey observation here is that the 119876-values can also be used inmultiple-goal problems to indicate the degree of preferencefor different actions The available way in this paper to selecta promising action to execute is to generate an overall119876-valueas a simple sum of the 119876-values of the individual SALA Theaction with the maximum summed value is then chosen toexecute

Environment State and Actions For the effective associationcapabilities exploration problem the FLG is provided as anenvironment in which intelligent agents can learn and sharetheir knowledge and can explore dependencies and relation-ships among MFs The SALAs receive the environmentalinformation data as a state and execute the elected action inthe form of an action value table which provides the best wayof sharing knowledge among agents The state is defined bythe status of theMF space In each state theMFobject is fittedwith the closest-synonyms relations there is a set of actionsIn each round the SALA decides to explore the effectiveassociations that gain the largest amount of interdependenceamong theMFs Actions are defined by selecting a promisingaction with the highest reward by learning which actionis the optimal for each state In summary once 119904

119896(MF

object) is received from the FLG SALA chooses one ofthe associations as the action 119886

119896according to SALArsquos policy

120587lowast(119904119896) executes the associated action and finally returns the

effective relations Heuristic 3 defines the action variable andthe optimal policy

Heuristic 3 (action variable) Let 119886119896symbolize the action that

is executed by SALA at round 119896

119886119896= 120587lowast(119904119896) (6)

Therein one has the following(i) 119886119896C action space similarity contiguity contrast

and causality(ii) 120587lowast(119904

119896) = argmax

119886119876lowast(119904119896 119886119896)

Once the optimal policy (120587lowast) is obtained the agentchooses the actions using the maximum reward

Reward FunctionThe reward function in this papermeasuresthe dependency and the relatedness among MFs The learneris not told which actions to take as in most forms of machinelearning Rather the learner must discover which actionsyield the most reward by trying them [33]

(2) The SALA Optimal AssociationRelationships InferencesThe optimal association actions are achieved according to theIOAC-QL algorithmThis algorithm starts to select the actionwith the highest expected future reward from each state Theimmediate reward which the SALA gets to execute an actionfrom a state s plus the value of an optimal policy is the 119876-value The higher 119876-value points to the greater chance ofthat action being chosen First initialize all the119876(119904 119886) valuesto zero Then each SALA performs the following steps inHeuristic 4

Heuristic 4(i) At every round 119896 select one action 119886

119896from all the

possible actions (similarity contiguity contrast andcausality) and execute it

(ii) Receive the immediate reward 119903119896and observe the new

state 1199041015840119896

(iii) Get the maximum 119876-value of the state 1199041015840

119896based on

all the possible actions 119886119896= 120587lowast(119904119896) = argmax

119886119876lowast

(119904119896 119886119896)

(iv) Update the 119876(119904119896 119886119896) as follows

119876 (119904119896 119886119896) = (1 minus 120572)119876 (119904

119896 119886119896)

+ 120572(119903 (119904119896 119886119896) + 120574 max1198861015840isin119860(119904119896)

119876 (1199041015840

119896 1198861015840

119896))

(7)

where 119876(119904119896 119886119896) is worthy of selecting the action (119886

119896)

at the state (119904119896) 119876(1199041015840

119896 1198861015840

119896) is worthy of selecting the

next action (1198861015840119896) at the next state (1199041015840

119896) The 119903(119904

119896 119886119896)

is a reward corresponding to the acquired payoff A(1199041015840119896) is a set of possible actions at the next state (1199041015840

119896) 120572

(0 lt 120572 le 1) is learning rate 120574 (0 le 120574 le 1) is discountrate

During each round after taking the action 119886 the actionvalue is extracted from WordNet The weight of the actionvalue is measured to represent the immediate reward wherethe SALA computes the weight AVW

119894119896for the action value 119894

in document 119896 as followsReward (119903) = AVW

119894119896

= AVW119894119896+

119899

sum

119895=1

(FW119895119896

lowast SemRel (119894 119895)) (8)

10 The Scientific World Journal

Input Feature Vertex Object(FVO) WordnetOutput The optimal action for FVO relationshipsProcedure

intialize Q[ ]lowastimplement 119876-learning algorithm to get the action with the highest 119876-value lowast119886119896isin action space[] = similarity contiguity contrast causality

lowastParallelforeach used for parallel processing of all actions lowastParallelfor (0 action spacecount() 119896 =gt 119877[119896] = Get Action Reward(119886[119896]) lowastFunction to implement (8) lowast119876[119894 119896] = (1 minus 120572) lowast 119876[119894 119896] + 120572 lowast (119877[119896] + 120574 lowastmax

119886(119876[119894 119896 + 1]))

) ParallelForreturn Get highest 119876-value action(119876[ ]) lowastselect action with the maximum summed value of 119876-values lowast

lowastexecute (8) to get reward value lowastGet Action Reward(ak)AVW

119896= (freq(119886

119896val)FVOCount()) lowast log(119873119899

119886)

for (int 119895 = 0 119895 lt FVOCount() 119895++) FW119895= (freq(FVO[119895])FVOCount()) lowast log(119873119899

119891)

SemRel(119894 119895) =Get SemRel(119886119896val FVO[119895])

Sum+= FW119895lowast SemRel(119894 119895)

return AVW119896= AVW

119896+ Sum

Algorithm 4 IOAC-QL algorithm

where 119903 is the reward of the selected action 119886119894 AVW

119894119896is the

weight of action value 119894 in document 119896 FW119895119896

is the weightof each feature object FVO

119895in document 119896 and SemRel(119894 119895)

is the semantic relatedness between the action value 119894 andfeature object FVO

119895

The most popular weighting scheme is the normalizedword frequency TFIDF [34] used to measure AVW

119894119896and

FW119895119896The FW

119895119896in (8) is calculated likewise AVW

119894119896is based

on (9) and (10)Action value weight (119860119881119882

119894119896) is a measure to calculate

the weight of the action value that is the scalar product ofaction value frequency and inverse document frequency as in(9)The action value frequency (119865) measures the importanceof this value in a document Inverse document frequency(IDF) measures the general importance of this action valuein a corpus of documents

AVW119894119896

=119865119894119896

sum119896119865119894119896

lowast IDF119894 (9)

where 119865119894119896

represents the number of this action value 119894

cocontributions in document 119889119896 normalized by the number

of cocontributions of all MFs in document 119889119896 normalized to

prevent a bias towards longer documents IDF is performedby dividing the number of all documents by the number ofdocuments containing this action value defined as follows

IDF119894= log( |119863|

1003816100381610038161003816119889119896 V119894 isin 1198891198961003816100381610038161003816

) (10)

where |119863| is the total number of documents in the corpusand |119889

119896 V119894isin 119889119896| is the number of documents containing

action value 119881119894

Thus in the IOAC-QL algorithm (see Algorithm 4)implementation shown in Figure 4 the SALA selects theoptimal action with the higher reward Hence each SALAretains effective actions in the FAss list object of each FVO Inthis list each object hasRel-Type object andRel vertex objectTheRel-Type object value equals the action type object valueand Rel vertex object value equals the related vertex indexThe SALA then links the related FVOs with the bidirectionaleffect using the SALA links algorithm and sets the FLW valueof the linkage weight in the adjacency matrix with the rewardof the selected action

An example of monitoring the SALA decision throughthe practical implementation of the IOAC-QL algorithmactions is illustrated in Figure 4

For state FVO1 one has the following

action similarity ActionResult State FVO1FVO7

FVO9FVO22FVO75FVO92FVO112

FVO125

Prob 8reward 184 PrevState FVO

1 119876(119904 119886) = 91

action contiguity ActionResult State FVO103

FVO17

FVO44 Prob 3 reward 74 PrevState FVO

1 119876(119904 119886) =

61action contrast ActionResult State FVO

89 Prob 1

reward 22 PrevState FVO1 119876(119904 119886) = 329

action causality ActionResult State FVO63

FVO249

Prob 2 reward 25 PrevState FVO

1 119876(119904 119886) = 429

As revealed in the case FVO1 the SALA receives the

environmental information data FVO1as a state and then

the SALA starts the implementation to select the actionwith the highest expected future reward among the otherFVOs The result of each action is illustrated the action with

The Scientific World Journal 11

S0

S0

S1

S1S2

a0 a0a0a4 a4 a4

DSO DSO

Sm

Sm

Object model construction

FVO

Vertex [V] Adjacent [V V]

FLWFval

SFKeyFsyn[]

FAss[]FW

The feature linkage graph (FLG)

Descriptive document object

Descriptive sentence object

Action selection

Q-learning mechanism

Assign Q-values with FLG environment

Reinforcement learning model for semanticrelational graph construction

SALA inference engine

DSO DSO DSO DSO DSO

DDO

Sem

antic

actio

nabl

e lea

rnin

g ag

ent (

SALA

)

F1

F1

F1

F2

F3

F4

F2

F2 F3 F4 middot middot middot Fn

DSO

Fn

Fn

middot middot middot

middot middot middot

middot middot middot

middot middot middot middot middot middotmiddot middot middot

middot middot middot

middot middot middotmiddot middot middot

middot middot middot

middot middot middot

middot middot middot

Q(S0 a0) Q(S0 a4)Q(S1 a0) Q(S1 a4)

Q(Sm a0) Q(Sm a4)

WordNet

Figure 4 The IOAC-QL algorithm implementation

the highest reward is selected and the elected action isexecutedTheSALAdecision is accomplished in FLGwith theeffective extracted relationships according to 119876(119904 119886) value asshown in Table 1

33 Document Linkage Conducting Stage This section pre-sents an efficient linkage process at the corpus level for manydocuments Due to the SHGRS scheme of each documentmeasuring the commonality between the FLG of the doc-uments through finding all distinct common similarity orrelatedness subgraphs is required The objective of findingcommon similar or related subgraphs as a relatedness mea-sure is to be utilized in document mining and managingprocesses such as clustering classification and retrieval Itis essential to establish that the measure is metric A metricmeasure ascertains the order of objects in a set and henceoperations such as comparing searching and indexing of theobjects can be performed

Graph relatedness [35 36] involves determining thedegree of similarity between these two graphs (a numberbetween 0 and 1) so the same node in both graphs wouldbe similar if its related nodes were similar The selectionof subgraphs for matching depends on detecting the moreimportant nodes in FLG using FVOFw value The vertexwith the highest weight is the first node selected which thendetects all related vertices from an adjacency matrix as asubgraphThis process depends on computing the maximumcommon subgraphsThe inclusion of all common relatednesssubgraphs is accomplished through recursive iterations andholding onto the value of the relatedness index of thematched

subgraphs Eliminating the overlap is performed using a dy-namic programming technique that keeps track of the nodesvisited and their attribute relatedness values

The function relatedness subgraphs utilize an array Rel-Nodes[] of the related vertices for each document Thefunction SimRel(119909

119894 119910119894) computes the similarity between the

two nodes 119909119894 119910119894as defined in (11) The subgraph matching

process is measured using the following formula

Sub-graph-Rel (SG1 SG2)

=sum

SG1119894

sumSG2119895

SemRel (119873119894 119873119895)

sumSG1119894

119873119894+ sum

SG2119895

119873119895

(11)

where SG1 is the selected subgraph from the FLG of onedocument SG2 is the selected subgraph from the FLG ofanother document and SemRel(119873

119894 119873119895) is the degree of

matching for both nodes 119873119894and 119873

119895 A threshold value for

node matching is also utilized the threshold is set to a 50match

4 Evaluation of the Experiments and Results

This evaluation is especially vital as the aim of building theSHGRS is to use them efficiently and effectively for the fur-ther mining process To explore the effectiveness of the pro-posed SHGRS scheme examining the correlation of the pro-posed semantic relatedness measure compared with otherprevious relatedness measures is required The impact ofthe proposed MFsSR for detecting the closest synonym is

12 The Scientific World Journal

Table 1 Action values table with the SALA decision

State FVO1 FVO1 FVO1 FVO1

Action Similarity Contiguity Contrast CausalityActionResult State FVO1 FVO7 FVO9 FVO22 FVO75 FVO92 FVO112 FVO125 FVO103 FVO17 FVO44 FVO89 FVO63 FVO249

Total reward 184 74 22 25119876(119904 119886) 94 61 329 429Decision Yes No No No

studied and compared to the PMI [25 37] and independentcomponent analysis (ICA) [38] for detecting the best near-synonym The difference between the proposed discoveringsemantic relationships or associations IOAC-QL implicitrelation extraction conditional random fields (CRFs) [6 39]and term frequency and inverse cluster frequency (TFICF)[40] is computed

A case study of amanaging andmining task (ie semanticclustering of text documents) is consideredThis study showshow the proposed approach is workable and how it canbe applied to mining and managing tasks Experimentallysignificant improvements in all performances are achieved inthe document clustering processThis approach outperformsconventional word-based [16] and concept-based [17 41]clusteringmethodsMoreover performance tends to improveas more semantic analysis is achieved in the representation

41 Evaluation Measures Evaluation measures are subcat-egorized into text quality-based evaluation content-basedevaluation coselection-based evaluation and task-basedevaluationsThefirst category of evaluationmeasures is basedon text quality using aspects such as grammaticality nonre-dundancy referential clarity and coherence For content-based evaluations measures such as similarity semanticrelatedness longest common subsequence and other scoresare used The third category is based on coselection evalua-tion using precision recall and 119865-measure values The lastcategory task-based evaluation is based on the performanceof accomplishing the given task Some of these tasks areinformation retrieval question answering document classi-fication document summarizing and document clusteringmethods In this paper the content-based the coselection-based and the task-based evaluations are used to validate theimplementation of the proposed scheme over the real corpusof documents

411 Measure for Content-Based Evaluation The relatednessor similarity measures are inherited from probability theoryand known as the correlation coefficient [42]The correlationcoefficient is one of the most widely used measures todescribe the relatedness 119903 between two vectors119883 and 119884

Correlation Coefficient 119903 The correlation coefficient is a rela-tively efficient relatedness measure which is a symmetricalmeasure of the linear dependence between two randomvariables

Table 2 Contingency table presents human judgment and systemjudgment

Human judgmentTrue False

System judgmentTrue TP FPFalse FN TN

Therefore the correlation coefficient can be consideredas the coefficient for the linear relationship between corre-sponding values of 119883 and 119884 The correlation coefficient 119903between sequences 119883 = 119909

119894 119894 = 1 119899 and 119884 = 119910

119894

119894 = 1 119899 is defined by

119903 =sum119899

119894=1119883119894119884119894

radic(sum119899

119894=11198832119894) lowast (sum

119899

119894=11198842119894)

(12)

Acceptance Rate (AR) Acceptance rate is a proportion ofcorrectly predicted similar or related sentences compared toall related sentences High acceptance ratemeans recognizingalmost all similar or related sentences

AR =TP

(TP + FN) (13)

Accuracy (Acc) Accuracy is a proportion of all correctlypredicted sentences compared to all sentences

Acc = TP + TN(TP + FN + FP + TN)

(14)

where TP TN FP and FN stand for true positive (the num-ber of pairs correctly labeled as similar) true negative (thenumber of pairs correctly labeled as dissimilar) false positive(the number of pairs incorrectly labeled as similar) andfalse negative (the number of pairs incorrectly labeled asdissimilar) as shown in Table 2

412 Measure for Coselection-Based Evaluation For all thedomains a precision 119875 recall 119877 and 119865-measure are utilizedas the measures of performance in the coselection-basedevaluation These measures may be defined via computingthe correlation between the extracted correct and wrongclosest synonyms or semantic relationshipsdependencies

The Scientific World Journal 13

Table 3 Datasets details for the experimental setup

Dataset Experimenting Dataset Name Description

DS1 Experiment 1content-basedevaluation

Miller and Charles (MC)1 RG consists of 65 pairs of nouns extracted from the WordNet rated bymultiple human annotators

DS2Microsoft ResearchParaphrase Corpus

(MRPC)2

The corpus consists of 5801 sentence pairs collected from newswire articles3900 of which were labeled as relatedness by human annotators The wholeset is divided into a training subset (4076 sentences of which 2753 are true)and a test subset (1725 pairs of which 1147 are true)

DS3 Experiment 2coselection-basedevaluation(closest-synonymdetection)

British National Corpus(BNC)3

BNC is a carefully selected collection of 4124 contemporary written andspoken English texts contains 100-million-word text corpus of samples ofwritten and spoken English with the near-synonym collocations

DS4 SN (semantic neighbors)4SN relates 462 target terms (nouns) to 5910 relatum terms with 14682semantic relations (7341 are meaningful and 7341 are random) The SNcontains synonyms coming from three sources WordNet 30 Rogetrsquosthesaurus and a synonyms database

DS5

Experiment 2coselection-basedevaluation (semanticrelationshipsexploration)

BLESS6BLESS relates 200 target terms (100 animate and 100 inanimate nouns) to 8625relatum terms with 26554 semantic relations (14440 are meaningful (correct)and 12154 are random) Every relation has one of the following typeshypernymy cohypernymy meronymy attribute event or random

DS6 TREC5

TREC includes 1437 sentences annotated with entities and relations at leastone relation There are three types of entities person (1685) location (1968)and organization (978) in addition there is a fourth type others (705) whichindicates that the candidate entity is none of the three types There are fivetypes of relations located in (406) indicates that one location is located insideanother location work for (394) indicates that a person works for anorganization OrgBased in (451) indicates that an organization is based in alocation live in (521) indicates that a person lives at a location and kill (268)indicates that a person killed another person There are 17007 pairs of entitiesthat are not related by any of the five relations and hence have the NR relationbetween them which thus significantly outnumbers other relations

DS7 IJCNLP 2011-New YorkTimes (NYT)6

NYT contains 150 business articles from NYT There are 536 instances (208positive 328 negative) with 140 distinct descriptors in NYT dataset

DS8 IJCNLP 2011-Wikipedia8Wikipedia personalsocial relation dataset was previously used in Culotta etal [6] There are 700 instances (122 positive 578 negative) with 70 distinctdescriptors in Wikipedia dataset

DS9 Experiment 3task-basedevaluation

Reuters 215787 Reuters-21578 contains 21578 documents (12902 are used) categorized to 10categories

DS10 20 Newsgroups8 20 newsgroups dataset contains 20000 documents (18846 are used)categorized to 20 categories

1Available at httpwwwcscmuedusimmfaruquisuitehtml2Available at httpresearchmicrosoftcomen-usdownloads3Available at httpcorpusbyuedubnc4Available at httpcentalfltruclacbeteamsimpanchenkosre-evalsncsv5Available at httpl2rcsuiucedusimcogcompDataERconll04corp6Available at httpwwwmysmuedufacultyjingjiangdataIJCNLP2011zip7Available at httpmlrcsumassedumldatasetsReuters-21578+Text+Categorization+Collection8Available at httpwwwcsminingorgindexphpid-20-newsgroupshtml

Let TP denote the number of correctly detected closestsynonyms or semantic relationships explored let FP be thenumber of incorrectly detected closest synonyms or semanticrelationships explored and let FN be the number of correctlybut not detected closest synonyms or semantic relationshipsexplored in a dataset The 119865-measure combines the precisionand recall in one metric and is often used to show theefficiencyThismeasurement can be interpreted as a weightedaverage of the precision and recall where an 119865-measure

reaches its best value at 1 andworst value at 0 Precision recalland 119865-measure are defined as follows

Recall = TP(TP + FN)

Precision =TP

(TP + FP)

119865-measure = 2 (Recall lowast Precision)(Recall + Precision)

(15)

14 The Scientific World Journal

413 Measure for Task-Based Evaluation A case study ofa managing and mining task (ie semantic clustering oftext documents) is studied for task-based evaluation Thetest is performed with the standard clustering algorithmsthat accept the pairwise (dis)similarity between documentsrather than the internal feature space of the documents Theclustering algorithms may be classified as listed below [43](a) flat clustering (which creates a set of clusters without anyexplicit structure that would relate the clusters to each otheralso called exclusive clustering) (b) hierarchical clustering(which creates a hierarchy of clusters) (c) hard clustering(which assigns each documentobject as a member of exactlyone cluster) and (d) soft clustering (which distributes thedocumentsobjects over all clusters) These algorithms areagglomerative (hierarchical clustering) 119870-means (flat clus-tering hard clustering) and EM algorithm (flat clusteringsoft clustering)

Hierarchical agglomerative clustering (HAC) and the 119870-means algorithm have been applied to text clustering ina straightforward way [16 17] The algorithms employedin the experiment included 119870-means clustering and HACTo evaluate the quality of the clustering the 119865-measure isused for quality measures which combine the precision andrecall measures These metrics are computed for every (classcluster) pair The precision 119875 and recall 119877 of a cluster 119878

119894with

respect to a class 119871119903are defined as

119875 (119871119903 119878119894) =

119873119903119894

119873119894

119877 (119871119903 119878119894) =

119873119903119894

119873119903

(16)

where 119873119903119894is the number of documents of class 119871

119903in cluster

119878119894119873119894is the number of documents of cluster 119878

119894 and119873

119903is the

number of documents of class 119871119903

The 119865-measure which is a harmonic mean of precisionand recall is defined as

119865 (119871119903 119878119894) =

2 lowast 119875 (119871119903 119878119894) lowast 119877 (119871

119903 119878119894)

119875 (119871119903 119878119894) + 119877 (119871

119903 119878119894)

(17)

A per-class 119865 score is calculated as follows

119865 (119871119903) = max119878119894isin119862

119865 (119871119903 119878119894) (18)

With respect to class 119871119903 the cluster with the highest 119865-

measure is considered to be the cluster that maps to class 119871119903

that 119865-measure becomes the score for class 119871119903 and 119862 is total

number of clusters The overall 119865-measure for the clusteringresult 119862 is the weighted average of the 119865-measure for eachclass 119871

119903(macroaverage)

119865119862 =sum119871119903

(1003816100381610038161003816119873119903

1003816100381610038161003816 lowast 119865 (119871119903))

sum119871119903

10038161003816100381610038161198731199031003816100381610038161003816

(19)

where |119873119903| is the number of objects in class 119871

119903 The higher

the overall 119865-measure is the better the clustering is due tothe higher accuracy of the clusters mapping to the originalclasses

42 Evaluation Setup (Dataset) Content-based coselection-based and task-based evaluations are used to validateexperimenting over the real corpus of documents theresults are very promising The experimental setup con-sists of some datasets of textual documents as detailed inTable 3 Experiment 1 uses the Miller and Charles (MC) andMicrosoft Research Paraphrase Corpus Experiment 2 usesthe British National Corpus TREC IJCNLP 2011 (NYT andWikipedia) SN and BLESS Experiment 3 usesThe Reuters-21578 and 20 newsgroups

43 Evaluation Results This section reports on the results ofthree experiments conducted using the evaluation datasetsoutlined in the previous section The SHGRS is imple-mented and evaluated based on concept analysis and anno-tation as sentence based in experiment 1 document basedin experiment 2 and corpus based in experiment 3 Inexperiment 1 the DS1 and DS2 investigated the effectivenessof the proposed MFsSR compared to previous studies basedon the availability of semantic relatedness values of humanjudgments and previous studies In experiment 2 the DS3andDS4 investigated the effectiveness of the closest synonymdetection using MFsSR compared to using PMI [25 37]and ICA [38] based on the availability of synonyms comingfrom three sources WordNet 30 Rogetrsquos thesaurus anda synonyms database Furthermore the DS5 DS6 DS7and DS8 investigated the effectiveness of the IOAC-QL forsemantic relationships exploration compared to the implicitrelation extraction CRFs [6 39] and TFICF [40] based on theavailability of semantic relations judgments In experiment 3the DS9 and DS10 investigated the effectiveness of thedocument linkage process for clustering case study thanprevious studies as word-based [16] and concept-based [17]based on the availability of human judgments

These experiments revealed some interesting trends interms of text organization and a representation schemebased on semantic annotation and reinforcement learningThe results show the effectiveness of the proposed schemefor accurate understanding of the underlying meaning Theproposed SHGRS is exploited in mining and managingoperations such as gathering filtering searching retrievingextracting classifying clustering and summarizing docu-ments In addition the experiments illustrate the improve-ments when direct relevance and indirect relevance areincluded in the process of measuring semantic relatednessbetween MFs of documents The main purpose of semanticrelatedness measures is to infer unknown dependencies andrelationships among the MFs The proposed MFsSR is abackbone of the closest synonym detection and semanticrelationships exploration algorithms The performance ofMFsSR for closest synonym detection and IOAC-QL forsemantic relationships exploration algorithms is proportion-ally improved as well as having a semantic understanding oftext content achieved The relatedness subgraph measure isused for the document linkage process at the corpus levelThrough this process better clustering results can be achievedas more attempt to improve the accuracy of measuring therelatedness between documents using semantic information

The Scientific World Journal 15

Table 4The results of the comparison of the correlation coefficientbetween human judgment with some relatedness measures and theproposed semantic relatedness measure

MeasureRelevance

correlation withMampC

Distance-based measuresRada 0688Wu and Palmer 0765

Information-based measuresResnik 077Jiang and Conrath 0848Lin 0853

Hybrid measuresT Hong and D smith 0879Zili Zhou 0882

Informationfeature-base measuresThe proposed MFsSR 0937

in SHGRSThe relatedness subgraphsmeasure affects the per-formance of the output of the document managing and min-ing process relative to howmuch of the semantic informationit considers The most accurate relatedness can be achievedwhen the maximum amount of the semantic information isprovided Consequently the quality of clustering is improvedand when semantic information is considered the approachsurpasses the word-based and concept-based techniques

431 Experiment 1 Comparative Study (Content-Based Eval-uation) This experiment shows the necessity to evaluate theperformance of the proposed MFsSR based on a benchmarkdataset of human judgments so the results of the proposedsemantic relatedness would be comparable with other pre-vious studies in the same field In this experiment theyattempted to compute the correlation between the ratings ofthe proposed semantic relatedness approach and the meanratings reported by Miller and Charles (DS1 in Table 3)Furthermore the results produced are compared againsteight other semantic similarities approaches namely RadaWu and Palmer Resnik Jiang and Conrath Lin T Hong andD Smith and Zili Zhou [21]

Table 4 shows the correlation coefficient results betweennine componential approaches and the Miller and Charlesratings mean The semantic relatedness for the proposedapproach outperformed all the listed approaches Unlike allthe listed methods in the proposed semantic relatednessdifferent properties are considered Furthermore the goodcorrelation value of the approach also results from consider-ing all available relationships between concepts and the indi-rect relationships between attributes of each concept whenmeasuring the semantic relatedness In this experiment theproposed approach achieved a good correlation value withthe human-subject rating reported by Miller and CharlesBased on the results of this study theMFsSR correlation valuehas proven that considering the directindirect relevance

is specifically important for at least 6 improvement overthe best previous results This improvement contributes tothe achievement of the largest amount of relationships andinterdependence among MFs and their attributes at thedocument level and to the document linkage process at acorpus level

Table 5 summarizes the characteristics of the MRPCdataset (DS2 in Table 3) and presents comparison of theAcc and AR values between the proposed semantic related-ness measure MFsSR and Islam and Inkpen [7] Differentrelatedness thresholds ranging from 0 to 1 with interval01 are used to validate the MFsSR with Islam and Inkpen[7] After evaluation the best relatedness thresholds of Accand AR are 06 07 and 08 These results indicate thatthe proposed MFsSR surpasses Islam and Inkpen [7] interms of Acc and AR The results of each approach listedbelow were based on the best Acc and AR through allthresholds instead of under the same relatedness thresholdThis improvement in Acc andAR values is due to the increasein the numbers of pairs predicted correctly after consideringdirectindirect relevance This relevance takes into accountthe closest synonym and the relationships of sentences MFs

432 Experiment 2 Comparative Study of Closest-SynonymDetection and Semantic Relationships Exploration Thisexperiment shows the necessity to study the impact of theMFsSR for inferring unknown dependencies and relation-ships among the MFs This impact is achieved through theclosest-synonym detection and semantic relationship explo-ration algorithms so the results would be comparable withother previous studies in the same field Throughout thisexperiment the impact of the proposed MFsSR for detectingthe closest synonym is studied and compared to the PMIapproach and ICA approach for detecting the best near-synonym The difference among the proposed discoveringsemantic relationships or associations IOAC-QL the implicitrelation extraction CRFs approach and TFICF approachis examined This experiment attempts to measure theperformance of theMFsSR for the closest synonym detectionand the IOAC-QL for semantic relationship explorationalgorithms Hence more semantic understanding of the textcontent is achieved by inferring unknown dependenciesand relationships among the MFs Considering the directrelevance among the MFs and the indirect relevance amongthe attributes of the MFs in the proposed MFsSR constitutesa certain advantage over previous measures However mostof the time the incorrect items are due to a wrong syntacticparsing from the OpenNLP parser and AlchemyAPIAccording to the preliminary study it is certain that theaccuracy of parsing toolsrsquo effects is on the performance of theMFsSR

Detecting the closest synonym is a process of MF disam-biguation resulting from the implementation of the MFsSRtechnique and the threshold of the synonyms scores throughthe C-SynD algorithm In the C-SynD algorithm the MFsSRtakes into account the direct relevance among MFs andthe indirect relevance among MFs and their attributes in acontext that gives a significant increase in the recall without

16 The Scientific World Journal

Table 5 The results of the comparison of the accuracy and acceptance rate between the proposed semantic relatedness measure and Islamand Inkpen [7]

MRPC dataset Relatednessthreshold

Human judgment(TP + FN)

Islam and Inkpen [7] The proposed MFsSRAcc AR Acc AR

Trainingsubset (4076)

01

2753 true

067 1 068 102 067 1 068 103 067 1 068 104 067 1 068 105 069 098 068 106 072 089 068 107 068 078 070 09808 056 04 072 08609 037 009 060 0491 033 0 034 002

Test subset(1725)

01

1147 true

066 1 067 102 066 1 067 103 066 1 067 104 066 1 067 105 068 098 067 106 072 089 066 107 068 078 066 09808 056 04 064 08509 038 009 049 051 033 0 034 003

disturbing the precision Thus the MFsSR between two con-cepts considers the information content with most of Word-Net relationships Table 6 illustrates the performance of theC-SynD algorithm based on theMFsSR compared to the bestnear-synonym algorithmusing the PMI and ICA approachesThis comparison was conducted for two corpora of textdocuments that are BNC and NS (DS3 and DS4 in Table 3)As indicated in Table 6 the C-SynD algorithm yielded higheraverage precision recall and 119865-measure values than PMIapproach by 25 20 and 21 on SN dataset respectivelyand also by 8 4 and 7 on BNC dataset Furthermorethe C-SynD algorithm also yielded higher average precisionrecall and 119865-measure values than ICA approach by 6 9and 7 on SN dataset respectively and also by 6 4and 5 on BNC dataset The improvements achieved in theperformance values are due to the increase of the numberof pairs predicted correctly and the decrease of the numberof pairs predicted incorrectly through implementing MFsSRThe important observation from this table is the improve-ments achieved in recall which measures effectiveness of theC-SynD algorithm Achieving better precision values is clearwith a high percentage in different datasets and domains

Inferring the MFs relationships and dependencies is aprocess of achieving a large number of interdependencesamongMFs resulting from the implementation of the IOAC-QL algorithm In the IOAC-QL algorithm the 119876-learning isused to select the optimal action (relationships) that gainsthe largest amount of interdependences among the MFs

resulting from the measure of the AVW Table 7 illustratesthe performance of the IOAC-QL algorithm based on AVWscompared to the implicit relationship extraction based onCRFs and TFICF approaches which was conducted overthe TREC Bless the NYT and the Wikipedia corpus (DS5DS6 DS7 and DS8 in Table 3) The data in this table showsincreases in precision recall and 119865-measure values due tothe increase in the number of pairs predicted correctly afterconsidering directindirect relevance through the expansionof closest synonym In addition the IOAC-QL algorithmthe CRFs and the TFICF approaches are beneficial tothe extraction performance but the IOAC-QL contributesmore than CRFs and TFICF Thus IOAC-QL considerssimilarity contiguity contrast and causality relationshipsbetween MFs and its closest-synonyms while the CRFs andTFICF consider is-a and part-of relationships only betweenconceptsThe improvements of the 119865-measure were achievedthrough an IOAC-QL algorithm up to 23 for TREC dataset22 for BLESS dataset 21 for NYT dataset and 6 forWikipedia dataset approximately from the CRFs approachFurthermore the IOAC-QL algorithm also yielded higher119865-measure values than TFICF approach by 10 for TRECdataset 17 for BLESS dataset and 7 for NYT datasetrespectively

433 Experiment 3 Comparative Study (Task-Based Eval-uation) This experiment shows the effectiveness of the

The Scientific World Journal 17

Table 6 The results of the C-SynD algorithm based on the MFsSR compared to the PMI and ICA for detecting the closest synonym

Datasets PMI ICA C-SynDPrecision Recall 119865-measure Precision Recall 119865-measure Precision Recall 119865-measure

SN 606 606 061 795 716 753 85 80 82BNC 745 679 71 76 678 72 82 719 77

Table 7 The results of the IOAC-QL algorithm based on AVWs compared to CRFs and TFICF approaches

Datasets CRFs TFICF IOAC-QLPrecision Recall 119865-measure Precision Recall 119865-measure Precision Recall 119865-measure

TREC 7508 602 6628 893 714 787 898 881 886BLESS 7304 6266 6703 738 695 716 950 835 889NYT 6846 5402 6038 860 650 740 900 740 812Wikipedia 560 420 480 646 5488 5934 700 440 540

Table 8The results of graph-based approach compared to term-based and concept-based approach using119870-means and the HAC clusteringalgorithms

Datasets Algorithm Term-based Concept-based Graph-basedFC FC FC

Reuters-21578 HAC 73 857 902119870-means 51 846 918

20 Newsgroups HAC 53 82 89119870-means 48 793 87

proposed semantic representation scheme in document min-ing and managing operations such as gathering filteringsearching retrieving extracting classifying clustering andsummarizing documents Many sets of experiments usingthe proposed scheme on different datasets in text clusteringhave been conducted The experiments demonstrate theextensive comparisons between the MF-based analysis andthe traditional analysis Experimental results demonstratethe substantial enhancement of the document clusteringquality In this experiment the application of the semanticunderstanding of the textual documents is presented Textualdocuments are parsed to produce the rich syntactic andsemantic representations of SHGRS Based on these semanticrepresentations document linkages are estimated using thesubgraph matching process that is based on finding alldistinct similarity common subgraphs The semantic repre-sentations of textual documents determining the relatednessbetween themare evaluated by allowing clustering algorithmsto use the produced relatedness matrix of the documentset To test the effectiveness of subgraph matching in deter-mining an accurate measure of the relatedness betweendocuments an experiment using the MFs-based analysisand subgraphs relatedness measures was conducted Twostandard document clustering techniques were chosen fortesting the effect of the document linkages on clustering [43](1) HAC and (2) 119870-means The MF-based weighting is themain factor that captures the importance of a concept in

a document Furthermore document linkages using the sub-graph relatedness measure are among the main factors thataffect clustering techniques Thus to study the effect of theMFs-based weighting and the subgraph relatedness measure(graph-based approach) on the clustering techniques thisexperiment is repeated for the different clustering techniquesas shown in Table 8 Basically the aim is to maximize theweighted average of the 119865-measure for each class (FC) ofclusters to achieve high-quality clustering

Table 8 shows that the graph-based approach achieveshigher performance than term-based and concept-basedapproachesThis comparison was conducted for two corporaof text documents that are the Reuters-21578 dataset and the20 newsgroups dataset (DS9 and DS10 in Table 3)The resultsshown in Table 8 illustrate improvement in the clusteringquality when more attributes are included in graph-basedrelatedness measure The improvements were achieved at thegraph-based approach up to 17 by HAC and 40 by 119870-means for Reuters-21578 dataset while up to 36 byHAC and39 by 119870-means for approximately 20 newsgroup datasetsfrom the base case (term-based) approach Furthermorethe improvements also were achieved at the graph-basedapproach up to 5 by HAC and 7 by119870-Means for Reuters-21578 dataset while up to 7 byHAC and 8 by119870-Means forapproximately 20 Newsgroup datasets from Concept-basedapproach

18 The Scientific World Journal

5 Conclusions

This paper proposed the SHGRS to improve the effectivenessof themining andmanaging operations in textual documentsSpecifically a three-stage approach was proposed for con-structing the scheme First the MFs with their attributesare extracted and hierarchy-based structure is built to rep-resent sentences by the MFs and their attributes Second anovel semantic relatedness computing method was proposedfor inferring relationships and dependencies of MFs andthe relationship between MFs is represented by a graph-based structure Third a subgraph matching algorithm wasconducted for document linkage process The documentsclustering is used as a case study and conducted extensiveexperiments on 10 different datasets and the results showedthat the scheme can result in a better mining performancecompared with traditional approach Future work will focuson conducting other case studies to processes such asgathering filtering retrieving classifying and summarizinginformation

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] I T Fatudimu A G Musa C K Ayo and A B SofoluweldquoKnowledge discovery in online repositories a text miningapproachrdquo European Journal of Scientific Research vol 22 no2 pp 241ndash250 2008

[2] A Hotho A Nurnberger and G Paaszlig ldquoA brief survey of textminingrdquo Journal for Computational Linguistics and LanguageTechnology vol 20 2005

[3] M Bernotas K Karklius R Laurutis and A Slotkiene ldquoThepeculiarities of the text document representation using ontol-ogy and tagging-based clustering techniquerdquo Journal of Infor-mation Technology and Control vol 36 pp 217ndash220 2007

[4] Y Li Z A Bandar andDMcLean ldquoAn approach formeasuringsemantic similarity between words using multiple informationsourcesrdquo IEEE Transactions on Knowledge and Data Engineer-ing vol 15 no 4 pp 871ndash882 2003

[5] K Shaban O Basir and M Kamel ldquoDocument mining basedon semantic understanding of textrdquo in Progress in Pattern Rec-ognition Image Analysis and Applications vol 4225 of LectureNotes in Computer Science pp 834ndash843 Springer BerlinGermany 2006

[6] A Culotta A McCallum and J Betz ldquoIntegrating probabilisticextraction models and data mining to discover relations andpatterns in textrdquo in Proceedings of theHuman Language Technol-ogy ConferencemdashNorth American Chapter of the Association forComputational Linguistics Annual Meeting (HLT-NAACL rsquo06)pp 296ndash303 June 2006

[7] A Islam and D Inkpen ldquoSemantic text similarity using corpus-based word similarity and string similarityrdquo ACM Transactionson Knowledge Discovery fromData vol 2 no 2 article 10 2008

[8] R Feldman and J SangerTheTextMining Handbook AdvancedApproaches in Analyzing Unstructured Data Cambridge Uni-versity Press 2006

[9] R Navigli ldquoWord sense disambiguation a surveyrdquo ACM Com-puting Surveys vol 41 no 2 article 10 2009

[10] W Mao andWW Chu ldquoThe phrase-based vector space modelfor automatic retrieval of free-text medical documentsrdquo Dataand Knowledge Engineering vol 61 no 1 pp 76ndash92 2007

[11] C Siefkes and P Siniakov ldquoAn overview and classification ofadaptive approaches to information extractionrdquo in Journal onData Semantics IV vol 3730 of Lecture Notes in ComputerScience pp 172ndash212 Springer New York NY USA 2005

[12] httpwordnetprincetonedu[13] OpenNLP httpopennlpapacheorg[14] httpwwwalchemyapicom[15] W V Siricharoen ldquoOntologies and object models in object ori-

ented software engineeringrdquo International Association of Engi-neers (IAENG) Journal of Computer Science (JICS) vol 33 no 1pp 1151ndash1162 2006

[16] S Shehata ldquoA WordNet-based semantic model for enhancingtext clusteringrdquo in Proceedings of the IEEE International Con-ference on Data Mining Workshops (ICDMW rsquo09) pp 477ndash482Miami Fla USA December 2009

[17] S Shehata F Karray andM Kamel ldquoAn efficient concept-basedminingmodel for enhancing text clusteringrdquo IEEE Transactionson Knowledge and Data Engineering vol 22 no 10 pp 1360ndash1371 2010

[18] E Zemmouri H Behja A Marzak and B Trousse ldquoOntology-based knowledge model for multi-view KDD processrdquo Interna-tional Journal of Mobile Computing and Multimedia Communi-cations vol 4 no 3 pp 21ndash33 2012

[19] C Marinica and F Guillet ldquoKnowledge-based interactive post-mining of association rules using ontologiesrdquo IEEETransactionson Knowledge and Data Engineering vol 22 no 6 pp 784ndash7972010

[20] B Choudhary and P Bhattacharyya ldquoText clustering usingUni-versal Networking Language representationrdquo in Proceedings ofthe 11th International World Wide Web Conference 2003

[21] G Pirro and J Euzenat ldquoA feature and information theoreticframework for semantic similarity and relatednessrdquo in Proceed-ings of the 9th International Semantic Web Conference on theSemantic Web (ISWC rsquo10) vol 6496 pp 615ndash630 2010

[22] Z Lin M R Lyu and I King ldquoMatchSim a novel similaritymeasure based on maximum neighborhood matchingrdquo Knowl-edge and Information Systems vol 32 no 1 pp 141ndash166 2012

[23] D Lin ldquoAn information-theoretic definition of similarityrdquo inProceedings of the 15th International Conference on MachineLearning Madison Wis USA July 1998

[24] D Inkpen ldquoA statistical model for near-synonym choicerdquo ACMTransactions on Speech and Language Processing vol 4 no 1article 2 2007

[25] LHan T Finin PMcNameeA Joshi andYYesha ldquoImprovingword similarity by augmenting PMIwith estimates of word pol-ysemyrdquo IEEE Transactions on Knowledge and Data Engineeringvol 25 no 6 pp 1307ndash1322 2013

[26] J OrsquoShea Z Bandar K Crockett and D McLean ldquoA compar-ative study of two short text semantic similarity measuresrdquo inAgent and Multi-Agent Systems Technologies and Applicationsvol 4953 of Lecture Notes in Artificial Intelligence pp 172ndash181Springer Berlin Germany 2008

[27] J Oliva J I SerranoMDDel Castillo and A Iglesias ldquoSyMSSa syntax-basedmeasure for short-text semantic similarityrdquoDataand Knowledge Engineering vol 70 no 4 pp 390ndash405 2011

The Scientific World Journal 19

[28] I Yoo and X Hu ldquoClustering large collection of biomedicalliterature based on ontology-enriched bipartite graph represen-tation andmutual refinement strategyrdquo inProceedings of the 10thPAKDD pp 9ndash12 2006

[29] S-T Yuan and Y-C Chen ldquoSemantic ideation learning foragent-based e-brainstormingrdquo IEEE Transactions on Knowledgeand Data Engineering vol 20 no 2 pp 261ndash275 2008

[30] D Liu Z Liu and Q Dong ldquoA dependency grammar andWordNet based sentence similarity measurerdquo Journal of Com-putational Information Systems vol 8 no 3 pp 1027ndash1035 2012

[31] A Zouaq M Gagnon and O Benoit ldquoSemantic analysis usingdependency-based grammars and upper-level ontologiesrdquoInternational Journal of Computational Linguistics and Applica-tions vol 1 no 1-2 pp 85ndash101 2010

[32] C Ramakrishnan K J Kochut and A P Sheth ldquoA frameworkfor schema-driven relationship discovery from unstructuredtextrdquo in The Semantic WebmdashISWC 2006 vol 4273 of LectureNotes in Computer Science pp 583ndash596 Springer BerlinGermany 2006

[33] C J C H Watkins and P Dayan ldquoTechnical note Q-learningrdquoMachine Learning vol 8 no 3-4 pp 279ndash292 1992

[34] M Lan C L Tan J Su and Y Lu ldquoSupervised and traditionalterm weighting methods for automatic text categorizationrdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 31 no 4 pp 721ndash735 2009

[35] L Zhang C Li J Liu and HWang ldquoGraph-based text similar-itymeasurement by exploiting wikipedia as background knowl-edgerdquo World Academy of Science Engineering and Technologyvol 59 pp 1548ndash1553 2011

[36] M Kuramochi and G Karypis ldquoAn efficient algorithm for dis-covering frequent subgraphsrdquo IEEE Transactions on Knowledgeand Data Engineering vol 16 no 9 pp 1038ndash1051 2004

[37] L-C Yu H-M Shih Y-L Lai J-F Yeh and C-H Wu ldquoDis-criminative training for near-synonym substitutionrdquo in Pro-ceedings of the 23rd International Conference on ComputationalLinguistics (COLING rsquo10) pp 1254ndash1262 2010

[38] L-C Yu and W-N Chien ldquoIndependent component analysisfor near-synonym choicerdquoDecision Support Systems vol 55 no1 pp 146ndash155 2013

[39] L Yao S Riedel and A Mccallum ldquoCollective cross-documentrelation extraction without labelled datardquo in Proceedings ofthe Conference on Empirical Methods in Natural LanguageProcessing pp 1013ndash1023MITCambridgeMassUSAOctober2010

[40] M Shen D-R Liu and Y-S Huang ldquoExtracting semantic rela-tions to enrich domain ontologiesrdquo Journal of Intelligent Infor-mation Systems vol 39 no 3 pp 749ndash761 2012

[41] H Chim and X Deng ldquoEfficient phrase-based documentsimilarity for clusteringrdquo IEEE Transactions on Knowledge andData Engineering vol 20 no 9 pp 1217ndash1229 2008

[42] GAMiller andWGCharles ldquoContextual correlates of seman-tic similarityrdquo Language and Cognitive Processes vol 6 no 1 pp1ndash28 1991

[43] M Steinbach G Karypis and V Kumar ldquoA comparison ofdocument clustering techniquesrdquo in Proceedings of the Knowl-edge Discovery and Data Mining (KDD) Workshop Text MiningAugust 2000

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 4: Research Article Exploiting Semantic Annotations and ...downloads.hindawi.com/journals/tswj/2015/136172.pdf · is bene cial for text categorization [ ], text summarization [ ], word

4 The Scientific World Journal

Semantic annotation (AlchemyAPI)

Word-tagging analyses

Syntactic analysis

Semantic analysis

Build MFs hierarchy structure MFs extraction

Semantic text refining and annotating stage

FLG matching Compute subgraph-relatedness measure

Documents linkage conducting stage

MFsSR algorithm

WordNet

SALA

Detecting closestsynonyms

Inferring dependency and semantic relationships

Build relationships graph structure

Construct semantic hierarchygraph document object

Dependenciesrelationships exploring and inferring stage

Figure 2 A framework for constructing a semantic hierarchygraph text representation scheme

correlation through many-to-many relationships The thirdstage aims to conduct the document linkage process usinga subgraph matching process that relies on finding all ofthe distinct similarities common to the subgraph The mainproceedings of this framework can be summarized as follows

(1) extracting MFs of sentences and their attributes tobuild the hierarchy-based structure for efficientdimension reduction

(2) estimating a novel semantic relatedness measure withconsideration of direct relevance and indirect rele-vance betweenMFs and their attributes for promisingperformance improvements

(3) detecting closest synonyms of MFs for better disam-biguation

(4) exploring and inferring dependencies and relation-ships of MFs for more understanding

(5) representing the interdependence amongMFs of sen-tences in graph-based structure so as to be moreprecise and informative

(6) conducting a document linkage process for moreeffectiveness

31 Semantic Text Refining and Annotating Stage This stageis responsible for refining the text to discover the text MFswith additional annotation for more semantic understandingand aims to

(1) accurately parse each sentence and identify POSsubject-action-object and named-entity recognition

(2) discover the MFs of each sentence in the textual doc-ument

(3) exploit semantic information in each sentencethrough detection attributes of the MFs

(4) reduce the dimensions as much as possible(5) generate an effective descriptive sentence object

(DSO) with a hierarchical sentence object automati-cally

This stage can be achieved through these processes Firstthe text NL is studied at different linguistic levels that iswords sentence and meaning for semantic analysis andannotation [30 31] Second exploiting the information ldquowhois doing what to whomrdquo clarifies dependencies between verbsand their arguments for extraction of MFs Finally buildingMFs hierarchy-based structure explains sentences MFs andtheir attributes

With regard to extracting MFs and building MFs hier-archy-based structure the following points must be high-lighted First there is a dire need to refine the text contentby representing each sentence with its MFs instead of a seriesof terms to reduce the dimensions as much as possible TheOpenNLP supports the most common NLP tasks such astokenization sentence segmentation part-of-speech taggingnamed entity extraction chunking parsing and coreferenceresolution Furthermore the AlchemyAPI extracts semanticmetadata from content such as information on subject-action-object relation extraction people places companiestopics facts relationships authors and languages Based onthe NL concept by OpenNLP and AlchemyAPI the sentenceMFs are identified as subject ldquoSubrdquo main verb ldquoMVrdquo andobject ldquoObjrdquo (direct or indirect object) In addition suchother terms remaining in the sentence as complement ldquoComrdquo(subject complement or object complement) and modifiers(Mod) are considered as attributes to these MFs

Second the automatic annotation is essential for eachof the MFs with additional information for more seman-tic understanding Based on the semantic annotation byOpenNLP and AlchemyAPI the sentenceMFs are annotatedwith feature value (FVal) part-of-speech (POS) named-entity recognition (Ner) and a list of feature attributes

The Scientific World Journal 5

Root

DSO

FValPosNer

FATTS

Annotated with

Val

POSNer

TypeAnnotated with

DSO DSO DSO

middot middot middot

middot middot middot

middot middot middot

MFnMF1 MF2

FATT1FATT2 FATTm

Figure 3 The hierarchy-based structure of the textual document

(FAtts) The list of FAtts is constructed from the remainingterms in the sentence (complement or modifier) relyingon the grammatical relation Each attribute is annotatedwith attribute value (Val) attribute type (type) which iscomplement or modifier attribute POS (POS) and attributenamed-entity recognition (Ner)

Finally the hierarchy-based structure of the textual doc-ument is represented as an object [32] containing the hierar-chical structure of sentences with the MFs of the sentencesand their attributesThis hierarchical structure maintains thedependency between the terms on the sentence level formoreunderstanding which provides sentences with fast access andretrieval Figure 3 shows the hierarchical structure of the textin a theme tree ldquotree DSOrdquo the first level of the hierarchicalmodel denotes all sentence objects that are called DSO thenext level of nodes denotes the MFs of each sentence objectand the last level of nodes denotes the attributes of the MFs

tree DSODSO[i] indicates annotation of sentenceobject 119894 which is fitted with additional information as inHeuristic 1 This heuristic simplifies the hierarchical modelof the textual document with its DSO objects and its MFsThe feature score is used to measure the contribution of theMF in the sentence This score is based on the number ofthe terms grammatically related to the MF (summation offeature Attributes) as follows

Feature Score (FS) = sum

119894

MF119894FAtt (1)

Heuristic 1 Let tree DSO = Sibling of DSO nodes and leteachDSO C tree DSO = SEN key SEN value SEN StateSibling of MF node[] where SEN key indicates the order ofthe sentence in the document as an index SEN value indi-cates the sentence parsed value as (S(NP(NN employee))(VP(VBZ guides)(NP(DT the)(NN customer)))(sdot)) SENState indicates the sentence tense and state (tense past statenegative) siblings of MF node[] indicate the list of siblingnodes of MFs for each sentence fitted with Val POS Ner andlist of FAtts

In the text refining and annotating (TRN) algorithm(see Algorithm 1) the textual document is converted to anobject model with the hierarchical DSOThe TRN algorithmuses OpenNLP and AlchemyAPI tools for MFs extractionand the hierarchy model construction As a result the

constructed hierarchy structure leads to the reduction ofthe dimensions as much as possible and achieves efficientinformation access and knowledge acquisition

32 DependenciesRelationships Exploring and Inferring StageThis stage is responsible for exploring how the dependenciesand relationships among MFs have an effect and aims to

(1) represent the interdependence among sentence MFsin a graph-based structure known as the feature link-age graph (FLG)

(2) formulate an accurate measure for the semantic relat-edness of MFs

(3) detect the closest synonyms for each of the MFs(4) infer the relationships and dependencies of MFs with

each other

This stage can be achieved in three processes First build-ing an FLG represents the dependencies and relationshipsamong sentences Second an efficientMFsSRmeasure is pro-posed to contribute to the detection of the closest synonymsand to infer the relationships and dependencies of the MFsThis measure considers the direct relevance and the indirectrelevance among MFs The direct relevance is the synonymsand associated capabilities (similarity contiguity contrastand causality) among the MFs These association capabilitiesindicate the relationships that give the largest amount ofinterdependence among theMFswhile the indirect relevancerefers to other relationships among the attributes of theseMFs Finally the unknown dependencies and relationshipsamong MFs are explored by exploiting semantic informationabout their texts at a document level

In addition a semantic actionable learning agent (SALA)plays a fundamental role in detecting the closest synonymsand inferring the relationships and dependencies of the MFsLearning is needed to improve the SALA functionality toachieve multiple goals Many studies show that the RL agenthas a high reproductive capability for human-like behaviors[29] As a result the SALA performs an adaptive approachcombining thesaurus-based and distributional mechanismsIn the thesaurus-based mechanism words are compared interms of how they are in the thesaurus (eg WordNet) whilein the distributionalmechanismwords are compared in termsof the shared number of contexts in which they may appear

321 Graph-Based Organization Constructing Process Therelationships anddependencies among theMFs are organizedinto a graph-like structure of nodes with links known asFLG and defined in Heuristic 2 The graph structure FLGrepresents many-to-many relationships among MFs TheFLG is a directed acyclic graph FLG(119881 119860) that would berepresented in a collection of vertices and a collection ofdirected arcs These arcs connect pairs of vertices with nopath returning to the same vertex (acyclic) In FLG(119881 119860)119881 is the set of vertices or states of the graph and 119860 isthe set of arcs between vertices The FLG construction isbased on two sets of data The first set represents the verticesin a one-dimensional array vertex (119881) of Feature Vertex

6 The Scientific World Journal

Input List of Sentences of The document 119889119894

Output List DSO lowastlist of DSO objects and its MFs with attributes lowastProcedure 119863 larr New documentList DSOlarr Empty list list of sentences object of documentfor each sentence 119904

119894in document 119863 do

AlchemyAPI(119904119894) lowastThis function calls AlchemyAPI to determine all MF and its NER lowast

for each feature mfj isin mf1mf2 mf119899 in DSO do

get MF Attributes(mf119894) lowastThis function determines attributes of each MF lowast

OpenNLP (si) lowastThis function calls OpenNLP to determine POS and NER for MF and attributes lowastcompute MF score(mf

119894) lowastThis function determines the score of each MF lowast

endList DSOadd(DSO)

end

Algorithm 1 TRN algorithm

objects (FVOs)The FVO is annotated with Sentence Featurekey (SFKey) feature value (Fval) feature closest synonyms(Fsyn) feature associations (FAss) and feature weight (FW)Where SFKey combines sentence key and feature key inDSO the sentence key is important because that facilitatesaccessing the parent of theMFsThe Fval object has the valueof the feature the Fsyn object has a list of the detected closestsynonyms of the Fval and the FAss object has a list of theexplored feature associations and relationships for the FvalIn this list each object has Rel Typ between two MFs (1 =similarity 2 = contiguity 3 = contrast 4 = causality and 5 =synonym) and Rel vertex index of the related vertex TheFWobject has accumulated the associations and relationshipsweights clarifying the importance of the Fval in the textualdocument The second set represents the arcs in a two-dimensional array adjacency (119881 119881) of Features link weights(FLW) which indicates the value of the linkage weight in anadjacency matrix between related vertices

Heuristic 2 Let FLG(119881 119860) = ⟨Vertex(119881)Adjacency(119881 119881)⟩let Vertex(119881) = FVO

0 FVO

1 FVO

119899 and let Adja-

cency(119881 119881) = FLW(00)

FLW(01)

FLW(10)

FLW(11)

FLW(119899119899)

where FVO indicates an object ofthe MF in vertex vector and annotates with the additionalinformation as FVO = SFKey Fval Fsyn FAss Fw SFKeyindicates combination of sentence key and feature key in aDSO Fval indicates the value of the feature Fsyn indicatesa list of the closest synonyms of the Fval FAss indicatesa list of the feature associations and relationships for theFval and each object in this list is defined as FAssitem =

Rel TypeRel vertex where Rel Type indicates the valueof the relation type linkage between the two features andRel vertex indicates the index of related vertex FW indicatesthe accumulation of the weights of the associations andrelationships of the Fval in the textual document FLW indi-cates the value of the linkage weight in an adjacency matrix

322 The MFs Semantic Relatedness (MFsSR) MeasuringProcess A primarymotivation for measuring semantic relat-edness comes from the NL processing applications such asinformation retrieval information extraction information

filtering text summary text annotation text mining wordsense disambiguation automatic indexing machine trans-lation and other aspects [2 3] In this paper one of theinformation-based methods is utilized by considering ICThe semantic relatedness that has been investigated concernsthe direct relevance and the indirect relevance among MFsThe MFs may be one word or more and thereby the MF isconsidered as a conceptThe IC is considered as a measure ofquantifying the amount of information a concept expressesLin [23] assumed that for a concept 119888 proposed IC let 119901(119888)be the probability of encountering an instance of concept119888 The IC value is obtained by considering the negative loglikelihood of encountering a concept in a given corpusTraditionally the semantic relatedness of concepts is usuallybased on the cooccurrence information on a large corpusHowever the cooccurrences do not achieve much matchingin a corpus and it is essential to take into account therelationships among concepts Therefore development of anew information content method based on the cocontribu-tions instead of the cooccurrences of the proposed MFsSR isimportant

The new information content (NIC) measure is an exten-sion of the information content measure The NIC measuresare based on the relations defined in the WordNet ontologyThe NIC measure uses hypernymhyponym synonymantonym holonymmeronymy and entailcause to quantifythe informativeness of concepts For example a hyponymrelation could be ldquobicycle is a vehiclerdquo and ameronym relationcould be ldquoa bicycle has wheelsrdquo NIC is defined as a functionof the hypernym hyponym synonym antonym holonymmeronymy entail and cause relationships normalized by themaximum number of MFsattributes objects in the textualdocument using

NIC (119888) = 1 minus ((log (Hype Hypo (119888) + Syn Anto (119888)

+Holo Mero (119888)

+ Enta Cause (119888) + 1))

times (log(max concept))minus1)

(2)

The Scientific World Journal 7

whereHype Hypo(119888) returns the number of FVOs in the FLGrelated to the hypernym or hyponym values Syn Anto(119888)returns the number of FVOs in the FLG related to the syn-onym or antonym values Holo Mero(119888) returns the numberof FVOs in the FLG related to the holonym or meronymyvalues and Enta Cause(119888) returns the number of FVOs in theFLG related to the entail or cause values The max concept isa constant that indicates the total number of MFsattributesobjects in the considered text The max concept normalizesthe NIC value and hence the NIC values fall in the range of[0 1]

With the consideration of direct relevance and indirectrelevance of MFs the proposed MFsSR can be stated asfollows

SemRel = 120582(2 lowast NIC (LCS (1198881 1198882))NIC (1198881) +NIC (1198882)

)

+ (1 minus 120582) (2 lowastNIC (LCS (att 1198881 att 1198882))NIC (att 1198881) + NIC (att 1198882)

)

(3)

where 120582 isin [0 1] decides the relative contribution of directand indirect relevance to the semantic relatedness andbecause the direct relevance is assumed to bemore importantthan the indirect relevance 120582 isin [05 1]

323 The Closest-Synonyms Detecting Process In MFsclosest-synonyms detection the SALA carried out the firstaction to extract the closest synonyms for each MF whereSALA defies automatic discovery of similar meaning words(synonyms) Due to the important role played by a lexicalknowledge base in the closest synonyms detection SALAadopts the dictionary-based approach to the disambiguationof theMFs In the dictionary-based approach the assumptionis that the most plausible sense to assign to multiple sharedwords is that sense that maximizes the relatedness among thechosen senses In this respect SALA detects the synonymsby choosing the meaning whose glosses share the largestnumber of words with the glosses of the neighboring wordsthrough lexical ontology Using a lexical ontology suchas WordNet allows the capture of semantic relationshipsbased on the concepts and exploiting hierarchies of conceptsbesides dictionary glosses One of the problems that canbe faced is that not all synonyms are really related to thecontext of the documentTherefore the MFs disambiguationis achieved by implementing the closest-synonym detection(C-SynD) algorithm

In the C-SynD algorithm the main task of each SALA isto search for synonyms of each FVO throughWordNet Eachsynonym and their relationships are assigned in a list calledSyn Relation List (SRL) Each item in the SRL contains onesynonym and its relationships such asHypernyms HyponymMeronyms Holonymy Antonymy Entail and Cause Thepurpose of using these relations is to eliminate the ambiguityand polysemy because not all of the synonyms are relatedto the context of the document Then all synonyms andtheir relationship lists are assigned in a list called F SRListThe SALA starts to filter the irrelevant synonyms accordingto the score of each SRL This score is computed based on

semantic relatedness between the SRL and every FVO in theFeature Vertex array as follows

Score (SRL119896) =

119898

sum

119894=0

119899

sum

119895=0

SemRel (SRL (119894)item FVO119895) (4)

where 119899 is the number of FVOs in the Feature Vertex arraywith index 119895 119898 is the number of items in SRL with index 119894and 119896 is index of SRL

119896in F SRList

The lists of synonyms and their relationships areused temporarily to serve the C-SynD algorithm (seeAlgorithm 2) The C-SynD algorithm specifies a thresholdvalue of the score for selecting closest synonyms SALAapplies the threshold to select the closest synonyms with thehighest score according to the specific threshold as follows

Closest-Synonyms = maxth

(F SRListSRListscore ()) (5)

Then each SALA retains the closest synonyms in anobject Fsyn of each FVO and SALA links the related FVOswith a bidirectional effect using the SALA links algorithm(see Algorithm 3) In the SALA LINKS algorithm the SALAreceives each FVO for detecting links among the other FVOsaccording to Fsyn object values and FAss object valuesThe SALA assigns the FLW between two FVOs to theirintersection location of the FLG

324 The MFs Relationships and Dependencies Inferring Pro-cess In inferringMFs relationships and dependencies SALAaims to explore the implicit dependencies and relationshipsin the context such as human relying on four associationcapabilities These association capabilities are similarity con-tiguity contrast (antonym) and causality [29] which give thelargest amount of interdependence among the MFs Theseassociation capabilities are defined as follows

Similarity For nouns it is the sibling instance of the sameparent instance with is-a relationship inWordNet indicatinghypernymhyponym relationships

Contiguity For nouns it is the sibling instance of thesame parent instance with part-of relationship in WordNetindicating holonymmeronymy relationship

Contrast For adjectives or adverbs it is the sibling instanceof the same parent instance with is-a relationship and withantonym (opposite) attribute in WordNet indicating a syn-onymantonym relationship

Causality For verbs it indicates the connection of the se-quence of events by a causeentail relationship where a causepicks out two verbs one of the verbs is causative such as (give)and the other is called resultant such as (have) in WordNet

To equip SALA with decision-making and experiencelearning capabilities to achieve multiple-goal RL this studyutilizes the RL method relying on 119876-learning to design aSALA inference engine with sequential decision-makingTheobjective of SALA is to select an optimal association to maxi-mize the total long-run accumulated reward Hence SALA

8 The Scientific World Journal

Input Array Feature Vertex Objects (FVO) WordnetOutput List of Closest-Synonym of FVO objects lowastThe detected closest synonym to set Fsyn value lowastProcedureSRLlarr Empty list list of each synonym and their relationshipsF SRListlarr Empty list list of SRL (all synonyms and their relationships lists)lowastParallelforeach used for parallel processing of all FVO lowastParallelForEach(FVO fvo =gt

Fsynlarr Empty listF SRList = get all Syn Relations List(fvo) lowast function to get all synonyms and their relationships lists from wordnet lowastlowast compute semantic relatedness of each synonym and their relationships with the fvo object then get score of each listlowastfor (int 119894 = 0 119894 lt F SRListCount() 119894++) score[119894] = sum(function Get SemRel(F SRList[119894] fvo)) lowastfunction to compute the score of each

synonym list this score contributes in accurate filtering the returned synonyms and selecting the closest onelowast

Fsyn = maxth(score[]) lowastfunction to select SRL with highest score according to the specific threshold lowast

) ParallelFor

Algorithm 2 C-SynD algorithm

Input Action type and Feature Vertex Object(FVO)Output Detect Links between Feature Vertices and assigned the Features Link Weight(FLW)Procedure ilarr index of the FVO input in FLGnumVertslarr number of Feature Vertex objects (FVO) in Vertex (119881)for (int 119895 = 119894 119895 le numVerts j++)if (Action type == ldquoClosest-Synonymsrdquo) then Fsyn = get C-SynD (FVO[119894]) lowast Function to execute C-SynD algorithm lowastif (FVO[119895]contain(Fsyn)) then FLWlarr SemRel(FVO[119894] FVO[119895]) lowastFunction to compute the value of semantic relatedness betweenthe FVO[119894] and FVO[119895] lowastRel-Typelarr 5 lowast set the relation type with 5 indicates of synonym lowast

elseif (Action type == ldquoAssociationrdquo) then optimal action = get IOAC-QL (FVO) lowast Function to execute IOAC-QL algorithm lowastFLWlarr reward lowastThe reward value of optimal action returnedlowastRel-Typelarr action type lowastThe type of optimal action returned (type = 1 for similarity type = 2 for contiguitytype = 3 for contrast type = 4 for causality) lowast

Rel vertex = 119895 lowast index of the related vertex lowastAdjacent[119894 119895] = FLW lowastset the value of linkage weight in adjacency matrix between related vertex [119894] vertex [119895] lowast

Algorithm 3 SALA links algorithm

can achieve the largest amount of interdependence amongMFs through implementing the inference optimal associationcapabilities with the 119876-learning (IOAC-QL) algorithm

(1) Utilizing Multiple-Goal Reinforcement Learning (RL) RLspecifies what to do but not how to do it through the rewardfunction In sequential decision-making tasks an agent needsto perform a sequence of actions to reach goal states ormultiple-goal states One popular algorithm for dealing withsequential decision-making tasks is 119876-learning [33] In a 119876-learning model a 119876-value is an evaluation of the ldquoqualityrdquo

of an action in a given state 119876(119909 119886) indicates how desirableaction 119886 is in the state 119909 SALA can choose an action based on119876-values The easiest way of choosing an action is to choosethe one that maximizes the 119876-value in the current state

To acquire the119876-values the algorithm119876-learning is usedto estimate the maximum discounted cumulative reinforce-ment that the model will receive from the current state 119909

on max(sum119894=0

120574119894119903119894) where 120574 is a discount factor and 119903

119894is the

reward received at step 119894 (which may be 0) The updating of119876(119909 119886) is based on minimizing 119903 + 120574119891(119910) minus 119876(119909 119886) 119891(119910) =max119886119876(119910 119886) and 119910 is the new state resulting from action

The Scientific World Journal 9

119886 Thus updating is based on the temporal difference inevaluating the current state and the action chosen Throughsuccessive updates of the 119876 function the model can learn totake into account future steps in longer and longer sequencesnotably without explicit planning [33] SALA may eventuallyconverge to a stable function or find an optimal sequence thatmaximizes the reward received Hence the SALA learns toaddress sequential decision-making tasks

Problem Formulation The 119876-learning algorithm is used todetermine the best possible action that is selecting the mosteffective relations analyzing the reward function and updat-ing the related feature weights accordingly The promisingaction can be verified by measuring the relatedness scoreof the action value with the remained MFs using a rewardfunctionThe formulation of the effective association capabil-ities exploration is performed by estimating an action-valuefunction The value of taking action 119886 in state 119904 is definedunder a policy 120587 denoted by 119876(119904 119886) as the expected returnstarting from 119904 and taking the action 119886 Action policy 120587 is adescription of the behavior of the learning SALA The policyis the most important criterion in providing the RL with theability to determine the behavior of the SALA [29]

In single goal reinforcement learning these 119876-values areused only to rank order of the actions in a given state Thekey observation here is that the 119876-values can also be used inmultiple-goal problems to indicate the degree of preferencefor different actions The available way in this paper to selecta promising action to execute is to generate an overall119876-valueas a simple sum of the 119876-values of the individual SALA Theaction with the maximum summed value is then chosen toexecute

Environment State and Actions For the effective associationcapabilities exploration problem the FLG is provided as anenvironment in which intelligent agents can learn and sharetheir knowledge and can explore dependencies and relation-ships among MFs The SALAs receive the environmentalinformation data as a state and execute the elected action inthe form of an action value table which provides the best wayof sharing knowledge among agents The state is defined bythe status of theMF space In each state theMFobject is fittedwith the closest-synonyms relations there is a set of actionsIn each round the SALA decides to explore the effectiveassociations that gain the largest amount of interdependenceamong theMFs Actions are defined by selecting a promisingaction with the highest reward by learning which actionis the optimal for each state In summary once 119904

119896(MF

object) is received from the FLG SALA chooses one ofthe associations as the action 119886

119896according to SALArsquos policy

120587lowast(119904119896) executes the associated action and finally returns the

effective relations Heuristic 3 defines the action variable andthe optimal policy

Heuristic 3 (action variable) Let 119886119896symbolize the action that

is executed by SALA at round 119896

119886119896= 120587lowast(119904119896) (6)

Therein one has the following(i) 119886119896C action space similarity contiguity contrast

and causality(ii) 120587lowast(119904

119896) = argmax

119886119876lowast(119904119896 119886119896)

Once the optimal policy (120587lowast) is obtained the agentchooses the actions using the maximum reward

Reward FunctionThe reward function in this papermeasuresthe dependency and the relatedness among MFs The learneris not told which actions to take as in most forms of machinelearning Rather the learner must discover which actionsyield the most reward by trying them [33]

(2) The SALA Optimal AssociationRelationships InferencesThe optimal association actions are achieved according to theIOAC-QL algorithmThis algorithm starts to select the actionwith the highest expected future reward from each state Theimmediate reward which the SALA gets to execute an actionfrom a state s plus the value of an optimal policy is the 119876-value The higher 119876-value points to the greater chance ofthat action being chosen First initialize all the119876(119904 119886) valuesto zero Then each SALA performs the following steps inHeuristic 4

Heuristic 4(i) At every round 119896 select one action 119886

119896from all the

possible actions (similarity contiguity contrast andcausality) and execute it

(ii) Receive the immediate reward 119903119896and observe the new

state 1199041015840119896

(iii) Get the maximum 119876-value of the state 1199041015840

119896based on

all the possible actions 119886119896= 120587lowast(119904119896) = argmax

119886119876lowast

(119904119896 119886119896)

(iv) Update the 119876(119904119896 119886119896) as follows

119876 (119904119896 119886119896) = (1 minus 120572)119876 (119904

119896 119886119896)

+ 120572(119903 (119904119896 119886119896) + 120574 max1198861015840isin119860(119904119896)

119876 (1199041015840

119896 1198861015840

119896))

(7)

where 119876(119904119896 119886119896) is worthy of selecting the action (119886

119896)

at the state (119904119896) 119876(1199041015840

119896 1198861015840

119896) is worthy of selecting the

next action (1198861015840119896) at the next state (1199041015840

119896) The 119903(119904

119896 119886119896)

is a reward corresponding to the acquired payoff A(1199041015840119896) is a set of possible actions at the next state (1199041015840

119896) 120572

(0 lt 120572 le 1) is learning rate 120574 (0 le 120574 le 1) is discountrate

During each round after taking the action 119886 the actionvalue is extracted from WordNet The weight of the actionvalue is measured to represent the immediate reward wherethe SALA computes the weight AVW

119894119896for the action value 119894

in document 119896 as followsReward (119903) = AVW

119894119896

= AVW119894119896+

119899

sum

119895=1

(FW119895119896

lowast SemRel (119894 119895)) (8)

10 The Scientific World Journal

Input Feature Vertex Object(FVO) WordnetOutput The optimal action for FVO relationshipsProcedure

intialize Q[ ]lowastimplement 119876-learning algorithm to get the action with the highest 119876-value lowast119886119896isin action space[] = similarity contiguity contrast causality

lowastParallelforeach used for parallel processing of all actions lowastParallelfor (0 action spacecount() 119896 =gt 119877[119896] = Get Action Reward(119886[119896]) lowastFunction to implement (8) lowast119876[119894 119896] = (1 minus 120572) lowast 119876[119894 119896] + 120572 lowast (119877[119896] + 120574 lowastmax

119886(119876[119894 119896 + 1]))

) ParallelForreturn Get highest 119876-value action(119876[ ]) lowastselect action with the maximum summed value of 119876-values lowast

lowastexecute (8) to get reward value lowastGet Action Reward(ak)AVW

119896= (freq(119886

119896val)FVOCount()) lowast log(119873119899

119886)

for (int 119895 = 0 119895 lt FVOCount() 119895++) FW119895= (freq(FVO[119895])FVOCount()) lowast log(119873119899

119891)

SemRel(119894 119895) =Get SemRel(119886119896val FVO[119895])

Sum+= FW119895lowast SemRel(119894 119895)

return AVW119896= AVW

119896+ Sum

Algorithm 4 IOAC-QL algorithm

where 119903 is the reward of the selected action 119886119894 AVW

119894119896is the

weight of action value 119894 in document 119896 FW119895119896

is the weightof each feature object FVO

119895in document 119896 and SemRel(119894 119895)

is the semantic relatedness between the action value 119894 andfeature object FVO

119895

The most popular weighting scheme is the normalizedword frequency TFIDF [34] used to measure AVW

119894119896and

FW119895119896The FW

119895119896in (8) is calculated likewise AVW

119894119896is based

on (9) and (10)Action value weight (119860119881119882

119894119896) is a measure to calculate

the weight of the action value that is the scalar product ofaction value frequency and inverse document frequency as in(9)The action value frequency (119865) measures the importanceof this value in a document Inverse document frequency(IDF) measures the general importance of this action valuein a corpus of documents

AVW119894119896

=119865119894119896

sum119896119865119894119896

lowast IDF119894 (9)

where 119865119894119896

represents the number of this action value 119894

cocontributions in document 119889119896 normalized by the number

of cocontributions of all MFs in document 119889119896 normalized to

prevent a bias towards longer documents IDF is performedby dividing the number of all documents by the number ofdocuments containing this action value defined as follows

IDF119894= log( |119863|

1003816100381610038161003816119889119896 V119894 isin 1198891198961003816100381610038161003816

) (10)

where |119863| is the total number of documents in the corpusand |119889

119896 V119894isin 119889119896| is the number of documents containing

action value 119881119894

Thus in the IOAC-QL algorithm (see Algorithm 4)implementation shown in Figure 4 the SALA selects theoptimal action with the higher reward Hence each SALAretains effective actions in the FAss list object of each FVO Inthis list each object hasRel-Type object andRel vertex objectTheRel-Type object value equals the action type object valueand Rel vertex object value equals the related vertex indexThe SALA then links the related FVOs with the bidirectionaleffect using the SALA links algorithm and sets the FLW valueof the linkage weight in the adjacency matrix with the rewardof the selected action

An example of monitoring the SALA decision throughthe practical implementation of the IOAC-QL algorithmactions is illustrated in Figure 4

For state FVO1 one has the following

action similarity ActionResult State FVO1FVO7

FVO9FVO22FVO75FVO92FVO112

FVO125

Prob 8reward 184 PrevState FVO

1 119876(119904 119886) = 91

action contiguity ActionResult State FVO103

FVO17

FVO44 Prob 3 reward 74 PrevState FVO

1 119876(119904 119886) =

61action contrast ActionResult State FVO

89 Prob 1

reward 22 PrevState FVO1 119876(119904 119886) = 329

action causality ActionResult State FVO63

FVO249

Prob 2 reward 25 PrevState FVO

1 119876(119904 119886) = 429

As revealed in the case FVO1 the SALA receives the

environmental information data FVO1as a state and then

the SALA starts the implementation to select the actionwith the highest expected future reward among the otherFVOs The result of each action is illustrated the action with

The Scientific World Journal 11

S0

S0

S1

S1S2

a0 a0a0a4 a4 a4

DSO DSO

Sm

Sm

Object model construction

FVO

Vertex [V] Adjacent [V V]

FLWFval

SFKeyFsyn[]

FAss[]FW

The feature linkage graph (FLG)

Descriptive document object

Descriptive sentence object

Action selection

Q-learning mechanism

Assign Q-values with FLG environment

Reinforcement learning model for semanticrelational graph construction

SALA inference engine

DSO DSO DSO DSO DSO

DDO

Sem

antic

actio

nabl

e lea

rnin

g ag

ent (

SALA

)

F1

F1

F1

F2

F3

F4

F2

F2 F3 F4 middot middot middot Fn

DSO

Fn

Fn

middot middot middot

middot middot middot

middot middot middot

middot middot middot middot middot middotmiddot middot middot

middot middot middot

middot middot middotmiddot middot middot

middot middot middot

middot middot middot

middot middot middot

Q(S0 a0) Q(S0 a4)Q(S1 a0) Q(S1 a4)

Q(Sm a0) Q(Sm a4)

WordNet

Figure 4 The IOAC-QL algorithm implementation

the highest reward is selected and the elected action isexecutedTheSALAdecision is accomplished in FLGwith theeffective extracted relationships according to 119876(119904 119886) value asshown in Table 1

33 Document Linkage Conducting Stage This section pre-sents an efficient linkage process at the corpus level for manydocuments Due to the SHGRS scheme of each documentmeasuring the commonality between the FLG of the doc-uments through finding all distinct common similarity orrelatedness subgraphs is required The objective of findingcommon similar or related subgraphs as a relatedness mea-sure is to be utilized in document mining and managingprocesses such as clustering classification and retrieval Itis essential to establish that the measure is metric A metricmeasure ascertains the order of objects in a set and henceoperations such as comparing searching and indexing of theobjects can be performed

Graph relatedness [35 36] involves determining thedegree of similarity between these two graphs (a numberbetween 0 and 1) so the same node in both graphs wouldbe similar if its related nodes were similar The selectionof subgraphs for matching depends on detecting the moreimportant nodes in FLG using FVOFw value The vertexwith the highest weight is the first node selected which thendetects all related vertices from an adjacency matrix as asubgraphThis process depends on computing the maximumcommon subgraphsThe inclusion of all common relatednesssubgraphs is accomplished through recursive iterations andholding onto the value of the relatedness index of thematched

subgraphs Eliminating the overlap is performed using a dy-namic programming technique that keeps track of the nodesvisited and their attribute relatedness values

The function relatedness subgraphs utilize an array Rel-Nodes[] of the related vertices for each document Thefunction SimRel(119909

119894 119910119894) computes the similarity between the

two nodes 119909119894 119910119894as defined in (11) The subgraph matching

process is measured using the following formula

Sub-graph-Rel (SG1 SG2)

=sum

SG1119894

sumSG2119895

SemRel (119873119894 119873119895)

sumSG1119894

119873119894+ sum

SG2119895

119873119895

(11)

where SG1 is the selected subgraph from the FLG of onedocument SG2 is the selected subgraph from the FLG ofanother document and SemRel(119873

119894 119873119895) is the degree of

matching for both nodes 119873119894and 119873

119895 A threshold value for

node matching is also utilized the threshold is set to a 50match

4 Evaluation of the Experiments and Results

This evaluation is especially vital as the aim of building theSHGRS is to use them efficiently and effectively for the fur-ther mining process To explore the effectiveness of the pro-posed SHGRS scheme examining the correlation of the pro-posed semantic relatedness measure compared with otherprevious relatedness measures is required The impact ofthe proposed MFsSR for detecting the closest synonym is

12 The Scientific World Journal

Table 1 Action values table with the SALA decision

State FVO1 FVO1 FVO1 FVO1

Action Similarity Contiguity Contrast CausalityActionResult State FVO1 FVO7 FVO9 FVO22 FVO75 FVO92 FVO112 FVO125 FVO103 FVO17 FVO44 FVO89 FVO63 FVO249

Total reward 184 74 22 25119876(119904 119886) 94 61 329 429Decision Yes No No No

studied and compared to the PMI [25 37] and independentcomponent analysis (ICA) [38] for detecting the best near-synonym The difference between the proposed discoveringsemantic relationships or associations IOAC-QL implicitrelation extraction conditional random fields (CRFs) [6 39]and term frequency and inverse cluster frequency (TFICF)[40] is computed

A case study of amanaging andmining task (ie semanticclustering of text documents) is consideredThis study showshow the proposed approach is workable and how it canbe applied to mining and managing tasks Experimentallysignificant improvements in all performances are achieved inthe document clustering processThis approach outperformsconventional word-based [16] and concept-based [17 41]clusteringmethodsMoreover performance tends to improveas more semantic analysis is achieved in the representation

41 Evaluation Measures Evaluation measures are subcat-egorized into text quality-based evaluation content-basedevaluation coselection-based evaluation and task-basedevaluationsThefirst category of evaluationmeasures is basedon text quality using aspects such as grammaticality nonre-dundancy referential clarity and coherence For content-based evaluations measures such as similarity semanticrelatedness longest common subsequence and other scoresare used The third category is based on coselection evalua-tion using precision recall and 119865-measure values The lastcategory task-based evaluation is based on the performanceof accomplishing the given task Some of these tasks areinformation retrieval question answering document classi-fication document summarizing and document clusteringmethods In this paper the content-based the coselection-based and the task-based evaluations are used to validate theimplementation of the proposed scheme over the real corpusof documents

411 Measure for Content-Based Evaluation The relatednessor similarity measures are inherited from probability theoryand known as the correlation coefficient [42]The correlationcoefficient is one of the most widely used measures todescribe the relatedness 119903 between two vectors119883 and 119884

Correlation Coefficient 119903 The correlation coefficient is a rela-tively efficient relatedness measure which is a symmetricalmeasure of the linear dependence between two randomvariables

Table 2 Contingency table presents human judgment and systemjudgment

Human judgmentTrue False

System judgmentTrue TP FPFalse FN TN

Therefore the correlation coefficient can be consideredas the coefficient for the linear relationship between corre-sponding values of 119883 and 119884 The correlation coefficient 119903between sequences 119883 = 119909

119894 119894 = 1 119899 and 119884 = 119910

119894

119894 = 1 119899 is defined by

119903 =sum119899

119894=1119883119894119884119894

radic(sum119899

119894=11198832119894) lowast (sum

119899

119894=11198842119894)

(12)

Acceptance Rate (AR) Acceptance rate is a proportion ofcorrectly predicted similar or related sentences compared toall related sentences High acceptance ratemeans recognizingalmost all similar or related sentences

AR =TP

(TP + FN) (13)

Accuracy (Acc) Accuracy is a proportion of all correctlypredicted sentences compared to all sentences

Acc = TP + TN(TP + FN + FP + TN)

(14)

where TP TN FP and FN stand for true positive (the num-ber of pairs correctly labeled as similar) true negative (thenumber of pairs correctly labeled as dissimilar) false positive(the number of pairs incorrectly labeled as similar) andfalse negative (the number of pairs incorrectly labeled asdissimilar) as shown in Table 2

412 Measure for Coselection-Based Evaluation For all thedomains a precision 119875 recall 119877 and 119865-measure are utilizedas the measures of performance in the coselection-basedevaluation These measures may be defined via computingthe correlation between the extracted correct and wrongclosest synonyms or semantic relationshipsdependencies

The Scientific World Journal 13

Table 3 Datasets details for the experimental setup

Dataset Experimenting Dataset Name Description

DS1 Experiment 1content-basedevaluation

Miller and Charles (MC)1 RG consists of 65 pairs of nouns extracted from the WordNet rated bymultiple human annotators

DS2Microsoft ResearchParaphrase Corpus

(MRPC)2

The corpus consists of 5801 sentence pairs collected from newswire articles3900 of which were labeled as relatedness by human annotators The wholeset is divided into a training subset (4076 sentences of which 2753 are true)and a test subset (1725 pairs of which 1147 are true)

DS3 Experiment 2coselection-basedevaluation(closest-synonymdetection)

British National Corpus(BNC)3

BNC is a carefully selected collection of 4124 contemporary written andspoken English texts contains 100-million-word text corpus of samples ofwritten and spoken English with the near-synonym collocations

DS4 SN (semantic neighbors)4SN relates 462 target terms (nouns) to 5910 relatum terms with 14682semantic relations (7341 are meaningful and 7341 are random) The SNcontains synonyms coming from three sources WordNet 30 Rogetrsquosthesaurus and a synonyms database

DS5

Experiment 2coselection-basedevaluation (semanticrelationshipsexploration)

BLESS6BLESS relates 200 target terms (100 animate and 100 inanimate nouns) to 8625relatum terms with 26554 semantic relations (14440 are meaningful (correct)and 12154 are random) Every relation has one of the following typeshypernymy cohypernymy meronymy attribute event or random

DS6 TREC5

TREC includes 1437 sentences annotated with entities and relations at leastone relation There are three types of entities person (1685) location (1968)and organization (978) in addition there is a fourth type others (705) whichindicates that the candidate entity is none of the three types There are fivetypes of relations located in (406) indicates that one location is located insideanother location work for (394) indicates that a person works for anorganization OrgBased in (451) indicates that an organization is based in alocation live in (521) indicates that a person lives at a location and kill (268)indicates that a person killed another person There are 17007 pairs of entitiesthat are not related by any of the five relations and hence have the NR relationbetween them which thus significantly outnumbers other relations

DS7 IJCNLP 2011-New YorkTimes (NYT)6

NYT contains 150 business articles from NYT There are 536 instances (208positive 328 negative) with 140 distinct descriptors in NYT dataset

DS8 IJCNLP 2011-Wikipedia8Wikipedia personalsocial relation dataset was previously used in Culotta etal [6] There are 700 instances (122 positive 578 negative) with 70 distinctdescriptors in Wikipedia dataset

DS9 Experiment 3task-basedevaluation

Reuters 215787 Reuters-21578 contains 21578 documents (12902 are used) categorized to 10categories

DS10 20 Newsgroups8 20 newsgroups dataset contains 20000 documents (18846 are used)categorized to 20 categories

1Available at httpwwwcscmuedusimmfaruquisuitehtml2Available at httpresearchmicrosoftcomen-usdownloads3Available at httpcorpusbyuedubnc4Available at httpcentalfltruclacbeteamsimpanchenkosre-evalsncsv5Available at httpl2rcsuiucedusimcogcompDataERconll04corp6Available at httpwwwmysmuedufacultyjingjiangdataIJCNLP2011zip7Available at httpmlrcsumassedumldatasetsReuters-21578+Text+Categorization+Collection8Available at httpwwwcsminingorgindexphpid-20-newsgroupshtml

Let TP denote the number of correctly detected closestsynonyms or semantic relationships explored let FP be thenumber of incorrectly detected closest synonyms or semanticrelationships explored and let FN be the number of correctlybut not detected closest synonyms or semantic relationshipsexplored in a dataset The 119865-measure combines the precisionand recall in one metric and is often used to show theefficiencyThismeasurement can be interpreted as a weightedaverage of the precision and recall where an 119865-measure

reaches its best value at 1 andworst value at 0 Precision recalland 119865-measure are defined as follows

Recall = TP(TP + FN)

Precision =TP

(TP + FP)

119865-measure = 2 (Recall lowast Precision)(Recall + Precision)

(15)

14 The Scientific World Journal

413 Measure for Task-Based Evaluation A case study ofa managing and mining task (ie semantic clustering oftext documents) is studied for task-based evaluation Thetest is performed with the standard clustering algorithmsthat accept the pairwise (dis)similarity between documentsrather than the internal feature space of the documents Theclustering algorithms may be classified as listed below [43](a) flat clustering (which creates a set of clusters without anyexplicit structure that would relate the clusters to each otheralso called exclusive clustering) (b) hierarchical clustering(which creates a hierarchy of clusters) (c) hard clustering(which assigns each documentobject as a member of exactlyone cluster) and (d) soft clustering (which distributes thedocumentsobjects over all clusters) These algorithms areagglomerative (hierarchical clustering) 119870-means (flat clus-tering hard clustering) and EM algorithm (flat clusteringsoft clustering)

Hierarchical agglomerative clustering (HAC) and the 119870-means algorithm have been applied to text clustering ina straightforward way [16 17] The algorithms employedin the experiment included 119870-means clustering and HACTo evaluate the quality of the clustering the 119865-measure isused for quality measures which combine the precision andrecall measures These metrics are computed for every (classcluster) pair The precision 119875 and recall 119877 of a cluster 119878

119894with

respect to a class 119871119903are defined as

119875 (119871119903 119878119894) =

119873119903119894

119873119894

119877 (119871119903 119878119894) =

119873119903119894

119873119903

(16)

where 119873119903119894is the number of documents of class 119871

119903in cluster

119878119894119873119894is the number of documents of cluster 119878

119894 and119873

119903is the

number of documents of class 119871119903

The 119865-measure which is a harmonic mean of precisionand recall is defined as

119865 (119871119903 119878119894) =

2 lowast 119875 (119871119903 119878119894) lowast 119877 (119871

119903 119878119894)

119875 (119871119903 119878119894) + 119877 (119871

119903 119878119894)

(17)

A per-class 119865 score is calculated as follows

119865 (119871119903) = max119878119894isin119862

119865 (119871119903 119878119894) (18)

With respect to class 119871119903 the cluster with the highest 119865-

measure is considered to be the cluster that maps to class 119871119903

that 119865-measure becomes the score for class 119871119903 and 119862 is total

number of clusters The overall 119865-measure for the clusteringresult 119862 is the weighted average of the 119865-measure for eachclass 119871

119903(macroaverage)

119865119862 =sum119871119903

(1003816100381610038161003816119873119903

1003816100381610038161003816 lowast 119865 (119871119903))

sum119871119903

10038161003816100381610038161198731199031003816100381610038161003816

(19)

where |119873119903| is the number of objects in class 119871

119903 The higher

the overall 119865-measure is the better the clustering is due tothe higher accuracy of the clusters mapping to the originalclasses

42 Evaluation Setup (Dataset) Content-based coselection-based and task-based evaluations are used to validateexperimenting over the real corpus of documents theresults are very promising The experimental setup con-sists of some datasets of textual documents as detailed inTable 3 Experiment 1 uses the Miller and Charles (MC) andMicrosoft Research Paraphrase Corpus Experiment 2 usesthe British National Corpus TREC IJCNLP 2011 (NYT andWikipedia) SN and BLESS Experiment 3 usesThe Reuters-21578 and 20 newsgroups

43 Evaluation Results This section reports on the results ofthree experiments conducted using the evaluation datasetsoutlined in the previous section The SHGRS is imple-mented and evaluated based on concept analysis and anno-tation as sentence based in experiment 1 document basedin experiment 2 and corpus based in experiment 3 Inexperiment 1 the DS1 and DS2 investigated the effectivenessof the proposed MFsSR compared to previous studies basedon the availability of semantic relatedness values of humanjudgments and previous studies In experiment 2 the DS3andDS4 investigated the effectiveness of the closest synonymdetection using MFsSR compared to using PMI [25 37]and ICA [38] based on the availability of synonyms comingfrom three sources WordNet 30 Rogetrsquos thesaurus anda synonyms database Furthermore the DS5 DS6 DS7and DS8 investigated the effectiveness of the IOAC-QL forsemantic relationships exploration compared to the implicitrelation extraction CRFs [6 39] and TFICF [40] based on theavailability of semantic relations judgments In experiment 3the DS9 and DS10 investigated the effectiveness of thedocument linkage process for clustering case study thanprevious studies as word-based [16] and concept-based [17]based on the availability of human judgments

These experiments revealed some interesting trends interms of text organization and a representation schemebased on semantic annotation and reinforcement learningThe results show the effectiveness of the proposed schemefor accurate understanding of the underlying meaning Theproposed SHGRS is exploited in mining and managingoperations such as gathering filtering searching retrievingextracting classifying clustering and summarizing docu-ments In addition the experiments illustrate the improve-ments when direct relevance and indirect relevance areincluded in the process of measuring semantic relatednessbetween MFs of documents The main purpose of semanticrelatedness measures is to infer unknown dependencies andrelationships among the MFs The proposed MFsSR is abackbone of the closest synonym detection and semanticrelationships exploration algorithms The performance ofMFsSR for closest synonym detection and IOAC-QL forsemantic relationships exploration algorithms is proportion-ally improved as well as having a semantic understanding oftext content achieved The relatedness subgraph measure isused for the document linkage process at the corpus levelThrough this process better clustering results can be achievedas more attempt to improve the accuracy of measuring therelatedness between documents using semantic information

The Scientific World Journal 15

Table 4The results of the comparison of the correlation coefficientbetween human judgment with some relatedness measures and theproposed semantic relatedness measure

MeasureRelevance

correlation withMampC

Distance-based measuresRada 0688Wu and Palmer 0765

Information-based measuresResnik 077Jiang and Conrath 0848Lin 0853

Hybrid measuresT Hong and D smith 0879Zili Zhou 0882

Informationfeature-base measuresThe proposed MFsSR 0937

in SHGRSThe relatedness subgraphsmeasure affects the per-formance of the output of the document managing and min-ing process relative to howmuch of the semantic informationit considers The most accurate relatedness can be achievedwhen the maximum amount of the semantic information isprovided Consequently the quality of clustering is improvedand when semantic information is considered the approachsurpasses the word-based and concept-based techniques

431 Experiment 1 Comparative Study (Content-Based Eval-uation) This experiment shows the necessity to evaluate theperformance of the proposed MFsSR based on a benchmarkdataset of human judgments so the results of the proposedsemantic relatedness would be comparable with other pre-vious studies in the same field In this experiment theyattempted to compute the correlation between the ratings ofthe proposed semantic relatedness approach and the meanratings reported by Miller and Charles (DS1 in Table 3)Furthermore the results produced are compared againsteight other semantic similarities approaches namely RadaWu and Palmer Resnik Jiang and Conrath Lin T Hong andD Smith and Zili Zhou [21]

Table 4 shows the correlation coefficient results betweennine componential approaches and the Miller and Charlesratings mean The semantic relatedness for the proposedapproach outperformed all the listed approaches Unlike allthe listed methods in the proposed semantic relatednessdifferent properties are considered Furthermore the goodcorrelation value of the approach also results from consider-ing all available relationships between concepts and the indi-rect relationships between attributes of each concept whenmeasuring the semantic relatedness In this experiment theproposed approach achieved a good correlation value withthe human-subject rating reported by Miller and CharlesBased on the results of this study theMFsSR correlation valuehas proven that considering the directindirect relevance

is specifically important for at least 6 improvement overthe best previous results This improvement contributes tothe achievement of the largest amount of relationships andinterdependence among MFs and their attributes at thedocument level and to the document linkage process at acorpus level

Table 5 summarizes the characteristics of the MRPCdataset (DS2 in Table 3) and presents comparison of theAcc and AR values between the proposed semantic related-ness measure MFsSR and Islam and Inkpen [7] Differentrelatedness thresholds ranging from 0 to 1 with interval01 are used to validate the MFsSR with Islam and Inkpen[7] After evaluation the best relatedness thresholds of Accand AR are 06 07 and 08 These results indicate thatthe proposed MFsSR surpasses Islam and Inkpen [7] interms of Acc and AR The results of each approach listedbelow were based on the best Acc and AR through allthresholds instead of under the same relatedness thresholdThis improvement in Acc andAR values is due to the increasein the numbers of pairs predicted correctly after consideringdirectindirect relevance This relevance takes into accountthe closest synonym and the relationships of sentences MFs

432 Experiment 2 Comparative Study of Closest-SynonymDetection and Semantic Relationships Exploration Thisexperiment shows the necessity to study the impact of theMFsSR for inferring unknown dependencies and relation-ships among the MFs This impact is achieved through theclosest-synonym detection and semantic relationship explo-ration algorithms so the results would be comparable withother previous studies in the same field Throughout thisexperiment the impact of the proposed MFsSR for detectingthe closest synonym is studied and compared to the PMIapproach and ICA approach for detecting the best near-synonym The difference among the proposed discoveringsemantic relationships or associations IOAC-QL the implicitrelation extraction CRFs approach and TFICF approachis examined This experiment attempts to measure theperformance of theMFsSR for the closest synonym detectionand the IOAC-QL for semantic relationship explorationalgorithms Hence more semantic understanding of the textcontent is achieved by inferring unknown dependenciesand relationships among the MFs Considering the directrelevance among the MFs and the indirect relevance amongthe attributes of the MFs in the proposed MFsSR constitutesa certain advantage over previous measures However mostof the time the incorrect items are due to a wrong syntacticparsing from the OpenNLP parser and AlchemyAPIAccording to the preliminary study it is certain that theaccuracy of parsing toolsrsquo effects is on the performance of theMFsSR

Detecting the closest synonym is a process of MF disam-biguation resulting from the implementation of the MFsSRtechnique and the threshold of the synonyms scores throughthe C-SynD algorithm In the C-SynD algorithm the MFsSRtakes into account the direct relevance among MFs andthe indirect relevance among MFs and their attributes in acontext that gives a significant increase in the recall without

16 The Scientific World Journal

Table 5 The results of the comparison of the accuracy and acceptance rate between the proposed semantic relatedness measure and Islamand Inkpen [7]

MRPC dataset Relatednessthreshold

Human judgment(TP + FN)

Islam and Inkpen [7] The proposed MFsSRAcc AR Acc AR

Trainingsubset (4076)

01

2753 true

067 1 068 102 067 1 068 103 067 1 068 104 067 1 068 105 069 098 068 106 072 089 068 107 068 078 070 09808 056 04 072 08609 037 009 060 0491 033 0 034 002

Test subset(1725)

01

1147 true

066 1 067 102 066 1 067 103 066 1 067 104 066 1 067 105 068 098 067 106 072 089 066 107 068 078 066 09808 056 04 064 08509 038 009 049 051 033 0 034 003

disturbing the precision Thus the MFsSR between two con-cepts considers the information content with most of Word-Net relationships Table 6 illustrates the performance of theC-SynD algorithm based on theMFsSR compared to the bestnear-synonym algorithmusing the PMI and ICA approachesThis comparison was conducted for two corpora of textdocuments that are BNC and NS (DS3 and DS4 in Table 3)As indicated in Table 6 the C-SynD algorithm yielded higheraverage precision recall and 119865-measure values than PMIapproach by 25 20 and 21 on SN dataset respectivelyand also by 8 4 and 7 on BNC dataset Furthermorethe C-SynD algorithm also yielded higher average precisionrecall and 119865-measure values than ICA approach by 6 9and 7 on SN dataset respectively and also by 6 4and 5 on BNC dataset The improvements achieved in theperformance values are due to the increase of the numberof pairs predicted correctly and the decrease of the numberof pairs predicted incorrectly through implementing MFsSRThe important observation from this table is the improve-ments achieved in recall which measures effectiveness of theC-SynD algorithm Achieving better precision values is clearwith a high percentage in different datasets and domains

Inferring the MFs relationships and dependencies is aprocess of achieving a large number of interdependencesamongMFs resulting from the implementation of the IOAC-QL algorithm In the IOAC-QL algorithm the 119876-learning isused to select the optimal action (relationships) that gainsthe largest amount of interdependences among the MFs

resulting from the measure of the AVW Table 7 illustratesthe performance of the IOAC-QL algorithm based on AVWscompared to the implicit relationship extraction based onCRFs and TFICF approaches which was conducted overthe TREC Bless the NYT and the Wikipedia corpus (DS5DS6 DS7 and DS8 in Table 3) The data in this table showsincreases in precision recall and 119865-measure values due tothe increase in the number of pairs predicted correctly afterconsidering directindirect relevance through the expansionof closest synonym In addition the IOAC-QL algorithmthe CRFs and the TFICF approaches are beneficial tothe extraction performance but the IOAC-QL contributesmore than CRFs and TFICF Thus IOAC-QL considerssimilarity contiguity contrast and causality relationshipsbetween MFs and its closest-synonyms while the CRFs andTFICF consider is-a and part-of relationships only betweenconceptsThe improvements of the 119865-measure were achievedthrough an IOAC-QL algorithm up to 23 for TREC dataset22 for BLESS dataset 21 for NYT dataset and 6 forWikipedia dataset approximately from the CRFs approachFurthermore the IOAC-QL algorithm also yielded higher119865-measure values than TFICF approach by 10 for TRECdataset 17 for BLESS dataset and 7 for NYT datasetrespectively

433 Experiment 3 Comparative Study (Task-Based Eval-uation) This experiment shows the effectiveness of the

The Scientific World Journal 17

Table 6 The results of the C-SynD algorithm based on the MFsSR compared to the PMI and ICA for detecting the closest synonym

Datasets PMI ICA C-SynDPrecision Recall 119865-measure Precision Recall 119865-measure Precision Recall 119865-measure

SN 606 606 061 795 716 753 85 80 82BNC 745 679 71 76 678 72 82 719 77

Table 7 The results of the IOAC-QL algorithm based on AVWs compared to CRFs and TFICF approaches

Datasets CRFs TFICF IOAC-QLPrecision Recall 119865-measure Precision Recall 119865-measure Precision Recall 119865-measure

TREC 7508 602 6628 893 714 787 898 881 886BLESS 7304 6266 6703 738 695 716 950 835 889NYT 6846 5402 6038 860 650 740 900 740 812Wikipedia 560 420 480 646 5488 5934 700 440 540

Table 8The results of graph-based approach compared to term-based and concept-based approach using119870-means and the HAC clusteringalgorithms

Datasets Algorithm Term-based Concept-based Graph-basedFC FC FC

Reuters-21578 HAC 73 857 902119870-means 51 846 918

20 Newsgroups HAC 53 82 89119870-means 48 793 87

proposed semantic representation scheme in document min-ing and managing operations such as gathering filteringsearching retrieving extracting classifying clustering andsummarizing documents Many sets of experiments usingthe proposed scheme on different datasets in text clusteringhave been conducted The experiments demonstrate theextensive comparisons between the MF-based analysis andthe traditional analysis Experimental results demonstratethe substantial enhancement of the document clusteringquality In this experiment the application of the semanticunderstanding of the textual documents is presented Textualdocuments are parsed to produce the rich syntactic andsemantic representations of SHGRS Based on these semanticrepresentations document linkages are estimated using thesubgraph matching process that is based on finding alldistinct similarity common subgraphs The semantic repre-sentations of textual documents determining the relatednessbetween themare evaluated by allowing clustering algorithmsto use the produced relatedness matrix of the documentset To test the effectiveness of subgraph matching in deter-mining an accurate measure of the relatedness betweendocuments an experiment using the MFs-based analysisand subgraphs relatedness measures was conducted Twostandard document clustering techniques were chosen fortesting the effect of the document linkages on clustering [43](1) HAC and (2) 119870-means The MF-based weighting is themain factor that captures the importance of a concept in

a document Furthermore document linkages using the sub-graph relatedness measure are among the main factors thataffect clustering techniques Thus to study the effect of theMFs-based weighting and the subgraph relatedness measure(graph-based approach) on the clustering techniques thisexperiment is repeated for the different clustering techniquesas shown in Table 8 Basically the aim is to maximize theweighted average of the 119865-measure for each class (FC) ofclusters to achieve high-quality clustering

Table 8 shows that the graph-based approach achieveshigher performance than term-based and concept-basedapproachesThis comparison was conducted for two corporaof text documents that are the Reuters-21578 dataset and the20 newsgroups dataset (DS9 and DS10 in Table 3)The resultsshown in Table 8 illustrate improvement in the clusteringquality when more attributes are included in graph-basedrelatedness measure The improvements were achieved at thegraph-based approach up to 17 by HAC and 40 by 119870-means for Reuters-21578 dataset while up to 36 byHAC and39 by 119870-means for approximately 20 newsgroup datasetsfrom the base case (term-based) approach Furthermorethe improvements also were achieved at the graph-basedapproach up to 5 by HAC and 7 by119870-Means for Reuters-21578 dataset while up to 7 byHAC and 8 by119870-Means forapproximately 20 Newsgroup datasets from Concept-basedapproach

18 The Scientific World Journal

5 Conclusions

This paper proposed the SHGRS to improve the effectivenessof themining andmanaging operations in textual documentsSpecifically a three-stage approach was proposed for con-structing the scheme First the MFs with their attributesare extracted and hierarchy-based structure is built to rep-resent sentences by the MFs and their attributes Second anovel semantic relatedness computing method was proposedfor inferring relationships and dependencies of MFs andthe relationship between MFs is represented by a graph-based structure Third a subgraph matching algorithm wasconducted for document linkage process The documentsclustering is used as a case study and conducted extensiveexperiments on 10 different datasets and the results showedthat the scheme can result in a better mining performancecompared with traditional approach Future work will focuson conducting other case studies to processes such asgathering filtering retrieving classifying and summarizinginformation

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] I T Fatudimu A G Musa C K Ayo and A B SofoluweldquoKnowledge discovery in online repositories a text miningapproachrdquo European Journal of Scientific Research vol 22 no2 pp 241ndash250 2008

[2] A Hotho A Nurnberger and G Paaszlig ldquoA brief survey of textminingrdquo Journal for Computational Linguistics and LanguageTechnology vol 20 2005

[3] M Bernotas K Karklius R Laurutis and A Slotkiene ldquoThepeculiarities of the text document representation using ontol-ogy and tagging-based clustering techniquerdquo Journal of Infor-mation Technology and Control vol 36 pp 217ndash220 2007

[4] Y Li Z A Bandar andDMcLean ldquoAn approach formeasuringsemantic similarity between words using multiple informationsourcesrdquo IEEE Transactions on Knowledge and Data Engineer-ing vol 15 no 4 pp 871ndash882 2003

[5] K Shaban O Basir and M Kamel ldquoDocument mining basedon semantic understanding of textrdquo in Progress in Pattern Rec-ognition Image Analysis and Applications vol 4225 of LectureNotes in Computer Science pp 834ndash843 Springer BerlinGermany 2006

[6] A Culotta A McCallum and J Betz ldquoIntegrating probabilisticextraction models and data mining to discover relations andpatterns in textrdquo in Proceedings of theHuman Language Technol-ogy ConferencemdashNorth American Chapter of the Association forComputational Linguistics Annual Meeting (HLT-NAACL rsquo06)pp 296ndash303 June 2006

[7] A Islam and D Inkpen ldquoSemantic text similarity using corpus-based word similarity and string similarityrdquo ACM Transactionson Knowledge Discovery fromData vol 2 no 2 article 10 2008

[8] R Feldman and J SangerTheTextMining Handbook AdvancedApproaches in Analyzing Unstructured Data Cambridge Uni-versity Press 2006

[9] R Navigli ldquoWord sense disambiguation a surveyrdquo ACM Com-puting Surveys vol 41 no 2 article 10 2009

[10] W Mao andWW Chu ldquoThe phrase-based vector space modelfor automatic retrieval of free-text medical documentsrdquo Dataand Knowledge Engineering vol 61 no 1 pp 76ndash92 2007

[11] C Siefkes and P Siniakov ldquoAn overview and classification ofadaptive approaches to information extractionrdquo in Journal onData Semantics IV vol 3730 of Lecture Notes in ComputerScience pp 172ndash212 Springer New York NY USA 2005

[12] httpwordnetprincetonedu[13] OpenNLP httpopennlpapacheorg[14] httpwwwalchemyapicom[15] W V Siricharoen ldquoOntologies and object models in object ori-

ented software engineeringrdquo International Association of Engi-neers (IAENG) Journal of Computer Science (JICS) vol 33 no 1pp 1151ndash1162 2006

[16] S Shehata ldquoA WordNet-based semantic model for enhancingtext clusteringrdquo in Proceedings of the IEEE International Con-ference on Data Mining Workshops (ICDMW rsquo09) pp 477ndash482Miami Fla USA December 2009

[17] S Shehata F Karray andM Kamel ldquoAn efficient concept-basedminingmodel for enhancing text clusteringrdquo IEEE Transactionson Knowledge and Data Engineering vol 22 no 10 pp 1360ndash1371 2010

[18] E Zemmouri H Behja A Marzak and B Trousse ldquoOntology-based knowledge model for multi-view KDD processrdquo Interna-tional Journal of Mobile Computing and Multimedia Communi-cations vol 4 no 3 pp 21ndash33 2012

[19] C Marinica and F Guillet ldquoKnowledge-based interactive post-mining of association rules using ontologiesrdquo IEEETransactionson Knowledge and Data Engineering vol 22 no 6 pp 784ndash7972010

[20] B Choudhary and P Bhattacharyya ldquoText clustering usingUni-versal Networking Language representationrdquo in Proceedings ofthe 11th International World Wide Web Conference 2003

[21] G Pirro and J Euzenat ldquoA feature and information theoreticframework for semantic similarity and relatednessrdquo in Proceed-ings of the 9th International Semantic Web Conference on theSemantic Web (ISWC rsquo10) vol 6496 pp 615ndash630 2010

[22] Z Lin M R Lyu and I King ldquoMatchSim a novel similaritymeasure based on maximum neighborhood matchingrdquo Knowl-edge and Information Systems vol 32 no 1 pp 141ndash166 2012

[23] D Lin ldquoAn information-theoretic definition of similarityrdquo inProceedings of the 15th International Conference on MachineLearning Madison Wis USA July 1998

[24] D Inkpen ldquoA statistical model for near-synonym choicerdquo ACMTransactions on Speech and Language Processing vol 4 no 1article 2 2007

[25] LHan T Finin PMcNameeA Joshi andYYesha ldquoImprovingword similarity by augmenting PMIwith estimates of word pol-ysemyrdquo IEEE Transactions on Knowledge and Data Engineeringvol 25 no 6 pp 1307ndash1322 2013

[26] J OrsquoShea Z Bandar K Crockett and D McLean ldquoA compar-ative study of two short text semantic similarity measuresrdquo inAgent and Multi-Agent Systems Technologies and Applicationsvol 4953 of Lecture Notes in Artificial Intelligence pp 172ndash181Springer Berlin Germany 2008

[27] J Oliva J I SerranoMDDel Castillo and A Iglesias ldquoSyMSSa syntax-basedmeasure for short-text semantic similarityrdquoDataand Knowledge Engineering vol 70 no 4 pp 390ndash405 2011

The Scientific World Journal 19

[28] I Yoo and X Hu ldquoClustering large collection of biomedicalliterature based on ontology-enriched bipartite graph represen-tation andmutual refinement strategyrdquo inProceedings of the 10thPAKDD pp 9ndash12 2006

[29] S-T Yuan and Y-C Chen ldquoSemantic ideation learning foragent-based e-brainstormingrdquo IEEE Transactions on Knowledgeand Data Engineering vol 20 no 2 pp 261ndash275 2008

[30] D Liu Z Liu and Q Dong ldquoA dependency grammar andWordNet based sentence similarity measurerdquo Journal of Com-putational Information Systems vol 8 no 3 pp 1027ndash1035 2012

[31] A Zouaq M Gagnon and O Benoit ldquoSemantic analysis usingdependency-based grammars and upper-level ontologiesrdquoInternational Journal of Computational Linguistics and Applica-tions vol 1 no 1-2 pp 85ndash101 2010

[32] C Ramakrishnan K J Kochut and A P Sheth ldquoA frameworkfor schema-driven relationship discovery from unstructuredtextrdquo in The Semantic WebmdashISWC 2006 vol 4273 of LectureNotes in Computer Science pp 583ndash596 Springer BerlinGermany 2006

[33] C J C H Watkins and P Dayan ldquoTechnical note Q-learningrdquoMachine Learning vol 8 no 3-4 pp 279ndash292 1992

[34] M Lan C L Tan J Su and Y Lu ldquoSupervised and traditionalterm weighting methods for automatic text categorizationrdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 31 no 4 pp 721ndash735 2009

[35] L Zhang C Li J Liu and HWang ldquoGraph-based text similar-itymeasurement by exploiting wikipedia as background knowl-edgerdquo World Academy of Science Engineering and Technologyvol 59 pp 1548ndash1553 2011

[36] M Kuramochi and G Karypis ldquoAn efficient algorithm for dis-covering frequent subgraphsrdquo IEEE Transactions on Knowledgeand Data Engineering vol 16 no 9 pp 1038ndash1051 2004

[37] L-C Yu H-M Shih Y-L Lai J-F Yeh and C-H Wu ldquoDis-criminative training for near-synonym substitutionrdquo in Pro-ceedings of the 23rd International Conference on ComputationalLinguistics (COLING rsquo10) pp 1254ndash1262 2010

[38] L-C Yu and W-N Chien ldquoIndependent component analysisfor near-synonym choicerdquoDecision Support Systems vol 55 no1 pp 146ndash155 2013

[39] L Yao S Riedel and A Mccallum ldquoCollective cross-documentrelation extraction without labelled datardquo in Proceedings ofthe Conference on Empirical Methods in Natural LanguageProcessing pp 1013ndash1023MITCambridgeMassUSAOctober2010

[40] M Shen D-R Liu and Y-S Huang ldquoExtracting semantic rela-tions to enrich domain ontologiesrdquo Journal of Intelligent Infor-mation Systems vol 39 no 3 pp 749ndash761 2012

[41] H Chim and X Deng ldquoEfficient phrase-based documentsimilarity for clusteringrdquo IEEE Transactions on Knowledge andData Engineering vol 20 no 9 pp 1217ndash1229 2008

[42] GAMiller andWGCharles ldquoContextual correlates of seman-tic similarityrdquo Language and Cognitive Processes vol 6 no 1 pp1ndash28 1991

[43] M Steinbach G Karypis and V Kumar ldquoA comparison ofdocument clustering techniquesrdquo in Proceedings of the Knowl-edge Discovery and Data Mining (KDD) Workshop Text MiningAugust 2000

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 5: Research Article Exploiting Semantic Annotations and ...downloads.hindawi.com/journals/tswj/2015/136172.pdf · is bene cial for text categorization [ ], text summarization [ ], word

The Scientific World Journal 5

Root

DSO

FValPosNer

FATTS

Annotated with

Val

POSNer

TypeAnnotated with

DSO DSO DSO

middot middot middot

middot middot middot

middot middot middot

MFnMF1 MF2

FATT1FATT2 FATTm

Figure 3 The hierarchy-based structure of the textual document

(FAtts) The list of FAtts is constructed from the remainingterms in the sentence (complement or modifier) relyingon the grammatical relation Each attribute is annotatedwith attribute value (Val) attribute type (type) which iscomplement or modifier attribute POS (POS) and attributenamed-entity recognition (Ner)

Finally the hierarchy-based structure of the textual doc-ument is represented as an object [32] containing the hierar-chical structure of sentences with the MFs of the sentencesand their attributesThis hierarchical structure maintains thedependency between the terms on the sentence level formoreunderstanding which provides sentences with fast access andretrieval Figure 3 shows the hierarchical structure of the textin a theme tree ldquotree DSOrdquo the first level of the hierarchicalmodel denotes all sentence objects that are called DSO thenext level of nodes denotes the MFs of each sentence objectand the last level of nodes denotes the attributes of the MFs

tree DSODSO[i] indicates annotation of sentenceobject 119894 which is fitted with additional information as inHeuristic 1 This heuristic simplifies the hierarchical modelof the textual document with its DSO objects and its MFsThe feature score is used to measure the contribution of theMF in the sentence This score is based on the number ofthe terms grammatically related to the MF (summation offeature Attributes) as follows

Feature Score (FS) = sum

119894

MF119894FAtt (1)

Heuristic 1 Let tree DSO = Sibling of DSO nodes and leteachDSO C tree DSO = SEN key SEN value SEN StateSibling of MF node[] where SEN key indicates the order ofthe sentence in the document as an index SEN value indi-cates the sentence parsed value as (S(NP(NN employee))(VP(VBZ guides)(NP(DT the)(NN customer)))(sdot)) SENState indicates the sentence tense and state (tense past statenegative) siblings of MF node[] indicate the list of siblingnodes of MFs for each sentence fitted with Val POS Ner andlist of FAtts

In the text refining and annotating (TRN) algorithm(see Algorithm 1) the textual document is converted to anobject model with the hierarchical DSOThe TRN algorithmuses OpenNLP and AlchemyAPI tools for MFs extractionand the hierarchy model construction As a result the

constructed hierarchy structure leads to the reduction ofthe dimensions as much as possible and achieves efficientinformation access and knowledge acquisition

32 DependenciesRelationships Exploring and Inferring StageThis stage is responsible for exploring how the dependenciesand relationships among MFs have an effect and aims to

(1) represent the interdependence among sentence MFsin a graph-based structure known as the feature link-age graph (FLG)

(2) formulate an accurate measure for the semantic relat-edness of MFs

(3) detect the closest synonyms for each of the MFs(4) infer the relationships and dependencies of MFs with

each other

This stage can be achieved in three processes First build-ing an FLG represents the dependencies and relationshipsamong sentences Second an efficientMFsSRmeasure is pro-posed to contribute to the detection of the closest synonymsand to infer the relationships and dependencies of the MFsThis measure considers the direct relevance and the indirectrelevance among MFs The direct relevance is the synonymsand associated capabilities (similarity contiguity contrastand causality) among the MFs These association capabilitiesindicate the relationships that give the largest amount ofinterdependence among theMFswhile the indirect relevancerefers to other relationships among the attributes of theseMFs Finally the unknown dependencies and relationshipsamong MFs are explored by exploiting semantic informationabout their texts at a document level

In addition a semantic actionable learning agent (SALA)plays a fundamental role in detecting the closest synonymsand inferring the relationships and dependencies of the MFsLearning is needed to improve the SALA functionality toachieve multiple goals Many studies show that the RL agenthas a high reproductive capability for human-like behaviors[29] As a result the SALA performs an adaptive approachcombining thesaurus-based and distributional mechanismsIn the thesaurus-based mechanism words are compared interms of how they are in the thesaurus (eg WordNet) whilein the distributionalmechanismwords are compared in termsof the shared number of contexts in which they may appear

321 Graph-Based Organization Constructing Process Therelationships anddependencies among theMFs are organizedinto a graph-like structure of nodes with links known asFLG and defined in Heuristic 2 The graph structure FLGrepresents many-to-many relationships among MFs TheFLG is a directed acyclic graph FLG(119881 119860) that would berepresented in a collection of vertices and a collection ofdirected arcs These arcs connect pairs of vertices with nopath returning to the same vertex (acyclic) In FLG(119881 119860)119881 is the set of vertices or states of the graph and 119860 isthe set of arcs between vertices The FLG construction isbased on two sets of data The first set represents the verticesin a one-dimensional array vertex (119881) of Feature Vertex

6 The Scientific World Journal

Input List of Sentences of The document 119889119894

Output List DSO lowastlist of DSO objects and its MFs with attributes lowastProcedure 119863 larr New documentList DSOlarr Empty list list of sentences object of documentfor each sentence 119904

119894in document 119863 do

AlchemyAPI(119904119894) lowastThis function calls AlchemyAPI to determine all MF and its NER lowast

for each feature mfj isin mf1mf2 mf119899 in DSO do

get MF Attributes(mf119894) lowastThis function determines attributes of each MF lowast

OpenNLP (si) lowastThis function calls OpenNLP to determine POS and NER for MF and attributes lowastcompute MF score(mf

119894) lowastThis function determines the score of each MF lowast

endList DSOadd(DSO)

end

Algorithm 1 TRN algorithm

objects (FVOs)The FVO is annotated with Sentence Featurekey (SFKey) feature value (Fval) feature closest synonyms(Fsyn) feature associations (FAss) and feature weight (FW)Where SFKey combines sentence key and feature key inDSO the sentence key is important because that facilitatesaccessing the parent of theMFsThe Fval object has the valueof the feature the Fsyn object has a list of the detected closestsynonyms of the Fval and the FAss object has a list of theexplored feature associations and relationships for the FvalIn this list each object has Rel Typ between two MFs (1 =similarity 2 = contiguity 3 = contrast 4 = causality and 5 =synonym) and Rel vertex index of the related vertex TheFWobject has accumulated the associations and relationshipsweights clarifying the importance of the Fval in the textualdocument The second set represents the arcs in a two-dimensional array adjacency (119881 119881) of Features link weights(FLW) which indicates the value of the linkage weight in anadjacency matrix between related vertices

Heuristic 2 Let FLG(119881 119860) = ⟨Vertex(119881)Adjacency(119881 119881)⟩let Vertex(119881) = FVO

0 FVO

1 FVO

119899 and let Adja-

cency(119881 119881) = FLW(00)

FLW(01)

FLW(10)

FLW(11)

FLW(119899119899)

where FVO indicates an object ofthe MF in vertex vector and annotates with the additionalinformation as FVO = SFKey Fval Fsyn FAss Fw SFKeyindicates combination of sentence key and feature key in aDSO Fval indicates the value of the feature Fsyn indicatesa list of the closest synonyms of the Fval FAss indicatesa list of the feature associations and relationships for theFval and each object in this list is defined as FAssitem =

Rel TypeRel vertex where Rel Type indicates the valueof the relation type linkage between the two features andRel vertex indicates the index of related vertex FW indicatesthe accumulation of the weights of the associations andrelationships of the Fval in the textual document FLW indi-cates the value of the linkage weight in an adjacency matrix

322 The MFs Semantic Relatedness (MFsSR) MeasuringProcess A primarymotivation for measuring semantic relat-edness comes from the NL processing applications such asinformation retrieval information extraction information

filtering text summary text annotation text mining wordsense disambiguation automatic indexing machine trans-lation and other aspects [2 3] In this paper one of theinformation-based methods is utilized by considering ICThe semantic relatedness that has been investigated concernsthe direct relevance and the indirect relevance among MFsThe MFs may be one word or more and thereby the MF isconsidered as a conceptThe IC is considered as a measure ofquantifying the amount of information a concept expressesLin [23] assumed that for a concept 119888 proposed IC let 119901(119888)be the probability of encountering an instance of concept119888 The IC value is obtained by considering the negative loglikelihood of encountering a concept in a given corpusTraditionally the semantic relatedness of concepts is usuallybased on the cooccurrence information on a large corpusHowever the cooccurrences do not achieve much matchingin a corpus and it is essential to take into account therelationships among concepts Therefore development of anew information content method based on the cocontribu-tions instead of the cooccurrences of the proposed MFsSR isimportant

The new information content (NIC) measure is an exten-sion of the information content measure The NIC measuresare based on the relations defined in the WordNet ontologyThe NIC measure uses hypernymhyponym synonymantonym holonymmeronymy and entailcause to quantifythe informativeness of concepts For example a hyponymrelation could be ldquobicycle is a vehiclerdquo and ameronym relationcould be ldquoa bicycle has wheelsrdquo NIC is defined as a functionof the hypernym hyponym synonym antonym holonymmeronymy entail and cause relationships normalized by themaximum number of MFsattributes objects in the textualdocument using

NIC (119888) = 1 minus ((log (Hype Hypo (119888) + Syn Anto (119888)

+Holo Mero (119888)

+ Enta Cause (119888) + 1))

times (log(max concept))minus1)

(2)

The Scientific World Journal 7

whereHype Hypo(119888) returns the number of FVOs in the FLGrelated to the hypernym or hyponym values Syn Anto(119888)returns the number of FVOs in the FLG related to the syn-onym or antonym values Holo Mero(119888) returns the numberof FVOs in the FLG related to the holonym or meronymyvalues and Enta Cause(119888) returns the number of FVOs in theFLG related to the entail or cause values The max concept isa constant that indicates the total number of MFsattributesobjects in the considered text The max concept normalizesthe NIC value and hence the NIC values fall in the range of[0 1]

With the consideration of direct relevance and indirectrelevance of MFs the proposed MFsSR can be stated asfollows

SemRel = 120582(2 lowast NIC (LCS (1198881 1198882))NIC (1198881) +NIC (1198882)

)

+ (1 minus 120582) (2 lowastNIC (LCS (att 1198881 att 1198882))NIC (att 1198881) + NIC (att 1198882)

)

(3)

where 120582 isin [0 1] decides the relative contribution of directand indirect relevance to the semantic relatedness andbecause the direct relevance is assumed to bemore importantthan the indirect relevance 120582 isin [05 1]

323 The Closest-Synonyms Detecting Process In MFsclosest-synonyms detection the SALA carried out the firstaction to extract the closest synonyms for each MF whereSALA defies automatic discovery of similar meaning words(synonyms) Due to the important role played by a lexicalknowledge base in the closest synonyms detection SALAadopts the dictionary-based approach to the disambiguationof theMFs In the dictionary-based approach the assumptionis that the most plausible sense to assign to multiple sharedwords is that sense that maximizes the relatedness among thechosen senses In this respect SALA detects the synonymsby choosing the meaning whose glosses share the largestnumber of words with the glosses of the neighboring wordsthrough lexical ontology Using a lexical ontology suchas WordNet allows the capture of semantic relationshipsbased on the concepts and exploiting hierarchies of conceptsbesides dictionary glosses One of the problems that canbe faced is that not all synonyms are really related to thecontext of the documentTherefore the MFs disambiguationis achieved by implementing the closest-synonym detection(C-SynD) algorithm

In the C-SynD algorithm the main task of each SALA isto search for synonyms of each FVO throughWordNet Eachsynonym and their relationships are assigned in a list calledSyn Relation List (SRL) Each item in the SRL contains onesynonym and its relationships such asHypernyms HyponymMeronyms Holonymy Antonymy Entail and Cause Thepurpose of using these relations is to eliminate the ambiguityand polysemy because not all of the synonyms are relatedto the context of the document Then all synonyms andtheir relationship lists are assigned in a list called F SRListThe SALA starts to filter the irrelevant synonyms accordingto the score of each SRL This score is computed based on

semantic relatedness between the SRL and every FVO in theFeature Vertex array as follows

Score (SRL119896) =

119898

sum

119894=0

119899

sum

119895=0

SemRel (SRL (119894)item FVO119895) (4)

where 119899 is the number of FVOs in the Feature Vertex arraywith index 119895 119898 is the number of items in SRL with index 119894and 119896 is index of SRL

119896in F SRList

The lists of synonyms and their relationships areused temporarily to serve the C-SynD algorithm (seeAlgorithm 2) The C-SynD algorithm specifies a thresholdvalue of the score for selecting closest synonyms SALAapplies the threshold to select the closest synonyms with thehighest score according to the specific threshold as follows

Closest-Synonyms = maxth

(F SRListSRListscore ()) (5)

Then each SALA retains the closest synonyms in anobject Fsyn of each FVO and SALA links the related FVOswith a bidirectional effect using the SALA links algorithm(see Algorithm 3) In the SALA LINKS algorithm the SALAreceives each FVO for detecting links among the other FVOsaccording to Fsyn object values and FAss object valuesThe SALA assigns the FLW between two FVOs to theirintersection location of the FLG

324 The MFs Relationships and Dependencies Inferring Pro-cess In inferringMFs relationships and dependencies SALAaims to explore the implicit dependencies and relationshipsin the context such as human relying on four associationcapabilities These association capabilities are similarity con-tiguity contrast (antonym) and causality [29] which give thelargest amount of interdependence among the MFs Theseassociation capabilities are defined as follows

Similarity For nouns it is the sibling instance of the sameparent instance with is-a relationship inWordNet indicatinghypernymhyponym relationships

Contiguity For nouns it is the sibling instance of thesame parent instance with part-of relationship in WordNetindicating holonymmeronymy relationship

Contrast For adjectives or adverbs it is the sibling instanceof the same parent instance with is-a relationship and withantonym (opposite) attribute in WordNet indicating a syn-onymantonym relationship

Causality For verbs it indicates the connection of the se-quence of events by a causeentail relationship where a causepicks out two verbs one of the verbs is causative such as (give)and the other is called resultant such as (have) in WordNet

To equip SALA with decision-making and experiencelearning capabilities to achieve multiple-goal RL this studyutilizes the RL method relying on 119876-learning to design aSALA inference engine with sequential decision-makingTheobjective of SALA is to select an optimal association to maxi-mize the total long-run accumulated reward Hence SALA

8 The Scientific World Journal

Input Array Feature Vertex Objects (FVO) WordnetOutput List of Closest-Synonym of FVO objects lowastThe detected closest synonym to set Fsyn value lowastProcedureSRLlarr Empty list list of each synonym and their relationshipsF SRListlarr Empty list list of SRL (all synonyms and their relationships lists)lowastParallelforeach used for parallel processing of all FVO lowastParallelForEach(FVO fvo =gt

Fsynlarr Empty listF SRList = get all Syn Relations List(fvo) lowast function to get all synonyms and their relationships lists from wordnet lowastlowast compute semantic relatedness of each synonym and their relationships with the fvo object then get score of each listlowastfor (int 119894 = 0 119894 lt F SRListCount() 119894++) score[119894] = sum(function Get SemRel(F SRList[119894] fvo)) lowastfunction to compute the score of each

synonym list this score contributes in accurate filtering the returned synonyms and selecting the closest onelowast

Fsyn = maxth(score[]) lowastfunction to select SRL with highest score according to the specific threshold lowast

) ParallelFor

Algorithm 2 C-SynD algorithm

Input Action type and Feature Vertex Object(FVO)Output Detect Links between Feature Vertices and assigned the Features Link Weight(FLW)Procedure ilarr index of the FVO input in FLGnumVertslarr number of Feature Vertex objects (FVO) in Vertex (119881)for (int 119895 = 119894 119895 le numVerts j++)if (Action type == ldquoClosest-Synonymsrdquo) then Fsyn = get C-SynD (FVO[119894]) lowast Function to execute C-SynD algorithm lowastif (FVO[119895]contain(Fsyn)) then FLWlarr SemRel(FVO[119894] FVO[119895]) lowastFunction to compute the value of semantic relatedness betweenthe FVO[119894] and FVO[119895] lowastRel-Typelarr 5 lowast set the relation type with 5 indicates of synonym lowast

elseif (Action type == ldquoAssociationrdquo) then optimal action = get IOAC-QL (FVO) lowast Function to execute IOAC-QL algorithm lowastFLWlarr reward lowastThe reward value of optimal action returnedlowastRel-Typelarr action type lowastThe type of optimal action returned (type = 1 for similarity type = 2 for contiguitytype = 3 for contrast type = 4 for causality) lowast

Rel vertex = 119895 lowast index of the related vertex lowastAdjacent[119894 119895] = FLW lowastset the value of linkage weight in adjacency matrix between related vertex [119894] vertex [119895] lowast

Algorithm 3 SALA links algorithm

can achieve the largest amount of interdependence amongMFs through implementing the inference optimal associationcapabilities with the 119876-learning (IOAC-QL) algorithm

(1) Utilizing Multiple-Goal Reinforcement Learning (RL) RLspecifies what to do but not how to do it through the rewardfunction In sequential decision-making tasks an agent needsto perform a sequence of actions to reach goal states ormultiple-goal states One popular algorithm for dealing withsequential decision-making tasks is 119876-learning [33] In a 119876-learning model a 119876-value is an evaluation of the ldquoqualityrdquo

of an action in a given state 119876(119909 119886) indicates how desirableaction 119886 is in the state 119909 SALA can choose an action based on119876-values The easiest way of choosing an action is to choosethe one that maximizes the 119876-value in the current state

To acquire the119876-values the algorithm119876-learning is usedto estimate the maximum discounted cumulative reinforce-ment that the model will receive from the current state 119909

on max(sum119894=0

120574119894119903119894) where 120574 is a discount factor and 119903

119894is the

reward received at step 119894 (which may be 0) The updating of119876(119909 119886) is based on minimizing 119903 + 120574119891(119910) minus 119876(119909 119886) 119891(119910) =max119886119876(119910 119886) and 119910 is the new state resulting from action

The Scientific World Journal 9

119886 Thus updating is based on the temporal difference inevaluating the current state and the action chosen Throughsuccessive updates of the 119876 function the model can learn totake into account future steps in longer and longer sequencesnotably without explicit planning [33] SALA may eventuallyconverge to a stable function or find an optimal sequence thatmaximizes the reward received Hence the SALA learns toaddress sequential decision-making tasks

Problem Formulation The 119876-learning algorithm is used todetermine the best possible action that is selecting the mosteffective relations analyzing the reward function and updat-ing the related feature weights accordingly The promisingaction can be verified by measuring the relatedness scoreof the action value with the remained MFs using a rewardfunctionThe formulation of the effective association capabil-ities exploration is performed by estimating an action-valuefunction The value of taking action 119886 in state 119904 is definedunder a policy 120587 denoted by 119876(119904 119886) as the expected returnstarting from 119904 and taking the action 119886 Action policy 120587 is adescription of the behavior of the learning SALA The policyis the most important criterion in providing the RL with theability to determine the behavior of the SALA [29]

In single goal reinforcement learning these 119876-values areused only to rank order of the actions in a given state Thekey observation here is that the 119876-values can also be used inmultiple-goal problems to indicate the degree of preferencefor different actions The available way in this paper to selecta promising action to execute is to generate an overall119876-valueas a simple sum of the 119876-values of the individual SALA Theaction with the maximum summed value is then chosen toexecute

Environment State and Actions For the effective associationcapabilities exploration problem the FLG is provided as anenvironment in which intelligent agents can learn and sharetheir knowledge and can explore dependencies and relation-ships among MFs The SALAs receive the environmentalinformation data as a state and execute the elected action inthe form of an action value table which provides the best wayof sharing knowledge among agents The state is defined bythe status of theMF space In each state theMFobject is fittedwith the closest-synonyms relations there is a set of actionsIn each round the SALA decides to explore the effectiveassociations that gain the largest amount of interdependenceamong theMFs Actions are defined by selecting a promisingaction with the highest reward by learning which actionis the optimal for each state In summary once 119904

119896(MF

object) is received from the FLG SALA chooses one ofthe associations as the action 119886

119896according to SALArsquos policy

120587lowast(119904119896) executes the associated action and finally returns the

effective relations Heuristic 3 defines the action variable andthe optimal policy

Heuristic 3 (action variable) Let 119886119896symbolize the action that

is executed by SALA at round 119896

119886119896= 120587lowast(119904119896) (6)

Therein one has the following(i) 119886119896C action space similarity contiguity contrast

and causality(ii) 120587lowast(119904

119896) = argmax

119886119876lowast(119904119896 119886119896)

Once the optimal policy (120587lowast) is obtained the agentchooses the actions using the maximum reward

Reward FunctionThe reward function in this papermeasuresthe dependency and the relatedness among MFs The learneris not told which actions to take as in most forms of machinelearning Rather the learner must discover which actionsyield the most reward by trying them [33]

(2) The SALA Optimal AssociationRelationships InferencesThe optimal association actions are achieved according to theIOAC-QL algorithmThis algorithm starts to select the actionwith the highest expected future reward from each state Theimmediate reward which the SALA gets to execute an actionfrom a state s plus the value of an optimal policy is the 119876-value The higher 119876-value points to the greater chance ofthat action being chosen First initialize all the119876(119904 119886) valuesto zero Then each SALA performs the following steps inHeuristic 4

Heuristic 4(i) At every round 119896 select one action 119886

119896from all the

possible actions (similarity contiguity contrast andcausality) and execute it

(ii) Receive the immediate reward 119903119896and observe the new

state 1199041015840119896

(iii) Get the maximum 119876-value of the state 1199041015840

119896based on

all the possible actions 119886119896= 120587lowast(119904119896) = argmax

119886119876lowast

(119904119896 119886119896)

(iv) Update the 119876(119904119896 119886119896) as follows

119876 (119904119896 119886119896) = (1 minus 120572)119876 (119904

119896 119886119896)

+ 120572(119903 (119904119896 119886119896) + 120574 max1198861015840isin119860(119904119896)

119876 (1199041015840

119896 1198861015840

119896))

(7)

where 119876(119904119896 119886119896) is worthy of selecting the action (119886

119896)

at the state (119904119896) 119876(1199041015840

119896 1198861015840

119896) is worthy of selecting the

next action (1198861015840119896) at the next state (1199041015840

119896) The 119903(119904

119896 119886119896)

is a reward corresponding to the acquired payoff A(1199041015840119896) is a set of possible actions at the next state (1199041015840

119896) 120572

(0 lt 120572 le 1) is learning rate 120574 (0 le 120574 le 1) is discountrate

During each round after taking the action 119886 the actionvalue is extracted from WordNet The weight of the actionvalue is measured to represent the immediate reward wherethe SALA computes the weight AVW

119894119896for the action value 119894

in document 119896 as followsReward (119903) = AVW

119894119896

= AVW119894119896+

119899

sum

119895=1

(FW119895119896

lowast SemRel (119894 119895)) (8)

10 The Scientific World Journal

Input Feature Vertex Object(FVO) WordnetOutput The optimal action for FVO relationshipsProcedure

intialize Q[ ]lowastimplement 119876-learning algorithm to get the action with the highest 119876-value lowast119886119896isin action space[] = similarity contiguity contrast causality

lowastParallelforeach used for parallel processing of all actions lowastParallelfor (0 action spacecount() 119896 =gt 119877[119896] = Get Action Reward(119886[119896]) lowastFunction to implement (8) lowast119876[119894 119896] = (1 minus 120572) lowast 119876[119894 119896] + 120572 lowast (119877[119896] + 120574 lowastmax

119886(119876[119894 119896 + 1]))

) ParallelForreturn Get highest 119876-value action(119876[ ]) lowastselect action with the maximum summed value of 119876-values lowast

lowastexecute (8) to get reward value lowastGet Action Reward(ak)AVW

119896= (freq(119886

119896val)FVOCount()) lowast log(119873119899

119886)

for (int 119895 = 0 119895 lt FVOCount() 119895++) FW119895= (freq(FVO[119895])FVOCount()) lowast log(119873119899

119891)

SemRel(119894 119895) =Get SemRel(119886119896val FVO[119895])

Sum+= FW119895lowast SemRel(119894 119895)

return AVW119896= AVW

119896+ Sum

Algorithm 4 IOAC-QL algorithm

where 119903 is the reward of the selected action 119886119894 AVW

119894119896is the

weight of action value 119894 in document 119896 FW119895119896

is the weightof each feature object FVO

119895in document 119896 and SemRel(119894 119895)

is the semantic relatedness between the action value 119894 andfeature object FVO

119895

The most popular weighting scheme is the normalizedword frequency TFIDF [34] used to measure AVW

119894119896and

FW119895119896The FW

119895119896in (8) is calculated likewise AVW

119894119896is based

on (9) and (10)Action value weight (119860119881119882

119894119896) is a measure to calculate

the weight of the action value that is the scalar product ofaction value frequency and inverse document frequency as in(9)The action value frequency (119865) measures the importanceof this value in a document Inverse document frequency(IDF) measures the general importance of this action valuein a corpus of documents

AVW119894119896

=119865119894119896

sum119896119865119894119896

lowast IDF119894 (9)

where 119865119894119896

represents the number of this action value 119894

cocontributions in document 119889119896 normalized by the number

of cocontributions of all MFs in document 119889119896 normalized to

prevent a bias towards longer documents IDF is performedby dividing the number of all documents by the number ofdocuments containing this action value defined as follows

IDF119894= log( |119863|

1003816100381610038161003816119889119896 V119894 isin 1198891198961003816100381610038161003816

) (10)

where |119863| is the total number of documents in the corpusand |119889

119896 V119894isin 119889119896| is the number of documents containing

action value 119881119894

Thus in the IOAC-QL algorithm (see Algorithm 4)implementation shown in Figure 4 the SALA selects theoptimal action with the higher reward Hence each SALAretains effective actions in the FAss list object of each FVO Inthis list each object hasRel-Type object andRel vertex objectTheRel-Type object value equals the action type object valueand Rel vertex object value equals the related vertex indexThe SALA then links the related FVOs with the bidirectionaleffect using the SALA links algorithm and sets the FLW valueof the linkage weight in the adjacency matrix with the rewardof the selected action

An example of monitoring the SALA decision throughthe practical implementation of the IOAC-QL algorithmactions is illustrated in Figure 4

For state FVO1 one has the following

action similarity ActionResult State FVO1FVO7

FVO9FVO22FVO75FVO92FVO112

FVO125

Prob 8reward 184 PrevState FVO

1 119876(119904 119886) = 91

action contiguity ActionResult State FVO103

FVO17

FVO44 Prob 3 reward 74 PrevState FVO

1 119876(119904 119886) =

61action contrast ActionResult State FVO

89 Prob 1

reward 22 PrevState FVO1 119876(119904 119886) = 329

action causality ActionResult State FVO63

FVO249

Prob 2 reward 25 PrevState FVO

1 119876(119904 119886) = 429

As revealed in the case FVO1 the SALA receives the

environmental information data FVO1as a state and then

the SALA starts the implementation to select the actionwith the highest expected future reward among the otherFVOs The result of each action is illustrated the action with

The Scientific World Journal 11

S0

S0

S1

S1S2

a0 a0a0a4 a4 a4

DSO DSO

Sm

Sm

Object model construction

FVO

Vertex [V] Adjacent [V V]

FLWFval

SFKeyFsyn[]

FAss[]FW

The feature linkage graph (FLG)

Descriptive document object

Descriptive sentence object

Action selection

Q-learning mechanism

Assign Q-values with FLG environment

Reinforcement learning model for semanticrelational graph construction

SALA inference engine

DSO DSO DSO DSO DSO

DDO

Sem

antic

actio

nabl

e lea

rnin

g ag

ent (

SALA

)

F1

F1

F1

F2

F3

F4

F2

F2 F3 F4 middot middot middot Fn

DSO

Fn

Fn

middot middot middot

middot middot middot

middot middot middot

middot middot middot middot middot middotmiddot middot middot

middot middot middot

middot middot middotmiddot middot middot

middot middot middot

middot middot middot

middot middot middot

Q(S0 a0) Q(S0 a4)Q(S1 a0) Q(S1 a4)

Q(Sm a0) Q(Sm a4)

WordNet

Figure 4 The IOAC-QL algorithm implementation

the highest reward is selected and the elected action isexecutedTheSALAdecision is accomplished in FLGwith theeffective extracted relationships according to 119876(119904 119886) value asshown in Table 1

33 Document Linkage Conducting Stage This section pre-sents an efficient linkage process at the corpus level for manydocuments Due to the SHGRS scheme of each documentmeasuring the commonality between the FLG of the doc-uments through finding all distinct common similarity orrelatedness subgraphs is required The objective of findingcommon similar or related subgraphs as a relatedness mea-sure is to be utilized in document mining and managingprocesses such as clustering classification and retrieval Itis essential to establish that the measure is metric A metricmeasure ascertains the order of objects in a set and henceoperations such as comparing searching and indexing of theobjects can be performed

Graph relatedness [35 36] involves determining thedegree of similarity between these two graphs (a numberbetween 0 and 1) so the same node in both graphs wouldbe similar if its related nodes were similar The selectionof subgraphs for matching depends on detecting the moreimportant nodes in FLG using FVOFw value The vertexwith the highest weight is the first node selected which thendetects all related vertices from an adjacency matrix as asubgraphThis process depends on computing the maximumcommon subgraphsThe inclusion of all common relatednesssubgraphs is accomplished through recursive iterations andholding onto the value of the relatedness index of thematched

subgraphs Eliminating the overlap is performed using a dy-namic programming technique that keeps track of the nodesvisited and their attribute relatedness values

The function relatedness subgraphs utilize an array Rel-Nodes[] of the related vertices for each document Thefunction SimRel(119909

119894 119910119894) computes the similarity between the

two nodes 119909119894 119910119894as defined in (11) The subgraph matching

process is measured using the following formula

Sub-graph-Rel (SG1 SG2)

=sum

SG1119894

sumSG2119895

SemRel (119873119894 119873119895)

sumSG1119894

119873119894+ sum

SG2119895

119873119895

(11)

where SG1 is the selected subgraph from the FLG of onedocument SG2 is the selected subgraph from the FLG ofanother document and SemRel(119873

119894 119873119895) is the degree of

matching for both nodes 119873119894and 119873

119895 A threshold value for

node matching is also utilized the threshold is set to a 50match

4 Evaluation of the Experiments and Results

This evaluation is especially vital as the aim of building theSHGRS is to use them efficiently and effectively for the fur-ther mining process To explore the effectiveness of the pro-posed SHGRS scheme examining the correlation of the pro-posed semantic relatedness measure compared with otherprevious relatedness measures is required The impact ofthe proposed MFsSR for detecting the closest synonym is

12 The Scientific World Journal

Table 1 Action values table with the SALA decision

State FVO1 FVO1 FVO1 FVO1

Action Similarity Contiguity Contrast CausalityActionResult State FVO1 FVO7 FVO9 FVO22 FVO75 FVO92 FVO112 FVO125 FVO103 FVO17 FVO44 FVO89 FVO63 FVO249

Total reward 184 74 22 25119876(119904 119886) 94 61 329 429Decision Yes No No No

studied and compared to the PMI [25 37] and independentcomponent analysis (ICA) [38] for detecting the best near-synonym The difference between the proposed discoveringsemantic relationships or associations IOAC-QL implicitrelation extraction conditional random fields (CRFs) [6 39]and term frequency and inverse cluster frequency (TFICF)[40] is computed

A case study of amanaging andmining task (ie semanticclustering of text documents) is consideredThis study showshow the proposed approach is workable and how it canbe applied to mining and managing tasks Experimentallysignificant improvements in all performances are achieved inthe document clustering processThis approach outperformsconventional word-based [16] and concept-based [17 41]clusteringmethodsMoreover performance tends to improveas more semantic analysis is achieved in the representation

41 Evaluation Measures Evaluation measures are subcat-egorized into text quality-based evaluation content-basedevaluation coselection-based evaluation and task-basedevaluationsThefirst category of evaluationmeasures is basedon text quality using aspects such as grammaticality nonre-dundancy referential clarity and coherence For content-based evaluations measures such as similarity semanticrelatedness longest common subsequence and other scoresare used The third category is based on coselection evalua-tion using precision recall and 119865-measure values The lastcategory task-based evaluation is based on the performanceof accomplishing the given task Some of these tasks areinformation retrieval question answering document classi-fication document summarizing and document clusteringmethods In this paper the content-based the coselection-based and the task-based evaluations are used to validate theimplementation of the proposed scheme over the real corpusof documents

411 Measure for Content-Based Evaluation The relatednessor similarity measures are inherited from probability theoryand known as the correlation coefficient [42]The correlationcoefficient is one of the most widely used measures todescribe the relatedness 119903 between two vectors119883 and 119884

Correlation Coefficient 119903 The correlation coefficient is a rela-tively efficient relatedness measure which is a symmetricalmeasure of the linear dependence between two randomvariables

Table 2 Contingency table presents human judgment and systemjudgment

Human judgmentTrue False

System judgmentTrue TP FPFalse FN TN

Therefore the correlation coefficient can be consideredas the coefficient for the linear relationship between corre-sponding values of 119883 and 119884 The correlation coefficient 119903between sequences 119883 = 119909

119894 119894 = 1 119899 and 119884 = 119910

119894

119894 = 1 119899 is defined by

119903 =sum119899

119894=1119883119894119884119894

radic(sum119899

119894=11198832119894) lowast (sum

119899

119894=11198842119894)

(12)

Acceptance Rate (AR) Acceptance rate is a proportion ofcorrectly predicted similar or related sentences compared toall related sentences High acceptance ratemeans recognizingalmost all similar or related sentences

AR =TP

(TP + FN) (13)

Accuracy (Acc) Accuracy is a proportion of all correctlypredicted sentences compared to all sentences

Acc = TP + TN(TP + FN + FP + TN)

(14)

where TP TN FP and FN stand for true positive (the num-ber of pairs correctly labeled as similar) true negative (thenumber of pairs correctly labeled as dissimilar) false positive(the number of pairs incorrectly labeled as similar) andfalse negative (the number of pairs incorrectly labeled asdissimilar) as shown in Table 2

412 Measure for Coselection-Based Evaluation For all thedomains a precision 119875 recall 119877 and 119865-measure are utilizedas the measures of performance in the coselection-basedevaluation These measures may be defined via computingthe correlation between the extracted correct and wrongclosest synonyms or semantic relationshipsdependencies

The Scientific World Journal 13

Table 3 Datasets details for the experimental setup

Dataset Experimenting Dataset Name Description

DS1 Experiment 1content-basedevaluation

Miller and Charles (MC)1 RG consists of 65 pairs of nouns extracted from the WordNet rated bymultiple human annotators

DS2Microsoft ResearchParaphrase Corpus

(MRPC)2

The corpus consists of 5801 sentence pairs collected from newswire articles3900 of which were labeled as relatedness by human annotators The wholeset is divided into a training subset (4076 sentences of which 2753 are true)and a test subset (1725 pairs of which 1147 are true)

DS3 Experiment 2coselection-basedevaluation(closest-synonymdetection)

British National Corpus(BNC)3

BNC is a carefully selected collection of 4124 contemporary written andspoken English texts contains 100-million-word text corpus of samples ofwritten and spoken English with the near-synonym collocations

DS4 SN (semantic neighbors)4SN relates 462 target terms (nouns) to 5910 relatum terms with 14682semantic relations (7341 are meaningful and 7341 are random) The SNcontains synonyms coming from three sources WordNet 30 Rogetrsquosthesaurus and a synonyms database

DS5

Experiment 2coselection-basedevaluation (semanticrelationshipsexploration)

BLESS6BLESS relates 200 target terms (100 animate and 100 inanimate nouns) to 8625relatum terms with 26554 semantic relations (14440 are meaningful (correct)and 12154 are random) Every relation has one of the following typeshypernymy cohypernymy meronymy attribute event or random

DS6 TREC5

TREC includes 1437 sentences annotated with entities and relations at leastone relation There are three types of entities person (1685) location (1968)and organization (978) in addition there is a fourth type others (705) whichindicates that the candidate entity is none of the three types There are fivetypes of relations located in (406) indicates that one location is located insideanother location work for (394) indicates that a person works for anorganization OrgBased in (451) indicates that an organization is based in alocation live in (521) indicates that a person lives at a location and kill (268)indicates that a person killed another person There are 17007 pairs of entitiesthat are not related by any of the five relations and hence have the NR relationbetween them which thus significantly outnumbers other relations

DS7 IJCNLP 2011-New YorkTimes (NYT)6

NYT contains 150 business articles from NYT There are 536 instances (208positive 328 negative) with 140 distinct descriptors in NYT dataset

DS8 IJCNLP 2011-Wikipedia8Wikipedia personalsocial relation dataset was previously used in Culotta etal [6] There are 700 instances (122 positive 578 negative) with 70 distinctdescriptors in Wikipedia dataset

DS9 Experiment 3task-basedevaluation

Reuters 215787 Reuters-21578 contains 21578 documents (12902 are used) categorized to 10categories

DS10 20 Newsgroups8 20 newsgroups dataset contains 20000 documents (18846 are used)categorized to 20 categories

1Available at httpwwwcscmuedusimmfaruquisuitehtml2Available at httpresearchmicrosoftcomen-usdownloads3Available at httpcorpusbyuedubnc4Available at httpcentalfltruclacbeteamsimpanchenkosre-evalsncsv5Available at httpl2rcsuiucedusimcogcompDataERconll04corp6Available at httpwwwmysmuedufacultyjingjiangdataIJCNLP2011zip7Available at httpmlrcsumassedumldatasetsReuters-21578+Text+Categorization+Collection8Available at httpwwwcsminingorgindexphpid-20-newsgroupshtml

Let TP denote the number of correctly detected closestsynonyms or semantic relationships explored let FP be thenumber of incorrectly detected closest synonyms or semanticrelationships explored and let FN be the number of correctlybut not detected closest synonyms or semantic relationshipsexplored in a dataset The 119865-measure combines the precisionand recall in one metric and is often used to show theefficiencyThismeasurement can be interpreted as a weightedaverage of the precision and recall where an 119865-measure

reaches its best value at 1 andworst value at 0 Precision recalland 119865-measure are defined as follows

Recall = TP(TP + FN)

Precision =TP

(TP + FP)

119865-measure = 2 (Recall lowast Precision)(Recall + Precision)

(15)

14 The Scientific World Journal

413 Measure for Task-Based Evaluation A case study ofa managing and mining task (ie semantic clustering oftext documents) is studied for task-based evaluation Thetest is performed with the standard clustering algorithmsthat accept the pairwise (dis)similarity between documentsrather than the internal feature space of the documents Theclustering algorithms may be classified as listed below [43](a) flat clustering (which creates a set of clusters without anyexplicit structure that would relate the clusters to each otheralso called exclusive clustering) (b) hierarchical clustering(which creates a hierarchy of clusters) (c) hard clustering(which assigns each documentobject as a member of exactlyone cluster) and (d) soft clustering (which distributes thedocumentsobjects over all clusters) These algorithms areagglomerative (hierarchical clustering) 119870-means (flat clus-tering hard clustering) and EM algorithm (flat clusteringsoft clustering)

Hierarchical agglomerative clustering (HAC) and the 119870-means algorithm have been applied to text clustering ina straightforward way [16 17] The algorithms employedin the experiment included 119870-means clustering and HACTo evaluate the quality of the clustering the 119865-measure isused for quality measures which combine the precision andrecall measures These metrics are computed for every (classcluster) pair The precision 119875 and recall 119877 of a cluster 119878

119894with

respect to a class 119871119903are defined as

119875 (119871119903 119878119894) =

119873119903119894

119873119894

119877 (119871119903 119878119894) =

119873119903119894

119873119903

(16)

where 119873119903119894is the number of documents of class 119871

119903in cluster

119878119894119873119894is the number of documents of cluster 119878

119894 and119873

119903is the

number of documents of class 119871119903

The 119865-measure which is a harmonic mean of precisionand recall is defined as

119865 (119871119903 119878119894) =

2 lowast 119875 (119871119903 119878119894) lowast 119877 (119871

119903 119878119894)

119875 (119871119903 119878119894) + 119877 (119871

119903 119878119894)

(17)

A per-class 119865 score is calculated as follows

119865 (119871119903) = max119878119894isin119862

119865 (119871119903 119878119894) (18)

With respect to class 119871119903 the cluster with the highest 119865-

measure is considered to be the cluster that maps to class 119871119903

that 119865-measure becomes the score for class 119871119903 and 119862 is total

number of clusters The overall 119865-measure for the clusteringresult 119862 is the weighted average of the 119865-measure for eachclass 119871

119903(macroaverage)

119865119862 =sum119871119903

(1003816100381610038161003816119873119903

1003816100381610038161003816 lowast 119865 (119871119903))

sum119871119903

10038161003816100381610038161198731199031003816100381610038161003816

(19)

where |119873119903| is the number of objects in class 119871

119903 The higher

the overall 119865-measure is the better the clustering is due tothe higher accuracy of the clusters mapping to the originalclasses

42 Evaluation Setup (Dataset) Content-based coselection-based and task-based evaluations are used to validateexperimenting over the real corpus of documents theresults are very promising The experimental setup con-sists of some datasets of textual documents as detailed inTable 3 Experiment 1 uses the Miller and Charles (MC) andMicrosoft Research Paraphrase Corpus Experiment 2 usesthe British National Corpus TREC IJCNLP 2011 (NYT andWikipedia) SN and BLESS Experiment 3 usesThe Reuters-21578 and 20 newsgroups

43 Evaluation Results This section reports on the results ofthree experiments conducted using the evaluation datasetsoutlined in the previous section The SHGRS is imple-mented and evaluated based on concept analysis and anno-tation as sentence based in experiment 1 document basedin experiment 2 and corpus based in experiment 3 Inexperiment 1 the DS1 and DS2 investigated the effectivenessof the proposed MFsSR compared to previous studies basedon the availability of semantic relatedness values of humanjudgments and previous studies In experiment 2 the DS3andDS4 investigated the effectiveness of the closest synonymdetection using MFsSR compared to using PMI [25 37]and ICA [38] based on the availability of synonyms comingfrom three sources WordNet 30 Rogetrsquos thesaurus anda synonyms database Furthermore the DS5 DS6 DS7and DS8 investigated the effectiveness of the IOAC-QL forsemantic relationships exploration compared to the implicitrelation extraction CRFs [6 39] and TFICF [40] based on theavailability of semantic relations judgments In experiment 3the DS9 and DS10 investigated the effectiveness of thedocument linkage process for clustering case study thanprevious studies as word-based [16] and concept-based [17]based on the availability of human judgments

These experiments revealed some interesting trends interms of text organization and a representation schemebased on semantic annotation and reinforcement learningThe results show the effectiveness of the proposed schemefor accurate understanding of the underlying meaning Theproposed SHGRS is exploited in mining and managingoperations such as gathering filtering searching retrievingextracting classifying clustering and summarizing docu-ments In addition the experiments illustrate the improve-ments when direct relevance and indirect relevance areincluded in the process of measuring semantic relatednessbetween MFs of documents The main purpose of semanticrelatedness measures is to infer unknown dependencies andrelationships among the MFs The proposed MFsSR is abackbone of the closest synonym detection and semanticrelationships exploration algorithms The performance ofMFsSR for closest synonym detection and IOAC-QL forsemantic relationships exploration algorithms is proportion-ally improved as well as having a semantic understanding oftext content achieved The relatedness subgraph measure isused for the document linkage process at the corpus levelThrough this process better clustering results can be achievedas more attempt to improve the accuracy of measuring therelatedness between documents using semantic information

The Scientific World Journal 15

Table 4The results of the comparison of the correlation coefficientbetween human judgment with some relatedness measures and theproposed semantic relatedness measure

MeasureRelevance

correlation withMampC

Distance-based measuresRada 0688Wu and Palmer 0765

Information-based measuresResnik 077Jiang and Conrath 0848Lin 0853

Hybrid measuresT Hong and D smith 0879Zili Zhou 0882

Informationfeature-base measuresThe proposed MFsSR 0937

in SHGRSThe relatedness subgraphsmeasure affects the per-formance of the output of the document managing and min-ing process relative to howmuch of the semantic informationit considers The most accurate relatedness can be achievedwhen the maximum amount of the semantic information isprovided Consequently the quality of clustering is improvedand when semantic information is considered the approachsurpasses the word-based and concept-based techniques

431 Experiment 1 Comparative Study (Content-Based Eval-uation) This experiment shows the necessity to evaluate theperformance of the proposed MFsSR based on a benchmarkdataset of human judgments so the results of the proposedsemantic relatedness would be comparable with other pre-vious studies in the same field In this experiment theyattempted to compute the correlation between the ratings ofthe proposed semantic relatedness approach and the meanratings reported by Miller and Charles (DS1 in Table 3)Furthermore the results produced are compared againsteight other semantic similarities approaches namely RadaWu and Palmer Resnik Jiang and Conrath Lin T Hong andD Smith and Zili Zhou [21]

Table 4 shows the correlation coefficient results betweennine componential approaches and the Miller and Charlesratings mean The semantic relatedness for the proposedapproach outperformed all the listed approaches Unlike allthe listed methods in the proposed semantic relatednessdifferent properties are considered Furthermore the goodcorrelation value of the approach also results from consider-ing all available relationships between concepts and the indi-rect relationships between attributes of each concept whenmeasuring the semantic relatedness In this experiment theproposed approach achieved a good correlation value withthe human-subject rating reported by Miller and CharlesBased on the results of this study theMFsSR correlation valuehas proven that considering the directindirect relevance

is specifically important for at least 6 improvement overthe best previous results This improvement contributes tothe achievement of the largest amount of relationships andinterdependence among MFs and their attributes at thedocument level and to the document linkage process at acorpus level

Table 5 summarizes the characteristics of the MRPCdataset (DS2 in Table 3) and presents comparison of theAcc and AR values between the proposed semantic related-ness measure MFsSR and Islam and Inkpen [7] Differentrelatedness thresholds ranging from 0 to 1 with interval01 are used to validate the MFsSR with Islam and Inkpen[7] After evaluation the best relatedness thresholds of Accand AR are 06 07 and 08 These results indicate thatthe proposed MFsSR surpasses Islam and Inkpen [7] interms of Acc and AR The results of each approach listedbelow were based on the best Acc and AR through allthresholds instead of under the same relatedness thresholdThis improvement in Acc andAR values is due to the increasein the numbers of pairs predicted correctly after consideringdirectindirect relevance This relevance takes into accountthe closest synonym and the relationships of sentences MFs

432 Experiment 2 Comparative Study of Closest-SynonymDetection and Semantic Relationships Exploration Thisexperiment shows the necessity to study the impact of theMFsSR for inferring unknown dependencies and relation-ships among the MFs This impact is achieved through theclosest-synonym detection and semantic relationship explo-ration algorithms so the results would be comparable withother previous studies in the same field Throughout thisexperiment the impact of the proposed MFsSR for detectingthe closest synonym is studied and compared to the PMIapproach and ICA approach for detecting the best near-synonym The difference among the proposed discoveringsemantic relationships or associations IOAC-QL the implicitrelation extraction CRFs approach and TFICF approachis examined This experiment attempts to measure theperformance of theMFsSR for the closest synonym detectionand the IOAC-QL for semantic relationship explorationalgorithms Hence more semantic understanding of the textcontent is achieved by inferring unknown dependenciesand relationships among the MFs Considering the directrelevance among the MFs and the indirect relevance amongthe attributes of the MFs in the proposed MFsSR constitutesa certain advantage over previous measures However mostof the time the incorrect items are due to a wrong syntacticparsing from the OpenNLP parser and AlchemyAPIAccording to the preliminary study it is certain that theaccuracy of parsing toolsrsquo effects is on the performance of theMFsSR

Detecting the closest synonym is a process of MF disam-biguation resulting from the implementation of the MFsSRtechnique and the threshold of the synonyms scores throughthe C-SynD algorithm In the C-SynD algorithm the MFsSRtakes into account the direct relevance among MFs andthe indirect relevance among MFs and their attributes in acontext that gives a significant increase in the recall without

16 The Scientific World Journal

Table 5 The results of the comparison of the accuracy and acceptance rate between the proposed semantic relatedness measure and Islamand Inkpen [7]

MRPC dataset Relatednessthreshold

Human judgment(TP + FN)

Islam and Inkpen [7] The proposed MFsSRAcc AR Acc AR

Trainingsubset (4076)

01

2753 true

067 1 068 102 067 1 068 103 067 1 068 104 067 1 068 105 069 098 068 106 072 089 068 107 068 078 070 09808 056 04 072 08609 037 009 060 0491 033 0 034 002

Test subset(1725)

01

1147 true

066 1 067 102 066 1 067 103 066 1 067 104 066 1 067 105 068 098 067 106 072 089 066 107 068 078 066 09808 056 04 064 08509 038 009 049 051 033 0 034 003

disturbing the precision Thus the MFsSR between two con-cepts considers the information content with most of Word-Net relationships Table 6 illustrates the performance of theC-SynD algorithm based on theMFsSR compared to the bestnear-synonym algorithmusing the PMI and ICA approachesThis comparison was conducted for two corpora of textdocuments that are BNC and NS (DS3 and DS4 in Table 3)As indicated in Table 6 the C-SynD algorithm yielded higheraverage precision recall and 119865-measure values than PMIapproach by 25 20 and 21 on SN dataset respectivelyand also by 8 4 and 7 on BNC dataset Furthermorethe C-SynD algorithm also yielded higher average precisionrecall and 119865-measure values than ICA approach by 6 9and 7 on SN dataset respectively and also by 6 4and 5 on BNC dataset The improvements achieved in theperformance values are due to the increase of the numberof pairs predicted correctly and the decrease of the numberof pairs predicted incorrectly through implementing MFsSRThe important observation from this table is the improve-ments achieved in recall which measures effectiveness of theC-SynD algorithm Achieving better precision values is clearwith a high percentage in different datasets and domains

Inferring the MFs relationships and dependencies is aprocess of achieving a large number of interdependencesamongMFs resulting from the implementation of the IOAC-QL algorithm In the IOAC-QL algorithm the 119876-learning isused to select the optimal action (relationships) that gainsthe largest amount of interdependences among the MFs

resulting from the measure of the AVW Table 7 illustratesthe performance of the IOAC-QL algorithm based on AVWscompared to the implicit relationship extraction based onCRFs and TFICF approaches which was conducted overthe TREC Bless the NYT and the Wikipedia corpus (DS5DS6 DS7 and DS8 in Table 3) The data in this table showsincreases in precision recall and 119865-measure values due tothe increase in the number of pairs predicted correctly afterconsidering directindirect relevance through the expansionof closest synonym In addition the IOAC-QL algorithmthe CRFs and the TFICF approaches are beneficial tothe extraction performance but the IOAC-QL contributesmore than CRFs and TFICF Thus IOAC-QL considerssimilarity contiguity contrast and causality relationshipsbetween MFs and its closest-synonyms while the CRFs andTFICF consider is-a and part-of relationships only betweenconceptsThe improvements of the 119865-measure were achievedthrough an IOAC-QL algorithm up to 23 for TREC dataset22 for BLESS dataset 21 for NYT dataset and 6 forWikipedia dataset approximately from the CRFs approachFurthermore the IOAC-QL algorithm also yielded higher119865-measure values than TFICF approach by 10 for TRECdataset 17 for BLESS dataset and 7 for NYT datasetrespectively

433 Experiment 3 Comparative Study (Task-Based Eval-uation) This experiment shows the effectiveness of the

The Scientific World Journal 17

Table 6 The results of the C-SynD algorithm based on the MFsSR compared to the PMI and ICA for detecting the closest synonym

Datasets PMI ICA C-SynDPrecision Recall 119865-measure Precision Recall 119865-measure Precision Recall 119865-measure

SN 606 606 061 795 716 753 85 80 82BNC 745 679 71 76 678 72 82 719 77

Table 7 The results of the IOAC-QL algorithm based on AVWs compared to CRFs and TFICF approaches

Datasets CRFs TFICF IOAC-QLPrecision Recall 119865-measure Precision Recall 119865-measure Precision Recall 119865-measure

TREC 7508 602 6628 893 714 787 898 881 886BLESS 7304 6266 6703 738 695 716 950 835 889NYT 6846 5402 6038 860 650 740 900 740 812Wikipedia 560 420 480 646 5488 5934 700 440 540

Table 8The results of graph-based approach compared to term-based and concept-based approach using119870-means and the HAC clusteringalgorithms

Datasets Algorithm Term-based Concept-based Graph-basedFC FC FC

Reuters-21578 HAC 73 857 902119870-means 51 846 918

20 Newsgroups HAC 53 82 89119870-means 48 793 87

proposed semantic representation scheme in document min-ing and managing operations such as gathering filteringsearching retrieving extracting classifying clustering andsummarizing documents Many sets of experiments usingthe proposed scheme on different datasets in text clusteringhave been conducted The experiments demonstrate theextensive comparisons between the MF-based analysis andthe traditional analysis Experimental results demonstratethe substantial enhancement of the document clusteringquality In this experiment the application of the semanticunderstanding of the textual documents is presented Textualdocuments are parsed to produce the rich syntactic andsemantic representations of SHGRS Based on these semanticrepresentations document linkages are estimated using thesubgraph matching process that is based on finding alldistinct similarity common subgraphs The semantic repre-sentations of textual documents determining the relatednessbetween themare evaluated by allowing clustering algorithmsto use the produced relatedness matrix of the documentset To test the effectiveness of subgraph matching in deter-mining an accurate measure of the relatedness betweendocuments an experiment using the MFs-based analysisand subgraphs relatedness measures was conducted Twostandard document clustering techniques were chosen fortesting the effect of the document linkages on clustering [43](1) HAC and (2) 119870-means The MF-based weighting is themain factor that captures the importance of a concept in

a document Furthermore document linkages using the sub-graph relatedness measure are among the main factors thataffect clustering techniques Thus to study the effect of theMFs-based weighting and the subgraph relatedness measure(graph-based approach) on the clustering techniques thisexperiment is repeated for the different clustering techniquesas shown in Table 8 Basically the aim is to maximize theweighted average of the 119865-measure for each class (FC) ofclusters to achieve high-quality clustering

Table 8 shows that the graph-based approach achieveshigher performance than term-based and concept-basedapproachesThis comparison was conducted for two corporaof text documents that are the Reuters-21578 dataset and the20 newsgroups dataset (DS9 and DS10 in Table 3)The resultsshown in Table 8 illustrate improvement in the clusteringquality when more attributes are included in graph-basedrelatedness measure The improvements were achieved at thegraph-based approach up to 17 by HAC and 40 by 119870-means for Reuters-21578 dataset while up to 36 byHAC and39 by 119870-means for approximately 20 newsgroup datasetsfrom the base case (term-based) approach Furthermorethe improvements also were achieved at the graph-basedapproach up to 5 by HAC and 7 by119870-Means for Reuters-21578 dataset while up to 7 byHAC and 8 by119870-Means forapproximately 20 Newsgroup datasets from Concept-basedapproach

18 The Scientific World Journal

5 Conclusions

This paper proposed the SHGRS to improve the effectivenessof themining andmanaging operations in textual documentsSpecifically a three-stage approach was proposed for con-structing the scheme First the MFs with their attributesare extracted and hierarchy-based structure is built to rep-resent sentences by the MFs and their attributes Second anovel semantic relatedness computing method was proposedfor inferring relationships and dependencies of MFs andthe relationship between MFs is represented by a graph-based structure Third a subgraph matching algorithm wasconducted for document linkage process The documentsclustering is used as a case study and conducted extensiveexperiments on 10 different datasets and the results showedthat the scheme can result in a better mining performancecompared with traditional approach Future work will focuson conducting other case studies to processes such asgathering filtering retrieving classifying and summarizinginformation

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] I T Fatudimu A G Musa C K Ayo and A B SofoluweldquoKnowledge discovery in online repositories a text miningapproachrdquo European Journal of Scientific Research vol 22 no2 pp 241ndash250 2008

[2] A Hotho A Nurnberger and G Paaszlig ldquoA brief survey of textminingrdquo Journal for Computational Linguistics and LanguageTechnology vol 20 2005

[3] M Bernotas K Karklius R Laurutis and A Slotkiene ldquoThepeculiarities of the text document representation using ontol-ogy and tagging-based clustering techniquerdquo Journal of Infor-mation Technology and Control vol 36 pp 217ndash220 2007

[4] Y Li Z A Bandar andDMcLean ldquoAn approach formeasuringsemantic similarity between words using multiple informationsourcesrdquo IEEE Transactions on Knowledge and Data Engineer-ing vol 15 no 4 pp 871ndash882 2003

[5] K Shaban O Basir and M Kamel ldquoDocument mining basedon semantic understanding of textrdquo in Progress in Pattern Rec-ognition Image Analysis and Applications vol 4225 of LectureNotes in Computer Science pp 834ndash843 Springer BerlinGermany 2006

[6] A Culotta A McCallum and J Betz ldquoIntegrating probabilisticextraction models and data mining to discover relations andpatterns in textrdquo in Proceedings of theHuman Language Technol-ogy ConferencemdashNorth American Chapter of the Association forComputational Linguistics Annual Meeting (HLT-NAACL rsquo06)pp 296ndash303 June 2006

[7] A Islam and D Inkpen ldquoSemantic text similarity using corpus-based word similarity and string similarityrdquo ACM Transactionson Knowledge Discovery fromData vol 2 no 2 article 10 2008

[8] R Feldman and J SangerTheTextMining Handbook AdvancedApproaches in Analyzing Unstructured Data Cambridge Uni-versity Press 2006

[9] R Navigli ldquoWord sense disambiguation a surveyrdquo ACM Com-puting Surveys vol 41 no 2 article 10 2009

[10] W Mao andWW Chu ldquoThe phrase-based vector space modelfor automatic retrieval of free-text medical documentsrdquo Dataand Knowledge Engineering vol 61 no 1 pp 76ndash92 2007

[11] C Siefkes and P Siniakov ldquoAn overview and classification ofadaptive approaches to information extractionrdquo in Journal onData Semantics IV vol 3730 of Lecture Notes in ComputerScience pp 172ndash212 Springer New York NY USA 2005

[12] httpwordnetprincetonedu[13] OpenNLP httpopennlpapacheorg[14] httpwwwalchemyapicom[15] W V Siricharoen ldquoOntologies and object models in object ori-

ented software engineeringrdquo International Association of Engi-neers (IAENG) Journal of Computer Science (JICS) vol 33 no 1pp 1151ndash1162 2006

[16] S Shehata ldquoA WordNet-based semantic model for enhancingtext clusteringrdquo in Proceedings of the IEEE International Con-ference on Data Mining Workshops (ICDMW rsquo09) pp 477ndash482Miami Fla USA December 2009

[17] S Shehata F Karray andM Kamel ldquoAn efficient concept-basedminingmodel for enhancing text clusteringrdquo IEEE Transactionson Knowledge and Data Engineering vol 22 no 10 pp 1360ndash1371 2010

[18] E Zemmouri H Behja A Marzak and B Trousse ldquoOntology-based knowledge model for multi-view KDD processrdquo Interna-tional Journal of Mobile Computing and Multimedia Communi-cations vol 4 no 3 pp 21ndash33 2012

[19] C Marinica and F Guillet ldquoKnowledge-based interactive post-mining of association rules using ontologiesrdquo IEEETransactionson Knowledge and Data Engineering vol 22 no 6 pp 784ndash7972010

[20] B Choudhary and P Bhattacharyya ldquoText clustering usingUni-versal Networking Language representationrdquo in Proceedings ofthe 11th International World Wide Web Conference 2003

[21] G Pirro and J Euzenat ldquoA feature and information theoreticframework for semantic similarity and relatednessrdquo in Proceed-ings of the 9th International Semantic Web Conference on theSemantic Web (ISWC rsquo10) vol 6496 pp 615ndash630 2010

[22] Z Lin M R Lyu and I King ldquoMatchSim a novel similaritymeasure based on maximum neighborhood matchingrdquo Knowl-edge and Information Systems vol 32 no 1 pp 141ndash166 2012

[23] D Lin ldquoAn information-theoretic definition of similarityrdquo inProceedings of the 15th International Conference on MachineLearning Madison Wis USA July 1998

[24] D Inkpen ldquoA statistical model for near-synonym choicerdquo ACMTransactions on Speech and Language Processing vol 4 no 1article 2 2007

[25] LHan T Finin PMcNameeA Joshi andYYesha ldquoImprovingword similarity by augmenting PMIwith estimates of word pol-ysemyrdquo IEEE Transactions on Knowledge and Data Engineeringvol 25 no 6 pp 1307ndash1322 2013

[26] J OrsquoShea Z Bandar K Crockett and D McLean ldquoA compar-ative study of two short text semantic similarity measuresrdquo inAgent and Multi-Agent Systems Technologies and Applicationsvol 4953 of Lecture Notes in Artificial Intelligence pp 172ndash181Springer Berlin Germany 2008

[27] J Oliva J I SerranoMDDel Castillo and A Iglesias ldquoSyMSSa syntax-basedmeasure for short-text semantic similarityrdquoDataand Knowledge Engineering vol 70 no 4 pp 390ndash405 2011

The Scientific World Journal 19

[28] I Yoo and X Hu ldquoClustering large collection of biomedicalliterature based on ontology-enriched bipartite graph represen-tation andmutual refinement strategyrdquo inProceedings of the 10thPAKDD pp 9ndash12 2006

[29] S-T Yuan and Y-C Chen ldquoSemantic ideation learning foragent-based e-brainstormingrdquo IEEE Transactions on Knowledgeand Data Engineering vol 20 no 2 pp 261ndash275 2008

[30] D Liu Z Liu and Q Dong ldquoA dependency grammar andWordNet based sentence similarity measurerdquo Journal of Com-putational Information Systems vol 8 no 3 pp 1027ndash1035 2012

[31] A Zouaq M Gagnon and O Benoit ldquoSemantic analysis usingdependency-based grammars and upper-level ontologiesrdquoInternational Journal of Computational Linguistics and Applica-tions vol 1 no 1-2 pp 85ndash101 2010

[32] C Ramakrishnan K J Kochut and A P Sheth ldquoA frameworkfor schema-driven relationship discovery from unstructuredtextrdquo in The Semantic WebmdashISWC 2006 vol 4273 of LectureNotes in Computer Science pp 583ndash596 Springer BerlinGermany 2006

[33] C J C H Watkins and P Dayan ldquoTechnical note Q-learningrdquoMachine Learning vol 8 no 3-4 pp 279ndash292 1992

[34] M Lan C L Tan J Su and Y Lu ldquoSupervised and traditionalterm weighting methods for automatic text categorizationrdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 31 no 4 pp 721ndash735 2009

[35] L Zhang C Li J Liu and HWang ldquoGraph-based text similar-itymeasurement by exploiting wikipedia as background knowl-edgerdquo World Academy of Science Engineering and Technologyvol 59 pp 1548ndash1553 2011

[36] M Kuramochi and G Karypis ldquoAn efficient algorithm for dis-covering frequent subgraphsrdquo IEEE Transactions on Knowledgeand Data Engineering vol 16 no 9 pp 1038ndash1051 2004

[37] L-C Yu H-M Shih Y-L Lai J-F Yeh and C-H Wu ldquoDis-criminative training for near-synonym substitutionrdquo in Pro-ceedings of the 23rd International Conference on ComputationalLinguistics (COLING rsquo10) pp 1254ndash1262 2010

[38] L-C Yu and W-N Chien ldquoIndependent component analysisfor near-synonym choicerdquoDecision Support Systems vol 55 no1 pp 146ndash155 2013

[39] L Yao S Riedel and A Mccallum ldquoCollective cross-documentrelation extraction without labelled datardquo in Proceedings ofthe Conference on Empirical Methods in Natural LanguageProcessing pp 1013ndash1023MITCambridgeMassUSAOctober2010

[40] M Shen D-R Liu and Y-S Huang ldquoExtracting semantic rela-tions to enrich domain ontologiesrdquo Journal of Intelligent Infor-mation Systems vol 39 no 3 pp 749ndash761 2012

[41] H Chim and X Deng ldquoEfficient phrase-based documentsimilarity for clusteringrdquo IEEE Transactions on Knowledge andData Engineering vol 20 no 9 pp 1217ndash1229 2008

[42] GAMiller andWGCharles ldquoContextual correlates of seman-tic similarityrdquo Language and Cognitive Processes vol 6 no 1 pp1ndash28 1991

[43] M Steinbach G Karypis and V Kumar ldquoA comparison ofdocument clustering techniquesrdquo in Proceedings of the Knowl-edge Discovery and Data Mining (KDD) Workshop Text MiningAugust 2000

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 6: Research Article Exploiting Semantic Annotations and ...downloads.hindawi.com/journals/tswj/2015/136172.pdf · is bene cial for text categorization [ ], text summarization [ ], word

6 The Scientific World Journal

Input List of Sentences of The document 119889119894

Output List DSO lowastlist of DSO objects and its MFs with attributes lowastProcedure 119863 larr New documentList DSOlarr Empty list list of sentences object of documentfor each sentence 119904

119894in document 119863 do

AlchemyAPI(119904119894) lowastThis function calls AlchemyAPI to determine all MF and its NER lowast

for each feature mfj isin mf1mf2 mf119899 in DSO do

get MF Attributes(mf119894) lowastThis function determines attributes of each MF lowast

OpenNLP (si) lowastThis function calls OpenNLP to determine POS and NER for MF and attributes lowastcompute MF score(mf

119894) lowastThis function determines the score of each MF lowast

endList DSOadd(DSO)

end

Algorithm 1 TRN algorithm

objects (FVOs)The FVO is annotated with Sentence Featurekey (SFKey) feature value (Fval) feature closest synonyms(Fsyn) feature associations (FAss) and feature weight (FW)Where SFKey combines sentence key and feature key inDSO the sentence key is important because that facilitatesaccessing the parent of theMFsThe Fval object has the valueof the feature the Fsyn object has a list of the detected closestsynonyms of the Fval and the FAss object has a list of theexplored feature associations and relationships for the FvalIn this list each object has Rel Typ between two MFs (1 =similarity 2 = contiguity 3 = contrast 4 = causality and 5 =synonym) and Rel vertex index of the related vertex TheFWobject has accumulated the associations and relationshipsweights clarifying the importance of the Fval in the textualdocument The second set represents the arcs in a two-dimensional array adjacency (119881 119881) of Features link weights(FLW) which indicates the value of the linkage weight in anadjacency matrix between related vertices

Heuristic 2 Let FLG(119881 119860) = ⟨Vertex(119881)Adjacency(119881 119881)⟩let Vertex(119881) = FVO

0 FVO

1 FVO

119899 and let Adja-

cency(119881 119881) = FLW(00)

FLW(01)

FLW(10)

FLW(11)

FLW(119899119899)

where FVO indicates an object ofthe MF in vertex vector and annotates with the additionalinformation as FVO = SFKey Fval Fsyn FAss Fw SFKeyindicates combination of sentence key and feature key in aDSO Fval indicates the value of the feature Fsyn indicatesa list of the closest synonyms of the Fval FAss indicatesa list of the feature associations and relationships for theFval and each object in this list is defined as FAssitem =

Rel TypeRel vertex where Rel Type indicates the valueof the relation type linkage between the two features andRel vertex indicates the index of related vertex FW indicatesthe accumulation of the weights of the associations andrelationships of the Fval in the textual document FLW indi-cates the value of the linkage weight in an adjacency matrix

322 The MFs Semantic Relatedness (MFsSR) MeasuringProcess A primarymotivation for measuring semantic relat-edness comes from the NL processing applications such asinformation retrieval information extraction information

filtering text summary text annotation text mining wordsense disambiguation automatic indexing machine trans-lation and other aspects [2 3] In this paper one of theinformation-based methods is utilized by considering ICThe semantic relatedness that has been investigated concernsthe direct relevance and the indirect relevance among MFsThe MFs may be one word or more and thereby the MF isconsidered as a conceptThe IC is considered as a measure ofquantifying the amount of information a concept expressesLin [23] assumed that for a concept 119888 proposed IC let 119901(119888)be the probability of encountering an instance of concept119888 The IC value is obtained by considering the negative loglikelihood of encountering a concept in a given corpusTraditionally the semantic relatedness of concepts is usuallybased on the cooccurrence information on a large corpusHowever the cooccurrences do not achieve much matchingin a corpus and it is essential to take into account therelationships among concepts Therefore development of anew information content method based on the cocontribu-tions instead of the cooccurrences of the proposed MFsSR isimportant

The new information content (NIC) measure is an exten-sion of the information content measure The NIC measuresare based on the relations defined in the WordNet ontologyThe NIC measure uses hypernymhyponym synonymantonym holonymmeronymy and entailcause to quantifythe informativeness of concepts For example a hyponymrelation could be ldquobicycle is a vehiclerdquo and ameronym relationcould be ldquoa bicycle has wheelsrdquo NIC is defined as a functionof the hypernym hyponym synonym antonym holonymmeronymy entail and cause relationships normalized by themaximum number of MFsattributes objects in the textualdocument using

NIC (119888) = 1 minus ((log (Hype Hypo (119888) + Syn Anto (119888)

+Holo Mero (119888)

+ Enta Cause (119888) + 1))

times (log(max concept))minus1)

(2)

The Scientific World Journal 7

whereHype Hypo(119888) returns the number of FVOs in the FLGrelated to the hypernym or hyponym values Syn Anto(119888)returns the number of FVOs in the FLG related to the syn-onym or antonym values Holo Mero(119888) returns the numberof FVOs in the FLG related to the holonym or meronymyvalues and Enta Cause(119888) returns the number of FVOs in theFLG related to the entail or cause values The max concept isa constant that indicates the total number of MFsattributesobjects in the considered text The max concept normalizesthe NIC value and hence the NIC values fall in the range of[0 1]

With the consideration of direct relevance and indirectrelevance of MFs the proposed MFsSR can be stated asfollows

SemRel = 120582(2 lowast NIC (LCS (1198881 1198882))NIC (1198881) +NIC (1198882)

)

+ (1 minus 120582) (2 lowastNIC (LCS (att 1198881 att 1198882))NIC (att 1198881) + NIC (att 1198882)

)

(3)

where 120582 isin [0 1] decides the relative contribution of directand indirect relevance to the semantic relatedness andbecause the direct relevance is assumed to bemore importantthan the indirect relevance 120582 isin [05 1]

323 The Closest-Synonyms Detecting Process In MFsclosest-synonyms detection the SALA carried out the firstaction to extract the closest synonyms for each MF whereSALA defies automatic discovery of similar meaning words(synonyms) Due to the important role played by a lexicalknowledge base in the closest synonyms detection SALAadopts the dictionary-based approach to the disambiguationof theMFs In the dictionary-based approach the assumptionis that the most plausible sense to assign to multiple sharedwords is that sense that maximizes the relatedness among thechosen senses In this respect SALA detects the synonymsby choosing the meaning whose glosses share the largestnumber of words with the glosses of the neighboring wordsthrough lexical ontology Using a lexical ontology suchas WordNet allows the capture of semantic relationshipsbased on the concepts and exploiting hierarchies of conceptsbesides dictionary glosses One of the problems that canbe faced is that not all synonyms are really related to thecontext of the documentTherefore the MFs disambiguationis achieved by implementing the closest-synonym detection(C-SynD) algorithm

In the C-SynD algorithm the main task of each SALA isto search for synonyms of each FVO throughWordNet Eachsynonym and their relationships are assigned in a list calledSyn Relation List (SRL) Each item in the SRL contains onesynonym and its relationships such asHypernyms HyponymMeronyms Holonymy Antonymy Entail and Cause Thepurpose of using these relations is to eliminate the ambiguityand polysemy because not all of the synonyms are relatedto the context of the document Then all synonyms andtheir relationship lists are assigned in a list called F SRListThe SALA starts to filter the irrelevant synonyms accordingto the score of each SRL This score is computed based on

semantic relatedness between the SRL and every FVO in theFeature Vertex array as follows

Score (SRL119896) =

119898

sum

119894=0

119899

sum

119895=0

SemRel (SRL (119894)item FVO119895) (4)

where 119899 is the number of FVOs in the Feature Vertex arraywith index 119895 119898 is the number of items in SRL with index 119894and 119896 is index of SRL

119896in F SRList

The lists of synonyms and their relationships areused temporarily to serve the C-SynD algorithm (seeAlgorithm 2) The C-SynD algorithm specifies a thresholdvalue of the score for selecting closest synonyms SALAapplies the threshold to select the closest synonyms with thehighest score according to the specific threshold as follows

Closest-Synonyms = maxth

(F SRListSRListscore ()) (5)

Then each SALA retains the closest synonyms in anobject Fsyn of each FVO and SALA links the related FVOswith a bidirectional effect using the SALA links algorithm(see Algorithm 3) In the SALA LINKS algorithm the SALAreceives each FVO for detecting links among the other FVOsaccording to Fsyn object values and FAss object valuesThe SALA assigns the FLW between two FVOs to theirintersection location of the FLG

324 The MFs Relationships and Dependencies Inferring Pro-cess In inferringMFs relationships and dependencies SALAaims to explore the implicit dependencies and relationshipsin the context such as human relying on four associationcapabilities These association capabilities are similarity con-tiguity contrast (antonym) and causality [29] which give thelargest amount of interdependence among the MFs Theseassociation capabilities are defined as follows

Similarity For nouns it is the sibling instance of the sameparent instance with is-a relationship inWordNet indicatinghypernymhyponym relationships

Contiguity For nouns it is the sibling instance of thesame parent instance with part-of relationship in WordNetindicating holonymmeronymy relationship

Contrast For adjectives or adverbs it is the sibling instanceof the same parent instance with is-a relationship and withantonym (opposite) attribute in WordNet indicating a syn-onymantonym relationship

Causality For verbs it indicates the connection of the se-quence of events by a causeentail relationship where a causepicks out two verbs one of the verbs is causative such as (give)and the other is called resultant such as (have) in WordNet

To equip SALA with decision-making and experiencelearning capabilities to achieve multiple-goal RL this studyutilizes the RL method relying on 119876-learning to design aSALA inference engine with sequential decision-makingTheobjective of SALA is to select an optimal association to maxi-mize the total long-run accumulated reward Hence SALA

8 The Scientific World Journal

Input Array Feature Vertex Objects (FVO) WordnetOutput List of Closest-Synonym of FVO objects lowastThe detected closest synonym to set Fsyn value lowastProcedureSRLlarr Empty list list of each synonym and their relationshipsF SRListlarr Empty list list of SRL (all synonyms and their relationships lists)lowastParallelforeach used for parallel processing of all FVO lowastParallelForEach(FVO fvo =gt

Fsynlarr Empty listF SRList = get all Syn Relations List(fvo) lowast function to get all synonyms and their relationships lists from wordnet lowastlowast compute semantic relatedness of each synonym and their relationships with the fvo object then get score of each listlowastfor (int 119894 = 0 119894 lt F SRListCount() 119894++) score[119894] = sum(function Get SemRel(F SRList[119894] fvo)) lowastfunction to compute the score of each

synonym list this score contributes in accurate filtering the returned synonyms and selecting the closest onelowast

Fsyn = maxth(score[]) lowastfunction to select SRL with highest score according to the specific threshold lowast

) ParallelFor

Algorithm 2 C-SynD algorithm

Input Action type and Feature Vertex Object(FVO)Output Detect Links between Feature Vertices and assigned the Features Link Weight(FLW)Procedure ilarr index of the FVO input in FLGnumVertslarr number of Feature Vertex objects (FVO) in Vertex (119881)for (int 119895 = 119894 119895 le numVerts j++)if (Action type == ldquoClosest-Synonymsrdquo) then Fsyn = get C-SynD (FVO[119894]) lowast Function to execute C-SynD algorithm lowastif (FVO[119895]contain(Fsyn)) then FLWlarr SemRel(FVO[119894] FVO[119895]) lowastFunction to compute the value of semantic relatedness betweenthe FVO[119894] and FVO[119895] lowastRel-Typelarr 5 lowast set the relation type with 5 indicates of synonym lowast

elseif (Action type == ldquoAssociationrdquo) then optimal action = get IOAC-QL (FVO) lowast Function to execute IOAC-QL algorithm lowastFLWlarr reward lowastThe reward value of optimal action returnedlowastRel-Typelarr action type lowastThe type of optimal action returned (type = 1 for similarity type = 2 for contiguitytype = 3 for contrast type = 4 for causality) lowast

Rel vertex = 119895 lowast index of the related vertex lowastAdjacent[119894 119895] = FLW lowastset the value of linkage weight in adjacency matrix between related vertex [119894] vertex [119895] lowast

Algorithm 3 SALA links algorithm

can achieve the largest amount of interdependence amongMFs through implementing the inference optimal associationcapabilities with the 119876-learning (IOAC-QL) algorithm

(1) Utilizing Multiple-Goal Reinforcement Learning (RL) RLspecifies what to do but not how to do it through the rewardfunction In sequential decision-making tasks an agent needsto perform a sequence of actions to reach goal states ormultiple-goal states One popular algorithm for dealing withsequential decision-making tasks is 119876-learning [33] In a 119876-learning model a 119876-value is an evaluation of the ldquoqualityrdquo

of an action in a given state 119876(119909 119886) indicates how desirableaction 119886 is in the state 119909 SALA can choose an action based on119876-values The easiest way of choosing an action is to choosethe one that maximizes the 119876-value in the current state

To acquire the119876-values the algorithm119876-learning is usedto estimate the maximum discounted cumulative reinforce-ment that the model will receive from the current state 119909

on max(sum119894=0

120574119894119903119894) where 120574 is a discount factor and 119903

119894is the

reward received at step 119894 (which may be 0) The updating of119876(119909 119886) is based on minimizing 119903 + 120574119891(119910) minus 119876(119909 119886) 119891(119910) =max119886119876(119910 119886) and 119910 is the new state resulting from action

The Scientific World Journal 9

119886 Thus updating is based on the temporal difference inevaluating the current state and the action chosen Throughsuccessive updates of the 119876 function the model can learn totake into account future steps in longer and longer sequencesnotably without explicit planning [33] SALA may eventuallyconverge to a stable function or find an optimal sequence thatmaximizes the reward received Hence the SALA learns toaddress sequential decision-making tasks

Problem Formulation The 119876-learning algorithm is used todetermine the best possible action that is selecting the mosteffective relations analyzing the reward function and updat-ing the related feature weights accordingly The promisingaction can be verified by measuring the relatedness scoreof the action value with the remained MFs using a rewardfunctionThe formulation of the effective association capabil-ities exploration is performed by estimating an action-valuefunction The value of taking action 119886 in state 119904 is definedunder a policy 120587 denoted by 119876(119904 119886) as the expected returnstarting from 119904 and taking the action 119886 Action policy 120587 is adescription of the behavior of the learning SALA The policyis the most important criterion in providing the RL with theability to determine the behavior of the SALA [29]

In single goal reinforcement learning these 119876-values areused only to rank order of the actions in a given state Thekey observation here is that the 119876-values can also be used inmultiple-goal problems to indicate the degree of preferencefor different actions The available way in this paper to selecta promising action to execute is to generate an overall119876-valueas a simple sum of the 119876-values of the individual SALA Theaction with the maximum summed value is then chosen toexecute

Environment State and Actions For the effective associationcapabilities exploration problem the FLG is provided as anenvironment in which intelligent agents can learn and sharetheir knowledge and can explore dependencies and relation-ships among MFs The SALAs receive the environmentalinformation data as a state and execute the elected action inthe form of an action value table which provides the best wayof sharing knowledge among agents The state is defined bythe status of theMF space In each state theMFobject is fittedwith the closest-synonyms relations there is a set of actionsIn each round the SALA decides to explore the effectiveassociations that gain the largest amount of interdependenceamong theMFs Actions are defined by selecting a promisingaction with the highest reward by learning which actionis the optimal for each state In summary once 119904

119896(MF

object) is received from the FLG SALA chooses one ofthe associations as the action 119886

119896according to SALArsquos policy

120587lowast(119904119896) executes the associated action and finally returns the

effective relations Heuristic 3 defines the action variable andthe optimal policy

Heuristic 3 (action variable) Let 119886119896symbolize the action that

is executed by SALA at round 119896

119886119896= 120587lowast(119904119896) (6)

Therein one has the following(i) 119886119896C action space similarity contiguity contrast

and causality(ii) 120587lowast(119904

119896) = argmax

119886119876lowast(119904119896 119886119896)

Once the optimal policy (120587lowast) is obtained the agentchooses the actions using the maximum reward

Reward FunctionThe reward function in this papermeasuresthe dependency and the relatedness among MFs The learneris not told which actions to take as in most forms of machinelearning Rather the learner must discover which actionsyield the most reward by trying them [33]

(2) The SALA Optimal AssociationRelationships InferencesThe optimal association actions are achieved according to theIOAC-QL algorithmThis algorithm starts to select the actionwith the highest expected future reward from each state Theimmediate reward which the SALA gets to execute an actionfrom a state s plus the value of an optimal policy is the 119876-value The higher 119876-value points to the greater chance ofthat action being chosen First initialize all the119876(119904 119886) valuesto zero Then each SALA performs the following steps inHeuristic 4

Heuristic 4(i) At every round 119896 select one action 119886

119896from all the

possible actions (similarity contiguity contrast andcausality) and execute it

(ii) Receive the immediate reward 119903119896and observe the new

state 1199041015840119896

(iii) Get the maximum 119876-value of the state 1199041015840

119896based on

all the possible actions 119886119896= 120587lowast(119904119896) = argmax

119886119876lowast

(119904119896 119886119896)

(iv) Update the 119876(119904119896 119886119896) as follows

119876 (119904119896 119886119896) = (1 minus 120572)119876 (119904

119896 119886119896)

+ 120572(119903 (119904119896 119886119896) + 120574 max1198861015840isin119860(119904119896)

119876 (1199041015840

119896 1198861015840

119896))

(7)

where 119876(119904119896 119886119896) is worthy of selecting the action (119886

119896)

at the state (119904119896) 119876(1199041015840

119896 1198861015840

119896) is worthy of selecting the

next action (1198861015840119896) at the next state (1199041015840

119896) The 119903(119904

119896 119886119896)

is a reward corresponding to the acquired payoff A(1199041015840119896) is a set of possible actions at the next state (1199041015840

119896) 120572

(0 lt 120572 le 1) is learning rate 120574 (0 le 120574 le 1) is discountrate

During each round after taking the action 119886 the actionvalue is extracted from WordNet The weight of the actionvalue is measured to represent the immediate reward wherethe SALA computes the weight AVW

119894119896for the action value 119894

in document 119896 as followsReward (119903) = AVW

119894119896

= AVW119894119896+

119899

sum

119895=1

(FW119895119896

lowast SemRel (119894 119895)) (8)

10 The Scientific World Journal

Input Feature Vertex Object(FVO) WordnetOutput The optimal action for FVO relationshipsProcedure

intialize Q[ ]lowastimplement 119876-learning algorithm to get the action with the highest 119876-value lowast119886119896isin action space[] = similarity contiguity contrast causality

lowastParallelforeach used for parallel processing of all actions lowastParallelfor (0 action spacecount() 119896 =gt 119877[119896] = Get Action Reward(119886[119896]) lowastFunction to implement (8) lowast119876[119894 119896] = (1 minus 120572) lowast 119876[119894 119896] + 120572 lowast (119877[119896] + 120574 lowastmax

119886(119876[119894 119896 + 1]))

) ParallelForreturn Get highest 119876-value action(119876[ ]) lowastselect action with the maximum summed value of 119876-values lowast

lowastexecute (8) to get reward value lowastGet Action Reward(ak)AVW

119896= (freq(119886

119896val)FVOCount()) lowast log(119873119899

119886)

for (int 119895 = 0 119895 lt FVOCount() 119895++) FW119895= (freq(FVO[119895])FVOCount()) lowast log(119873119899

119891)

SemRel(119894 119895) =Get SemRel(119886119896val FVO[119895])

Sum+= FW119895lowast SemRel(119894 119895)

return AVW119896= AVW

119896+ Sum

Algorithm 4 IOAC-QL algorithm

where 119903 is the reward of the selected action 119886119894 AVW

119894119896is the

weight of action value 119894 in document 119896 FW119895119896

is the weightof each feature object FVO

119895in document 119896 and SemRel(119894 119895)

is the semantic relatedness between the action value 119894 andfeature object FVO

119895

The most popular weighting scheme is the normalizedword frequency TFIDF [34] used to measure AVW

119894119896and

FW119895119896The FW

119895119896in (8) is calculated likewise AVW

119894119896is based

on (9) and (10)Action value weight (119860119881119882

119894119896) is a measure to calculate

the weight of the action value that is the scalar product ofaction value frequency and inverse document frequency as in(9)The action value frequency (119865) measures the importanceof this value in a document Inverse document frequency(IDF) measures the general importance of this action valuein a corpus of documents

AVW119894119896

=119865119894119896

sum119896119865119894119896

lowast IDF119894 (9)

where 119865119894119896

represents the number of this action value 119894

cocontributions in document 119889119896 normalized by the number

of cocontributions of all MFs in document 119889119896 normalized to

prevent a bias towards longer documents IDF is performedby dividing the number of all documents by the number ofdocuments containing this action value defined as follows

IDF119894= log( |119863|

1003816100381610038161003816119889119896 V119894 isin 1198891198961003816100381610038161003816

) (10)

where |119863| is the total number of documents in the corpusand |119889

119896 V119894isin 119889119896| is the number of documents containing

action value 119881119894

Thus in the IOAC-QL algorithm (see Algorithm 4)implementation shown in Figure 4 the SALA selects theoptimal action with the higher reward Hence each SALAretains effective actions in the FAss list object of each FVO Inthis list each object hasRel-Type object andRel vertex objectTheRel-Type object value equals the action type object valueand Rel vertex object value equals the related vertex indexThe SALA then links the related FVOs with the bidirectionaleffect using the SALA links algorithm and sets the FLW valueof the linkage weight in the adjacency matrix with the rewardof the selected action

An example of monitoring the SALA decision throughthe practical implementation of the IOAC-QL algorithmactions is illustrated in Figure 4

For state FVO1 one has the following

action similarity ActionResult State FVO1FVO7

FVO9FVO22FVO75FVO92FVO112

FVO125

Prob 8reward 184 PrevState FVO

1 119876(119904 119886) = 91

action contiguity ActionResult State FVO103

FVO17

FVO44 Prob 3 reward 74 PrevState FVO

1 119876(119904 119886) =

61action contrast ActionResult State FVO

89 Prob 1

reward 22 PrevState FVO1 119876(119904 119886) = 329

action causality ActionResult State FVO63

FVO249

Prob 2 reward 25 PrevState FVO

1 119876(119904 119886) = 429

As revealed in the case FVO1 the SALA receives the

environmental information data FVO1as a state and then

the SALA starts the implementation to select the actionwith the highest expected future reward among the otherFVOs The result of each action is illustrated the action with

The Scientific World Journal 11

S0

S0

S1

S1S2

a0 a0a0a4 a4 a4

DSO DSO

Sm

Sm

Object model construction

FVO

Vertex [V] Adjacent [V V]

FLWFval

SFKeyFsyn[]

FAss[]FW

The feature linkage graph (FLG)

Descriptive document object

Descriptive sentence object

Action selection

Q-learning mechanism

Assign Q-values with FLG environment

Reinforcement learning model for semanticrelational graph construction

SALA inference engine

DSO DSO DSO DSO DSO

DDO

Sem

antic

actio

nabl

e lea

rnin

g ag

ent (

SALA

)

F1

F1

F1

F2

F3

F4

F2

F2 F3 F4 middot middot middot Fn

DSO

Fn

Fn

middot middot middot

middot middot middot

middot middot middot

middot middot middot middot middot middotmiddot middot middot

middot middot middot

middot middot middotmiddot middot middot

middot middot middot

middot middot middot

middot middot middot

Q(S0 a0) Q(S0 a4)Q(S1 a0) Q(S1 a4)

Q(Sm a0) Q(Sm a4)

WordNet

Figure 4 The IOAC-QL algorithm implementation

the highest reward is selected and the elected action isexecutedTheSALAdecision is accomplished in FLGwith theeffective extracted relationships according to 119876(119904 119886) value asshown in Table 1

33 Document Linkage Conducting Stage This section pre-sents an efficient linkage process at the corpus level for manydocuments Due to the SHGRS scheme of each documentmeasuring the commonality between the FLG of the doc-uments through finding all distinct common similarity orrelatedness subgraphs is required The objective of findingcommon similar or related subgraphs as a relatedness mea-sure is to be utilized in document mining and managingprocesses such as clustering classification and retrieval Itis essential to establish that the measure is metric A metricmeasure ascertains the order of objects in a set and henceoperations such as comparing searching and indexing of theobjects can be performed

Graph relatedness [35 36] involves determining thedegree of similarity between these two graphs (a numberbetween 0 and 1) so the same node in both graphs wouldbe similar if its related nodes were similar The selectionof subgraphs for matching depends on detecting the moreimportant nodes in FLG using FVOFw value The vertexwith the highest weight is the first node selected which thendetects all related vertices from an adjacency matrix as asubgraphThis process depends on computing the maximumcommon subgraphsThe inclusion of all common relatednesssubgraphs is accomplished through recursive iterations andholding onto the value of the relatedness index of thematched

subgraphs Eliminating the overlap is performed using a dy-namic programming technique that keeps track of the nodesvisited and their attribute relatedness values

The function relatedness subgraphs utilize an array Rel-Nodes[] of the related vertices for each document Thefunction SimRel(119909

119894 119910119894) computes the similarity between the

two nodes 119909119894 119910119894as defined in (11) The subgraph matching

process is measured using the following formula

Sub-graph-Rel (SG1 SG2)

=sum

SG1119894

sumSG2119895

SemRel (119873119894 119873119895)

sumSG1119894

119873119894+ sum

SG2119895

119873119895

(11)

where SG1 is the selected subgraph from the FLG of onedocument SG2 is the selected subgraph from the FLG ofanother document and SemRel(119873

119894 119873119895) is the degree of

matching for both nodes 119873119894and 119873

119895 A threshold value for

node matching is also utilized the threshold is set to a 50match

4 Evaluation of the Experiments and Results

This evaluation is especially vital as the aim of building theSHGRS is to use them efficiently and effectively for the fur-ther mining process To explore the effectiveness of the pro-posed SHGRS scheme examining the correlation of the pro-posed semantic relatedness measure compared with otherprevious relatedness measures is required The impact ofthe proposed MFsSR for detecting the closest synonym is

12 The Scientific World Journal

Table 1 Action values table with the SALA decision

State FVO1 FVO1 FVO1 FVO1

Action Similarity Contiguity Contrast CausalityActionResult State FVO1 FVO7 FVO9 FVO22 FVO75 FVO92 FVO112 FVO125 FVO103 FVO17 FVO44 FVO89 FVO63 FVO249

Total reward 184 74 22 25119876(119904 119886) 94 61 329 429Decision Yes No No No

studied and compared to the PMI [25 37] and independentcomponent analysis (ICA) [38] for detecting the best near-synonym The difference between the proposed discoveringsemantic relationships or associations IOAC-QL implicitrelation extraction conditional random fields (CRFs) [6 39]and term frequency and inverse cluster frequency (TFICF)[40] is computed

A case study of amanaging andmining task (ie semanticclustering of text documents) is consideredThis study showshow the proposed approach is workable and how it canbe applied to mining and managing tasks Experimentallysignificant improvements in all performances are achieved inthe document clustering processThis approach outperformsconventional word-based [16] and concept-based [17 41]clusteringmethodsMoreover performance tends to improveas more semantic analysis is achieved in the representation

41 Evaluation Measures Evaluation measures are subcat-egorized into text quality-based evaluation content-basedevaluation coselection-based evaluation and task-basedevaluationsThefirst category of evaluationmeasures is basedon text quality using aspects such as grammaticality nonre-dundancy referential clarity and coherence For content-based evaluations measures such as similarity semanticrelatedness longest common subsequence and other scoresare used The third category is based on coselection evalua-tion using precision recall and 119865-measure values The lastcategory task-based evaluation is based on the performanceof accomplishing the given task Some of these tasks areinformation retrieval question answering document classi-fication document summarizing and document clusteringmethods In this paper the content-based the coselection-based and the task-based evaluations are used to validate theimplementation of the proposed scheme over the real corpusof documents

411 Measure for Content-Based Evaluation The relatednessor similarity measures are inherited from probability theoryand known as the correlation coefficient [42]The correlationcoefficient is one of the most widely used measures todescribe the relatedness 119903 between two vectors119883 and 119884

Correlation Coefficient 119903 The correlation coefficient is a rela-tively efficient relatedness measure which is a symmetricalmeasure of the linear dependence between two randomvariables

Table 2 Contingency table presents human judgment and systemjudgment

Human judgmentTrue False

System judgmentTrue TP FPFalse FN TN

Therefore the correlation coefficient can be consideredas the coefficient for the linear relationship between corre-sponding values of 119883 and 119884 The correlation coefficient 119903between sequences 119883 = 119909

119894 119894 = 1 119899 and 119884 = 119910

119894

119894 = 1 119899 is defined by

119903 =sum119899

119894=1119883119894119884119894

radic(sum119899

119894=11198832119894) lowast (sum

119899

119894=11198842119894)

(12)

Acceptance Rate (AR) Acceptance rate is a proportion ofcorrectly predicted similar or related sentences compared toall related sentences High acceptance ratemeans recognizingalmost all similar or related sentences

AR =TP

(TP + FN) (13)

Accuracy (Acc) Accuracy is a proportion of all correctlypredicted sentences compared to all sentences

Acc = TP + TN(TP + FN + FP + TN)

(14)

where TP TN FP and FN stand for true positive (the num-ber of pairs correctly labeled as similar) true negative (thenumber of pairs correctly labeled as dissimilar) false positive(the number of pairs incorrectly labeled as similar) andfalse negative (the number of pairs incorrectly labeled asdissimilar) as shown in Table 2

412 Measure for Coselection-Based Evaluation For all thedomains a precision 119875 recall 119877 and 119865-measure are utilizedas the measures of performance in the coselection-basedevaluation These measures may be defined via computingthe correlation between the extracted correct and wrongclosest synonyms or semantic relationshipsdependencies

The Scientific World Journal 13

Table 3 Datasets details for the experimental setup

Dataset Experimenting Dataset Name Description

DS1 Experiment 1content-basedevaluation

Miller and Charles (MC)1 RG consists of 65 pairs of nouns extracted from the WordNet rated bymultiple human annotators

DS2Microsoft ResearchParaphrase Corpus

(MRPC)2

The corpus consists of 5801 sentence pairs collected from newswire articles3900 of which were labeled as relatedness by human annotators The wholeset is divided into a training subset (4076 sentences of which 2753 are true)and a test subset (1725 pairs of which 1147 are true)

DS3 Experiment 2coselection-basedevaluation(closest-synonymdetection)

British National Corpus(BNC)3

BNC is a carefully selected collection of 4124 contemporary written andspoken English texts contains 100-million-word text corpus of samples ofwritten and spoken English with the near-synonym collocations

DS4 SN (semantic neighbors)4SN relates 462 target terms (nouns) to 5910 relatum terms with 14682semantic relations (7341 are meaningful and 7341 are random) The SNcontains synonyms coming from three sources WordNet 30 Rogetrsquosthesaurus and a synonyms database

DS5

Experiment 2coselection-basedevaluation (semanticrelationshipsexploration)

BLESS6BLESS relates 200 target terms (100 animate and 100 inanimate nouns) to 8625relatum terms with 26554 semantic relations (14440 are meaningful (correct)and 12154 are random) Every relation has one of the following typeshypernymy cohypernymy meronymy attribute event or random

DS6 TREC5

TREC includes 1437 sentences annotated with entities and relations at leastone relation There are three types of entities person (1685) location (1968)and organization (978) in addition there is a fourth type others (705) whichindicates that the candidate entity is none of the three types There are fivetypes of relations located in (406) indicates that one location is located insideanother location work for (394) indicates that a person works for anorganization OrgBased in (451) indicates that an organization is based in alocation live in (521) indicates that a person lives at a location and kill (268)indicates that a person killed another person There are 17007 pairs of entitiesthat are not related by any of the five relations and hence have the NR relationbetween them which thus significantly outnumbers other relations

DS7 IJCNLP 2011-New YorkTimes (NYT)6

NYT contains 150 business articles from NYT There are 536 instances (208positive 328 negative) with 140 distinct descriptors in NYT dataset

DS8 IJCNLP 2011-Wikipedia8Wikipedia personalsocial relation dataset was previously used in Culotta etal [6] There are 700 instances (122 positive 578 negative) with 70 distinctdescriptors in Wikipedia dataset

DS9 Experiment 3task-basedevaluation

Reuters 215787 Reuters-21578 contains 21578 documents (12902 are used) categorized to 10categories

DS10 20 Newsgroups8 20 newsgroups dataset contains 20000 documents (18846 are used)categorized to 20 categories

1Available at httpwwwcscmuedusimmfaruquisuitehtml2Available at httpresearchmicrosoftcomen-usdownloads3Available at httpcorpusbyuedubnc4Available at httpcentalfltruclacbeteamsimpanchenkosre-evalsncsv5Available at httpl2rcsuiucedusimcogcompDataERconll04corp6Available at httpwwwmysmuedufacultyjingjiangdataIJCNLP2011zip7Available at httpmlrcsumassedumldatasetsReuters-21578+Text+Categorization+Collection8Available at httpwwwcsminingorgindexphpid-20-newsgroupshtml

Let TP denote the number of correctly detected closestsynonyms or semantic relationships explored let FP be thenumber of incorrectly detected closest synonyms or semanticrelationships explored and let FN be the number of correctlybut not detected closest synonyms or semantic relationshipsexplored in a dataset The 119865-measure combines the precisionand recall in one metric and is often used to show theefficiencyThismeasurement can be interpreted as a weightedaverage of the precision and recall where an 119865-measure

reaches its best value at 1 andworst value at 0 Precision recalland 119865-measure are defined as follows

Recall = TP(TP + FN)

Precision =TP

(TP + FP)

119865-measure = 2 (Recall lowast Precision)(Recall + Precision)

(15)

14 The Scientific World Journal

413 Measure for Task-Based Evaluation A case study ofa managing and mining task (ie semantic clustering oftext documents) is studied for task-based evaluation Thetest is performed with the standard clustering algorithmsthat accept the pairwise (dis)similarity between documentsrather than the internal feature space of the documents Theclustering algorithms may be classified as listed below [43](a) flat clustering (which creates a set of clusters without anyexplicit structure that would relate the clusters to each otheralso called exclusive clustering) (b) hierarchical clustering(which creates a hierarchy of clusters) (c) hard clustering(which assigns each documentobject as a member of exactlyone cluster) and (d) soft clustering (which distributes thedocumentsobjects over all clusters) These algorithms areagglomerative (hierarchical clustering) 119870-means (flat clus-tering hard clustering) and EM algorithm (flat clusteringsoft clustering)

Hierarchical agglomerative clustering (HAC) and the 119870-means algorithm have been applied to text clustering ina straightforward way [16 17] The algorithms employedin the experiment included 119870-means clustering and HACTo evaluate the quality of the clustering the 119865-measure isused for quality measures which combine the precision andrecall measures These metrics are computed for every (classcluster) pair The precision 119875 and recall 119877 of a cluster 119878

119894with

respect to a class 119871119903are defined as

119875 (119871119903 119878119894) =

119873119903119894

119873119894

119877 (119871119903 119878119894) =

119873119903119894

119873119903

(16)

where 119873119903119894is the number of documents of class 119871

119903in cluster

119878119894119873119894is the number of documents of cluster 119878

119894 and119873

119903is the

number of documents of class 119871119903

The 119865-measure which is a harmonic mean of precisionand recall is defined as

119865 (119871119903 119878119894) =

2 lowast 119875 (119871119903 119878119894) lowast 119877 (119871

119903 119878119894)

119875 (119871119903 119878119894) + 119877 (119871

119903 119878119894)

(17)

A per-class 119865 score is calculated as follows

119865 (119871119903) = max119878119894isin119862

119865 (119871119903 119878119894) (18)

With respect to class 119871119903 the cluster with the highest 119865-

measure is considered to be the cluster that maps to class 119871119903

that 119865-measure becomes the score for class 119871119903 and 119862 is total

number of clusters The overall 119865-measure for the clusteringresult 119862 is the weighted average of the 119865-measure for eachclass 119871

119903(macroaverage)

119865119862 =sum119871119903

(1003816100381610038161003816119873119903

1003816100381610038161003816 lowast 119865 (119871119903))

sum119871119903

10038161003816100381610038161198731199031003816100381610038161003816

(19)

where |119873119903| is the number of objects in class 119871

119903 The higher

the overall 119865-measure is the better the clustering is due tothe higher accuracy of the clusters mapping to the originalclasses

42 Evaluation Setup (Dataset) Content-based coselection-based and task-based evaluations are used to validateexperimenting over the real corpus of documents theresults are very promising The experimental setup con-sists of some datasets of textual documents as detailed inTable 3 Experiment 1 uses the Miller and Charles (MC) andMicrosoft Research Paraphrase Corpus Experiment 2 usesthe British National Corpus TREC IJCNLP 2011 (NYT andWikipedia) SN and BLESS Experiment 3 usesThe Reuters-21578 and 20 newsgroups

43 Evaluation Results This section reports on the results ofthree experiments conducted using the evaluation datasetsoutlined in the previous section The SHGRS is imple-mented and evaluated based on concept analysis and anno-tation as sentence based in experiment 1 document basedin experiment 2 and corpus based in experiment 3 Inexperiment 1 the DS1 and DS2 investigated the effectivenessof the proposed MFsSR compared to previous studies basedon the availability of semantic relatedness values of humanjudgments and previous studies In experiment 2 the DS3andDS4 investigated the effectiveness of the closest synonymdetection using MFsSR compared to using PMI [25 37]and ICA [38] based on the availability of synonyms comingfrom three sources WordNet 30 Rogetrsquos thesaurus anda synonyms database Furthermore the DS5 DS6 DS7and DS8 investigated the effectiveness of the IOAC-QL forsemantic relationships exploration compared to the implicitrelation extraction CRFs [6 39] and TFICF [40] based on theavailability of semantic relations judgments In experiment 3the DS9 and DS10 investigated the effectiveness of thedocument linkage process for clustering case study thanprevious studies as word-based [16] and concept-based [17]based on the availability of human judgments

These experiments revealed some interesting trends interms of text organization and a representation schemebased on semantic annotation and reinforcement learningThe results show the effectiveness of the proposed schemefor accurate understanding of the underlying meaning Theproposed SHGRS is exploited in mining and managingoperations such as gathering filtering searching retrievingextracting classifying clustering and summarizing docu-ments In addition the experiments illustrate the improve-ments when direct relevance and indirect relevance areincluded in the process of measuring semantic relatednessbetween MFs of documents The main purpose of semanticrelatedness measures is to infer unknown dependencies andrelationships among the MFs The proposed MFsSR is abackbone of the closest synonym detection and semanticrelationships exploration algorithms The performance ofMFsSR for closest synonym detection and IOAC-QL forsemantic relationships exploration algorithms is proportion-ally improved as well as having a semantic understanding oftext content achieved The relatedness subgraph measure isused for the document linkage process at the corpus levelThrough this process better clustering results can be achievedas more attempt to improve the accuracy of measuring therelatedness between documents using semantic information

The Scientific World Journal 15

Table 4The results of the comparison of the correlation coefficientbetween human judgment with some relatedness measures and theproposed semantic relatedness measure

MeasureRelevance

correlation withMampC

Distance-based measuresRada 0688Wu and Palmer 0765

Information-based measuresResnik 077Jiang and Conrath 0848Lin 0853

Hybrid measuresT Hong and D smith 0879Zili Zhou 0882

Informationfeature-base measuresThe proposed MFsSR 0937

in SHGRSThe relatedness subgraphsmeasure affects the per-formance of the output of the document managing and min-ing process relative to howmuch of the semantic informationit considers The most accurate relatedness can be achievedwhen the maximum amount of the semantic information isprovided Consequently the quality of clustering is improvedand when semantic information is considered the approachsurpasses the word-based and concept-based techniques

431 Experiment 1 Comparative Study (Content-Based Eval-uation) This experiment shows the necessity to evaluate theperformance of the proposed MFsSR based on a benchmarkdataset of human judgments so the results of the proposedsemantic relatedness would be comparable with other pre-vious studies in the same field In this experiment theyattempted to compute the correlation between the ratings ofthe proposed semantic relatedness approach and the meanratings reported by Miller and Charles (DS1 in Table 3)Furthermore the results produced are compared againsteight other semantic similarities approaches namely RadaWu and Palmer Resnik Jiang and Conrath Lin T Hong andD Smith and Zili Zhou [21]

Table 4 shows the correlation coefficient results betweennine componential approaches and the Miller and Charlesratings mean The semantic relatedness for the proposedapproach outperformed all the listed approaches Unlike allthe listed methods in the proposed semantic relatednessdifferent properties are considered Furthermore the goodcorrelation value of the approach also results from consider-ing all available relationships between concepts and the indi-rect relationships between attributes of each concept whenmeasuring the semantic relatedness In this experiment theproposed approach achieved a good correlation value withthe human-subject rating reported by Miller and CharlesBased on the results of this study theMFsSR correlation valuehas proven that considering the directindirect relevance

is specifically important for at least 6 improvement overthe best previous results This improvement contributes tothe achievement of the largest amount of relationships andinterdependence among MFs and their attributes at thedocument level and to the document linkage process at acorpus level

Table 5 summarizes the characteristics of the MRPCdataset (DS2 in Table 3) and presents comparison of theAcc and AR values between the proposed semantic related-ness measure MFsSR and Islam and Inkpen [7] Differentrelatedness thresholds ranging from 0 to 1 with interval01 are used to validate the MFsSR with Islam and Inkpen[7] After evaluation the best relatedness thresholds of Accand AR are 06 07 and 08 These results indicate thatthe proposed MFsSR surpasses Islam and Inkpen [7] interms of Acc and AR The results of each approach listedbelow were based on the best Acc and AR through allthresholds instead of under the same relatedness thresholdThis improvement in Acc andAR values is due to the increasein the numbers of pairs predicted correctly after consideringdirectindirect relevance This relevance takes into accountthe closest synonym and the relationships of sentences MFs

432 Experiment 2 Comparative Study of Closest-SynonymDetection and Semantic Relationships Exploration Thisexperiment shows the necessity to study the impact of theMFsSR for inferring unknown dependencies and relation-ships among the MFs This impact is achieved through theclosest-synonym detection and semantic relationship explo-ration algorithms so the results would be comparable withother previous studies in the same field Throughout thisexperiment the impact of the proposed MFsSR for detectingthe closest synonym is studied and compared to the PMIapproach and ICA approach for detecting the best near-synonym The difference among the proposed discoveringsemantic relationships or associations IOAC-QL the implicitrelation extraction CRFs approach and TFICF approachis examined This experiment attempts to measure theperformance of theMFsSR for the closest synonym detectionand the IOAC-QL for semantic relationship explorationalgorithms Hence more semantic understanding of the textcontent is achieved by inferring unknown dependenciesand relationships among the MFs Considering the directrelevance among the MFs and the indirect relevance amongthe attributes of the MFs in the proposed MFsSR constitutesa certain advantage over previous measures However mostof the time the incorrect items are due to a wrong syntacticparsing from the OpenNLP parser and AlchemyAPIAccording to the preliminary study it is certain that theaccuracy of parsing toolsrsquo effects is on the performance of theMFsSR

Detecting the closest synonym is a process of MF disam-biguation resulting from the implementation of the MFsSRtechnique and the threshold of the synonyms scores throughthe C-SynD algorithm In the C-SynD algorithm the MFsSRtakes into account the direct relevance among MFs andthe indirect relevance among MFs and their attributes in acontext that gives a significant increase in the recall without

16 The Scientific World Journal

Table 5 The results of the comparison of the accuracy and acceptance rate between the proposed semantic relatedness measure and Islamand Inkpen [7]

MRPC dataset Relatednessthreshold

Human judgment(TP + FN)

Islam and Inkpen [7] The proposed MFsSRAcc AR Acc AR

Trainingsubset (4076)

01

2753 true

067 1 068 102 067 1 068 103 067 1 068 104 067 1 068 105 069 098 068 106 072 089 068 107 068 078 070 09808 056 04 072 08609 037 009 060 0491 033 0 034 002

Test subset(1725)

01

1147 true

066 1 067 102 066 1 067 103 066 1 067 104 066 1 067 105 068 098 067 106 072 089 066 107 068 078 066 09808 056 04 064 08509 038 009 049 051 033 0 034 003

disturbing the precision Thus the MFsSR between two con-cepts considers the information content with most of Word-Net relationships Table 6 illustrates the performance of theC-SynD algorithm based on theMFsSR compared to the bestnear-synonym algorithmusing the PMI and ICA approachesThis comparison was conducted for two corpora of textdocuments that are BNC and NS (DS3 and DS4 in Table 3)As indicated in Table 6 the C-SynD algorithm yielded higheraverage precision recall and 119865-measure values than PMIapproach by 25 20 and 21 on SN dataset respectivelyand also by 8 4 and 7 on BNC dataset Furthermorethe C-SynD algorithm also yielded higher average precisionrecall and 119865-measure values than ICA approach by 6 9and 7 on SN dataset respectively and also by 6 4and 5 on BNC dataset The improvements achieved in theperformance values are due to the increase of the numberof pairs predicted correctly and the decrease of the numberof pairs predicted incorrectly through implementing MFsSRThe important observation from this table is the improve-ments achieved in recall which measures effectiveness of theC-SynD algorithm Achieving better precision values is clearwith a high percentage in different datasets and domains

Inferring the MFs relationships and dependencies is aprocess of achieving a large number of interdependencesamongMFs resulting from the implementation of the IOAC-QL algorithm In the IOAC-QL algorithm the 119876-learning isused to select the optimal action (relationships) that gainsthe largest amount of interdependences among the MFs

resulting from the measure of the AVW Table 7 illustratesthe performance of the IOAC-QL algorithm based on AVWscompared to the implicit relationship extraction based onCRFs and TFICF approaches which was conducted overthe TREC Bless the NYT and the Wikipedia corpus (DS5DS6 DS7 and DS8 in Table 3) The data in this table showsincreases in precision recall and 119865-measure values due tothe increase in the number of pairs predicted correctly afterconsidering directindirect relevance through the expansionof closest synonym In addition the IOAC-QL algorithmthe CRFs and the TFICF approaches are beneficial tothe extraction performance but the IOAC-QL contributesmore than CRFs and TFICF Thus IOAC-QL considerssimilarity contiguity contrast and causality relationshipsbetween MFs and its closest-synonyms while the CRFs andTFICF consider is-a and part-of relationships only betweenconceptsThe improvements of the 119865-measure were achievedthrough an IOAC-QL algorithm up to 23 for TREC dataset22 for BLESS dataset 21 for NYT dataset and 6 forWikipedia dataset approximately from the CRFs approachFurthermore the IOAC-QL algorithm also yielded higher119865-measure values than TFICF approach by 10 for TRECdataset 17 for BLESS dataset and 7 for NYT datasetrespectively

433 Experiment 3 Comparative Study (Task-Based Eval-uation) This experiment shows the effectiveness of the

The Scientific World Journal 17

Table 6 The results of the C-SynD algorithm based on the MFsSR compared to the PMI and ICA for detecting the closest synonym

Datasets PMI ICA C-SynDPrecision Recall 119865-measure Precision Recall 119865-measure Precision Recall 119865-measure

SN 606 606 061 795 716 753 85 80 82BNC 745 679 71 76 678 72 82 719 77

Table 7 The results of the IOAC-QL algorithm based on AVWs compared to CRFs and TFICF approaches

Datasets CRFs TFICF IOAC-QLPrecision Recall 119865-measure Precision Recall 119865-measure Precision Recall 119865-measure

TREC 7508 602 6628 893 714 787 898 881 886BLESS 7304 6266 6703 738 695 716 950 835 889NYT 6846 5402 6038 860 650 740 900 740 812Wikipedia 560 420 480 646 5488 5934 700 440 540

Table 8The results of graph-based approach compared to term-based and concept-based approach using119870-means and the HAC clusteringalgorithms

Datasets Algorithm Term-based Concept-based Graph-basedFC FC FC

Reuters-21578 HAC 73 857 902119870-means 51 846 918

20 Newsgroups HAC 53 82 89119870-means 48 793 87

proposed semantic representation scheme in document min-ing and managing operations such as gathering filteringsearching retrieving extracting classifying clustering andsummarizing documents Many sets of experiments usingthe proposed scheme on different datasets in text clusteringhave been conducted The experiments demonstrate theextensive comparisons between the MF-based analysis andthe traditional analysis Experimental results demonstratethe substantial enhancement of the document clusteringquality In this experiment the application of the semanticunderstanding of the textual documents is presented Textualdocuments are parsed to produce the rich syntactic andsemantic representations of SHGRS Based on these semanticrepresentations document linkages are estimated using thesubgraph matching process that is based on finding alldistinct similarity common subgraphs The semantic repre-sentations of textual documents determining the relatednessbetween themare evaluated by allowing clustering algorithmsto use the produced relatedness matrix of the documentset To test the effectiveness of subgraph matching in deter-mining an accurate measure of the relatedness betweendocuments an experiment using the MFs-based analysisand subgraphs relatedness measures was conducted Twostandard document clustering techniques were chosen fortesting the effect of the document linkages on clustering [43](1) HAC and (2) 119870-means The MF-based weighting is themain factor that captures the importance of a concept in

a document Furthermore document linkages using the sub-graph relatedness measure are among the main factors thataffect clustering techniques Thus to study the effect of theMFs-based weighting and the subgraph relatedness measure(graph-based approach) on the clustering techniques thisexperiment is repeated for the different clustering techniquesas shown in Table 8 Basically the aim is to maximize theweighted average of the 119865-measure for each class (FC) ofclusters to achieve high-quality clustering

Table 8 shows that the graph-based approach achieveshigher performance than term-based and concept-basedapproachesThis comparison was conducted for two corporaof text documents that are the Reuters-21578 dataset and the20 newsgroups dataset (DS9 and DS10 in Table 3)The resultsshown in Table 8 illustrate improvement in the clusteringquality when more attributes are included in graph-basedrelatedness measure The improvements were achieved at thegraph-based approach up to 17 by HAC and 40 by 119870-means for Reuters-21578 dataset while up to 36 byHAC and39 by 119870-means for approximately 20 newsgroup datasetsfrom the base case (term-based) approach Furthermorethe improvements also were achieved at the graph-basedapproach up to 5 by HAC and 7 by119870-Means for Reuters-21578 dataset while up to 7 byHAC and 8 by119870-Means forapproximately 20 Newsgroup datasets from Concept-basedapproach

18 The Scientific World Journal

5 Conclusions

This paper proposed the SHGRS to improve the effectivenessof themining andmanaging operations in textual documentsSpecifically a three-stage approach was proposed for con-structing the scheme First the MFs with their attributesare extracted and hierarchy-based structure is built to rep-resent sentences by the MFs and their attributes Second anovel semantic relatedness computing method was proposedfor inferring relationships and dependencies of MFs andthe relationship between MFs is represented by a graph-based structure Third a subgraph matching algorithm wasconducted for document linkage process The documentsclustering is used as a case study and conducted extensiveexperiments on 10 different datasets and the results showedthat the scheme can result in a better mining performancecompared with traditional approach Future work will focuson conducting other case studies to processes such asgathering filtering retrieving classifying and summarizinginformation

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] I T Fatudimu A G Musa C K Ayo and A B SofoluweldquoKnowledge discovery in online repositories a text miningapproachrdquo European Journal of Scientific Research vol 22 no2 pp 241ndash250 2008

[2] A Hotho A Nurnberger and G Paaszlig ldquoA brief survey of textminingrdquo Journal for Computational Linguistics and LanguageTechnology vol 20 2005

[3] M Bernotas K Karklius R Laurutis and A Slotkiene ldquoThepeculiarities of the text document representation using ontol-ogy and tagging-based clustering techniquerdquo Journal of Infor-mation Technology and Control vol 36 pp 217ndash220 2007

[4] Y Li Z A Bandar andDMcLean ldquoAn approach formeasuringsemantic similarity between words using multiple informationsourcesrdquo IEEE Transactions on Knowledge and Data Engineer-ing vol 15 no 4 pp 871ndash882 2003

[5] K Shaban O Basir and M Kamel ldquoDocument mining basedon semantic understanding of textrdquo in Progress in Pattern Rec-ognition Image Analysis and Applications vol 4225 of LectureNotes in Computer Science pp 834ndash843 Springer BerlinGermany 2006

[6] A Culotta A McCallum and J Betz ldquoIntegrating probabilisticextraction models and data mining to discover relations andpatterns in textrdquo in Proceedings of theHuman Language Technol-ogy ConferencemdashNorth American Chapter of the Association forComputational Linguistics Annual Meeting (HLT-NAACL rsquo06)pp 296ndash303 June 2006

[7] A Islam and D Inkpen ldquoSemantic text similarity using corpus-based word similarity and string similarityrdquo ACM Transactionson Knowledge Discovery fromData vol 2 no 2 article 10 2008

[8] R Feldman and J SangerTheTextMining Handbook AdvancedApproaches in Analyzing Unstructured Data Cambridge Uni-versity Press 2006

[9] R Navigli ldquoWord sense disambiguation a surveyrdquo ACM Com-puting Surveys vol 41 no 2 article 10 2009

[10] W Mao andWW Chu ldquoThe phrase-based vector space modelfor automatic retrieval of free-text medical documentsrdquo Dataand Knowledge Engineering vol 61 no 1 pp 76ndash92 2007

[11] C Siefkes and P Siniakov ldquoAn overview and classification ofadaptive approaches to information extractionrdquo in Journal onData Semantics IV vol 3730 of Lecture Notes in ComputerScience pp 172ndash212 Springer New York NY USA 2005

[12] httpwordnetprincetonedu[13] OpenNLP httpopennlpapacheorg[14] httpwwwalchemyapicom[15] W V Siricharoen ldquoOntologies and object models in object ori-

ented software engineeringrdquo International Association of Engi-neers (IAENG) Journal of Computer Science (JICS) vol 33 no 1pp 1151ndash1162 2006

[16] S Shehata ldquoA WordNet-based semantic model for enhancingtext clusteringrdquo in Proceedings of the IEEE International Con-ference on Data Mining Workshops (ICDMW rsquo09) pp 477ndash482Miami Fla USA December 2009

[17] S Shehata F Karray andM Kamel ldquoAn efficient concept-basedminingmodel for enhancing text clusteringrdquo IEEE Transactionson Knowledge and Data Engineering vol 22 no 10 pp 1360ndash1371 2010

[18] E Zemmouri H Behja A Marzak and B Trousse ldquoOntology-based knowledge model for multi-view KDD processrdquo Interna-tional Journal of Mobile Computing and Multimedia Communi-cations vol 4 no 3 pp 21ndash33 2012

[19] C Marinica and F Guillet ldquoKnowledge-based interactive post-mining of association rules using ontologiesrdquo IEEETransactionson Knowledge and Data Engineering vol 22 no 6 pp 784ndash7972010

[20] B Choudhary and P Bhattacharyya ldquoText clustering usingUni-versal Networking Language representationrdquo in Proceedings ofthe 11th International World Wide Web Conference 2003

[21] G Pirro and J Euzenat ldquoA feature and information theoreticframework for semantic similarity and relatednessrdquo in Proceed-ings of the 9th International Semantic Web Conference on theSemantic Web (ISWC rsquo10) vol 6496 pp 615ndash630 2010

[22] Z Lin M R Lyu and I King ldquoMatchSim a novel similaritymeasure based on maximum neighborhood matchingrdquo Knowl-edge and Information Systems vol 32 no 1 pp 141ndash166 2012

[23] D Lin ldquoAn information-theoretic definition of similarityrdquo inProceedings of the 15th International Conference on MachineLearning Madison Wis USA July 1998

[24] D Inkpen ldquoA statistical model for near-synonym choicerdquo ACMTransactions on Speech and Language Processing vol 4 no 1article 2 2007

[25] LHan T Finin PMcNameeA Joshi andYYesha ldquoImprovingword similarity by augmenting PMIwith estimates of word pol-ysemyrdquo IEEE Transactions on Knowledge and Data Engineeringvol 25 no 6 pp 1307ndash1322 2013

[26] J OrsquoShea Z Bandar K Crockett and D McLean ldquoA compar-ative study of two short text semantic similarity measuresrdquo inAgent and Multi-Agent Systems Technologies and Applicationsvol 4953 of Lecture Notes in Artificial Intelligence pp 172ndash181Springer Berlin Germany 2008

[27] J Oliva J I SerranoMDDel Castillo and A Iglesias ldquoSyMSSa syntax-basedmeasure for short-text semantic similarityrdquoDataand Knowledge Engineering vol 70 no 4 pp 390ndash405 2011

The Scientific World Journal 19

[28] I Yoo and X Hu ldquoClustering large collection of biomedicalliterature based on ontology-enriched bipartite graph represen-tation andmutual refinement strategyrdquo inProceedings of the 10thPAKDD pp 9ndash12 2006

[29] S-T Yuan and Y-C Chen ldquoSemantic ideation learning foragent-based e-brainstormingrdquo IEEE Transactions on Knowledgeand Data Engineering vol 20 no 2 pp 261ndash275 2008

[30] D Liu Z Liu and Q Dong ldquoA dependency grammar andWordNet based sentence similarity measurerdquo Journal of Com-putational Information Systems vol 8 no 3 pp 1027ndash1035 2012

[31] A Zouaq M Gagnon and O Benoit ldquoSemantic analysis usingdependency-based grammars and upper-level ontologiesrdquoInternational Journal of Computational Linguistics and Applica-tions vol 1 no 1-2 pp 85ndash101 2010

[32] C Ramakrishnan K J Kochut and A P Sheth ldquoA frameworkfor schema-driven relationship discovery from unstructuredtextrdquo in The Semantic WebmdashISWC 2006 vol 4273 of LectureNotes in Computer Science pp 583ndash596 Springer BerlinGermany 2006

[33] C J C H Watkins and P Dayan ldquoTechnical note Q-learningrdquoMachine Learning vol 8 no 3-4 pp 279ndash292 1992

[34] M Lan C L Tan J Su and Y Lu ldquoSupervised and traditionalterm weighting methods for automatic text categorizationrdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 31 no 4 pp 721ndash735 2009

[35] L Zhang C Li J Liu and HWang ldquoGraph-based text similar-itymeasurement by exploiting wikipedia as background knowl-edgerdquo World Academy of Science Engineering and Technologyvol 59 pp 1548ndash1553 2011

[36] M Kuramochi and G Karypis ldquoAn efficient algorithm for dis-covering frequent subgraphsrdquo IEEE Transactions on Knowledgeand Data Engineering vol 16 no 9 pp 1038ndash1051 2004

[37] L-C Yu H-M Shih Y-L Lai J-F Yeh and C-H Wu ldquoDis-criminative training for near-synonym substitutionrdquo in Pro-ceedings of the 23rd International Conference on ComputationalLinguistics (COLING rsquo10) pp 1254ndash1262 2010

[38] L-C Yu and W-N Chien ldquoIndependent component analysisfor near-synonym choicerdquoDecision Support Systems vol 55 no1 pp 146ndash155 2013

[39] L Yao S Riedel and A Mccallum ldquoCollective cross-documentrelation extraction without labelled datardquo in Proceedings ofthe Conference on Empirical Methods in Natural LanguageProcessing pp 1013ndash1023MITCambridgeMassUSAOctober2010

[40] M Shen D-R Liu and Y-S Huang ldquoExtracting semantic rela-tions to enrich domain ontologiesrdquo Journal of Intelligent Infor-mation Systems vol 39 no 3 pp 749ndash761 2012

[41] H Chim and X Deng ldquoEfficient phrase-based documentsimilarity for clusteringrdquo IEEE Transactions on Knowledge andData Engineering vol 20 no 9 pp 1217ndash1229 2008

[42] GAMiller andWGCharles ldquoContextual correlates of seman-tic similarityrdquo Language and Cognitive Processes vol 6 no 1 pp1ndash28 1991

[43] M Steinbach G Karypis and V Kumar ldquoA comparison ofdocument clustering techniquesrdquo in Proceedings of the Knowl-edge Discovery and Data Mining (KDD) Workshop Text MiningAugust 2000

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 7: Research Article Exploiting Semantic Annotations and ...downloads.hindawi.com/journals/tswj/2015/136172.pdf · is bene cial for text categorization [ ], text summarization [ ], word

The Scientific World Journal 7

whereHype Hypo(119888) returns the number of FVOs in the FLGrelated to the hypernym or hyponym values Syn Anto(119888)returns the number of FVOs in the FLG related to the syn-onym or antonym values Holo Mero(119888) returns the numberof FVOs in the FLG related to the holonym or meronymyvalues and Enta Cause(119888) returns the number of FVOs in theFLG related to the entail or cause values The max concept isa constant that indicates the total number of MFsattributesobjects in the considered text The max concept normalizesthe NIC value and hence the NIC values fall in the range of[0 1]

With the consideration of direct relevance and indirectrelevance of MFs the proposed MFsSR can be stated asfollows

SemRel = 120582(2 lowast NIC (LCS (1198881 1198882))NIC (1198881) +NIC (1198882)

)

+ (1 minus 120582) (2 lowastNIC (LCS (att 1198881 att 1198882))NIC (att 1198881) + NIC (att 1198882)

)

(3)

where 120582 isin [0 1] decides the relative contribution of directand indirect relevance to the semantic relatedness andbecause the direct relevance is assumed to bemore importantthan the indirect relevance 120582 isin [05 1]

323 The Closest-Synonyms Detecting Process In MFsclosest-synonyms detection the SALA carried out the firstaction to extract the closest synonyms for each MF whereSALA defies automatic discovery of similar meaning words(synonyms) Due to the important role played by a lexicalknowledge base in the closest synonyms detection SALAadopts the dictionary-based approach to the disambiguationof theMFs In the dictionary-based approach the assumptionis that the most plausible sense to assign to multiple sharedwords is that sense that maximizes the relatedness among thechosen senses In this respect SALA detects the synonymsby choosing the meaning whose glosses share the largestnumber of words with the glosses of the neighboring wordsthrough lexical ontology Using a lexical ontology suchas WordNet allows the capture of semantic relationshipsbased on the concepts and exploiting hierarchies of conceptsbesides dictionary glosses One of the problems that canbe faced is that not all synonyms are really related to thecontext of the documentTherefore the MFs disambiguationis achieved by implementing the closest-synonym detection(C-SynD) algorithm

In the C-SynD algorithm the main task of each SALA isto search for synonyms of each FVO throughWordNet Eachsynonym and their relationships are assigned in a list calledSyn Relation List (SRL) Each item in the SRL contains onesynonym and its relationships such asHypernyms HyponymMeronyms Holonymy Antonymy Entail and Cause Thepurpose of using these relations is to eliminate the ambiguityand polysemy because not all of the synonyms are relatedto the context of the document Then all synonyms andtheir relationship lists are assigned in a list called F SRListThe SALA starts to filter the irrelevant synonyms accordingto the score of each SRL This score is computed based on

semantic relatedness between the SRL and every FVO in theFeature Vertex array as follows

Score (SRL119896) =

119898

sum

119894=0

119899

sum

119895=0

SemRel (SRL (119894)item FVO119895) (4)

where 119899 is the number of FVOs in the Feature Vertex arraywith index 119895 119898 is the number of items in SRL with index 119894and 119896 is index of SRL

119896in F SRList

The lists of synonyms and their relationships areused temporarily to serve the C-SynD algorithm (seeAlgorithm 2) The C-SynD algorithm specifies a thresholdvalue of the score for selecting closest synonyms SALAapplies the threshold to select the closest synonyms with thehighest score according to the specific threshold as follows

Closest-Synonyms = maxth

(F SRListSRListscore ()) (5)

Then each SALA retains the closest synonyms in anobject Fsyn of each FVO and SALA links the related FVOswith a bidirectional effect using the SALA links algorithm(see Algorithm 3) In the SALA LINKS algorithm the SALAreceives each FVO for detecting links among the other FVOsaccording to Fsyn object values and FAss object valuesThe SALA assigns the FLW between two FVOs to theirintersection location of the FLG

324 The MFs Relationships and Dependencies Inferring Pro-cess In inferringMFs relationships and dependencies SALAaims to explore the implicit dependencies and relationshipsin the context such as human relying on four associationcapabilities These association capabilities are similarity con-tiguity contrast (antonym) and causality [29] which give thelargest amount of interdependence among the MFs Theseassociation capabilities are defined as follows

Similarity For nouns it is the sibling instance of the sameparent instance with is-a relationship inWordNet indicatinghypernymhyponym relationships

Contiguity For nouns it is the sibling instance of thesame parent instance with part-of relationship in WordNetindicating holonymmeronymy relationship

Contrast For adjectives or adverbs it is the sibling instanceof the same parent instance with is-a relationship and withantonym (opposite) attribute in WordNet indicating a syn-onymantonym relationship

Causality For verbs it indicates the connection of the se-quence of events by a causeentail relationship where a causepicks out two verbs one of the verbs is causative such as (give)and the other is called resultant such as (have) in WordNet

To equip SALA with decision-making and experiencelearning capabilities to achieve multiple-goal RL this studyutilizes the RL method relying on 119876-learning to design aSALA inference engine with sequential decision-makingTheobjective of SALA is to select an optimal association to maxi-mize the total long-run accumulated reward Hence SALA

8 The Scientific World Journal

Input Array Feature Vertex Objects (FVO) WordnetOutput List of Closest-Synonym of FVO objects lowastThe detected closest synonym to set Fsyn value lowastProcedureSRLlarr Empty list list of each synonym and their relationshipsF SRListlarr Empty list list of SRL (all synonyms and their relationships lists)lowastParallelforeach used for parallel processing of all FVO lowastParallelForEach(FVO fvo =gt

Fsynlarr Empty listF SRList = get all Syn Relations List(fvo) lowast function to get all synonyms and their relationships lists from wordnet lowastlowast compute semantic relatedness of each synonym and their relationships with the fvo object then get score of each listlowastfor (int 119894 = 0 119894 lt F SRListCount() 119894++) score[119894] = sum(function Get SemRel(F SRList[119894] fvo)) lowastfunction to compute the score of each

synonym list this score contributes in accurate filtering the returned synonyms and selecting the closest onelowast

Fsyn = maxth(score[]) lowastfunction to select SRL with highest score according to the specific threshold lowast

) ParallelFor

Algorithm 2 C-SynD algorithm

Input Action type and Feature Vertex Object(FVO)Output Detect Links between Feature Vertices and assigned the Features Link Weight(FLW)Procedure ilarr index of the FVO input in FLGnumVertslarr number of Feature Vertex objects (FVO) in Vertex (119881)for (int 119895 = 119894 119895 le numVerts j++)if (Action type == ldquoClosest-Synonymsrdquo) then Fsyn = get C-SynD (FVO[119894]) lowast Function to execute C-SynD algorithm lowastif (FVO[119895]contain(Fsyn)) then FLWlarr SemRel(FVO[119894] FVO[119895]) lowastFunction to compute the value of semantic relatedness betweenthe FVO[119894] and FVO[119895] lowastRel-Typelarr 5 lowast set the relation type with 5 indicates of synonym lowast

elseif (Action type == ldquoAssociationrdquo) then optimal action = get IOAC-QL (FVO) lowast Function to execute IOAC-QL algorithm lowastFLWlarr reward lowastThe reward value of optimal action returnedlowastRel-Typelarr action type lowastThe type of optimal action returned (type = 1 for similarity type = 2 for contiguitytype = 3 for contrast type = 4 for causality) lowast

Rel vertex = 119895 lowast index of the related vertex lowastAdjacent[119894 119895] = FLW lowastset the value of linkage weight in adjacency matrix between related vertex [119894] vertex [119895] lowast

Algorithm 3 SALA links algorithm

can achieve the largest amount of interdependence amongMFs through implementing the inference optimal associationcapabilities with the 119876-learning (IOAC-QL) algorithm

(1) Utilizing Multiple-Goal Reinforcement Learning (RL) RLspecifies what to do but not how to do it through the rewardfunction In sequential decision-making tasks an agent needsto perform a sequence of actions to reach goal states ormultiple-goal states One popular algorithm for dealing withsequential decision-making tasks is 119876-learning [33] In a 119876-learning model a 119876-value is an evaluation of the ldquoqualityrdquo

of an action in a given state 119876(119909 119886) indicates how desirableaction 119886 is in the state 119909 SALA can choose an action based on119876-values The easiest way of choosing an action is to choosethe one that maximizes the 119876-value in the current state

To acquire the119876-values the algorithm119876-learning is usedto estimate the maximum discounted cumulative reinforce-ment that the model will receive from the current state 119909

on max(sum119894=0

120574119894119903119894) where 120574 is a discount factor and 119903

119894is the

reward received at step 119894 (which may be 0) The updating of119876(119909 119886) is based on minimizing 119903 + 120574119891(119910) minus 119876(119909 119886) 119891(119910) =max119886119876(119910 119886) and 119910 is the new state resulting from action

The Scientific World Journal 9

119886 Thus updating is based on the temporal difference inevaluating the current state and the action chosen Throughsuccessive updates of the 119876 function the model can learn totake into account future steps in longer and longer sequencesnotably without explicit planning [33] SALA may eventuallyconverge to a stable function or find an optimal sequence thatmaximizes the reward received Hence the SALA learns toaddress sequential decision-making tasks

Problem Formulation The 119876-learning algorithm is used todetermine the best possible action that is selecting the mosteffective relations analyzing the reward function and updat-ing the related feature weights accordingly The promisingaction can be verified by measuring the relatedness scoreof the action value with the remained MFs using a rewardfunctionThe formulation of the effective association capabil-ities exploration is performed by estimating an action-valuefunction The value of taking action 119886 in state 119904 is definedunder a policy 120587 denoted by 119876(119904 119886) as the expected returnstarting from 119904 and taking the action 119886 Action policy 120587 is adescription of the behavior of the learning SALA The policyis the most important criterion in providing the RL with theability to determine the behavior of the SALA [29]

In single goal reinforcement learning these 119876-values areused only to rank order of the actions in a given state Thekey observation here is that the 119876-values can also be used inmultiple-goal problems to indicate the degree of preferencefor different actions The available way in this paper to selecta promising action to execute is to generate an overall119876-valueas a simple sum of the 119876-values of the individual SALA Theaction with the maximum summed value is then chosen toexecute

Environment State and Actions For the effective associationcapabilities exploration problem the FLG is provided as anenvironment in which intelligent agents can learn and sharetheir knowledge and can explore dependencies and relation-ships among MFs The SALAs receive the environmentalinformation data as a state and execute the elected action inthe form of an action value table which provides the best wayof sharing knowledge among agents The state is defined bythe status of theMF space In each state theMFobject is fittedwith the closest-synonyms relations there is a set of actionsIn each round the SALA decides to explore the effectiveassociations that gain the largest amount of interdependenceamong theMFs Actions are defined by selecting a promisingaction with the highest reward by learning which actionis the optimal for each state In summary once 119904

119896(MF

object) is received from the FLG SALA chooses one ofthe associations as the action 119886

119896according to SALArsquos policy

120587lowast(119904119896) executes the associated action and finally returns the

effective relations Heuristic 3 defines the action variable andthe optimal policy

Heuristic 3 (action variable) Let 119886119896symbolize the action that

is executed by SALA at round 119896

119886119896= 120587lowast(119904119896) (6)

Therein one has the following(i) 119886119896C action space similarity contiguity contrast

and causality(ii) 120587lowast(119904

119896) = argmax

119886119876lowast(119904119896 119886119896)

Once the optimal policy (120587lowast) is obtained the agentchooses the actions using the maximum reward

Reward FunctionThe reward function in this papermeasuresthe dependency and the relatedness among MFs The learneris not told which actions to take as in most forms of machinelearning Rather the learner must discover which actionsyield the most reward by trying them [33]

(2) The SALA Optimal AssociationRelationships InferencesThe optimal association actions are achieved according to theIOAC-QL algorithmThis algorithm starts to select the actionwith the highest expected future reward from each state Theimmediate reward which the SALA gets to execute an actionfrom a state s plus the value of an optimal policy is the 119876-value The higher 119876-value points to the greater chance ofthat action being chosen First initialize all the119876(119904 119886) valuesto zero Then each SALA performs the following steps inHeuristic 4

Heuristic 4(i) At every round 119896 select one action 119886

119896from all the

possible actions (similarity contiguity contrast andcausality) and execute it

(ii) Receive the immediate reward 119903119896and observe the new

state 1199041015840119896

(iii) Get the maximum 119876-value of the state 1199041015840

119896based on

all the possible actions 119886119896= 120587lowast(119904119896) = argmax

119886119876lowast

(119904119896 119886119896)

(iv) Update the 119876(119904119896 119886119896) as follows

119876 (119904119896 119886119896) = (1 minus 120572)119876 (119904

119896 119886119896)

+ 120572(119903 (119904119896 119886119896) + 120574 max1198861015840isin119860(119904119896)

119876 (1199041015840

119896 1198861015840

119896))

(7)

where 119876(119904119896 119886119896) is worthy of selecting the action (119886

119896)

at the state (119904119896) 119876(1199041015840

119896 1198861015840

119896) is worthy of selecting the

next action (1198861015840119896) at the next state (1199041015840

119896) The 119903(119904

119896 119886119896)

is a reward corresponding to the acquired payoff A(1199041015840119896) is a set of possible actions at the next state (1199041015840

119896) 120572

(0 lt 120572 le 1) is learning rate 120574 (0 le 120574 le 1) is discountrate

During each round after taking the action 119886 the actionvalue is extracted from WordNet The weight of the actionvalue is measured to represent the immediate reward wherethe SALA computes the weight AVW

119894119896for the action value 119894

in document 119896 as followsReward (119903) = AVW

119894119896

= AVW119894119896+

119899

sum

119895=1

(FW119895119896

lowast SemRel (119894 119895)) (8)

10 The Scientific World Journal

Input Feature Vertex Object(FVO) WordnetOutput The optimal action for FVO relationshipsProcedure

intialize Q[ ]lowastimplement 119876-learning algorithm to get the action with the highest 119876-value lowast119886119896isin action space[] = similarity contiguity contrast causality

lowastParallelforeach used for parallel processing of all actions lowastParallelfor (0 action spacecount() 119896 =gt 119877[119896] = Get Action Reward(119886[119896]) lowastFunction to implement (8) lowast119876[119894 119896] = (1 minus 120572) lowast 119876[119894 119896] + 120572 lowast (119877[119896] + 120574 lowastmax

119886(119876[119894 119896 + 1]))

) ParallelForreturn Get highest 119876-value action(119876[ ]) lowastselect action with the maximum summed value of 119876-values lowast

lowastexecute (8) to get reward value lowastGet Action Reward(ak)AVW

119896= (freq(119886

119896val)FVOCount()) lowast log(119873119899

119886)

for (int 119895 = 0 119895 lt FVOCount() 119895++) FW119895= (freq(FVO[119895])FVOCount()) lowast log(119873119899

119891)

SemRel(119894 119895) =Get SemRel(119886119896val FVO[119895])

Sum+= FW119895lowast SemRel(119894 119895)

return AVW119896= AVW

119896+ Sum

Algorithm 4 IOAC-QL algorithm

where 119903 is the reward of the selected action 119886119894 AVW

119894119896is the

weight of action value 119894 in document 119896 FW119895119896

is the weightof each feature object FVO

119895in document 119896 and SemRel(119894 119895)

is the semantic relatedness between the action value 119894 andfeature object FVO

119895

The most popular weighting scheme is the normalizedword frequency TFIDF [34] used to measure AVW

119894119896and

FW119895119896The FW

119895119896in (8) is calculated likewise AVW

119894119896is based

on (9) and (10)Action value weight (119860119881119882

119894119896) is a measure to calculate

the weight of the action value that is the scalar product ofaction value frequency and inverse document frequency as in(9)The action value frequency (119865) measures the importanceof this value in a document Inverse document frequency(IDF) measures the general importance of this action valuein a corpus of documents

AVW119894119896

=119865119894119896

sum119896119865119894119896

lowast IDF119894 (9)

where 119865119894119896

represents the number of this action value 119894

cocontributions in document 119889119896 normalized by the number

of cocontributions of all MFs in document 119889119896 normalized to

prevent a bias towards longer documents IDF is performedby dividing the number of all documents by the number ofdocuments containing this action value defined as follows

IDF119894= log( |119863|

1003816100381610038161003816119889119896 V119894 isin 1198891198961003816100381610038161003816

) (10)

where |119863| is the total number of documents in the corpusand |119889

119896 V119894isin 119889119896| is the number of documents containing

action value 119881119894

Thus in the IOAC-QL algorithm (see Algorithm 4)implementation shown in Figure 4 the SALA selects theoptimal action with the higher reward Hence each SALAretains effective actions in the FAss list object of each FVO Inthis list each object hasRel-Type object andRel vertex objectTheRel-Type object value equals the action type object valueand Rel vertex object value equals the related vertex indexThe SALA then links the related FVOs with the bidirectionaleffect using the SALA links algorithm and sets the FLW valueof the linkage weight in the adjacency matrix with the rewardof the selected action

An example of monitoring the SALA decision throughthe practical implementation of the IOAC-QL algorithmactions is illustrated in Figure 4

For state FVO1 one has the following

action similarity ActionResult State FVO1FVO7

FVO9FVO22FVO75FVO92FVO112

FVO125

Prob 8reward 184 PrevState FVO

1 119876(119904 119886) = 91

action contiguity ActionResult State FVO103

FVO17

FVO44 Prob 3 reward 74 PrevState FVO

1 119876(119904 119886) =

61action contrast ActionResult State FVO

89 Prob 1

reward 22 PrevState FVO1 119876(119904 119886) = 329

action causality ActionResult State FVO63

FVO249

Prob 2 reward 25 PrevState FVO

1 119876(119904 119886) = 429

As revealed in the case FVO1 the SALA receives the

environmental information data FVO1as a state and then

the SALA starts the implementation to select the actionwith the highest expected future reward among the otherFVOs The result of each action is illustrated the action with

The Scientific World Journal 11

S0

S0

S1

S1S2

a0 a0a0a4 a4 a4

DSO DSO

Sm

Sm

Object model construction

FVO

Vertex [V] Adjacent [V V]

FLWFval

SFKeyFsyn[]

FAss[]FW

The feature linkage graph (FLG)

Descriptive document object

Descriptive sentence object

Action selection

Q-learning mechanism

Assign Q-values with FLG environment

Reinforcement learning model for semanticrelational graph construction

SALA inference engine

DSO DSO DSO DSO DSO

DDO

Sem

antic

actio

nabl

e lea

rnin

g ag

ent (

SALA

)

F1

F1

F1

F2

F3

F4

F2

F2 F3 F4 middot middot middot Fn

DSO

Fn

Fn

middot middot middot

middot middot middot

middot middot middot

middot middot middot middot middot middotmiddot middot middot

middot middot middot

middot middot middotmiddot middot middot

middot middot middot

middot middot middot

middot middot middot

Q(S0 a0) Q(S0 a4)Q(S1 a0) Q(S1 a4)

Q(Sm a0) Q(Sm a4)

WordNet

Figure 4 The IOAC-QL algorithm implementation

the highest reward is selected and the elected action isexecutedTheSALAdecision is accomplished in FLGwith theeffective extracted relationships according to 119876(119904 119886) value asshown in Table 1

33 Document Linkage Conducting Stage This section pre-sents an efficient linkage process at the corpus level for manydocuments Due to the SHGRS scheme of each documentmeasuring the commonality between the FLG of the doc-uments through finding all distinct common similarity orrelatedness subgraphs is required The objective of findingcommon similar or related subgraphs as a relatedness mea-sure is to be utilized in document mining and managingprocesses such as clustering classification and retrieval Itis essential to establish that the measure is metric A metricmeasure ascertains the order of objects in a set and henceoperations such as comparing searching and indexing of theobjects can be performed

Graph relatedness [35 36] involves determining thedegree of similarity between these two graphs (a numberbetween 0 and 1) so the same node in both graphs wouldbe similar if its related nodes were similar The selectionof subgraphs for matching depends on detecting the moreimportant nodes in FLG using FVOFw value The vertexwith the highest weight is the first node selected which thendetects all related vertices from an adjacency matrix as asubgraphThis process depends on computing the maximumcommon subgraphsThe inclusion of all common relatednesssubgraphs is accomplished through recursive iterations andholding onto the value of the relatedness index of thematched

subgraphs Eliminating the overlap is performed using a dy-namic programming technique that keeps track of the nodesvisited and their attribute relatedness values

The function relatedness subgraphs utilize an array Rel-Nodes[] of the related vertices for each document Thefunction SimRel(119909

119894 119910119894) computes the similarity between the

two nodes 119909119894 119910119894as defined in (11) The subgraph matching

process is measured using the following formula

Sub-graph-Rel (SG1 SG2)

=sum

SG1119894

sumSG2119895

SemRel (119873119894 119873119895)

sumSG1119894

119873119894+ sum

SG2119895

119873119895

(11)

where SG1 is the selected subgraph from the FLG of onedocument SG2 is the selected subgraph from the FLG ofanother document and SemRel(119873

119894 119873119895) is the degree of

matching for both nodes 119873119894and 119873

119895 A threshold value for

node matching is also utilized the threshold is set to a 50match

4 Evaluation of the Experiments and Results

This evaluation is especially vital as the aim of building theSHGRS is to use them efficiently and effectively for the fur-ther mining process To explore the effectiveness of the pro-posed SHGRS scheme examining the correlation of the pro-posed semantic relatedness measure compared with otherprevious relatedness measures is required The impact ofthe proposed MFsSR for detecting the closest synonym is

12 The Scientific World Journal

Table 1 Action values table with the SALA decision

State FVO1 FVO1 FVO1 FVO1

Action Similarity Contiguity Contrast CausalityActionResult State FVO1 FVO7 FVO9 FVO22 FVO75 FVO92 FVO112 FVO125 FVO103 FVO17 FVO44 FVO89 FVO63 FVO249

Total reward 184 74 22 25119876(119904 119886) 94 61 329 429Decision Yes No No No

studied and compared to the PMI [25 37] and independentcomponent analysis (ICA) [38] for detecting the best near-synonym The difference between the proposed discoveringsemantic relationships or associations IOAC-QL implicitrelation extraction conditional random fields (CRFs) [6 39]and term frequency and inverse cluster frequency (TFICF)[40] is computed

A case study of amanaging andmining task (ie semanticclustering of text documents) is consideredThis study showshow the proposed approach is workable and how it canbe applied to mining and managing tasks Experimentallysignificant improvements in all performances are achieved inthe document clustering processThis approach outperformsconventional word-based [16] and concept-based [17 41]clusteringmethodsMoreover performance tends to improveas more semantic analysis is achieved in the representation

41 Evaluation Measures Evaluation measures are subcat-egorized into text quality-based evaluation content-basedevaluation coselection-based evaluation and task-basedevaluationsThefirst category of evaluationmeasures is basedon text quality using aspects such as grammaticality nonre-dundancy referential clarity and coherence For content-based evaluations measures such as similarity semanticrelatedness longest common subsequence and other scoresare used The third category is based on coselection evalua-tion using precision recall and 119865-measure values The lastcategory task-based evaluation is based on the performanceof accomplishing the given task Some of these tasks areinformation retrieval question answering document classi-fication document summarizing and document clusteringmethods In this paper the content-based the coselection-based and the task-based evaluations are used to validate theimplementation of the proposed scheme over the real corpusof documents

411 Measure for Content-Based Evaluation The relatednessor similarity measures are inherited from probability theoryand known as the correlation coefficient [42]The correlationcoefficient is one of the most widely used measures todescribe the relatedness 119903 between two vectors119883 and 119884

Correlation Coefficient 119903 The correlation coefficient is a rela-tively efficient relatedness measure which is a symmetricalmeasure of the linear dependence between two randomvariables

Table 2 Contingency table presents human judgment and systemjudgment

Human judgmentTrue False

System judgmentTrue TP FPFalse FN TN

Therefore the correlation coefficient can be consideredas the coefficient for the linear relationship between corre-sponding values of 119883 and 119884 The correlation coefficient 119903between sequences 119883 = 119909

119894 119894 = 1 119899 and 119884 = 119910

119894

119894 = 1 119899 is defined by

119903 =sum119899

119894=1119883119894119884119894

radic(sum119899

119894=11198832119894) lowast (sum

119899

119894=11198842119894)

(12)

Acceptance Rate (AR) Acceptance rate is a proportion ofcorrectly predicted similar or related sentences compared toall related sentences High acceptance ratemeans recognizingalmost all similar or related sentences

AR =TP

(TP + FN) (13)

Accuracy (Acc) Accuracy is a proportion of all correctlypredicted sentences compared to all sentences

Acc = TP + TN(TP + FN + FP + TN)

(14)

where TP TN FP and FN stand for true positive (the num-ber of pairs correctly labeled as similar) true negative (thenumber of pairs correctly labeled as dissimilar) false positive(the number of pairs incorrectly labeled as similar) andfalse negative (the number of pairs incorrectly labeled asdissimilar) as shown in Table 2

412 Measure for Coselection-Based Evaluation For all thedomains a precision 119875 recall 119877 and 119865-measure are utilizedas the measures of performance in the coselection-basedevaluation These measures may be defined via computingthe correlation between the extracted correct and wrongclosest synonyms or semantic relationshipsdependencies

The Scientific World Journal 13

Table 3 Datasets details for the experimental setup

Dataset Experimenting Dataset Name Description

DS1 Experiment 1content-basedevaluation

Miller and Charles (MC)1 RG consists of 65 pairs of nouns extracted from the WordNet rated bymultiple human annotators

DS2Microsoft ResearchParaphrase Corpus

(MRPC)2

The corpus consists of 5801 sentence pairs collected from newswire articles3900 of which were labeled as relatedness by human annotators The wholeset is divided into a training subset (4076 sentences of which 2753 are true)and a test subset (1725 pairs of which 1147 are true)

DS3 Experiment 2coselection-basedevaluation(closest-synonymdetection)

British National Corpus(BNC)3

BNC is a carefully selected collection of 4124 contemporary written andspoken English texts contains 100-million-word text corpus of samples ofwritten and spoken English with the near-synonym collocations

DS4 SN (semantic neighbors)4SN relates 462 target terms (nouns) to 5910 relatum terms with 14682semantic relations (7341 are meaningful and 7341 are random) The SNcontains synonyms coming from three sources WordNet 30 Rogetrsquosthesaurus and a synonyms database

DS5

Experiment 2coselection-basedevaluation (semanticrelationshipsexploration)

BLESS6BLESS relates 200 target terms (100 animate and 100 inanimate nouns) to 8625relatum terms with 26554 semantic relations (14440 are meaningful (correct)and 12154 are random) Every relation has one of the following typeshypernymy cohypernymy meronymy attribute event or random

DS6 TREC5

TREC includes 1437 sentences annotated with entities and relations at leastone relation There are three types of entities person (1685) location (1968)and organization (978) in addition there is a fourth type others (705) whichindicates that the candidate entity is none of the three types There are fivetypes of relations located in (406) indicates that one location is located insideanother location work for (394) indicates that a person works for anorganization OrgBased in (451) indicates that an organization is based in alocation live in (521) indicates that a person lives at a location and kill (268)indicates that a person killed another person There are 17007 pairs of entitiesthat are not related by any of the five relations and hence have the NR relationbetween them which thus significantly outnumbers other relations

DS7 IJCNLP 2011-New YorkTimes (NYT)6

NYT contains 150 business articles from NYT There are 536 instances (208positive 328 negative) with 140 distinct descriptors in NYT dataset

DS8 IJCNLP 2011-Wikipedia8Wikipedia personalsocial relation dataset was previously used in Culotta etal [6] There are 700 instances (122 positive 578 negative) with 70 distinctdescriptors in Wikipedia dataset

DS9 Experiment 3task-basedevaluation

Reuters 215787 Reuters-21578 contains 21578 documents (12902 are used) categorized to 10categories

DS10 20 Newsgroups8 20 newsgroups dataset contains 20000 documents (18846 are used)categorized to 20 categories

1Available at httpwwwcscmuedusimmfaruquisuitehtml2Available at httpresearchmicrosoftcomen-usdownloads3Available at httpcorpusbyuedubnc4Available at httpcentalfltruclacbeteamsimpanchenkosre-evalsncsv5Available at httpl2rcsuiucedusimcogcompDataERconll04corp6Available at httpwwwmysmuedufacultyjingjiangdataIJCNLP2011zip7Available at httpmlrcsumassedumldatasetsReuters-21578+Text+Categorization+Collection8Available at httpwwwcsminingorgindexphpid-20-newsgroupshtml

Let TP denote the number of correctly detected closestsynonyms or semantic relationships explored let FP be thenumber of incorrectly detected closest synonyms or semanticrelationships explored and let FN be the number of correctlybut not detected closest synonyms or semantic relationshipsexplored in a dataset The 119865-measure combines the precisionand recall in one metric and is often used to show theefficiencyThismeasurement can be interpreted as a weightedaverage of the precision and recall where an 119865-measure

reaches its best value at 1 andworst value at 0 Precision recalland 119865-measure are defined as follows

Recall = TP(TP + FN)

Precision =TP

(TP + FP)

119865-measure = 2 (Recall lowast Precision)(Recall + Precision)

(15)

14 The Scientific World Journal

413 Measure for Task-Based Evaluation A case study ofa managing and mining task (ie semantic clustering oftext documents) is studied for task-based evaluation Thetest is performed with the standard clustering algorithmsthat accept the pairwise (dis)similarity between documentsrather than the internal feature space of the documents Theclustering algorithms may be classified as listed below [43](a) flat clustering (which creates a set of clusters without anyexplicit structure that would relate the clusters to each otheralso called exclusive clustering) (b) hierarchical clustering(which creates a hierarchy of clusters) (c) hard clustering(which assigns each documentobject as a member of exactlyone cluster) and (d) soft clustering (which distributes thedocumentsobjects over all clusters) These algorithms areagglomerative (hierarchical clustering) 119870-means (flat clus-tering hard clustering) and EM algorithm (flat clusteringsoft clustering)

Hierarchical agglomerative clustering (HAC) and the 119870-means algorithm have been applied to text clustering ina straightforward way [16 17] The algorithms employedin the experiment included 119870-means clustering and HACTo evaluate the quality of the clustering the 119865-measure isused for quality measures which combine the precision andrecall measures These metrics are computed for every (classcluster) pair The precision 119875 and recall 119877 of a cluster 119878

119894with

respect to a class 119871119903are defined as

119875 (119871119903 119878119894) =

119873119903119894

119873119894

119877 (119871119903 119878119894) =

119873119903119894

119873119903

(16)

where 119873119903119894is the number of documents of class 119871

119903in cluster

119878119894119873119894is the number of documents of cluster 119878

119894 and119873

119903is the

number of documents of class 119871119903

The 119865-measure which is a harmonic mean of precisionand recall is defined as

119865 (119871119903 119878119894) =

2 lowast 119875 (119871119903 119878119894) lowast 119877 (119871

119903 119878119894)

119875 (119871119903 119878119894) + 119877 (119871

119903 119878119894)

(17)

A per-class 119865 score is calculated as follows

119865 (119871119903) = max119878119894isin119862

119865 (119871119903 119878119894) (18)

With respect to class 119871119903 the cluster with the highest 119865-

measure is considered to be the cluster that maps to class 119871119903

that 119865-measure becomes the score for class 119871119903 and 119862 is total

number of clusters The overall 119865-measure for the clusteringresult 119862 is the weighted average of the 119865-measure for eachclass 119871

119903(macroaverage)

119865119862 =sum119871119903

(1003816100381610038161003816119873119903

1003816100381610038161003816 lowast 119865 (119871119903))

sum119871119903

10038161003816100381610038161198731199031003816100381610038161003816

(19)

where |119873119903| is the number of objects in class 119871

119903 The higher

the overall 119865-measure is the better the clustering is due tothe higher accuracy of the clusters mapping to the originalclasses

42 Evaluation Setup (Dataset) Content-based coselection-based and task-based evaluations are used to validateexperimenting over the real corpus of documents theresults are very promising The experimental setup con-sists of some datasets of textual documents as detailed inTable 3 Experiment 1 uses the Miller and Charles (MC) andMicrosoft Research Paraphrase Corpus Experiment 2 usesthe British National Corpus TREC IJCNLP 2011 (NYT andWikipedia) SN and BLESS Experiment 3 usesThe Reuters-21578 and 20 newsgroups

43 Evaluation Results This section reports on the results ofthree experiments conducted using the evaluation datasetsoutlined in the previous section The SHGRS is imple-mented and evaluated based on concept analysis and anno-tation as sentence based in experiment 1 document basedin experiment 2 and corpus based in experiment 3 Inexperiment 1 the DS1 and DS2 investigated the effectivenessof the proposed MFsSR compared to previous studies basedon the availability of semantic relatedness values of humanjudgments and previous studies In experiment 2 the DS3andDS4 investigated the effectiveness of the closest synonymdetection using MFsSR compared to using PMI [25 37]and ICA [38] based on the availability of synonyms comingfrom three sources WordNet 30 Rogetrsquos thesaurus anda synonyms database Furthermore the DS5 DS6 DS7and DS8 investigated the effectiveness of the IOAC-QL forsemantic relationships exploration compared to the implicitrelation extraction CRFs [6 39] and TFICF [40] based on theavailability of semantic relations judgments In experiment 3the DS9 and DS10 investigated the effectiveness of thedocument linkage process for clustering case study thanprevious studies as word-based [16] and concept-based [17]based on the availability of human judgments

These experiments revealed some interesting trends interms of text organization and a representation schemebased on semantic annotation and reinforcement learningThe results show the effectiveness of the proposed schemefor accurate understanding of the underlying meaning Theproposed SHGRS is exploited in mining and managingoperations such as gathering filtering searching retrievingextracting classifying clustering and summarizing docu-ments In addition the experiments illustrate the improve-ments when direct relevance and indirect relevance areincluded in the process of measuring semantic relatednessbetween MFs of documents The main purpose of semanticrelatedness measures is to infer unknown dependencies andrelationships among the MFs The proposed MFsSR is abackbone of the closest synonym detection and semanticrelationships exploration algorithms The performance ofMFsSR for closest synonym detection and IOAC-QL forsemantic relationships exploration algorithms is proportion-ally improved as well as having a semantic understanding oftext content achieved The relatedness subgraph measure isused for the document linkage process at the corpus levelThrough this process better clustering results can be achievedas more attempt to improve the accuracy of measuring therelatedness between documents using semantic information

The Scientific World Journal 15

Table 4The results of the comparison of the correlation coefficientbetween human judgment with some relatedness measures and theproposed semantic relatedness measure

MeasureRelevance

correlation withMampC

Distance-based measuresRada 0688Wu and Palmer 0765

Information-based measuresResnik 077Jiang and Conrath 0848Lin 0853

Hybrid measuresT Hong and D smith 0879Zili Zhou 0882

Informationfeature-base measuresThe proposed MFsSR 0937

in SHGRSThe relatedness subgraphsmeasure affects the per-formance of the output of the document managing and min-ing process relative to howmuch of the semantic informationit considers The most accurate relatedness can be achievedwhen the maximum amount of the semantic information isprovided Consequently the quality of clustering is improvedand when semantic information is considered the approachsurpasses the word-based and concept-based techniques

431 Experiment 1 Comparative Study (Content-Based Eval-uation) This experiment shows the necessity to evaluate theperformance of the proposed MFsSR based on a benchmarkdataset of human judgments so the results of the proposedsemantic relatedness would be comparable with other pre-vious studies in the same field In this experiment theyattempted to compute the correlation between the ratings ofthe proposed semantic relatedness approach and the meanratings reported by Miller and Charles (DS1 in Table 3)Furthermore the results produced are compared againsteight other semantic similarities approaches namely RadaWu and Palmer Resnik Jiang and Conrath Lin T Hong andD Smith and Zili Zhou [21]

Table 4 shows the correlation coefficient results betweennine componential approaches and the Miller and Charlesratings mean The semantic relatedness for the proposedapproach outperformed all the listed approaches Unlike allthe listed methods in the proposed semantic relatednessdifferent properties are considered Furthermore the goodcorrelation value of the approach also results from consider-ing all available relationships between concepts and the indi-rect relationships between attributes of each concept whenmeasuring the semantic relatedness In this experiment theproposed approach achieved a good correlation value withthe human-subject rating reported by Miller and CharlesBased on the results of this study theMFsSR correlation valuehas proven that considering the directindirect relevance

is specifically important for at least 6 improvement overthe best previous results This improvement contributes tothe achievement of the largest amount of relationships andinterdependence among MFs and their attributes at thedocument level and to the document linkage process at acorpus level

Table 5 summarizes the characteristics of the MRPCdataset (DS2 in Table 3) and presents comparison of theAcc and AR values between the proposed semantic related-ness measure MFsSR and Islam and Inkpen [7] Differentrelatedness thresholds ranging from 0 to 1 with interval01 are used to validate the MFsSR with Islam and Inkpen[7] After evaluation the best relatedness thresholds of Accand AR are 06 07 and 08 These results indicate thatthe proposed MFsSR surpasses Islam and Inkpen [7] interms of Acc and AR The results of each approach listedbelow were based on the best Acc and AR through allthresholds instead of under the same relatedness thresholdThis improvement in Acc andAR values is due to the increasein the numbers of pairs predicted correctly after consideringdirectindirect relevance This relevance takes into accountthe closest synonym and the relationships of sentences MFs

432 Experiment 2 Comparative Study of Closest-SynonymDetection and Semantic Relationships Exploration Thisexperiment shows the necessity to study the impact of theMFsSR for inferring unknown dependencies and relation-ships among the MFs This impact is achieved through theclosest-synonym detection and semantic relationship explo-ration algorithms so the results would be comparable withother previous studies in the same field Throughout thisexperiment the impact of the proposed MFsSR for detectingthe closest synonym is studied and compared to the PMIapproach and ICA approach for detecting the best near-synonym The difference among the proposed discoveringsemantic relationships or associations IOAC-QL the implicitrelation extraction CRFs approach and TFICF approachis examined This experiment attempts to measure theperformance of theMFsSR for the closest synonym detectionand the IOAC-QL for semantic relationship explorationalgorithms Hence more semantic understanding of the textcontent is achieved by inferring unknown dependenciesand relationships among the MFs Considering the directrelevance among the MFs and the indirect relevance amongthe attributes of the MFs in the proposed MFsSR constitutesa certain advantage over previous measures However mostof the time the incorrect items are due to a wrong syntacticparsing from the OpenNLP parser and AlchemyAPIAccording to the preliminary study it is certain that theaccuracy of parsing toolsrsquo effects is on the performance of theMFsSR

Detecting the closest synonym is a process of MF disam-biguation resulting from the implementation of the MFsSRtechnique and the threshold of the synonyms scores throughthe C-SynD algorithm In the C-SynD algorithm the MFsSRtakes into account the direct relevance among MFs andthe indirect relevance among MFs and their attributes in acontext that gives a significant increase in the recall without

16 The Scientific World Journal

Table 5 The results of the comparison of the accuracy and acceptance rate between the proposed semantic relatedness measure and Islamand Inkpen [7]

MRPC dataset Relatednessthreshold

Human judgment(TP + FN)

Islam and Inkpen [7] The proposed MFsSRAcc AR Acc AR

Trainingsubset (4076)

01

2753 true

067 1 068 102 067 1 068 103 067 1 068 104 067 1 068 105 069 098 068 106 072 089 068 107 068 078 070 09808 056 04 072 08609 037 009 060 0491 033 0 034 002

Test subset(1725)

01

1147 true

066 1 067 102 066 1 067 103 066 1 067 104 066 1 067 105 068 098 067 106 072 089 066 107 068 078 066 09808 056 04 064 08509 038 009 049 051 033 0 034 003

disturbing the precision Thus the MFsSR between two con-cepts considers the information content with most of Word-Net relationships Table 6 illustrates the performance of theC-SynD algorithm based on theMFsSR compared to the bestnear-synonym algorithmusing the PMI and ICA approachesThis comparison was conducted for two corpora of textdocuments that are BNC and NS (DS3 and DS4 in Table 3)As indicated in Table 6 the C-SynD algorithm yielded higheraverage precision recall and 119865-measure values than PMIapproach by 25 20 and 21 on SN dataset respectivelyand also by 8 4 and 7 on BNC dataset Furthermorethe C-SynD algorithm also yielded higher average precisionrecall and 119865-measure values than ICA approach by 6 9and 7 on SN dataset respectively and also by 6 4and 5 on BNC dataset The improvements achieved in theperformance values are due to the increase of the numberof pairs predicted correctly and the decrease of the numberof pairs predicted incorrectly through implementing MFsSRThe important observation from this table is the improve-ments achieved in recall which measures effectiveness of theC-SynD algorithm Achieving better precision values is clearwith a high percentage in different datasets and domains

Inferring the MFs relationships and dependencies is aprocess of achieving a large number of interdependencesamongMFs resulting from the implementation of the IOAC-QL algorithm In the IOAC-QL algorithm the 119876-learning isused to select the optimal action (relationships) that gainsthe largest amount of interdependences among the MFs

resulting from the measure of the AVW Table 7 illustratesthe performance of the IOAC-QL algorithm based on AVWscompared to the implicit relationship extraction based onCRFs and TFICF approaches which was conducted overthe TREC Bless the NYT and the Wikipedia corpus (DS5DS6 DS7 and DS8 in Table 3) The data in this table showsincreases in precision recall and 119865-measure values due tothe increase in the number of pairs predicted correctly afterconsidering directindirect relevance through the expansionof closest synonym In addition the IOAC-QL algorithmthe CRFs and the TFICF approaches are beneficial tothe extraction performance but the IOAC-QL contributesmore than CRFs and TFICF Thus IOAC-QL considerssimilarity contiguity contrast and causality relationshipsbetween MFs and its closest-synonyms while the CRFs andTFICF consider is-a and part-of relationships only betweenconceptsThe improvements of the 119865-measure were achievedthrough an IOAC-QL algorithm up to 23 for TREC dataset22 for BLESS dataset 21 for NYT dataset and 6 forWikipedia dataset approximately from the CRFs approachFurthermore the IOAC-QL algorithm also yielded higher119865-measure values than TFICF approach by 10 for TRECdataset 17 for BLESS dataset and 7 for NYT datasetrespectively

433 Experiment 3 Comparative Study (Task-Based Eval-uation) This experiment shows the effectiveness of the

The Scientific World Journal 17

Table 6 The results of the C-SynD algorithm based on the MFsSR compared to the PMI and ICA for detecting the closest synonym

Datasets PMI ICA C-SynDPrecision Recall 119865-measure Precision Recall 119865-measure Precision Recall 119865-measure

SN 606 606 061 795 716 753 85 80 82BNC 745 679 71 76 678 72 82 719 77

Table 7 The results of the IOAC-QL algorithm based on AVWs compared to CRFs and TFICF approaches

Datasets CRFs TFICF IOAC-QLPrecision Recall 119865-measure Precision Recall 119865-measure Precision Recall 119865-measure

TREC 7508 602 6628 893 714 787 898 881 886BLESS 7304 6266 6703 738 695 716 950 835 889NYT 6846 5402 6038 860 650 740 900 740 812Wikipedia 560 420 480 646 5488 5934 700 440 540

Table 8The results of graph-based approach compared to term-based and concept-based approach using119870-means and the HAC clusteringalgorithms

Datasets Algorithm Term-based Concept-based Graph-basedFC FC FC

Reuters-21578 HAC 73 857 902119870-means 51 846 918

20 Newsgroups HAC 53 82 89119870-means 48 793 87

proposed semantic representation scheme in document min-ing and managing operations such as gathering filteringsearching retrieving extracting classifying clustering andsummarizing documents Many sets of experiments usingthe proposed scheme on different datasets in text clusteringhave been conducted The experiments demonstrate theextensive comparisons between the MF-based analysis andthe traditional analysis Experimental results demonstratethe substantial enhancement of the document clusteringquality In this experiment the application of the semanticunderstanding of the textual documents is presented Textualdocuments are parsed to produce the rich syntactic andsemantic representations of SHGRS Based on these semanticrepresentations document linkages are estimated using thesubgraph matching process that is based on finding alldistinct similarity common subgraphs The semantic repre-sentations of textual documents determining the relatednessbetween themare evaluated by allowing clustering algorithmsto use the produced relatedness matrix of the documentset To test the effectiveness of subgraph matching in deter-mining an accurate measure of the relatedness betweendocuments an experiment using the MFs-based analysisand subgraphs relatedness measures was conducted Twostandard document clustering techniques were chosen fortesting the effect of the document linkages on clustering [43](1) HAC and (2) 119870-means The MF-based weighting is themain factor that captures the importance of a concept in

a document Furthermore document linkages using the sub-graph relatedness measure are among the main factors thataffect clustering techniques Thus to study the effect of theMFs-based weighting and the subgraph relatedness measure(graph-based approach) on the clustering techniques thisexperiment is repeated for the different clustering techniquesas shown in Table 8 Basically the aim is to maximize theweighted average of the 119865-measure for each class (FC) ofclusters to achieve high-quality clustering

Table 8 shows that the graph-based approach achieveshigher performance than term-based and concept-basedapproachesThis comparison was conducted for two corporaof text documents that are the Reuters-21578 dataset and the20 newsgroups dataset (DS9 and DS10 in Table 3)The resultsshown in Table 8 illustrate improvement in the clusteringquality when more attributes are included in graph-basedrelatedness measure The improvements were achieved at thegraph-based approach up to 17 by HAC and 40 by 119870-means for Reuters-21578 dataset while up to 36 byHAC and39 by 119870-means for approximately 20 newsgroup datasetsfrom the base case (term-based) approach Furthermorethe improvements also were achieved at the graph-basedapproach up to 5 by HAC and 7 by119870-Means for Reuters-21578 dataset while up to 7 byHAC and 8 by119870-Means forapproximately 20 Newsgroup datasets from Concept-basedapproach

18 The Scientific World Journal

5 Conclusions

This paper proposed the SHGRS to improve the effectivenessof themining andmanaging operations in textual documentsSpecifically a three-stage approach was proposed for con-structing the scheme First the MFs with their attributesare extracted and hierarchy-based structure is built to rep-resent sentences by the MFs and their attributes Second anovel semantic relatedness computing method was proposedfor inferring relationships and dependencies of MFs andthe relationship between MFs is represented by a graph-based structure Third a subgraph matching algorithm wasconducted for document linkage process The documentsclustering is used as a case study and conducted extensiveexperiments on 10 different datasets and the results showedthat the scheme can result in a better mining performancecompared with traditional approach Future work will focuson conducting other case studies to processes such asgathering filtering retrieving classifying and summarizinginformation

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] I T Fatudimu A G Musa C K Ayo and A B SofoluweldquoKnowledge discovery in online repositories a text miningapproachrdquo European Journal of Scientific Research vol 22 no2 pp 241ndash250 2008

[2] A Hotho A Nurnberger and G Paaszlig ldquoA brief survey of textminingrdquo Journal for Computational Linguistics and LanguageTechnology vol 20 2005

[3] M Bernotas K Karklius R Laurutis and A Slotkiene ldquoThepeculiarities of the text document representation using ontol-ogy and tagging-based clustering techniquerdquo Journal of Infor-mation Technology and Control vol 36 pp 217ndash220 2007

[4] Y Li Z A Bandar andDMcLean ldquoAn approach formeasuringsemantic similarity between words using multiple informationsourcesrdquo IEEE Transactions on Knowledge and Data Engineer-ing vol 15 no 4 pp 871ndash882 2003

[5] K Shaban O Basir and M Kamel ldquoDocument mining basedon semantic understanding of textrdquo in Progress in Pattern Rec-ognition Image Analysis and Applications vol 4225 of LectureNotes in Computer Science pp 834ndash843 Springer BerlinGermany 2006

[6] A Culotta A McCallum and J Betz ldquoIntegrating probabilisticextraction models and data mining to discover relations andpatterns in textrdquo in Proceedings of theHuman Language Technol-ogy ConferencemdashNorth American Chapter of the Association forComputational Linguistics Annual Meeting (HLT-NAACL rsquo06)pp 296ndash303 June 2006

[7] A Islam and D Inkpen ldquoSemantic text similarity using corpus-based word similarity and string similarityrdquo ACM Transactionson Knowledge Discovery fromData vol 2 no 2 article 10 2008

[8] R Feldman and J SangerTheTextMining Handbook AdvancedApproaches in Analyzing Unstructured Data Cambridge Uni-versity Press 2006

[9] R Navigli ldquoWord sense disambiguation a surveyrdquo ACM Com-puting Surveys vol 41 no 2 article 10 2009

[10] W Mao andWW Chu ldquoThe phrase-based vector space modelfor automatic retrieval of free-text medical documentsrdquo Dataand Knowledge Engineering vol 61 no 1 pp 76ndash92 2007

[11] C Siefkes and P Siniakov ldquoAn overview and classification ofadaptive approaches to information extractionrdquo in Journal onData Semantics IV vol 3730 of Lecture Notes in ComputerScience pp 172ndash212 Springer New York NY USA 2005

[12] httpwordnetprincetonedu[13] OpenNLP httpopennlpapacheorg[14] httpwwwalchemyapicom[15] W V Siricharoen ldquoOntologies and object models in object ori-

ented software engineeringrdquo International Association of Engi-neers (IAENG) Journal of Computer Science (JICS) vol 33 no 1pp 1151ndash1162 2006

[16] S Shehata ldquoA WordNet-based semantic model for enhancingtext clusteringrdquo in Proceedings of the IEEE International Con-ference on Data Mining Workshops (ICDMW rsquo09) pp 477ndash482Miami Fla USA December 2009

[17] S Shehata F Karray andM Kamel ldquoAn efficient concept-basedminingmodel for enhancing text clusteringrdquo IEEE Transactionson Knowledge and Data Engineering vol 22 no 10 pp 1360ndash1371 2010

[18] E Zemmouri H Behja A Marzak and B Trousse ldquoOntology-based knowledge model for multi-view KDD processrdquo Interna-tional Journal of Mobile Computing and Multimedia Communi-cations vol 4 no 3 pp 21ndash33 2012

[19] C Marinica and F Guillet ldquoKnowledge-based interactive post-mining of association rules using ontologiesrdquo IEEETransactionson Knowledge and Data Engineering vol 22 no 6 pp 784ndash7972010

[20] B Choudhary and P Bhattacharyya ldquoText clustering usingUni-versal Networking Language representationrdquo in Proceedings ofthe 11th International World Wide Web Conference 2003

[21] G Pirro and J Euzenat ldquoA feature and information theoreticframework for semantic similarity and relatednessrdquo in Proceed-ings of the 9th International Semantic Web Conference on theSemantic Web (ISWC rsquo10) vol 6496 pp 615ndash630 2010

[22] Z Lin M R Lyu and I King ldquoMatchSim a novel similaritymeasure based on maximum neighborhood matchingrdquo Knowl-edge and Information Systems vol 32 no 1 pp 141ndash166 2012

[23] D Lin ldquoAn information-theoretic definition of similarityrdquo inProceedings of the 15th International Conference on MachineLearning Madison Wis USA July 1998

[24] D Inkpen ldquoA statistical model for near-synonym choicerdquo ACMTransactions on Speech and Language Processing vol 4 no 1article 2 2007

[25] LHan T Finin PMcNameeA Joshi andYYesha ldquoImprovingword similarity by augmenting PMIwith estimates of word pol-ysemyrdquo IEEE Transactions on Knowledge and Data Engineeringvol 25 no 6 pp 1307ndash1322 2013

[26] J OrsquoShea Z Bandar K Crockett and D McLean ldquoA compar-ative study of two short text semantic similarity measuresrdquo inAgent and Multi-Agent Systems Technologies and Applicationsvol 4953 of Lecture Notes in Artificial Intelligence pp 172ndash181Springer Berlin Germany 2008

[27] J Oliva J I SerranoMDDel Castillo and A Iglesias ldquoSyMSSa syntax-basedmeasure for short-text semantic similarityrdquoDataand Knowledge Engineering vol 70 no 4 pp 390ndash405 2011

The Scientific World Journal 19

[28] I Yoo and X Hu ldquoClustering large collection of biomedicalliterature based on ontology-enriched bipartite graph represen-tation andmutual refinement strategyrdquo inProceedings of the 10thPAKDD pp 9ndash12 2006

[29] S-T Yuan and Y-C Chen ldquoSemantic ideation learning foragent-based e-brainstormingrdquo IEEE Transactions on Knowledgeand Data Engineering vol 20 no 2 pp 261ndash275 2008

[30] D Liu Z Liu and Q Dong ldquoA dependency grammar andWordNet based sentence similarity measurerdquo Journal of Com-putational Information Systems vol 8 no 3 pp 1027ndash1035 2012

[31] A Zouaq M Gagnon and O Benoit ldquoSemantic analysis usingdependency-based grammars and upper-level ontologiesrdquoInternational Journal of Computational Linguistics and Applica-tions vol 1 no 1-2 pp 85ndash101 2010

[32] C Ramakrishnan K J Kochut and A P Sheth ldquoA frameworkfor schema-driven relationship discovery from unstructuredtextrdquo in The Semantic WebmdashISWC 2006 vol 4273 of LectureNotes in Computer Science pp 583ndash596 Springer BerlinGermany 2006

[33] C J C H Watkins and P Dayan ldquoTechnical note Q-learningrdquoMachine Learning vol 8 no 3-4 pp 279ndash292 1992

[34] M Lan C L Tan J Su and Y Lu ldquoSupervised and traditionalterm weighting methods for automatic text categorizationrdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 31 no 4 pp 721ndash735 2009

[35] L Zhang C Li J Liu and HWang ldquoGraph-based text similar-itymeasurement by exploiting wikipedia as background knowl-edgerdquo World Academy of Science Engineering and Technologyvol 59 pp 1548ndash1553 2011

[36] M Kuramochi and G Karypis ldquoAn efficient algorithm for dis-covering frequent subgraphsrdquo IEEE Transactions on Knowledgeand Data Engineering vol 16 no 9 pp 1038ndash1051 2004

[37] L-C Yu H-M Shih Y-L Lai J-F Yeh and C-H Wu ldquoDis-criminative training for near-synonym substitutionrdquo in Pro-ceedings of the 23rd International Conference on ComputationalLinguistics (COLING rsquo10) pp 1254ndash1262 2010

[38] L-C Yu and W-N Chien ldquoIndependent component analysisfor near-synonym choicerdquoDecision Support Systems vol 55 no1 pp 146ndash155 2013

[39] L Yao S Riedel and A Mccallum ldquoCollective cross-documentrelation extraction without labelled datardquo in Proceedings ofthe Conference on Empirical Methods in Natural LanguageProcessing pp 1013ndash1023MITCambridgeMassUSAOctober2010

[40] M Shen D-R Liu and Y-S Huang ldquoExtracting semantic rela-tions to enrich domain ontologiesrdquo Journal of Intelligent Infor-mation Systems vol 39 no 3 pp 749ndash761 2012

[41] H Chim and X Deng ldquoEfficient phrase-based documentsimilarity for clusteringrdquo IEEE Transactions on Knowledge andData Engineering vol 20 no 9 pp 1217ndash1229 2008

[42] GAMiller andWGCharles ldquoContextual correlates of seman-tic similarityrdquo Language and Cognitive Processes vol 6 no 1 pp1ndash28 1991

[43] M Steinbach G Karypis and V Kumar ldquoA comparison ofdocument clustering techniquesrdquo in Proceedings of the Knowl-edge Discovery and Data Mining (KDD) Workshop Text MiningAugust 2000

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 8: Research Article Exploiting Semantic Annotations and ...downloads.hindawi.com/journals/tswj/2015/136172.pdf · is bene cial for text categorization [ ], text summarization [ ], word

8 The Scientific World Journal

Input Array Feature Vertex Objects (FVO) WordnetOutput List of Closest-Synonym of FVO objects lowastThe detected closest synonym to set Fsyn value lowastProcedureSRLlarr Empty list list of each synonym and their relationshipsF SRListlarr Empty list list of SRL (all synonyms and their relationships lists)lowastParallelforeach used for parallel processing of all FVO lowastParallelForEach(FVO fvo =gt

Fsynlarr Empty listF SRList = get all Syn Relations List(fvo) lowast function to get all synonyms and their relationships lists from wordnet lowastlowast compute semantic relatedness of each synonym and their relationships with the fvo object then get score of each listlowastfor (int 119894 = 0 119894 lt F SRListCount() 119894++) score[119894] = sum(function Get SemRel(F SRList[119894] fvo)) lowastfunction to compute the score of each

synonym list this score contributes in accurate filtering the returned synonyms and selecting the closest onelowast

Fsyn = maxth(score[]) lowastfunction to select SRL with highest score according to the specific threshold lowast

) ParallelFor

Algorithm 2 C-SynD algorithm

Input Action type and Feature Vertex Object(FVO)Output Detect Links between Feature Vertices and assigned the Features Link Weight(FLW)Procedure ilarr index of the FVO input in FLGnumVertslarr number of Feature Vertex objects (FVO) in Vertex (119881)for (int 119895 = 119894 119895 le numVerts j++)if (Action type == ldquoClosest-Synonymsrdquo) then Fsyn = get C-SynD (FVO[119894]) lowast Function to execute C-SynD algorithm lowastif (FVO[119895]contain(Fsyn)) then FLWlarr SemRel(FVO[119894] FVO[119895]) lowastFunction to compute the value of semantic relatedness betweenthe FVO[119894] and FVO[119895] lowastRel-Typelarr 5 lowast set the relation type with 5 indicates of synonym lowast

elseif (Action type == ldquoAssociationrdquo) then optimal action = get IOAC-QL (FVO) lowast Function to execute IOAC-QL algorithm lowastFLWlarr reward lowastThe reward value of optimal action returnedlowastRel-Typelarr action type lowastThe type of optimal action returned (type = 1 for similarity type = 2 for contiguitytype = 3 for contrast type = 4 for causality) lowast

Rel vertex = 119895 lowast index of the related vertex lowastAdjacent[119894 119895] = FLW lowastset the value of linkage weight in adjacency matrix between related vertex [119894] vertex [119895] lowast

Algorithm 3 SALA links algorithm

can achieve the largest amount of interdependence amongMFs through implementing the inference optimal associationcapabilities with the 119876-learning (IOAC-QL) algorithm

(1) Utilizing Multiple-Goal Reinforcement Learning (RL) RLspecifies what to do but not how to do it through the rewardfunction In sequential decision-making tasks an agent needsto perform a sequence of actions to reach goal states ormultiple-goal states One popular algorithm for dealing withsequential decision-making tasks is 119876-learning [33] In a 119876-learning model a 119876-value is an evaluation of the ldquoqualityrdquo

of an action in a given state 119876(119909 119886) indicates how desirableaction 119886 is in the state 119909 SALA can choose an action based on119876-values The easiest way of choosing an action is to choosethe one that maximizes the 119876-value in the current state

To acquire the119876-values the algorithm119876-learning is usedto estimate the maximum discounted cumulative reinforce-ment that the model will receive from the current state 119909

on max(sum119894=0

120574119894119903119894) where 120574 is a discount factor and 119903

119894is the

reward received at step 119894 (which may be 0) The updating of119876(119909 119886) is based on minimizing 119903 + 120574119891(119910) minus 119876(119909 119886) 119891(119910) =max119886119876(119910 119886) and 119910 is the new state resulting from action

The Scientific World Journal 9

119886 Thus updating is based on the temporal difference inevaluating the current state and the action chosen Throughsuccessive updates of the 119876 function the model can learn totake into account future steps in longer and longer sequencesnotably without explicit planning [33] SALA may eventuallyconverge to a stable function or find an optimal sequence thatmaximizes the reward received Hence the SALA learns toaddress sequential decision-making tasks

Problem Formulation The 119876-learning algorithm is used todetermine the best possible action that is selecting the mosteffective relations analyzing the reward function and updat-ing the related feature weights accordingly The promisingaction can be verified by measuring the relatedness scoreof the action value with the remained MFs using a rewardfunctionThe formulation of the effective association capabil-ities exploration is performed by estimating an action-valuefunction The value of taking action 119886 in state 119904 is definedunder a policy 120587 denoted by 119876(119904 119886) as the expected returnstarting from 119904 and taking the action 119886 Action policy 120587 is adescription of the behavior of the learning SALA The policyis the most important criterion in providing the RL with theability to determine the behavior of the SALA [29]

In single goal reinforcement learning these 119876-values areused only to rank order of the actions in a given state Thekey observation here is that the 119876-values can also be used inmultiple-goal problems to indicate the degree of preferencefor different actions The available way in this paper to selecta promising action to execute is to generate an overall119876-valueas a simple sum of the 119876-values of the individual SALA Theaction with the maximum summed value is then chosen toexecute

Environment State and Actions For the effective associationcapabilities exploration problem the FLG is provided as anenvironment in which intelligent agents can learn and sharetheir knowledge and can explore dependencies and relation-ships among MFs The SALAs receive the environmentalinformation data as a state and execute the elected action inthe form of an action value table which provides the best wayof sharing knowledge among agents The state is defined bythe status of theMF space In each state theMFobject is fittedwith the closest-synonyms relations there is a set of actionsIn each round the SALA decides to explore the effectiveassociations that gain the largest amount of interdependenceamong theMFs Actions are defined by selecting a promisingaction with the highest reward by learning which actionis the optimal for each state In summary once 119904

119896(MF

object) is received from the FLG SALA chooses one ofthe associations as the action 119886

119896according to SALArsquos policy

120587lowast(119904119896) executes the associated action and finally returns the

effective relations Heuristic 3 defines the action variable andthe optimal policy

Heuristic 3 (action variable) Let 119886119896symbolize the action that

is executed by SALA at round 119896

119886119896= 120587lowast(119904119896) (6)

Therein one has the following(i) 119886119896C action space similarity contiguity contrast

and causality(ii) 120587lowast(119904

119896) = argmax

119886119876lowast(119904119896 119886119896)

Once the optimal policy (120587lowast) is obtained the agentchooses the actions using the maximum reward

Reward FunctionThe reward function in this papermeasuresthe dependency and the relatedness among MFs The learneris not told which actions to take as in most forms of machinelearning Rather the learner must discover which actionsyield the most reward by trying them [33]

(2) The SALA Optimal AssociationRelationships InferencesThe optimal association actions are achieved according to theIOAC-QL algorithmThis algorithm starts to select the actionwith the highest expected future reward from each state Theimmediate reward which the SALA gets to execute an actionfrom a state s plus the value of an optimal policy is the 119876-value The higher 119876-value points to the greater chance ofthat action being chosen First initialize all the119876(119904 119886) valuesto zero Then each SALA performs the following steps inHeuristic 4

Heuristic 4(i) At every round 119896 select one action 119886

119896from all the

possible actions (similarity contiguity contrast andcausality) and execute it

(ii) Receive the immediate reward 119903119896and observe the new

state 1199041015840119896

(iii) Get the maximum 119876-value of the state 1199041015840

119896based on

all the possible actions 119886119896= 120587lowast(119904119896) = argmax

119886119876lowast

(119904119896 119886119896)

(iv) Update the 119876(119904119896 119886119896) as follows

119876 (119904119896 119886119896) = (1 minus 120572)119876 (119904

119896 119886119896)

+ 120572(119903 (119904119896 119886119896) + 120574 max1198861015840isin119860(119904119896)

119876 (1199041015840

119896 1198861015840

119896))

(7)

where 119876(119904119896 119886119896) is worthy of selecting the action (119886

119896)

at the state (119904119896) 119876(1199041015840

119896 1198861015840

119896) is worthy of selecting the

next action (1198861015840119896) at the next state (1199041015840

119896) The 119903(119904

119896 119886119896)

is a reward corresponding to the acquired payoff A(1199041015840119896) is a set of possible actions at the next state (1199041015840

119896) 120572

(0 lt 120572 le 1) is learning rate 120574 (0 le 120574 le 1) is discountrate

During each round after taking the action 119886 the actionvalue is extracted from WordNet The weight of the actionvalue is measured to represent the immediate reward wherethe SALA computes the weight AVW

119894119896for the action value 119894

in document 119896 as followsReward (119903) = AVW

119894119896

= AVW119894119896+

119899

sum

119895=1

(FW119895119896

lowast SemRel (119894 119895)) (8)

10 The Scientific World Journal

Input Feature Vertex Object(FVO) WordnetOutput The optimal action for FVO relationshipsProcedure

intialize Q[ ]lowastimplement 119876-learning algorithm to get the action with the highest 119876-value lowast119886119896isin action space[] = similarity contiguity contrast causality

lowastParallelforeach used for parallel processing of all actions lowastParallelfor (0 action spacecount() 119896 =gt 119877[119896] = Get Action Reward(119886[119896]) lowastFunction to implement (8) lowast119876[119894 119896] = (1 minus 120572) lowast 119876[119894 119896] + 120572 lowast (119877[119896] + 120574 lowastmax

119886(119876[119894 119896 + 1]))

) ParallelForreturn Get highest 119876-value action(119876[ ]) lowastselect action with the maximum summed value of 119876-values lowast

lowastexecute (8) to get reward value lowastGet Action Reward(ak)AVW

119896= (freq(119886

119896val)FVOCount()) lowast log(119873119899

119886)

for (int 119895 = 0 119895 lt FVOCount() 119895++) FW119895= (freq(FVO[119895])FVOCount()) lowast log(119873119899

119891)

SemRel(119894 119895) =Get SemRel(119886119896val FVO[119895])

Sum+= FW119895lowast SemRel(119894 119895)

return AVW119896= AVW

119896+ Sum

Algorithm 4 IOAC-QL algorithm

where 119903 is the reward of the selected action 119886119894 AVW

119894119896is the

weight of action value 119894 in document 119896 FW119895119896

is the weightof each feature object FVO

119895in document 119896 and SemRel(119894 119895)

is the semantic relatedness between the action value 119894 andfeature object FVO

119895

The most popular weighting scheme is the normalizedword frequency TFIDF [34] used to measure AVW

119894119896and

FW119895119896The FW

119895119896in (8) is calculated likewise AVW

119894119896is based

on (9) and (10)Action value weight (119860119881119882

119894119896) is a measure to calculate

the weight of the action value that is the scalar product ofaction value frequency and inverse document frequency as in(9)The action value frequency (119865) measures the importanceof this value in a document Inverse document frequency(IDF) measures the general importance of this action valuein a corpus of documents

AVW119894119896

=119865119894119896

sum119896119865119894119896

lowast IDF119894 (9)

where 119865119894119896

represents the number of this action value 119894

cocontributions in document 119889119896 normalized by the number

of cocontributions of all MFs in document 119889119896 normalized to

prevent a bias towards longer documents IDF is performedby dividing the number of all documents by the number ofdocuments containing this action value defined as follows

IDF119894= log( |119863|

1003816100381610038161003816119889119896 V119894 isin 1198891198961003816100381610038161003816

) (10)

where |119863| is the total number of documents in the corpusand |119889

119896 V119894isin 119889119896| is the number of documents containing

action value 119881119894

Thus in the IOAC-QL algorithm (see Algorithm 4)implementation shown in Figure 4 the SALA selects theoptimal action with the higher reward Hence each SALAretains effective actions in the FAss list object of each FVO Inthis list each object hasRel-Type object andRel vertex objectTheRel-Type object value equals the action type object valueand Rel vertex object value equals the related vertex indexThe SALA then links the related FVOs with the bidirectionaleffect using the SALA links algorithm and sets the FLW valueof the linkage weight in the adjacency matrix with the rewardof the selected action

An example of monitoring the SALA decision throughthe practical implementation of the IOAC-QL algorithmactions is illustrated in Figure 4

For state FVO1 one has the following

action similarity ActionResult State FVO1FVO7

FVO9FVO22FVO75FVO92FVO112

FVO125

Prob 8reward 184 PrevState FVO

1 119876(119904 119886) = 91

action contiguity ActionResult State FVO103

FVO17

FVO44 Prob 3 reward 74 PrevState FVO

1 119876(119904 119886) =

61action contrast ActionResult State FVO

89 Prob 1

reward 22 PrevState FVO1 119876(119904 119886) = 329

action causality ActionResult State FVO63

FVO249

Prob 2 reward 25 PrevState FVO

1 119876(119904 119886) = 429

As revealed in the case FVO1 the SALA receives the

environmental information data FVO1as a state and then

the SALA starts the implementation to select the actionwith the highest expected future reward among the otherFVOs The result of each action is illustrated the action with

The Scientific World Journal 11

S0

S0

S1

S1S2

a0 a0a0a4 a4 a4

DSO DSO

Sm

Sm

Object model construction

FVO

Vertex [V] Adjacent [V V]

FLWFval

SFKeyFsyn[]

FAss[]FW

The feature linkage graph (FLG)

Descriptive document object

Descriptive sentence object

Action selection

Q-learning mechanism

Assign Q-values with FLG environment

Reinforcement learning model for semanticrelational graph construction

SALA inference engine

DSO DSO DSO DSO DSO

DDO

Sem

antic

actio

nabl

e lea

rnin

g ag

ent (

SALA

)

F1

F1

F1

F2

F3

F4

F2

F2 F3 F4 middot middot middot Fn

DSO

Fn

Fn

middot middot middot

middot middot middot

middot middot middot

middot middot middot middot middot middotmiddot middot middot

middot middot middot

middot middot middotmiddot middot middot

middot middot middot

middot middot middot

middot middot middot

Q(S0 a0) Q(S0 a4)Q(S1 a0) Q(S1 a4)

Q(Sm a0) Q(Sm a4)

WordNet

Figure 4 The IOAC-QL algorithm implementation

the highest reward is selected and the elected action isexecutedTheSALAdecision is accomplished in FLGwith theeffective extracted relationships according to 119876(119904 119886) value asshown in Table 1

33 Document Linkage Conducting Stage This section pre-sents an efficient linkage process at the corpus level for manydocuments Due to the SHGRS scheme of each documentmeasuring the commonality between the FLG of the doc-uments through finding all distinct common similarity orrelatedness subgraphs is required The objective of findingcommon similar or related subgraphs as a relatedness mea-sure is to be utilized in document mining and managingprocesses such as clustering classification and retrieval Itis essential to establish that the measure is metric A metricmeasure ascertains the order of objects in a set and henceoperations such as comparing searching and indexing of theobjects can be performed

Graph relatedness [35 36] involves determining thedegree of similarity between these two graphs (a numberbetween 0 and 1) so the same node in both graphs wouldbe similar if its related nodes were similar The selectionof subgraphs for matching depends on detecting the moreimportant nodes in FLG using FVOFw value The vertexwith the highest weight is the first node selected which thendetects all related vertices from an adjacency matrix as asubgraphThis process depends on computing the maximumcommon subgraphsThe inclusion of all common relatednesssubgraphs is accomplished through recursive iterations andholding onto the value of the relatedness index of thematched

subgraphs Eliminating the overlap is performed using a dy-namic programming technique that keeps track of the nodesvisited and their attribute relatedness values

The function relatedness subgraphs utilize an array Rel-Nodes[] of the related vertices for each document Thefunction SimRel(119909

119894 119910119894) computes the similarity between the

two nodes 119909119894 119910119894as defined in (11) The subgraph matching

process is measured using the following formula

Sub-graph-Rel (SG1 SG2)

=sum

SG1119894

sumSG2119895

SemRel (119873119894 119873119895)

sumSG1119894

119873119894+ sum

SG2119895

119873119895

(11)

where SG1 is the selected subgraph from the FLG of onedocument SG2 is the selected subgraph from the FLG ofanother document and SemRel(119873

119894 119873119895) is the degree of

matching for both nodes 119873119894and 119873

119895 A threshold value for

node matching is also utilized the threshold is set to a 50match

4 Evaluation of the Experiments and Results

This evaluation is especially vital as the aim of building theSHGRS is to use them efficiently and effectively for the fur-ther mining process To explore the effectiveness of the pro-posed SHGRS scheme examining the correlation of the pro-posed semantic relatedness measure compared with otherprevious relatedness measures is required The impact ofthe proposed MFsSR for detecting the closest synonym is

12 The Scientific World Journal

Table 1 Action values table with the SALA decision

State FVO1 FVO1 FVO1 FVO1

Action Similarity Contiguity Contrast CausalityActionResult State FVO1 FVO7 FVO9 FVO22 FVO75 FVO92 FVO112 FVO125 FVO103 FVO17 FVO44 FVO89 FVO63 FVO249

Total reward 184 74 22 25119876(119904 119886) 94 61 329 429Decision Yes No No No

studied and compared to the PMI [25 37] and independentcomponent analysis (ICA) [38] for detecting the best near-synonym The difference between the proposed discoveringsemantic relationships or associations IOAC-QL implicitrelation extraction conditional random fields (CRFs) [6 39]and term frequency and inverse cluster frequency (TFICF)[40] is computed

A case study of amanaging andmining task (ie semanticclustering of text documents) is consideredThis study showshow the proposed approach is workable and how it canbe applied to mining and managing tasks Experimentallysignificant improvements in all performances are achieved inthe document clustering processThis approach outperformsconventional word-based [16] and concept-based [17 41]clusteringmethodsMoreover performance tends to improveas more semantic analysis is achieved in the representation

41 Evaluation Measures Evaluation measures are subcat-egorized into text quality-based evaluation content-basedevaluation coselection-based evaluation and task-basedevaluationsThefirst category of evaluationmeasures is basedon text quality using aspects such as grammaticality nonre-dundancy referential clarity and coherence For content-based evaluations measures such as similarity semanticrelatedness longest common subsequence and other scoresare used The third category is based on coselection evalua-tion using precision recall and 119865-measure values The lastcategory task-based evaluation is based on the performanceof accomplishing the given task Some of these tasks areinformation retrieval question answering document classi-fication document summarizing and document clusteringmethods In this paper the content-based the coselection-based and the task-based evaluations are used to validate theimplementation of the proposed scheme over the real corpusof documents

411 Measure for Content-Based Evaluation The relatednessor similarity measures are inherited from probability theoryand known as the correlation coefficient [42]The correlationcoefficient is one of the most widely used measures todescribe the relatedness 119903 between two vectors119883 and 119884

Correlation Coefficient 119903 The correlation coefficient is a rela-tively efficient relatedness measure which is a symmetricalmeasure of the linear dependence between two randomvariables

Table 2 Contingency table presents human judgment and systemjudgment

Human judgmentTrue False

System judgmentTrue TP FPFalse FN TN

Therefore the correlation coefficient can be consideredas the coefficient for the linear relationship between corre-sponding values of 119883 and 119884 The correlation coefficient 119903between sequences 119883 = 119909

119894 119894 = 1 119899 and 119884 = 119910

119894

119894 = 1 119899 is defined by

119903 =sum119899

119894=1119883119894119884119894

radic(sum119899

119894=11198832119894) lowast (sum

119899

119894=11198842119894)

(12)

Acceptance Rate (AR) Acceptance rate is a proportion ofcorrectly predicted similar or related sentences compared toall related sentences High acceptance ratemeans recognizingalmost all similar or related sentences

AR =TP

(TP + FN) (13)

Accuracy (Acc) Accuracy is a proportion of all correctlypredicted sentences compared to all sentences

Acc = TP + TN(TP + FN + FP + TN)

(14)

where TP TN FP and FN stand for true positive (the num-ber of pairs correctly labeled as similar) true negative (thenumber of pairs correctly labeled as dissimilar) false positive(the number of pairs incorrectly labeled as similar) andfalse negative (the number of pairs incorrectly labeled asdissimilar) as shown in Table 2

412 Measure for Coselection-Based Evaluation For all thedomains a precision 119875 recall 119877 and 119865-measure are utilizedas the measures of performance in the coselection-basedevaluation These measures may be defined via computingthe correlation between the extracted correct and wrongclosest synonyms or semantic relationshipsdependencies

The Scientific World Journal 13

Table 3 Datasets details for the experimental setup

Dataset Experimenting Dataset Name Description

DS1 Experiment 1content-basedevaluation

Miller and Charles (MC)1 RG consists of 65 pairs of nouns extracted from the WordNet rated bymultiple human annotators

DS2Microsoft ResearchParaphrase Corpus

(MRPC)2

The corpus consists of 5801 sentence pairs collected from newswire articles3900 of which were labeled as relatedness by human annotators The wholeset is divided into a training subset (4076 sentences of which 2753 are true)and a test subset (1725 pairs of which 1147 are true)

DS3 Experiment 2coselection-basedevaluation(closest-synonymdetection)

British National Corpus(BNC)3

BNC is a carefully selected collection of 4124 contemporary written andspoken English texts contains 100-million-word text corpus of samples ofwritten and spoken English with the near-synonym collocations

DS4 SN (semantic neighbors)4SN relates 462 target terms (nouns) to 5910 relatum terms with 14682semantic relations (7341 are meaningful and 7341 are random) The SNcontains synonyms coming from three sources WordNet 30 Rogetrsquosthesaurus and a synonyms database

DS5

Experiment 2coselection-basedevaluation (semanticrelationshipsexploration)

BLESS6BLESS relates 200 target terms (100 animate and 100 inanimate nouns) to 8625relatum terms with 26554 semantic relations (14440 are meaningful (correct)and 12154 are random) Every relation has one of the following typeshypernymy cohypernymy meronymy attribute event or random

DS6 TREC5

TREC includes 1437 sentences annotated with entities and relations at leastone relation There are three types of entities person (1685) location (1968)and organization (978) in addition there is a fourth type others (705) whichindicates that the candidate entity is none of the three types There are fivetypes of relations located in (406) indicates that one location is located insideanother location work for (394) indicates that a person works for anorganization OrgBased in (451) indicates that an organization is based in alocation live in (521) indicates that a person lives at a location and kill (268)indicates that a person killed another person There are 17007 pairs of entitiesthat are not related by any of the five relations and hence have the NR relationbetween them which thus significantly outnumbers other relations

DS7 IJCNLP 2011-New YorkTimes (NYT)6

NYT contains 150 business articles from NYT There are 536 instances (208positive 328 negative) with 140 distinct descriptors in NYT dataset

DS8 IJCNLP 2011-Wikipedia8Wikipedia personalsocial relation dataset was previously used in Culotta etal [6] There are 700 instances (122 positive 578 negative) with 70 distinctdescriptors in Wikipedia dataset

DS9 Experiment 3task-basedevaluation

Reuters 215787 Reuters-21578 contains 21578 documents (12902 are used) categorized to 10categories

DS10 20 Newsgroups8 20 newsgroups dataset contains 20000 documents (18846 are used)categorized to 20 categories

1Available at httpwwwcscmuedusimmfaruquisuitehtml2Available at httpresearchmicrosoftcomen-usdownloads3Available at httpcorpusbyuedubnc4Available at httpcentalfltruclacbeteamsimpanchenkosre-evalsncsv5Available at httpl2rcsuiucedusimcogcompDataERconll04corp6Available at httpwwwmysmuedufacultyjingjiangdataIJCNLP2011zip7Available at httpmlrcsumassedumldatasetsReuters-21578+Text+Categorization+Collection8Available at httpwwwcsminingorgindexphpid-20-newsgroupshtml

Let TP denote the number of correctly detected closestsynonyms or semantic relationships explored let FP be thenumber of incorrectly detected closest synonyms or semanticrelationships explored and let FN be the number of correctlybut not detected closest synonyms or semantic relationshipsexplored in a dataset The 119865-measure combines the precisionand recall in one metric and is often used to show theefficiencyThismeasurement can be interpreted as a weightedaverage of the precision and recall where an 119865-measure

reaches its best value at 1 andworst value at 0 Precision recalland 119865-measure are defined as follows

Recall = TP(TP + FN)

Precision =TP

(TP + FP)

119865-measure = 2 (Recall lowast Precision)(Recall + Precision)

(15)

14 The Scientific World Journal

413 Measure for Task-Based Evaluation A case study ofa managing and mining task (ie semantic clustering oftext documents) is studied for task-based evaluation Thetest is performed with the standard clustering algorithmsthat accept the pairwise (dis)similarity between documentsrather than the internal feature space of the documents Theclustering algorithms may be classified as listed below [43](a) flat clustering (which creates a set of clusters without anyexplicit structure that would relate the clusters to each otheralso called exclusive clustering) (b) hierarchical clustering(which creates a hierarchy of clusters) (c) hard clustering(which assigns each documentobject as a member of exactlyone cluster) and (d) soft clustering (which distributes thedocumentsobjects over all clusters) These algorithms areagglomerative (hierarchical clustering) 119870-means (flat clus-tering hard clustering) and EM algorithm (flat clusteringsoft clustering)

Hierarchical agglomerative clustering (HAC) and the 119870-means algorithm have been applied to text clustering ina straightforward way [16 17] The algorithms employedin the experiment included 119870-means clustering and HACTo evaluate the quality of the clustering the 119865-measure isused for quality measures which combine the precision andrecall measures These metrics are computed for every (classcluster) pair The precision 119875 and recall 119877 of a cluster 119878

119894with

respect to a class 119871119903are defined as

119875 (119871119903 119878119894) =

119873119903119894

119873119894

119877 (119871119903 119878119894) =

119873119903119894

119873119903

(16)

where 119873119903119894is the number of documents of class 119871

119903in cluster

119878119894119873119894is the number of documents of cluster 119878

119894 and119873

119903is the

number of documents of class 119871119903

The 119865-measure which is a harmonic mean of precisionand recall is defined as

119865 (119871119903 119878119894) =

2 lowast 119875 (119871119903 119878119894) lowast 119877 (119871

119903 119878119894)

119875 (119871119903 119878119894) + 119877 (119871

119903 119878119894)

(17)

A per-class 119865 score is calculated as follows

119865 (119871119903) = max119878119894isin119862

119865 (119871119903 119878119894) (18)

With respect to class 119871119903 the cluster with the highest 119865-

measure is considered to be the cluster that maps to class 119871119903

that 119865-measure becomes the score for class 119871119903 and 119862 is total

number of clusters The overall 119865-measure for the clusteringresult 119862 is the weighted average of the 119865-measure for eachclass 119871

119903(macroaverage)

119865119862 =sum119871119903

(1003816100381610038161003816119873119903

1003816100381610038161003816 lowast 119865 (119871119903))

sum119871119903

10038161003816100381610038161198731199031003816100381610038161003816

(19)

where |119873119903| is the number of objects in class 119871

119903 The higher

the overall 119865-measure is the better the clustering is due tothe higher accuracy of the clusters mapping to the originalclasses

42 Evaluation Setup (Dataset) Content-based coselection-based and task-based evaluations are used to validateexperimenting over the real corpus of documents theresults are very promising The experimental setup con-sists of some datasets of textual documents as detailed inTable 3 Experiment 1 uses the Miller and Charles (MC) andMicrosoft Research Paraphrase Corpus Experiment 2 usesthe British National Corpus TREC IJCNLP 2011 (NYT andWikipedia) SN and BLESS Experiment 3 usesThe Reuters-21578 and 20 newsgroups

43 Evaluation Results This section reports on the results ofthree experiments conducted using the evaluation datasetsoutlined in the previous section The SHGRS is imple-mented and evaluated based on concept analysis and anno-tation as sentence based in experiment 1 document basedin experiment 2 and corpus based in experiment 3 Inexperiment 1 the DS1 and DS2 investigated the effectivenessof the proposed MFsSR compared to previous studies basedon the availability of semantic relatedness values of humanjudgments and previous studies In experiment 2 the DS3andDS4 investigated the effectiveness of the closest synonymdetection using MFsSR compared to using PMI [25 37]and ICA [38] based on the availability of synonyms comingfrom three sources WordNet 30 Rogetrsquos thesaurus anda synonyms database Furthermore the DS5 DS6 DS7and DS8 investigated the effectiveness of the IOAC-QL forsemantic relationships exploration compared to the implicitrelation extraction CRFs [6 39] and TFICF [40] based on theavailability of semantic relations judgments In experiment 3the DS9 and DS10 investigated the effectiveness of thedocument linkage process for clustering case study thanprevious studies as word-based [16] and concept-based [17]based on the availability of human judgments

These experiments revealed some interesting trends interms of text organization and a representation schemebased on semantic annotation and reinforcement learningThe results show the effectiveness of the proposed schemefor accurate understanding of the underlying meaning Theproposed SHGRS is exploited in mining and managingoperations such as gathering filtering searching retrievingextracting classifying clustering and summarizing docu-ments In addition the experiments illustrate the improve-ments when direct relevance and indirect relevance areincluded in the process of measuring semantic relatednessbetween MFs of documents The main purpose of semanticrelatedness measures is to infer unknown dependencies andrelationships among the MFs The proposed MFsSR is abackbone of the closest synonym detection and semanticrelationships exploration algorithms The performance ofMFsSR for closest synonym detection and IOAC-QL forsemantic relationships exploration algorithms is proportion-ally improved as well as having a semantic understanding oftext content achieved The relatedness subgraph measure isused for the document linkage process at the corpus levelThrough this process better clustering results can be achievedas more attempt to improve the accuracy of measuring therelatedness between documents using semantic information

The Scientific World Journal 15

Table 4The results of the comparison of the correlation coefficientbetween human judgment with some relatedness measures and theproposed semantic relatedness measure

MeasureRelevance

correlation withMampC

Distance-based measuresRada 0688Wu and Palmer 0765

Information-based measuresResnik 077Jiang and Conrath 0848Lin 0853

Hybrid measuresT Hong and D smith 0879Zili Zhou 0882

Informationfeature-base measuresThe proposed MFsSR 0937

in SHGRSThe relatedness subgraphsmeasure affects the per-formance of the output of the document managing and min-ing process relative to howmuch of the semantic informationit considers The most accurate relatedness can be achievedwhen the maximum amount of the semantic information isprovided Consequently the quality of clustering is improvedand when semantic information is considered the approachsurpasses the word-based and concept-based techniques

431 Experiment 1 Comparative Study (Content-Based Eval-uation) This experiment shows the necessity to evaluate theperformance of the proposed MFsSR based on a benchmarkdataset of human judgments so the results of the proposedsemantic relatedness would be comparable with other pre-vious studies in the same field In this experiment theyattempted to compute the correlation between the ratings ofthe proposed semantic relatedness approach and the meanratings reported by Miller and Charles (DS1 in Table 3)Furthermore the results produced are compared againsteight other semantic similarities approaches namely RadaWu and Palmer Resnik Jiang and Conrath Lin T Hong andD Smith and Zili Zhou [21]

Table 4 shows the correlation coefficient results betweennine componential approaches and the Miller and Charlesratings mean The semantic relatedness for the proposedapproach outperformed all the listed approaches Unlike allthe listed methods in the proposed semantic relatednessdifferent properties are considered Furthermore the goodcorrelation value of the approach also results from consider-ing all available relationships between concepts and the indi-rect relationships between attributes of each concept whenmeasuring the semantic relatedness In this experiment theproposed approach achieved a good correlation value withthe human-subject rating reported by Miller and CharlesBased on the results of this study theMFsSR correlation valuehas proven that considering the directindirect relevance

is specifically important for at least 6 improvement overthe best previous results This improvement contributes tothe achievement of the largest amount of relationships andinterdependence among MFs and their attributes at thedocument level and to the document linkage process at acorpus level

Table 5 summarizes the characteristics of the MRPCdataset (DS2 in Table 3) and presents comparison of theAcc and AR values between the proposed semantic related-ness measure MFsSR and Islam and Inkpen [7] Differentrelatedness thresholds ranging from 0 to 1 with interval01 are used to validate the MFsSR with Islam and Inkpen[7] After evaluation the best relatedness thresholds of Accand AR are 06 07 and 08 These results indicate thatthe proposed MFsSR surpasses Islam and Inkpen [7] interms of Acc and AR The results of each approach listedbelow were based on the best Acc and AR through allthresholds instead of under the same relatedness thresholdThis improvement in Acc andAR values is due to the increasein the numbers of pairs predicted correctly after consideringdirectindirect relevance This relevance takes into accountthe closest synonym and the relationships of sentences MFs

432 Experiment 2 Comparative Study of Closest-SynonymDetection and Semantic Relationships Exploration Thisexperiment shows the necessity to study the impact of theMFsSR for inferring unknown dependencies and relation-ships among the MFs This impact is achieved through theclosest-synonym detection and semantic relationship explo-ration algorithms so the results would be comparable withother previous studies in the same field Throughout thisexperiment the impact of the proposed MFsSR for detectingthe closest synonym is studied and compared to the PMIapproach and ICA approach for detecting the best near-synonym The difference among the proposed discoveringsemantic relationships or associations IOAC-QL the implicitrelation extraction CRFs approach and TFICF approachis examined This experiment attempts to measure theperformance of theMFsSR for the closest synonym detectionand the IOAC-QL for semantic relationship explorationalgorithms Hence more semantic understanding of the textcontent is achieved by inferring unknown dependenciesand relationships among the MFs Considering the directrelevance among the MFs and the indirect relevance amongthe attributes of the MFs in the proposed MFsSR constitutesa certain advantage over previous measures However mostof the time the incorrect items are due to a wrong syntacticparsing from the OpenNLP parser and AlchemyAPIAccording to the preliminary study it is certain that theaccuracy of parsing toolsrsquo effects is on the performance of theMFsSR

Detecting the closest synonym is a process of MF disam-biguation resulting from the implementation of the MFsSRtechnique and the threshold of the synonyms scores throughthe C-SynD algorithm In the C-SynD algorithm the MFsSRtakes into account the direct relevance among MFs andthe indirect relevance among MFs and their attributes in acontext that gives a significant increase in the recall without

16 The Scientific World Journal

Table 5 The results of the comparison of the accuracy and acceptance rate between the proposed semantic relatedness measure and Islamand Inkpen [7]

MRPC dataset Relatednessthreshold

Human judgment(TP + FN)

Islam and Inkpen [7] The proposed MFsSRAcc AR Acc AR

Trainingsubset (4076)

01

2753 true

067 1 068 102 067 1 068 103 067 1 068 104 067 1 068 105 069 098 068 106 072 089 068 107 068 078 070 09808 056 04 072 08609 037 009 060 0491 033 0 034 002

Test subset(1725)

01

1147 true

066 1 067 102 066 1 067 103 066 1 067 104 066 1 067 105 068 098 067 106 072 089 066 107 068 078 066 09808 056 04 064 08509 038 009 049 051 033 0 034 003

disturbing the precision Thus the MFsSR between two con-cepts considers the information content with most of Word-Net relationships Table 6 illustrates the performance of theC-SynD algorithm based on theMFsSR compared to the bestnear-synonym algorithmusing the PMI and ICA approachesThis comparison was conducted for two corpora of textdocuments that are BNC and NS (DS3 and DS4 in Table 3)As indicated in Table 6 the C-SynD algorithm yielded higheraverage precision recall and 119865-measure values than PMIapproach by 25 20 and 21 on SN dataset respectivelyand also by 8 4 and 7 on BNC dataset Furthermorethe C-SynD algorithm also yielded higher average precisionrecall and 119865-measure values than ICA approach by 6 9and 7 on SN dataset respectively and also by 6 4and 5 on BNC dataset The improvements achieved in theperformance values are due to the increase of the numberof pairs predicted correctly and the decrease of the numberof pairs predicted incorrectly through implementing MFsSRThe important observation from this table is the improve-ments achieved in recall which measures effectiveness of theC-SynD algorithm Achieving better precision values is clearwith a high percentage in different datasets and domains

Inferring the MFs relationships and dependencies is aprocess of achieving a large number of interdependencesamongMFs resulting from the implementation of the IOAC-QL algorithm In the IOAC-QL algorithm the 119876-learning isused to select the optimal action (relationships) that gainsthe largest amount of interdependences among the MFs

resulting from the measure of the AVW Table 7 illustratesthe performance of the IOAC-QL algorithm based on AVWscompared to the implicit relationship extraction based onCRFs and TFICF approaches which was conducted overthe TREC Bless the NYT and the Wikipedia corpus (DS5DS6 DS7 and DS8 in Table 3) The data in this table showsincreases in precision recall and 119865-measure values due tothe increase in the number of pairs predicted correctly afterconsidering directindirect relevance through the expansionof closest synonym In addition the IOAC-QL algorithmthe CRFs and the TFICF approaches are beneficial tothe extraction performance but the IOAC-QL contributesmore than CRFs and TFICF Thus IOAC-QL considerssimilarity contiguity contrast and causality relationshipsbetween MFs and its closest-synonyms while the CRFs andTFICF consider is-a and part-of relationships only betweenconceptsThe improvements of the 119865-measure were achievedthrough an IOAC-QL algorithm up to 23 for TREC dataset22 for BLESS dataset 21 for NYT dataset and 6 forWikipedia dataset approximately from the CRFs approachFurthermore the IOAC-QL algorithm also yielded higher119865-measure values than TFICF approach by 10 for TRECdataset 17 for BLESS dataset and 7 for NYT datasetrespectively

433 Experiment 3 Comparative Study (Task-Based Eval-uation) This experiment shows the effectiveness of the

The Scientific World Journal 17

Table 6 The results of the C-SynD algorithm based on the MFsSR compared to the PMI and ICA for detecting the closest synonym

Datasets PMI ICA C-SynDPrecision Recall 119865-measure Precision Recall 119865-measure Precision Recall 119865-measure

SN 606 606 061 795 716 753 85 80 82BNC 745 679 71 76 678 72 82 719 77

Table 7 The results of the IOAC-QL algorithm based on AVWs compared to CRFs and TFICF approaches

Datasets CRFs TFICF IOAC-QLPrecision Recall 119865-measure Precision Recall 119865-measure Precision Recall 119865-measure

TREC 7508 602 6628 893 714 787 898 881 886BLESS 7304 6266 6703 738 695 716 950 835 889NYT 6846 5402 6038 860 650 740 900 740 812Wikipedia 560 420 480 646 5488 5934 700 440 540

Table 8The results of graph-based approach compared to term-based and concept-based approach using119870-means and the HAC clusteringalgorithms

Datasets Algorithm Term-based Concept-based Graph-basedFC FC FC

Reuters-21578 HAC 73 857 902119870-means 51 846 918

20 Newsgroups HAC 53 82 89119870-means 48 793 87

proposed semantic representation scheme in document min-ing and managing operations such as gathering filteringsearching retrieving extracting classifying clustering andsummarizing documents Many sets of experiments usingthe proposed scheme on different datasets in text clusteringhave been conducted The experiments demonstrate theextensive comparisons between the MF-based analysis andthe traditional analysis Experimental results demonstratethe substantial enhancement of the document clusteringquality In this experiment the application of the semanticunderstanding of the textual documents is presented Textualdocuments are parsed to produce the rich syntactic andsemantic representations of SHGRS Based on these semanticrepresentations document linkages are estimated using thesubgraph matching process that is based on finding alldistinct similarity common subgraphs The semantic repre-sentations of textual documents determining the relatednessbetween themare evaluated by allowing clustering algorithmsto use the produced relatedness matrix of the documentset To test the effectiveness of subgraph matching in deter-mining an accurate measure of the relatedness betweendocuments an experiment using the MFs-based analysisand subgraphs relatedness measures was conducted Twostandard document clustering techniques were chosen fortesting the effect of the document linkages on clustering [43](1) HAC and (2) 119870-means The MF-based weighting is themain factor that captures the importance of a concept in

a document Furthermore document linkages using the sub-graph relatedness measure are among the main factors thataffect clustering techniques Thus to study the effect of theMFs-based weighting and the subgraph relatedness measure(graph-based approach) on the clustering techniques thisexperiment is repeated for the different clustering techniquesas shown in Table 8 Basically the aim is to maximize theweighted average of the 119865-measure for each class (FC) ofclusters to achieve high-quality clustering

Table 8 shows that the graph-based approach achieveshigher performance than term-based and concept-basedapproachesThis comparison was conducted for two corporaof text documents that are the Reuters-21578 dataset and the20 newsgroups dataset (DS9 and DS10 in Table 3)The resultsshown in Table 8 illustrate improvement in the clusteringquality when more attributes are included in graph-basedrelatedness measure The improvements were achieved at thegraph-based approach up to 17 by HAC and 40 by 119870-means for Reuters-21578 dataset while up to 36 byHAC and39 by 119870-means for approximately 20 newsgroup datasetsfrom the base case (term-based) approach Furthermorethe improvements also were achieved at the graph-basedapproach up to 5 by HAC and 7 by119870-Means for Reuters-21578 dataset while up to 7 byHAC and 8 by119870-Means forapproximately 20 Newsgroup datasets from Concept-basedapproach

18 The Scientific World Journal

5 Conclusions

This paper proposed the SHGRS to improve the effectivenessof themining andmanaging operations in textual documentsSpecifically a three-stage approach was proposed for con-structing the scheme First the MFs with their attributesare extracted and hierarchy-based structure is built to rep-resent sentences by the MFs and their attributes Second anovel semantic relatedness computing method was proposedfor inferring relationships and dependencies of MFs andthe relationship between MFs is represented by a graph-based structure Third a subgraph matching algorithm wasconducted for document linkage process The documentsclustering is used as a case study and conducted extensiveexperiments on 10 different datasets and the results showedthat the scheme can result in a better mining performancecompared with traditional approach Future work will focuson conducting other case studies to processes such asgathering filtering retrieving classifying and summarizinginformation

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] I T Fatudimu A G Musa C K Ayo and A B SofoluweldquoKnowledge discovery in online repositories a text miningapproachrdquo European Journal of Scientific Research vol 22 no2 pp 241ndash250 2008

[2] A Hotho A Nurnberger and G Paaszlig ldquoA brief survey of textminingrdquo Journal for Computational Linguistics and LanguageTechnology vol 20 2005

[3] M Bernotas K Karklius R Laurutis and A Slotkiene ldquoThepeculiarities of the text document representation using ontol-ogy and tagging-based clustering techniquerdquo Journal of Infor-mation Technology and Control vol 36 pp 217ndash220 2007

[4] Y Li Z A Bandar andDMcLean ldquoAn approach formeasuringsemantic similarity between words using multiple informationsourcesrdquo IEEE Transactions on Knowledge and Data Engineer-ing vol 15 no 4 pp 871ndash882 2003

[5] K Shaban O Basir and M Kamel ldquoDocument mining basedon semantic understanding of textrdquo in Progress in Pattern Rec-ognition Image Analysis and Applications vol 4225 of LectureNotes in Computer Science pp 834ndash843 Springer BerlinGermany 2006

[6] A Culotta A McCallum and J Betz ldquoIntegrating probabilisticextraction models and data mining to discover relations andpatterns in textrdquo in Proceedings of theHuman Language Technol-ogy ConferencemdashNorth American Chapter of the Association forComputational Linguistics Annual Meeting (HLT-NAACL rsquo06)pp 296ndash303 June 2006

[7] A Islam and D Inkpen ldquoSemantic text similarity using corpus-based word similarity and string similarityrdquo ACM Transactionson Knowledge Discovery fromData vol 2 no 2 article 10 2008

[8] R Feldman and J SangerTheTextMining Handbook AdvancedApproaches in Analyzing Unstructured Data Cambridge Uni-versity Press 2006

[9] R Navigli ldquoWord sense disambiguation a surveyrdquo ACM Com-puting Surveys vol 41 no 2 article 10 2009

[10] W Mao andWW Chu ldquoThe phrase-based vector space modelfor automatic retrieval of free-text medical documentsrdquo Dataand Knowledge Engineering vol 61 no 1 pp 76ndash92 2007

[11] C Siefkes and P Siniakov ldquoAn overview and classification ofadaptive approaches to information extractionrdquo in Journal onData Semantics IV vol 3730 of Lecture Notes in ComputerScience pp 172ndash212 Springer New York NY USA 2005

[12] httpwordnetprincetonedu[13] OpenNLP httpopennlpapacheorg[14] httpwwwalchemyapicom[15] W V Siricharoen ldquoOntologies and object models in object ori-

ented software engineeringrdquo International Association of Engi-neers (IAENG) Journal of Computer Science (JICS) vol 33 no 1pp 1151ndash1162 2006

[16] S Shehata ldquoA WordNet-based semantic model for enhancingtext clusteringrdquo in Proceedings of the IEEE International Con-ference on Data Mining Workshops (ICDMW rsquo09) pp 477ndash482Miami Fla USA December 2009

[17] S Shehata F Karray andM Kamel ldquoAn efficient concept-basedminingmodel for enhancing text clusteringrdquo IEEE Transactionson Knowledge and Data Engineering vol 22 no 10 pp 1360ndash1371 2010

[18] E Zemmouri H Behja A Marzak and B Trousse ldquoOntology-based knowledge model for multi-view KDD processrdquo Interna-tional Journal of Mobile Computing and Multimedia Communi-cations vol 4 no 3 pp 21ndash33 2012

[19] C Marinica and F Guillet ldquoKnowledge-based interactive post-mining of association rules using ontologiesrdquo IEEETransactionson Knowledge and Data Engineering vol 22 no 6 pp 784ndash7972010

[20] B Choudhary and P Bhattacharyya ldquoText clustering usingUni-versal Networking Language representationrdquo in Proceedings ofthe 11th International World Wide Web Conference 2003

[21] G Pirro and J Euzenat ldquoA feature and information theoreticframework for semantic similarity and relatednessrdquo in Proceed-ings of the 9th International Semantic Web Conference on theSemantic Web (ISWC rsquo10) vol 6496 pp 615ndash630 2010

[22] Z Lin M R Lyu and I King ldquoMatchSim a novel similaritymeasure based on maximum neighborhood matchingrdquo Knowl-edge and Information Systems vol 32 no 1 pp 141ndash166 2012

[23] D Lin ldquoAn information-theoretic definition of similarityrdquo inProceedings of the 15th International Conference on MachineLearning Madison Wis USA July 1998

[24] D Inkpen ldquoA statistical model for near-synonym choicerdquo ACMTransactions on Speech and Language Processing vol 4 no 1article 2 2007

[25] LHan T Finin PMcNameeA Joshi andYYesha ldquoImprovingword similarity by augmenting PMIwith estimates of word pol-ysemyrdquo IEEE Transactions on Knowledge and Data Engineeringvol 25 no 6 pp 1307ndash1322 2013

[26] J OrsquoShea Z Bandar K Crockett and D McLean ldquoA compar-ative study of two short text semantic similarity measuresrdquo inAgent and Multi-Agent Systems Technologies and Applicationsvol 4953 of Lecture Notes in Artificial Intelligence pp 172ndash181Springer Berlin Germany 2008

[27] J Oliva J I SerranoMDDel Castillo and A Iglesias ldquoSyMSSa syntax-basedmeasure for short-text semantic similarityrdquoDataand Knowledge Engineering vol 70 no 4 pp 390ndash405 2011

The Scientific World Journal 19

[28] I Yoo and X Hu ldquoClustering large collection of biomedicalliterature based on ontology-enriched bipartite graph represen-tation andmutual refinement strategyrdquo inProceedings of the 10thPAKDD pp 9ndash12 2006

[29] S-T Yuan and Y-C Chen ldquoSemantic ideation learning foragent-based e-brainstormingrdquo IEEE Transactions on Knowledgeand Data Engineering vol 20 no 2 pp 261ndash275 2008

[30] D Liu Z Liu and Q Dong ldquoA dependency grammar andWordNet based sentence similarity measurerdquo Journal of Com-putational Information Systems vol 8 no 3 pp 1027ndash1035 2012

[31] A Zouaq M Gagnon and O Benoit ldquoSemantic analysis usingdependency-based grammars and upper-level ontologiesrdquoInternational Journal of Computational Linguistics and Applica-tions vol 1 no 1-2 pp 85ndash101 2010

[32] C Ramakrishnan K J Kochut and A P Sheth ldquoA frameworkfor schema-driven relationship discovery from unstructuredtextrdquo in The Semantic WebmdashISWC 2006 vol 4273 of LectureNotes in Computer Science pp 583ndash596 Springer BerlinGermany 2006

[33] C J C H Watkins and P Dayan ldquoTechnical note Q-learningrdquoMachine Learning vol 8 no 3-4 pp 279ndash292 1992

[34] M Lan C L Tan J Su and Y Lu ldquoSupervised and traditionalterm weighting methods for automatic text categorizationrdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 31 no 4 pp 721ndash735 2009

[35] L Zhang C Li J Liu and HWang ldquoGraph-based text similar-itymeasurement by exploiting wikipedia as background knowl-edgerdquo World Academy of Science Engineering and Technologyvol 59 pp 1548ndash1553 2011

[36] M Kuramochi and G Karypis ldquoAn efficient algorithm for dis-covering frequent subgraphsrdquo IEEE Transactions on Knowledgeand Data Engineering vol 16 no 9 pp 1038ndash1051 2004

[37] L-C Yu H-M Shih Y-L Lai J-F Yeh and C-H Wu ldquoDis-criminative training for near-synonym substitutionrdquo in Pro-ceedings of the 23rd International Conference on ComputationalLinguistics (COLING rsquo10) pp 1254ndash1262 2010

[38] L-C Yu and W-N Chien ldquoIndependent component analysisfor near-synonym choicerdquoDecision Support Systems vol 55 no1 pp 146ndash155 2013

[39] L Yao S Riedel and A Mccallum ldquoCollective cross-documentrelation extraction without labelled datardquo in Proceedings ofthe Conference on Empirical Methods in Natural LanguageProcessing pp 1013ndash1023MITCambridgeMassUSAOctober2010

[40] M Shen D-R Liu and Y-S Huang ldquoExtracting semantic rela-tions to enrich domain ontologiesrdquo Journal of Intelligent Infor-mation Systems vol 39 no 3 pp 749ndash761 2012

[41] H Chim and X Deng ldquoEfficient phrase-based documentsimilarity for clusteringrdquo IEEE Transactions on Knowledge andData Engineering vol 20 no 9 pp 1217ndash1229 2008

[42] GAMiller andWGCharles ldquoContextual correlates of seman-tic similarityrdquo Language and Cognitive Processes vol 6 no 1 pp1ndash28 1991

[43] M Steinbach G Karypis and V Kumar ldquoA comparison ofdocument clustering techniquesrdquo in Proceedings of the Knowl-edge Discovery and Data Mining (KDD) Workshop Text MiningAugust 2000

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 9: Research Article Exploiting Semantic Annotations and ...downloads.hindawi.com/journals/tswj/2015/136172.pdf · is bene cial for text categorization [ ], text summarization [ ], word

The Scientific World Journal 9

119886 Thus updating is based on the temporal difference inevaluating the current state and the action chosen Throughsuccessive updates of the 119876 function the model can learn totake into account future steps in longer and longer sequencesnotably without explicit planning [33] SALA may eventuallyconverge to a stable function or find an optimal sequence thatmaximizes the reward received Hence the SALA learns toaddress sequential decision-making tasks

Problem Formulation The 119876-learning algorithm is used todetermine the best possible action that is selecting the mosteffective relations analyzing the reward function and updat-ing the related feature weights accordingly The promisingaction can be verified by measuring the relatedness scoreof the action value with the remained MFs using a rewardfunctionThe formulation of the effective association capabil-ities exploration is performed by estimating an action-valuefunction The value of taking action 119886 in state 119904 is definedunder a policy 120587 denoted by 119876(119904 119886) as the expected returnstarting from 119904 and taking the action 119886 Action policy 120587 is adescription of the behavior of the learning SALA The policyis the most important criterion in providing the RL with theability to determine the behavior of the SALA [29]

In single goal reinforcement learning these 119876-values areused only to rank order of the actions in a given state Thekey observation here is that the 119876-values can also be used inmultiple-goal problems to indicate the degree of preferencefor different actions The available way in this paper to selecta promising action to execute is to generate an overall119876-valueas a simple sum of the 119876-values of the individual SALA Theaction with the maximum summed value is then chosen toexecute

Environment State and Actions For the effective associationcapabilities exploration problem the FLG is provided as anenvironment in which intelligent agents can learn and sharetheir knowledge and can explore dependencies and relation-ships among MFs The SALAs receive the environmentalinformation data as a state and execute the elected action inthe form of an action value table which provides the best wayof sharing knowledge among agents The state is defined bythe status of theMF space In each state theMFobject is fittedwith the closest-synonyms relations there is a set of actionsIn each round the SALA decides to explore the effectiveassociations that gain the largest amount of interdependenceamong theMFs Actions are defined by selecting a promisingaction with the highest reward by learning which actionis the optimal for each state In summary once 119904

119896(MF

object) is received from the FLG SALA chooses one ofthe associations as the action 119886

119896according to SALArsquos policy

120587lowast(119904119896) executes the associated action and finally returns the

effective relations Heuristic 3 defines the action variable andthe optimal policy

Heuristic 3 (action variable) Let 119886119896symbolize the action that

is executed by SALA at round 119896

119886119896= 120587lowast(119904119896) (6)

Therein one has the following(i) 119886119896C action space similarity contiguity contrast

and causality(ii) 120587lowast(119904

119896) = argmax

119886119876lowast(119904119896 119886119896)

Once the optimal policy (120587lowast) is obtained the agentchooses the actions using the maximum reward

Reward FunctionThe reward function in this papermeasuresthe dependency and the relatedness among MFs The learneris not told which actions to take as in most forms of machinelearning Rather the learner must discover which actionsyield the most reward by trying them [33]

(2) The SALA Optimal AssociationRelationships InferencesThe optimal association actions are achieved according to theIOAC-QL algorithmThis algorithm starts to select the actionwith the highest expected future reward from each state Theimmediate reward which the SALA gets to execute an actionfrom a state s plus the value of an optimal policy is the 119876-value The higher 119876-value points to the greater chance ofthat action being chosen First initialize all the119876(119904 119886) valuesto zero Then each SALA performs the following steps inHeuristic 4

Heuristic 4(i) At every round 119896 select one action 119886

119896from all the

possible actions (similarity contiguity contrast andcausality) and execute it

(ii) Receive the immediate reward 119903119896and observe the new

state 1199041015840119896

(iii) Get the maximum 119876-value of the state 1199041015840

119896based on

all the possible actions 119886119896= 120587lowast(119904119896) = argmax

119886119876lowast

(119904119896 119886119896)

(iv) Update the 119876(119904119896 119886119896) as follows

119876 (119904119896 119886119896) = (1 minus 120572)119876 (119904

119896 119886119896)

+ 120572(119903 (119904119896 119886119896) + 120574 max1198861015840isin119860(119904119896)

119876 (1199041015840

119896 1198861015840

119896))

(7)

where 119876(119904119896 119886119896) is worthy of selecting the action (119886

119896)

at the state (119904119896) 119876(1199041015840

119896 1198861015840

119896) is worthy of selecting the

next action (1198861015840119896) at the next state (1199041015840

119896) The 119903(119904

119896 119886119896)

is a reward corresponding to the acquired payoff A(1199041015840119896) is a set of possible actions at the next state (1199041015840

119896) 120572

(0 lt 120572 le 1) is learning rate 120574 (0 le 120574 le 1) is discountrate

During each round after taking the action 119886 the actionvalue is extracted from WordNet The weight of the actionvalue is measured to represent the immediate reward wherethe SALA computes the weight AVW

119894119896for the action value 119894

in document 119896 as followsReward (119903) = AVW

119894119896

= AVW119894119896+

119899

sum

119895=1

(FW119895119896

lowast SemRel (119894 119895)) (8)

10 The Scientific World Journal

Input Feature Vertex Object(FVO) WordnetOutput The optimal action for FVO relationshipsProcedure

intialize Q[ ]lowastimplement 119876-learning algorithm to get the action with the highest 119876-value lowast119886119896isin action space[] = similarity contiguity contrast causality

lowastParallelforeach used for parallel processing of all actions lowastParallelfor (0 action spacecount() 119896 =gt 119877[119896] = Get Action Reward(119886[119896]) lowastFunction to implement (8) lowast119876[119894 119896] = (1 minus 120572) lowast 119876[119894 119896] + 120572 lowast (119877[119896] + 120574 lowastmax

119886(119876[119894 119896 + 1]))

) ParallelForreturn Get highest 119876-value action(119876[ ]) lowastselect action with the maximum summed value of 119876-values lowast

lowastexecute (8) to get reward value lowastGet Action Reward(ak)AVW

119896= (freq(119886

119896val)FVOCount()) lowast log(119873119899

119886)

for (int 119895 = 0 119895 lt FVOCount() 119895++) FW119895= (freq(FVO[119895])FVOCount()) lowast log(119873119899

119891)

SemRel(119894 119895) =Get SemRel(119886119896val FVO[119895])

Sum+= FW119895lowast SemRel(119894 119895)

return AVW119896= AVW

119896+ Sum

Algorithm 4 IOAC-QL algorithm

where 119903 is the reward of the selected action 119886119894 AVW

119894119896is the

weight of action value 119894 in document 119896 FW119895119896

is the weightof each feature object FVO

119895in document 119896 and SemRel(119894 119895)

is the semantic relatedness between the action value 119894 andfeature object FVO

119895

The most popular weighting scheme is the normalizedword frequency TFIDF [34] used to measure AVW

119894119896and

FW119895119896The FW

119895119896in (8) is calculated likewise AVW

119894119896is based

on (9) and (10)Action value weight (119860119881119882

119894119896) is a measure to calculate

the weight of the action value that is the scalar product ofaction value frequency and inverse document frequency as in(9)The action value frequency (119865) measures the importanceof this value in a document Inverse document frequency(IDF) measures the general importance of this action valuein a corpus of documents

AVW119894119896

=119865119894119896

sum119896119865119894119896

lowast IDF119894 (9)

where 119865119894119896

represents the number of this action value 119894

cocontributions in document 119889119896 normalized by the number

of cocontributions of all MFs in document 119889119896 normalized to

prevent a bias towards longer documents IDF is performedby dividing the number of all documents by the number ofdocuments containing this action value defined as follows

IDF119894= log( |119863|

1003816100381610038161003816119889119896 V119894 isin 1198891198961003816100381610038161003816

) (10)

where |119863| is the total number of documents in the corpusand |119889

119896 V119894isin 119889119896| is the number of documents containing

action value 119881119894

Thus in the IOAC-QL algorithm (see Algorithm 4)implementation shown in Figure 4 the SALA selects theoptimal action with the higher reward Hence each SALAretains effective actions in the FAss list object of each FVO Inthis list each object hasRel-Type object andRel vertex objectTheRel-Type object value equals the action type object valueand Rel vertex object value equals the related vertex indexThe SALA then links the related FVOs with the bidirectionaleffect using the SALA links algorithm and sets the FLW valueof the linkage weight in the adjacency matrix with the rewardof the selected action

An example of monitoring the SALA decision throughthe practical implementation of the IOAC-QL algorithmactions is illustrated in Figure 4

For state FVO1 one has the following

action similarity ActionResult State FVO1FVO7

FVO9FVO22FVO75FVO92FVO112

FVO125

Prob 8reward 184 PrevState FVO

1 119876(119904 119886) = 91

action contiguity ActionResult State FVO103

FVO17

FVO44 Prob 3 reward 74 PrevState FVO

1 119876(119904 119886) =

61action contrast ActionResult State FVO

89 Prob 1

reward 22 PrevState FVO1 119876(119904 119886) = 329

action causality ActionResult State FVO63

FVO249

Prob 2 reward 25 PrevState FVO

1 119876(119904 119886) = 429

As revealed in the case FVO1 the SALA receives the

environmental information data FVO1as a state and then

the SALA starts the implementation to select the actionwith the highest expected future reward among the otherFVOs The result of each action is illustrated the action with

The Scientific World Journal 11

S0

S0

S1

S1S2

a0 a0a0a4 a4 a4

DSO DSO

Sm

Sm

Object model construction

FVO

Vertex [V] Adjacent [V V]

FLWFval

SFKeyFsyn[]

FAss[]FW

The feature linkage graph (FLG)

Descriptive document object

Descriptive sentence object

Action selection

Q-learning mechanism

Assign Q-values with FLG environment

Reinforcement learning model for semanticrelational graph construction

SALA inference engine

DSO DSO DSO DSO DSO

DDO

Sem

antic

actio

nabl

e lea

rnin

g ag

ent (

SALA

)

F1

F1

F1

F2

F3

F4

F2

F2 F3 F4 middot middot middot Fn

DSO

Fn

Fn

middot middot middot

middot middot middot

middot middot middot

middot middot middot middot middot middotmiddot middot middot

middot middot middot

middot middot middotmiddot middot middot

middot middot middot

middot middot middot

middot middot middot

Q(S0 a0) Q(S0 a4)Q(S1 a0) Q(S1 a4)

Q(Sm a0) Q(Sm a4)

WordNet

Figure 4 The IOAC-QL algorithm implementation

the highest reward is selected and the elected action isexecutedTheSALAdecision is accomplished in FLGwith theeffective extracted relationships according to 119876(119904 119886) value asshown in Table 1

33 Document Linkage Conducting Stage This section pre-sents an efficient linkage process at the corpus level for manydocuments Due to the SHGRS scheme of each documentmeasuring the commonality between the FLG of the doc-uments through finding all distinct common similarity orrelatedness subgraphs is required The objective of findingcommon similar or related subgraphs as a relatedness mea-sure is to be utilized in document mining and managingprocesses such as clustering classification and retrieval Itis essential to establish that the measure is metric A metricmeasure ascertains the order of objects in a set and henceoperations such as comparing searching and indexing of theobjects can be performed

Graph relatedness [35 36] involves determining thedegree of similarity between these two graphs (a numberbetween 0 and 1) so the same node in both graphs wouldbe similar if its related nodes were similar The selectionof subgraphs for matching depends on detecting the moreimportant nodes in FLG using FVOFw value The vertexwith the highest weight is the first node selected which thendetects all related vertices from an adjacency matrix as asubgraphThis process depends on computing the maximumcommon subgraphsThe inclusion of all common relatednesssubgraphs is accomplished through recursive iterations andholding onto the value of the relatedness index of thematched

subgraphs Eliminating the overlap is performed using a dy-namic programming technique that keeps track of the nodesvisited and their attribute relatedness values

The function relatedness subgraphs utilize an array Rel-Nodes[] of the related vertices for each document Thefunction SimRel(119909

119894 119910119894) computes the similarity between the

two nodes 119909119894 119910119894as defined in (11) The subgraph matching

process is measured using the following formula

Sub-graph-Rel (SG1 SG2)

=sum

SG1119894

sumSG2119895

SemRel (119873119894 119873119895)

sumSG1119894

119873119894+ sum

SG2119895

119873119895

(11)

where SG1 is the selected subgraph from the FLG of onedocument SG2 is the selected subgraph from the FLG ofanother document and SemRel(119873

119894 119873119895) is the degree of

matching for both nodes 119873119894and 119873

119895 A threshold value for

node matching is also utilized the threshold is set to a 50match

4 Evaluation of the Experiments and Results

This evaluation is especially vital as the aim of building theSHGRS is to use them efficiently and effectively for the fur-ther mining process To explore the effectiveness of the pro-posed SHGRS scheme examining the correlation of the pro-posed semantic relatedness measure compared with otherprevious relatedness measures is required The impact ofthe proposed MFsSR for detecting the closest synonym is

12 The Scientific World Journal

Table 1 Action values table with the SALA decision

State FVO1 FVO1 FVO1 FVO1

Action Similarity Contiguity Contrast CausalityActionResult State FVO1 FVO7 FVO9 FVO22 FVO75 FVO92 FVO112 FVO125 FVO103 FVO17 FVO44 FVO89 FVO63 FVO249

Total reward 184 74 22 25119876(119904 119886) 94 61 329 429Decision Yes No No No

studied and compared to the PMI [25 37] and independentcomponent analysis (ICA) [38] for detecting the best near-synonym The difference between the proposed discoveringsemantic relationships or associations IOAC-QL implicitrelation extraction conditional random fields (CRFs) [6 39]and term frequency and inverse cluster frequency (TFICF)[40] is computed

A case study of amanaging andmining task (ie semanticclustering of text documents) is consideredThis study showshow the proposed approach is workable and how it canbe applied to mining and managing tasks Experimentallysignificant improvements in all performances are achieved inthe document clustering processThis approach outperformsconventional word-based [16] and concept-based [17 41]clusteringmethodsMoreover performance tends to improveas more semantic analysis is achieved in the representation

41 Evaluation Measures Evaluation measures are subcat-egorized into text quality-based evaluation content-basedevaluation coselection-based evaluation and task-basedevaluationsThefirst category of evaluationmeasures is basedon text quality using aspects such as grammaticality nonre-dundancy referential clarity and coherence For content-based evaluations measures such as similarity semanticrelatedness longest common subsequence and other scoresare used The third category is based on coselection evalua-tion using precision recall and 119865-measure values The lastcategory task-based evaluation is based on the performanceof accomplishing the given task Some of these tasks areinformation retrieval question answering document classi-fication document summarizing and document clusteringmethods In this paper the content-based the coselection-based and the task-based evaluations are used to validate theimplementation of the proposed scheme over the real corpusof documents

411 Measure for Content-Based Evaluation The relatednessor similarity measures are inherited from probability theoryand known as the correlation coefficient [42]The correlationcoefficient is one of the most widely used measures todescribe the relatedness 119903 between two vectors119883 and 119884

Correlation Coefficient 119903 The correlation coefficient is a rela-tively efficient relatedness measure which is a symmetricalmeasure of the linear dependence between two randomvariables

Table 2 Contingency table presents human judgment and systemjudgment

Human judgmentTrue False

System judgmentTrue TP FPFalse FN TN

Therefore the correlation coefficient can be consideredas the coefficient for the linear relationship between corre-sponding values of 119883 and 119884 The correlation coefficient 119903between sequences 119883 = 119909

119894 119894 = 1 119899 and 119884 = 119910

119894

119894 = 1 119899 is defined by

119903 =sum119899

119894=1119883119894119884119894

radic(sum119899

119894=11198832119894) lowast (sum

119899

119894=11198842119894)

(12)

Acceptance Rate (AR) Acceptance rate is a proportion ofcorrectly predicted similar or related sentences compared toall related sentences High acceptance ratemeans recognizingalmost all similar or related sentences

AR =TP

(TP + FN) (13)

Accuracy (Acc) Accuracy is a proportion of all correctlypredicted sentences compared to all sentences

Acc = TP + TN(TP + FN + FP + TN)

(14)

where TP TN FP and FN stand for true positive (the num-ber of pairs correctly labeled as similar) true negative (thenumber of pairs correctly labeled as dissimilar) false positive(the number of pairs incorrectly labeled as similar) andfalse negative (the number of pairs incorrectly labeled asdissimilar) as shown in Table 2

412 Measure for Coselection-Based Evaluation For all thedomains a precision 119875 recall 119877 and 119865-measure are utilizedas the measures of performance in the coselection-basedevaluation These measures may be defined via computingthe correlation between the extracted correct and wrongclosest synonyms or semantic relationshipsdependencies

The Scientific World Journal 13

Table 3 Datasets details for the experimental setup

Dataset Experimenting Dataset Name Description

DS1 Experiment 1content-basedevaluation

Miller and Charles (MC)1 RG consists of 65 pairs of nouns extracted from the WordNet rated bymultiple human annotators

DS2Microsoft ResearchParaphrase Corpus

(MRPC)2

The corpus consists of 5801 sentence pairs collected from newswire articles3900 of which were labeled as relatedness by human annotators The wholeset is divided into a training subset (4076 sentences of which 2753 are true)and a test subset (1725 pairs of which 1147 are true)

DS3 Experiment 2coselection-basedevaluation(closest-synonymdetection)

British National Corpus(BNC)3

BNC is a carefully selected collection of 4124 contemporary written andspoken English texts contains 100-million-word text corpus of samples ofwritten and spoken English with the near-synonym collocations

DS4 SN (semantic neighbors)4SN relates 462 target terms (nouns) to 5910 relatum terms with 14682semantic relations (7341 are meaningful and 7341 are random) The SNcontains synonyms coming from three sources WordNet 30 Rogetrsquosthesaurus and a synonyms database

DS5

Experiment 2coselection-basedevaluation (semanticrelationshipsexploration)

BLESS6BLESS relates 200 target terms (100 animate and 100 inanimate nouns) to 8625relatum terms with 26554 semantic relations (14440 are meaningful (correct)and 12154 are random) Every relation has one of the following typeshypernymy cohypernymy meronymy attribute event or random

DS6 TREC5

TREC includes 1437 sentences annotated with entities and relations at leastone relation There are three types of entities person (1685) location (1968)and organization (978) in addition there is a fourth type others (705) whichindicates that the candidate entity is none of the three types There are fivetypes of relations located in (406) indicates that one location is located insideanother location work for (394) indicates that a person works for anorganization OrgBased in (451) indicates that an organization is based in alocation live in (521) indicates that a person lives at a location and kill (268)indicates that a person killed another person There are 17007 pairs of entitiesthat are not related by any of the five relations and hence have the NR relationbetween them which thus significantly outnumbers other relations

DS7 IJCNLP 2011-New YorkTimes (NYT)6

NYT contains 150 business articles from NYT There are 536 instances (208positive 328 negative) with 140 distinct descriptors in NYT dataset

DS8 IJCNLP 2011-Wikipedia8Wikipedia personalsocial relation dataset was previously used in Culotta etal [6] There are 700 instances (122 positive 578 negative) with 70 distinctdescriptors in Wikipedia dataset

DS9 Experiment 3task-basedevaluation

Reuters 215787 Reuters-21578 contains 21578 documents (12902 are used) categorized to 10categories

DS10 20 Newsgroups8 20 newsgroups dataset contains 20000 documents (18846 are used)categorized to 20 categories

1Available at httpwwwcscmuedusimmfaruquisuitehtml2Available at httpresearchmicrosoftcomen-usdownloads3Available at httpcorpusbyuedubnc4Available at httpcentalfltruclacbeteamsimpanchenkosre-evalsncsv5Available at httpl2rcsuiucedusimcogcompDataERconll04corp6Available at httpwwwmysmuedufacultyjingjiangdataIJCNLP2011zip7Available at httpmlrcsumassedumldatasetsReuters-21578+Text+Categorization+Collection8Available at httpwwwcsminingorgindexphpid-20-newsgroupshtml

Let TP denote the number of correctly detected closestsynonyms or semantic relationships explored let FP be thenumber of incorrectly detected closest synonyms or semanticrelationships explored and let FN be the number of correctlybut not detected closest synonyms or semantic relationshipsexplored in a dataset The 119865-measure combines the precisionand recall in one metric and is often used to show theefficiencyThismeasurement can be interpreted as a weightedaverage of the precision and recall where an 119865-measure

reaches its best value at 1 andworst value at 0 Precision recalland 119865-measure are defined as follows

Recall = TP(TP + FN)

Precision =TP

(TP + FP)

119865-measure = 2 (Recall lowast Precision)(Recall + Precision)

(15)

14 The Scientific World Journal

413 Measure for Task-Based Evaluation A case study ofa managing and mining task (ie semantic clustering oftext documents) is studied for task-based evaluation Thetest is performed with the standard clustering algorithmsthat accept the pairwise (dis)similarity between documentsrather than the internal feature space of the documents Theclustering algorithms may be classified as listed below [43](a) flat clustering (which creates a set of clusters without anyexplicit structure that would relate the clusters to each otheralso called exclusive clustering) (b) hierarchical clustering(which creates a hierarchy of clusters) (c) hard clustering(which assigns each documentobject as a member of exactlyone cluster) and (d) soft clustering (which distributes thedocumentsobjects over all clusters) These algorithms areagglomerative (hierarchical clustering) 119870-means (flat clus-tering hard clustering) and EM algorithm (flat clusteringsoft clustering)

Hierarchical agglomerative clustering (HAC) and the 119870-means algorithm have been applied to text clustering ina straightforward way [16 17] The algorithms employedin the experiment included 119870-means clustering and HACTo evaluate the quality of the clustering the 119865-measure isused for quality measures which combine the precision andrecall measures These metrics are computed for every (classcluster) pair The precision 119875 and recall 119877 of a cluster 119878

119894with

respect to a class 119871119903are defined as

119875 (119871119903 119878119894) =

119873119903119894

119873119894

119877 (119871119903 119878119894) =

119873119903119894

119873119903

(16)

where 119873119903119894is the number of documents of class 119871

119903in cluster

119878119894119873119894is the number of documents of cluster 119878

119894 and119873

119903is the

number of documents of class 119871119903

The 119865-measure which is a harmonic mean of precisionand recall is defined as

119865 (119871119903 119878119894) =

2 lowast 119875 (119871119903 119878119894) lowast 119877 (119871

119903 119878119894)

119875 (119871119903 119878119894) + 119877 (119871

119903 119878119894)

(17)

A per-class 119865 score is calculated as follows

119865 (119871119903) = max119878119894isin119862

119865 (119871119903 119878119894) (18)

With respect to class 119871119903 the cluster with the highest 119865-

measure is considered to be the cluster that maps to class 119871119903

that 119865-measure becomes the score for class 119871119903 and 119862 is total

number of clusters The overall 119865-measure for the clusteringresult 119862 is the weighted average of the 119865-measure for eachclass 119871

119903(macroaverage)

119865119862 =sum119871119903

(1003816100381610038161003816119873119903

1003816100381610038161003816 lowast 119865 (119871119903))

sum119871119903

10038161003816100381610038161198731199031003816100381610038161003816

(19)

where |119873119903| is the number of objects in class 119871

119903 The higher

the overall 119865-measure is the better the clustering is due tothe higher accuracy of the clusters mapping to the originalclasses

42 Evaluation Setup (Dataset) Content-based coselection-based and task-based evaluations are used to validateexperimenting over the real corpus of documents theresults are very promising The experimental setup con-sists of some datasets of textual documents as detailed inTable 3 Experiment 1 uses the Miller and Charles (MC) andMicrosoft Research Paraphrase Corpus Experiment 2 usesthe British National Corpus TREC IJCNLP 2011 (NYT andWikipedia) SN and BLESS Experiment 3 usesThe Reuters-21578 and 20 newsgroups

43 Evaluation Results This section reports on the results ofthree experiments conducted using the evaluation datasetsoutlined in the previous section The SHGRS is imple-mented and evaluated based on concept analysis and anno-tation as sentence based in experiment 1 document basedin experiment 2 and corpus based in experiment 3 Inexperiment 1 the DS1 and DS2 investigated the effectivenessof the proposed MFsSR compared to previous studies basedon the availability of semantic relatedness values of humanjudgments and previous studies In experiment 2 the DS3andDS4 investigated the effectiveness of the closest synonymdetection using MFsSR compared to using PMI [25 37]and ICA [38] based on the availability of synonyms comingfrom three sources WordNet 30 Rogetrsquos thesaurus anda synonyms database Furthermore the DS5 DS6 DS7and DS8 investigated the effectiveness of the IOAC-QL forsemantic relationships exploration compared to the implicitrelation extraction CRFs [6 39] and TFICF [40] based on theavailability of semantic relations judgments In experiment 3the DS9 and DS10 investigated the effectiveness of thedocument linkage process for clustering case study thanprevious studies as word-based [16] and concept-based [17]based on the availability of human judgments

These experiments revealed some interesting trends interms of text organization and a representation schemebased on semantic annotation and reinforcement learningThe results show the effectiveness of the proposed schemefor accurate understanding of the underlying meaning Theproposed SHGRS is exploited in mining and managingoperations such as gathering filtering searching retrievingextracting classifying clustering and summarizing docu-ments In addition the experiments illustrate the improve-ments when direct relevance and indirect relevance areincluded in the process of measuring semantic relatednessbetween MFs of documents The main purpose of semanticrelatedness measures is to infer unknown dependencies andrelationships among the MFs The proposed MFsSR is abackbone of the closest synonym detection and semanticrelationships exploration algorithms The performance ofMFsSR for closest synonym detection and IOAC-QL forsemantic relationships exploration algorithms is proportion-ally improved as well as having a semantic understanding oftext content achieved The relatedness subgraph measure isused for the document linkage process at the corpus levelThrough this process better clustering results can be achievedas more attempt to improve the accuracy of measuring therelatedness between documents using semantic information

The Scientific World Journal 15

Table 4The results of the comparison of the correlation coefficientbetween human judgment with some relatedness measures and theproposed semantic relatedness measure

MeasureRelevance

correlation withMampC

Distance-based measuresRada 0688Wu and Palmer 0765

Information-based measuresResnik 077Jiang and Conrath 0848Lin 0853

Hybrid measuresT Hong and D smith 0879Zili Zhou 0882

Informationfeature-base measuresThe proposed MFsSR 0937

in SHGRSThe relatedness subgraphsmeasure affects the per-formance of the output of the document managing and min-ing process relative to howmuch of the semantic informationit considers The most accurate relatedness can be achievedwhen the maximum amount of the semantic information isprovided Consequently the quality of clustering is improvedand when semantic information is considered the approachsurpasses the word-based and concept-based techniques

431 Experiment 1 Comparative Study (Content-Based Eval-uation) This experiment shows the necessity to evaluate theperformance of the proposed MFsSR based on a benchmarkdataset of human judgments so the results of the proposedsemantic relatedness would be comparable with other pre-vious studies in the same field In this experiment theyattempted to compute the correlation between the ratings ofthe proposed semantic relatedness approach and the meanratings reported by Miller and Charles (DS1 in Table 3)Furthermore the results produced are compared againsteight other semantic similarities approaches namely RadaWu and Palmer Resnik Jiang and Conrath Lin T Hong andD Smith and Zili Zhou [21]

Table 4 shows the correlation coefficient results betweennine componential approaches and the Miller and Charlesratings mean The semantic relatedness for the proposedapproach outperformed all the listed approaches Unlike allthe listed methods in the proposed semantic relatednessdifferent properties are considered Furthermore the goodcorrelation value of the approach also results from consider-ing all available relationships between concepts and the indi-rect relationships between attributes of each concept whenmeasuring the semantic relatedness In this experiment theproposed approach achieved a good correlation value withthe human-subject rating reported by Miller and CharlesBased on the results of this study theMFsSR correlation valuehas proven that considering the directindirect relevance

is specifically important for at least 6 improvement overthe best previous results This improvement contributes tothe achievement of the largest amount of relationships andinterdependence among MFs and their attributes at thedocument level and to the document linkage process at acorpus level

Table 5 summarizes the characteristics of the MRPCdataset (DS2 in Table 3) and presents comparison of theAcc and AR values between the proposed semantic related-ness measure MFsSR and Islam and Inkpen [7] Differentrelatedness thresholds ranging from 0 to 1 with interval01 are used to validate the MFsSR with Islam and Inkpen[7] After evaluation the best relatedness thresholds of Accand AR are 06 07 and 08 These results indicate thatthe proposed MFsSR surpasses Islam and Inkpen [7] interms of Acc and AR The results of each approach listedbelow were based on the best Acc and AR through allthresholds instead of under the same relatedness thresholdThis improvement in Acc andAR values is due to the increasein the numbers of pairs predicted correctly after consideringdirectindirect relevance This relevance takes into accountthe closest synonym and the relationships of sentences MFs

432 Experiment 2 Comparative Study of Closest-SynonymDetection and Semantic Relationships Exploration Thisexperiment shows the necessity to study the impact of theMFsSR for inferring unknown dependencies and relation-ships among the MFs This impact is achieved through theclosest-synonym detection and semantic relationship explo-ration algorithms so the results would be comparable withother previous studies in the same field Throughout thisexperiment the impact of the proposed MFsSR for detectingthe closest synonym is studied and compared to the PMIapproach and ICA approach for detecting the best near-synonym The difference among the proposed discoveringsemantic relationships or associations IOAC-QL the implicitrelation extraction CRFs approach and TFICF approachis examined This experiment attempts to measure theperformance of theMFsSR for the closest synonym detectionand the IOAC-QL for semantic relationship explorationalgorithms Hence more semantic understanding of the textcontent is achieved by inferring unknown dependenciesand relationships among the MFs Considering the directrelevance among the MFs and the indirect relevance amongthe attributes of the MFs in the proposed MFsSR constitutesa certain advantage over previous measures However mostof the time the incorrect items are due to a wrong syntacticparsing from the OpenNLP parser and AlchemyAPIAccording to the preliminary study it is certain that theaccuracy of parsing toolsrsquo effects is on the performance of theMFsSR

Detecting the closest synonym is a process of MF disam-biguation resulting from the implementation of the MFsSRtechnique and the threshold of the synonyms scores throughthe C-SynD algorithm In the C-SynD algorithm the MFsSRtakes into account the direct relevance among MFs andthe indirect relevance among MFs and their attributes in acontext that gives a significant increase in the recall without

16 The Scientific World Journal

Table 5 The results of the comparison of the accuracy and acceptance rate between the proposed semantic relatedness measure and Islamand Inkpen [7]

MRPC dataset Relatednessthreshold

Human judgment(TP + FN)

Islam and Inkpen [7] The proposed MFsSRAcc AR Acc AR

Trainingsubset (4076)

01

2753 true

067 1 068 102 067 1 068 103 067 1 068 104 067 1 068 105 069 098 068 106 072 089 068 107 068 078 070 09808 056 04 072 08609 037 009 060 0491 033 0 034 002

Test subset(1725)

01

1147 true

066 1 067 102 066 1 067 103 066 1 067 104 066 1 067 105 068 098 067 106 072 089 066 107 068 078 066 09808 056 04 064 08509 038 009 049 051 033 0 034 003

disturbing the precision Thus the MFsSR between two con-cepts considers the information content with most of Word-Net relationships Table 6 illustrates the performance of theC-SynD algorithm based on theMFsSR compared to the bestnear-synonym algorithmusing the PMI and ICA approachesThis comparison was conducted for two corpora of textdocuments that are BNC and NS (DS3 and DS4 in Table 3)As indicated in Table 6 the C-SynD algorithm yielded higheraverage precision recall and 119865-measure values than PMIapproach by 25 20 and 21 on SN dataset respectivelyand also by 8 4 and 7 on BNC dataset Furthermorethe C-SynD algorithm also yielded higher average precisionrecall and 119865-measure values than ICA approach by 6 9and 7 on SN dataset respectively and also by 6 4and 5 on BNC dataset The improvements achieved in theperformance values are due to the increase of the numberof pairs predicted correctly and the decrease of the numberof pairs predicted incorrectly through implementing MFsSRThe important observation from this table is the improve-ments achieved in recall which measures effectiveness of theC-SynD algorithm Achieving better precision values is clearwith a high percentage in different datasets and domains

Inferring the MFs relationships and dependencies is aprocess of achieving a large number of interdependencesamongMFs resulting from the implementation of the IOAC-QL algorithm In the IOAC-QL algorithm the 119876-learning isused to select the optimal action (relationships) that gainsthe largest amount of interdependences among the MFs

resulting from the measure of the AVW Table 7 illustratesthe performance of the IOAC-QL algorithm based on AVWscompared to the implicit relationship extraction based onCRFs and TFICF approaches which was conducted overthe TREC Bless the NYT and the Wikipedia corpus (DS5DS6 DS7 and DS8 in Table 3) The data in this table showsincreases in precision recall and 119865-measure values due tothe increase in the number of pairs predicted correctly afterconsidering directindirect relevance through the expansionof closest synonym In addition the IOAC-QL algorithmthe CRFs and the TFICF approaches are beneficial tothe extraction performance but the IOAC-QL contributesmore than CRFs and TFICF Thus IOAC-QL considerssimilarity contiguity contrast and causality relationshipsbetween MFs and its closest-synonyms while the CRFs andTFICF consider is-a and part-of relationships only betweenconceptsThe improvements of the 119865-measure were achievedthrough an IOAC-QL algorithm up to 23 for TREC dataset22 for BLESS dataset 21 for NYT dataset and 6 forWikipedia dataset approximately from the CRFs approachFurthermore the IOAC-QL algorithm also yielded higher119865-measure values than TFICF approach by 10 for TRECdataset 17 for BLESS dataset and 7 for NYT datasetrespectively

433 Experiment 3 Comparative Study (Task-Based Eval-uation) This experiment shows the effectiveness of the

The Scientific World Journal 17

Table 6 The results of the C-SynD algorithm based on the MFsSR compared to the PMI and ICA for detecting the closest synonym

Datasets PMI ICA C-SynDPrecision Recall 119865-measure Precision Recall 119865-measure Precision Recall 119865-measure

SN 606 606 061 795 716 753 85 80 82BNC 745 679 71 76 678 72 82 719 77

Table 7 The results of the IOAC-QL algorithm based on AVWs compared to CRFs and TFICF approaches

Datasets CRFs TFICF IOAC-QLPrecision Recall 119865-measure Precision Recall 119865-measure Precision Recall 119865-measure

TREC 7508 602 6628 893 714 787 898 881 886BLESS 7304 6266 6703 738 695 716 950 835 889NYT 6846 5402 6038 860 650 740 900 740 812Wikipedia 560 420 480 646 5488 5934 700 440 540

Table 8The results of graph-based approach compared to term-based and concept-based approach using119870-means and the HAC clusteringalgorithms

Datasets Algorithm Term-based Concept-based Graph-basedFC FC FC

Reuters-21578 HAC 73 857 902119870-means 51 846 918

20 Newsgroups HAC 53 82 89119870-means 48 793 87

proposed semantic representation scheme in document min-ing and managing operations such as gathering filteringsearching retrieving extracting classifying clustering andsummarizing documents Many sets of experiments usingthe proposed scheme on different datasets in text clusteringhave been conducted The experiments demonstrate theextensive comparisons between the MF-based analysis andthe traditional analysis Experimental results demonstratethe substantial enhancement of the document clusteringquality In this experiment the application of the semanticunderstanding of the textual documents is presented Textualdocuments are parsed to produce the rich syntactic andsemantic representations of SHGRS Based on these semanticrepresentations document linkages are estimated using thesubgraph matching process that is based on finding alldistinct similarity common subgraphs The semantic repre-sentations of textual documents determining the relatednessbetween themare evaluated by allowing clustering algorithmsto use the produced relatedness matrix of the documentset To test the effectiveness of subgraph matching in deter-mining an accurate measure of the relatedness betweendocuments an experiment using the MFs-based analysisand subgraphs relatedness measures was conducted Twostandard document clustering techniques were chosen fortesting the effect of the document linkages on clustering [43](1) HAC and (2) 119870-means The MF-based weighting is themain factor that captures the importance of a concept in

a document Furthermore document linkages using the sub-graph relatedness measure are among the main factors thataffect clustering techniques Thus to study the effect of theMFs-based weighting and the subgraph relatedness measure(graph-based approach) on the clustering techniques thisexperiment is repeated for the different clustering techniquesas shown in Table 8 Basically the aim is to maximize theweighted average of the 119865-measure for each class (FC) ofclusters to achieve high-quality clustering

Table 8 shows that the graph-based approach achieveshigher performance than term-based and concept-basedapproachesThis comparison was conducted for two corporaof text documents that are the Reuters-21578 dataset and the20 newsgroups dataset (DS9 and DS10 in Table 3)The resultsshown in Table 8 illustrate improvement in the clusteringquality when more attributes are included in graph-basedrelatedness measure The improvements were achieved at thegraph-based approach up to 17 by HAC and 40 by 119870-means for Reuters-21578 dataset while up to 36 byHAC and39 by 119870-means for approximately 20 newsgroup datasetsfrom the base case (term-based) approach Furthermorethe improvements also were achieved at the graph-basedapproach up to 5 by HAC and 7 by119870-Means for Reuters-21578 dataset while up to 7 byHAC and 8 by119870-Means forapproximately 20 Newsgroup datasets from Concept-basedapproach

18 The Scientific World Journal

5 Conclusions

This paper proposed the SHGRS to improve the effectivenessof themining andmanaging operations in textual documentsSpecifically a three-stage approach was proposed for con-structing the scheme First the MFs with their attributesare extracted and hierarchy-based structure is built to rep-resent sentences by the MFs and their attributes Second anovel semantic relatedness computing method was proposedfor inferring relationships and dependencies of MFs andthe relationship between MFs is represented by a graph-based structure Third a subgraph matching algorithm wasconducted for document linkage process The documentsclustering is used as a case study and conducted extensiveexperiments on 10 different datasets and the results showedthat the scheme can result in a better mining performancecompared with traditional approach Future work will focuson conducting other case studies to processes such asgathering filtering retrieving classifying and summarizinginformation

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] I T Fatudimu A G Musa C K Ayo and A B SofoluweldquoKnowledge discovery in online repositories a text miningapproachrdquo European Journal of Scientific Research vol 22 no2 pp 241ndash250 2008

[2] A Hotho A Nurnberger and G Paaszlig ldquoA brief survey of textminingrdquo Journal for Computational Linguistics and LanguageTechnology vol 20 2005

[3] M Bernotas K Karklius R Laurutis and A Slotkiene ldquoThepeculiarities of the text document representation using ontol-ogy and tagging-based clustering techniquerdquo Journal of Infor-mation Technology and Control vol 36 pp 217ndash220 2007

[4] Y Li Z A Bandar andDMcLean ldquoAn approach formeasuringsemantic similarity between words using multiple informationsourcesrdquo IEEE Transactions on Knowledge and Data Engineer-ing vol 15 no 4 pp 871ndash882 2003

[5] K Shaban O Basir and M Kamel ldquoDocument mining basedon semantic understanding of textrdquo in Progress in Pattern Rec-ognition Image Analysis and Applications vol 4225 of LectureNotes in Computer Science pp 834ndash843 Springer BerlinGermany 2006

[6] A Culotta A McCallum and J Betz ldquoIntegrating probabilisticextraction models and data mining to discover relations andpatterns in textrdquo in Proceedings of theHuman Language Technol-ogy ConferencemdashNorth American Chapter of the Association forComputational Linguistics Annual Meeting (HLT-NAACL rsquo06)pp 296ndash303 June 2006

[7] A Islam and D Inkpen ldquoSemantic text similarity using corpus-based word similarity and string similarityrdquo ACM Transactionson Knowledge Discovery fromData vol 2 no 2 article 10 2008

[8] R Feldman and J SangerTheTextMining Handbook AdvancedApproaches in Analyzing Unstructured Data Cambridge Uni-versity Press 2006

[9] R Navigli ldquoWord sense disambiguation a surveyrdquo ACM Com-puting Surveys vol 41 no 2 article 10 2009

[10] W Mao andWW Chu ldquoThe phrase-based vector space modelfor automatic retrieval of free-text medical documentsrdquo Dataand Knowledge Engineering vol 61 no 1 pp 76ndash92 2007

[11] C Siefkes and P Siniakov ldquoAn overview and classification ofadaptive approaches to information extractionrdquo in Journal onData Semantics IV vol 3730 of Lecture Notes in ComputerScience pp 172ndash212 Springer New York NY USA 2005

[12] httpwordnetprincetonedu[13] OpenNLP httpopennlpapacheorg[14] httpwwwalchemyapicom[15] W V Siricharoen ldquoOntologies and object models in object ori-

ented software engineeringrdquo International Association of Engi-neers (IAENG) Journal of Computer Science (JICS) vol 33 no 1pp 1151ndash1162 2006

[16] S Shehata ldquoA WordNet-based semantic model for enhancingtext clusteringrdquo in Proceedings of the IEEE International Con-ference on Data Mining Workshops (ICDMW rsquo09) pp 477ndash482Miami Fla USA December 2009

[17] S Shehata F Karray andM Kamel ldquoAn efficient concept-basedminingmodel for enhancing text clusteringrdquo IEEE Transactionson Knowledge and Data Engineering vol 22 no 10 pp 1360ndash1371 2010

[18] E Zemmouri H Behja A Marzak and B Trousse ldquoOntology-based knowledge model for multi-view KDD processrdquo Interna-tional Journal of Mobile Computing and Multimedia Communi-cations vol 4 no 3 pp 21ndash33 2012

[19] C Marinica and F Guillet ldquoKnowledge-based interactive post-mining of association rules using ontologiesrdquo IEEETransactionson Knowledge and Data Engineering vol 22 no 6 pp 784ndash7972010

[20] B Choudhary and P Bhattacharyya ldquoText clustering usingUni-versal Networking Language representationrdquo in Proceedings ofthe 11th International World Wide Web Conference 2003

[21] G Pirro and J Euzenat ldquoA feature and information theoreticframework for semantic similarity and relatednessrdquo in Proceed-ings of the 9th International Semantic Web Conference on theSemantic Web (ISWC rsquo10) vol 6496 pp 615ndash630 2010

[22] Z Lin M R Lyu and I King ldquoMatchSim a novel similaritymeasure based on maximum neighborhood matchingrdquo Knowl-edge and Information Systems vol 32 no 1 pp 141ndash166 2012

[23] D Lin ldquoAn information-theoretic definition of similarityrdquo inProceedings of the 15th International Conference on MachineLearning Madison Wis USA July 1998

[24] D Inkpen ldquoA statistical model for near-synonym choicerdquo ACMTransactions on Speech and Language Processing vol 4 no 1article 2 2007

[25] LHan T Finin PMcNameeA Joshi andYYesha ldquoImprovingword similarity by augmenting PMIwith estimates of word pol-ysemyrdquo IEEE Transactions on Knowledge and Data Engineeringvol 25 no 6 pp 1307ndash1322 2013

[26] J OrsquoShea Z Bandar K Crockett and D McLean ldquoA compar-ative study of two short text semantic similarity measuresrdquo inAgent and Multi-Agent Systems Technologies and Applicationsvol 4953 of Lecture Notes in Artificial Intelligence pp 172ndash181Springer Berlin Germany 2008

[27] J Oliva J I SerranoMDDel Castillo and A Iglesias ldquoSyMSSa syntax-basedmeasure for short-text semantic similarityrdquoDataand Knowledge Engineering vol 70 no 4 pp 390ndash405 2011

The Scientific World Journal 19

[28] I Yoo and X Hu ldquoClustering large collection of biomedicalliterature based on ontology-enriched bipartite graph represen-tation andmutual refinement strategyrdquo inProceedings of the 10thPAKDD pp 9ndash12 2006

[29] S-T Yuan and Y-C Chen ldquoSemantic ideation learning foragent-based e-brainstormingrdquo IEEE Transactions on Knowledgeand Data Engineering vol 20 no 2 pp 261ndash275 2008

[30] D Liu Z Liu and Q Dong ldquoA dependency grammar andWordNet based sentence similarity measurerdquo Journal of Com-putational Information Systems vol 8 no 3 pp 1027ndash1035 2012

[31] A Zouaq M Gagnon and O Benoit ldquoSemantic analysis usingdependency-based grammars and upper-level ontologiesrdquoInternational Journal of Computational Linguistics and Applica-tions vol 1 no 1-2 pp 85ndash101 2010

[32] C Ramakrishnan K J Kochut and A P Sheth ldquoA frameworkfor schema-driven relationship discovery from unstructuredtextrdquo in The Semantic WebmdashISWC 2006 vol 4273 of LectureNotes in Computer Science pp 583ndash596 Springer BerlinGermany 2006

[33] C J C H Watkins and P Dayan ldquoTechnical note Q-learningrdquoMachine Learning vol 8 no 3-4 pp 279ndash292 1992

[34] M Lan C L Tan J Su and Y Lu ldquoSupervised and traditionalterm weighting methods for automatic text categorizationrdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 31 no 4 pp 721ndash735 2009

[35] L Zhang C Li J Liu and HWang ldquoGraph-based text similar-itymeasurement by exploiting wikipedia as background knowl-edgerdquo World Academy of Science Engineering and Technologyvol 59 pp 1548ndash1553 2011

[36] M Kuramochi and G Karypis ldquoAn efficient algorithm for dis-covering frequent subgraphsrdquo IEEE Transactions on Knowledgeand Data Engineering vol 16 no 9 pp 1038ndash1051 2004

[37] L-C Yu H-M Shih Y-L Lai J-F Yeh and C-H Wu ldquoDis-criminative training for near-synonym substitutionrdquo in Pro-ceedings of the 23rd International Conference on ComputationalLinguistics (COLING rsquo10) pp 1254ndash1262 2010

[38] L-C Yu and W-N Chien ldquoIndependent component analysisfor near-synonym choicerdquoDecision Support Systems vol 55 no1 pp 146ndash155 2013

[39] L Yao S Riedel and A Mccallum ldquoCollective cross-documentrelation extraction without labelled datardquo in Proceedings ofthe Conference on Empirical Methods in Natural LanguageProcessing pp 1013ndash1023MITCambridgeMassUSAOctober2010

[40] M Shen D-R Liu and Y-S Huang ldquoExtracting semantic rela-tions to enrich domain ontologiesrdquo Journal of Intelligent Infor-mation Systems vol 39 no 3 pp 749ndash761 2012

[41] H Chim and X Deng ldquoEfficient phrase-based documentsimilarity for clusteringrdquo IEEE Transactions on Knowledge andData Engineering vol 20 no 9 pp 1217ndash1229 2008

[42] GAMiller andWGCharles ldquoContextual correlates of seman-tic similarityrdquo Language and Cognitive Processes vol 6 no 1 pp1ndash28 1991

[43] M Steinbach G Karypis and V Kumar ldquoA comparison ofdocument clustering techniquesrdquo in Proceedings of the Knowl-edge Discovery and Data Mining (KDD) Workshop Text MiningAugust 2000

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 10: Research Article Exploiting Semantic Annotations and ...downloads.hindawi.com/journals/tswj/2015/136172.pdf · is bene cial for text categorization [ ], text summarization [ ], word

10 The Scientific World Journal

Input Feature Vertex Object(FVO) WordnetOutput The optimal action for FVO relationshipsProcedure

intialize Q[ ]lowastimplement 119876-learning algorithm to get the action with the highest 119876-value lowast119886119896isin action space[] = similarity contiguity contrast causality

lowastParallelforeach used for parallel processing of all actions lowastParallelfor (0 action spacecount() 119896 =gt 119877[119896] = Get Action Reward(119886[119896]) lowastFunction to implement (8) lowast119876[119894 119896] = (1 minus 120572) lowast 119876[119894 119896] + 120572 lowast (119877[119896] + 120574 lowastmax

119886(119876[119894 119896 + 1]))

) ParallelForreturn Get highest 119876-value action(119876[ ]) lowastselect action with the maximum summed value of 119876-values lowast

lowastexecute (8) to get reward value lowastGet Action Reward(ak)AVW

119896= (freq(119886

119896val)FVOCount()) lowast log(119873119899

119886)

for (int 119895 = 0 119895 lt FVOCount() 119895++) FW119895= (freq(FVO[119895])FVOCount()) lowast log(119873119899

119891)

SemRel(119894 119895) =Get SemRel(119886119896val FVO[119895])

Sum+= FW119895lowast SemRel(119894 119895)

return AVW119896= AVW

119896+ Sum

Algorithm 4 IOAC-QL algorithm

where 119903 is the reward of the selected action 119886119894 AVW

119894119896is the

weight of action value 119894 in document 119896 FW119895119896

is the weightof each feature object FVO

119895in document 119896 and SemRel(119894 119895)

is the semantic relatedness between the action value 119894 andfeature object FVO

119895

The most popular weighting scheme is the normalizedword frequency TFIDF [34] used to measure AVW

119894119896and

FW119895119896The FW

119895119896in (8) is calculated likewise AVW

119894119896is based

on (9) and (10)Action value weight (119860119881119882

119894119896) is a measure to calculate

the weight of the action value that is the scalar product ofaction value frequency and inverse document frequency as in(9)The action value frequency (119865) measures the importanceof this value in a document Inverse document frequency(IDF) measures the general importance of this action valuein a corpus of documents

AVW119894119896

=119865119894119896

sum119896119865119894119896

lowast IDF119894 (9)

where 119865119894119896

represents the number of this action value 119894

cocontributions in document 119889119896 normalized by the number

of cocontributions of all MFs in document 119889119896 normalized to

prevent a bias towards longer documents IDF is performedby dividing the number of all documents by the number ofdocuments containing this action value defined as follows

IDF119894= log( |119863|

1003816100381610038161003816119889119896 V119894 isin 1198891198961003816100381610038161003816

) (10)

where |119863| is the total number of documents in the corpusand |119889

119896 V119894isin 119889119896| is the number of documents containing

action value 119881119894

Thus in the IOAC-QL algorithm (see Algorithm 4)implementation shown in Figure 4 the SALA selects theoptimal action with the higher reward Hence each SALAretains effective actions in the FAss list object of each FVO Inthis list each object hasRel-Type object andRel vertex objectTheRel-Type object value equals the action type object valueand Rel vertex object value equals the related vertex indexThe SALA then links the related FVOs with the bidirectionaleffect using the SALA links algorithm and sets the FLW valueof the linkage weight in the adjacency matrix with the rewardof the selected action

An example of monitoring the SALA decision throughthe practical implementation of the IOAC-QL algorithmactions is illustrated in Figure 4

For state FVO1 one has the following

action similarity ActionResult State FVO1FVO7

FVO9FVO22FVO75FVO92FVO112

FVO125

Prob 8reward 184 PrevState FVO

1 119876(119904 119886) = 91

action contiguity ActionResult State FVO103

FVO17

FVO44 Prob 3 reward 74 PrevState FVO

1 119876(119904 119886) =

61action contrast ActionResult State FVO

89 Prob 1

reward 22 PrevState FVO1 119876(119904 119886) = 329

action causality ActionResult State FVO63

FVO249

Prob 2 reward 25 PrevState FVO

1 119876(119904 119886) = 429

As revealed in the case FVO1 the SALA receives the

environmental information data FVO1as a state and then

the SALA starts the implementation to select the actionwith the highest expected future reward among the otherFVOs The result of each action is illustrated the action with

The Scientific World Journal 11

S0

S0

S1

S1S2

a0 a0a0a4 a4 a4

DSO DSO

Sm

Sm

Object model construction

FVO

Vertex [V] Adjacent [V V]

FLWFval

SFKeyFsyn[]

FAss[]FW

The feature linkage graph (FLG)

Descriptive document object

Descriptive sentence object

Action selection

Q-learning mechanism

Assign Q-values with FLG environment

Reinforcement learning model for semanticrelational graph construction

SALA inference engine

DSO DSO DSO DSO DSO

DDO

Sem

antic

actio

nabl

e lea

rnin

g ag

ent (

SALA

)

F1

F1

F1

F2

F3

F4

F2

F2 F3 F4 middot middot middot Fn

DSO

Fn

Fn

middot middot middot

middot middot middot

middot middot middot

middot middot middot middot middot middotmiddot middot middot

middot middot middot

middot middot middotmiddot middot middot

middot middot middot

middot middot middot

middot middot middot

Q(S0 a0) Q(S0 a4)Q(S1 a0) Q(S1 a4)

Q(Sm a0) Q(Sm a4)

WordNet

Figure 4 The IOAC-QL algorithm implementation

the highest reward is selected and the elected action isexecutedTheSALAdecision is accomplished in FLGwith theeffective extracted relationships according to 119876(119904 119886) value asshown in Table 1

33 Document Linkage Conducting Stage This section pre-sents an efficient linkage process at the corpus level for manydocuments Due to the SHGRS scheme of each documentmeasuring the commonality between the FLG of the doc-uments through finding all distinct common similarity orrelatedness subgraphs is required The objective of findingcommon similar or related subgraphs as a relatedness mea-sure is to be utilized in document mining and managingprocesses such as clustering classification and retrieval Itis essential to establish that the measure is metric A metricmeasure ascertains the order of objects in a set and henceoperations such as comparing searching and indexing of theobjects can be performed

Graph relatedness [35 36] involves determining thedegree of similarity between these two graphs (a numberbetween 0 and 1) so the same node in both graphs wouldbe similar if its related nodes were similar The selectionof subgraphs for matching depends on detecting the moreimportant nodes in FLG using FVOFw value The vertexwith the highest weight is the first node selected which thendetects all related vertices from an adjacency matrix as asubgraphThis process depends on computing the maximumcommon subgraphsThe inclusion of all common relatednesssubgraphs is accomplished through recursive iterations andholding onto the value of the relatedness index of thematched

subgraphs Eliminating the overlap is performed using a dy-namic programming technique that keeps track of the nodesvisited and their attribute relatedness values

The function relatedness subgraphs utilize an array Rel-Nodes[] of the related vertices for each document Thefunction SimRel(119909

119894 119910119894) computes the similarity between the

two nodes 119909119894 119910119894as defined in (11) The subgraph matching

process is measured using the following formula

Sub-graph-Rel (SG1 SG2)

=sum

SG1119894

sumSG2119895

SemRel (119873119894 119873119895)

sumSG1119894

119873119894+ sum

SG2119895

119873119895

(11)

where SG1 is the selected subgraph from the FLG of onedocument SG2 is the selected subgraph from the FLG ofanother document and SemRel(119873

119894 119873119895) is the degree of

matching for both nodes 119873119894and 119873

119895 A threshold value for

node matching is also utilized the threshold is set to a 50match

4 Evaluation of the Experiments and Results

This evaluation is especially vital as the aim of building theSHGRS is to use them efficiently and effectively for the fur-ther mining process To explore the effectiveness of the pro-posed SHGRS scheme examining the correlation of the pro-posed semantic relatedness measure compared with otherprevious relatedness measures is required The impact ofthe proposed MFsSR for detecting the closest synonym is

12 The Scientific World Journal

Table 1 Action values table with the SALA decision

State FVO1 FVO1 FVO1 FVO1

Action Similarity Contiguity Contrast CausalityActionResult State FVO1 FVO7 FVO9 FVO22 FVO75 FVO92 FVO112 FVO125 FVO103 FVO17 FVO44 FVO89 FVO63 FVO249

Total reward 184 74 22 25119876(119904 119886) 94 61 329 429Decision Yes No No No

studied and compared to the PMI [25 37] and independentcomponent analysis (ICA) [38] for detecting the best near-synonym The difference between the proposed discoveringsemantic relationships or associations IOAC-QL implicitrelation extraction conditional random fields (CRFs) [6 39]and term frequency and inverse cluster frequency (TFICF)[40] is computed

A case study of amanaging andmining task (ie semanticclustering of text documents) is consideredThis study showshow the proposed approach is workable and how it canbe applied to mining and managing tasks Experimentallysignificant improvements in all performances are achieved inthe document clustering processThis approach outperformsconventional word-based [16] and concept-based [17 41]clusteringmethodsMoreover performance tends to improveas more semantic analysis is achieved in the representation

41 Evaluation Measures Evaluation measures are subcat-egorized into text quality-based evaluation content-basedevaluation coselection-based evaluation and task-basedevaluationsThefirst category of evaluationmeasures is basedon text quality using aspects such as grammaticality nonre-dundancy referential clarity and coherence For content-based evaluations measures such as similarity semanticrelatedness longest common subsequence and other scoresare used The third category is based on coselection evalua-tion using precision recall and 119865-measure values The lastcategory task-based evaluation is based on the performanceof accomplishing the given task Some of these tasks areinformation retrieval question answering document classi-fication document summarizing and document clusteringmethods In this paper the content-based the coselection-based and the task-based evaluations are used to validate theimplementation of the proposed scheme over the real corpusof documents

411 Measure for Content-Based Evaluation The relatednessor similarity measures are inherited from probability theoryand known as the correlation coefficient [42]The correlationcoefficient is one of the most widely used measures todescribe the relatedness 119903 between two vectors119883 and 119884

Correlation Coefficient 119903 The correlation coefficient is a rela-tively efficient relatedness measure which is a symmetricalmeasure of the linear dependence between two randomvariables

Table 2 Contingency table presents human judgment and systemjudgment

Human judgmentTrue False

System judgmentTrue TP FPFalse FN TN

Therefore the correlation coefficient can be consideredas the coefficient for the linear relationship between corre-sponding values of 119883 and 119884 The correlation coefficient 119903between sequences 119883 = 119909

119894 119894 = 1 119899 and 119884 = 119910

119894

119894 = 1 119899 is defined by

119903 =sum119899

119894=1119883119894119884119894

radic(sum119899

119894=11198832119894) lowast (sum

119899

119894=11198842119894)

(12)

Acceptance Rate (AR) Acceptance rate is a proportion ofcorrectly predicted similar or related sentences compared toall related sentences High acceptance ratemeans recognizingalmost all similar or related sentences

AR =TP

(TP + FN) (13)

Accuracy (Acc) Accuracy is a proportion of all correctlypredicted sentences compared to all sentences

Acc = TP + TN(TP + FN + FP + TN)

(14)

where TP TN FP and FN stand for true positive (the num-ber of pairs correctly labeled as similar) true negative (thenumber of pairs correctly labeled as dissimilar) false positive(the number of pairs incorrectly labeled as similar) andfalse negative (the number of pairs incorrectly labeled asdissimilar) as shown in Table 2

412 Measure for Coselection-Based Evaluation For all thedomains a precision 119875 recall 119877 and 119865-measure are utilizedas the measures of performance in the coselection-basedevaluation These measures may be defined via computingthe correlation between the extracted correct and wrongclosest synonyms or semantic relationshipsdependencies

The Scientific World Journal 13

Table 3 Datasets details for the experimental setup

Dataset Experimenting Dataset Name Description

DS1 Experiment 1content-basedevaluation

Miller and Charles (MC)1 RG consists of 65 pairs of nouns extracted from the WordNet rated bymultiple human annotators

DS2Microsoft ResearchParaphrase Corpus

(MRPC)2

The corpus consists of 5801 sentence pairs collected from newswire articles3900 of which were labeled as relatedness by human annotators The wholeset is divided into a training subset (4076 sentences of which 2753 are true)and a test subset (1725 pairs of which 1147 are true)

DS3 Experiment 2coselection-basedevaluation(closest-synonymdetection)

British National Corpus(BNC)3

BNC is a carefully selected collection of 4124 contemporary written andspoken English texts contains 100-million-word text corpus of samples ofwritten and spoken English with the near-synonym collocations

DS4 SN (semantic neighbors)4SN relates 462 target terms (nouns) to 5910 relatum terms with 14682semantic relations (7341 are meaningful and 7341 are random) The SNcontains synonyms coming from three sources WordNet 30 Rogetrsquosthesaurus and a synonyms database

DS5

Experiment 2coselection-basedevaluation (semanticrelationshipsexploration)

BLESS6BLESS relates 200 target terms (100 animate and 100 inanimate nouns) to 8625relatum terms with 26554 semantic relations (14440 are meaningful (correct)and 12154 are random) Every relation has one of the following typeshypernymy cohypernymy meronymy attribute event or random

DS6 TREC5

TREC includes 1437 sentences annotated with entities and relations at leastone relation There are three types of entities person (1685) location (1968)and organization (978) in addition there is a fourth type others (705) whichindicates that the candidate entity is none of the three types There are fivetypes of relations located in (406) indicates that one location is located insideanother location work for (394) indicates that a person works for anorganization OrgBased in (451) indicates that an organization is based in alocation live in (521) indicates that a person lives at a location and kill (268)indicates that a person killed another person There are 17007 pairs of entitiesthat are not related by any of the five relations and hence have the NR relationbetween them which thus significantly outnumbers other relations

DS7 IJCNLP 2011-New YorkTimes (NYT)6

NYT contains 150 business articles from NYT There are 536 instances (208positive 328 negative) with 140 distinct descriptors in NYT dataset

DS8 IJCNLP 2011-Wikipedia8Wikipedia personalsocial relation dataset was previously used in Culotta etal [6] There are 700 instances (122 positive 578 negative) with 70 distinctdescriptors in Wikipedia dataset

DS9 Experiment 3task-basedevaluation

Reuters 215787 Reuters-21578 contains 21578 documents (12902 are used) categorized to 10categories

DS10 20 Newsgroups8 20 newsgroups dataset contains 20000 documents (18846 are used)categorized to 20 categories

1Available at httpwwwcscmuedusimmfaruquisuitehtml2Available at httpresearchmicrosoftcomen-usdownloads3Available at httpcorpusbyuedubnc4Available at httpcentalfltruclacbeteamsimpanchenkosre-evalsncsv5Available at httpl2rcsuiucedusimcogcompDataERconll04corp6Available at httpwwwmysmuedufacultyjingjiangdataIJCNLP2011zip7Available at httpmlrcsumassedumldatasetsReuters-21578+Text+Categorization+Collection8Available at httpwwwcsminingorgindexphpid-20-newsgroupshtml

Let TP denote the number of correctly detected closestsynonyms or semantic relationships explored let FP be thenumber of incorrectly detected closest synonyms or semanticrelationships explored and let FN be the number of correctlybut not detected closest synonyms or semantic relationshipsexplored in a dataset The 119865-measure combines the precisionand recall in one metric and is often used to show theefficiencyThismeasurement can be interpreted as a weightedaverage of the precision and recall where an 119865-measure

reaches its best value at 1 andworst value at 0 Precision recalland 119865-measure are defined as follows

Recall = TP(TP + FN)

Precision =TP

(TP + FP)

119865-measure = 2 (Recall lowast Precision)(Recall + Precision)

(15)

14 The Scientific World Journal

413 Measure for Task-Based Evaluation A case study ofa managing and mining task (ie semantic clustering oftext documents) is studied for task-based evaluation Thetest is performed with the standard clustering algorithmsthat accept the pairwise (dis)similarity between documentsrather than the internal feature space of the documents Theclustering algorithms may be classified as listed below [43](a) flat clustering (which creates a set of clusters without anyexplicit structure that would relate the clusters to each otheralso called exclusive clustering) (b) hierarchical clustering(which creates a hierarchy of clusters) (c) hard clustering(which assigns each documentobject as a member of exactlyone cluster) and (d) soft clustering (which distributes thedocumentsobjects over all clusters) These algorithms areagglomerative (hierarchical clustering) 119870-means (flat clus-tering hard clustering) and EM algorithm (flat clusteringsoft clustering)

Hierarchical agglomerative clustering (HAC) and the 119870-means algorithm have been applied to text clustering ina straightforward way [16 17] The algorithms employedin the experiment included 119870-means clustering and HACTo evaluate the quality of the clustering the 119865-measure isused for quality measures which combine the precision andrecall measures These metrics are computed for every (classcluster) pair The precision 119875 and recall 119877 of a cluster 119878

119894with

respect to a class 119871119903are defined as

119875 (119871119903 119878119894) =

119873119903119894

119873119894

119877 (119871119903 119878119894) =

119873119903119894

119873119903

(16)

where 119873119903119894is the number of documents of class 119871

119903in cluster

119878119894119873119894is the number of documents of cluster 119878

119894 and119873

119903is the

number of documents of class 119871119903

The 119865-measure which is a harmonic mean of precisionand recall is defined as

119865 (119871119903 119878119894) =

2 lowast 119875 (119871119903 119878119894) lowast 119877 (119871

119903 119878119894)

119875 (119871119903 119878119894) + 119877 (119871

119903 119878119894)

(17)

A per-class 119865 score is calculated as follows

119865 (119871119903) = max119878119894isin119862

119865 (119871119903 119878119894) (18)

With respect to class 119871119903 the cluster with the highest 119865-

measure is considered to be the cluster that maps to class 119871119903

that 119865-measure becomes the score for class 119871119903 and 119862 is total

number of clusters The overall 119865-measure for the clusteringresult 119862 is the weighted average of the 119865-measure for eachclass 119871

119903(macroaverage)

119865119862 =sum119871119903

(1003816100381610038161003816119873119903

1003816100381610038161003816 lowast 119865 (119871119903))

sum119871119903

10038161003816100381610038161198731199031003816100381610038161003816

(19)

where |119873119903| is the number of objects in class 119871

119903 The higher

the overall 119865-measure is the better the clustering is due tothe higher accuracy of the clusters mapping to the originalclasses

42 Evaluation Setup (Dataset) Content-based coselection-based and task-based evaluations are used to validateexperimenting over the real corpus of documents theresults are very promising The experimental setup con-sists of some datasets of textual documents as detailed inTable 3 Experiment 1 uses the Miller and Charles (MC) andMicrosoft Research Paraphrase Corpus Experiment 2 usesthe British National Corpus TREC IJCNLP 2011 (NYT andWikipedia) SN and BLESS Experiment 3 usesThe Reuters-21578 and 20 newsgroups

43 Evaluation Results This section reports on the results ofthree experiments conducted using the evaluation datasetsoutlined in the previous section The SHGRS is imple-mented and evaluated based on concept analysis and anno-tation as sentence based in experiment 1 document basedin experiment 2 and corpus based in experiment 3 Inexperiment 1 the DS1 and DS2 investigated the effectivenessof the proposed MFsSR compared to previous studies basedon the availability of semantic relatedness values of humanjudgments and previous studies In experiment 2 the DS3andDS4 investigated the effectiveness of the closest synonymdetection using MFsSR compared to using PMI [25 37]and ICA [38] based on the availability of synonyms comingfrom three sources WordNet 30 Rogetrsquos thesaurus anda synonyms database Furthermore the DS5 DS6 DS7and DS8 investigated the effectiveness of the IOAC-QL forsemantic relationships exploration compared to the implicitrelation extraction CRFs [6 39] and TFICF [40] based on theavailability of semantic relations judgments In experiment 3the DS9 and DS10 investigated the effectiveness of thedocument linkage process for clustering case study thanprevious studies as word-based [16] and concept-based [17]based on the availability of human judgments

These experiments revealed some interesting trends interms of text organization and a representation schemebased on semantic annotation and reinforcement learningThe results show the effectiveness of the proposed schemefor accurate understanding of the underlying meaning Theproposed SHGRS is exploited in mining and managingoperations such as gathering filtering searching retrievingextracting classifying clustering and summarizing docu-ments In addition the experiments illustrate the improve-ments when direct relevance and indirect relevance areincluded in the process of measuring semantic relatednessbetween MFs of documents The main purpose of semanticrelatedness measures is to infer unknown dependencies andrelationships among the MFs The proposed MFsSR is abackbone of the closest synonym detection and semanticrelationships exploration algorithms The performance ofMFsSR for closest synonym detection and IOAC-QL forsemantic relationships exploration algorithms is proportion-ally improved as well as having a semantic understanding oftext content achieved The relatedness subgraph measure isused for the document linkage process at the corpus levelThrough this process better clustering results can be achievedas more attempt to improve the accuracy of measuring therelatedness between documents using semantic information

The Scientific World Journal 15

Table 4The results of the comparison of the correlation coefficientbetween human judgment with some relatedness measures and theproposed semantic relatedness measure

MeasureRelevance

correlation withMampC

Distance-based measuresRada 0688Wu and Palmer 0765

Information-based measuresResnik 077Jiang and Conrath 0848Lin 0853

Hybrid measuresT Hong and D smith 0879Zili Zhou 0882

Informationfeature-base measuresThe proposed MFsSR 0937

in SHGRSThe relatedness subgraphsmeasure affects the per-formance of the output of the document managing and min-ing process relative to howmuch of the semantic informationit considers The most accurate relatedness can be achievedwhen the maximum amount of the semantic information isprovided Consequently the quality of clustering is improvedand when semantic information is considered the approachsurpasses the word-based and concept-based techniques

431 Experiment 1 Comparative Study (Content-Based Eval-uation) This experiment shows the necessity to evaluate theperformance of the proposed MFsSR based on a benchmarkdataset of human judgments so the results of the proposedsemantic relatedness would be comparable with other pre-vious studies in the same field In this experiment theyattempted to compute the correlation between the ratings ofthe proposed semantic relatedness approach and the meanratings reported by Miller and Charles (DS1 in Table 3)Furthermore the results produced are compared againsteight other semantic similarities approaches namely RadaWu and Palmer Resnik Jiang and Conrath Lin T Hong andD Smith and Zili Zhou [21]

Table 4 shows the correlation coefficient results betweennine componential approaches and the Miller and Charlesratings mean The semantic relatedness for the proposedapproach outperformed all the listed approaches Unlike allthe listed methods in the proposed semantic relatednessdifferent properties are considered Furthermore the goodcorrelation value of the approach also results from consider-ing all available relationships between concepts and the indi-rect relationships between attributes of each concept whenmeasuring the semantic relatedness In this experiment theproposed approach achieved a good correlation value withthe human-subject rating reported by Miller and CharlesBased on the results of this study theMFsSR correlation valuehas proven that considering the directindirect relevance

is specifically important for at least 6 improvement overthe best previous results This improvement contributes tothe achievement of the largest amount of relationships andinterdependence among MFs and their attributes at thedocument level and to the document linkage process at acorpus level

Table 5 summarizes the characteristics of the MRPCdataset (DS2 in Table 3) and presents comparison of theAcc and AR values between the proposed semantic related-ness measure MFsSR and Islam and Inkpen [7] Differentrelatedness thresholds ranging from 0 to 1 with interval01 are used to validate the MFsSR with Islam and Inkpen[7] After evaluation the best relatedness thresholds of Accand AR are 06 07 and 08 These results indicate thatthe proposed MFsSR surpasses Islam and Inkpen [7] interms of Acc and AR The results of each approach listedbelow were based on the best Acc and AR through allthresholds instead of under the same relatedness thresholdThis improvement in Acc andAR values is due to the increasein the numbers of pairs predicted correctly after consideringdirectindirect relevance This relevance takes into accountthe closest synonym and the relationships of sentences MFs

432 Experiment 2 Comparative Study of Closest-SynonymDetection and Semantic Relationships Exploration Thisexperiment shows the necessity to study the impact of theMFsSR for inferring unknown dependencies and relation-ships among the MFs This impact is achieved through theclosest-synonym detection and semantic relationship explo-ration algorithms so the results would be comparable withother previous studies in the same field Throughout thisexperiment the impact of the proposed MFsSR for detectingthe closest synonym is studied and compared to the PMIapproach and ICA approach for detecting the best near-synonym The difference among the proposed discoveringsemantic relationships or associations IOAC-QL the implicitrelation extraction CRFs approach and TFICF approachis examined This experiment attempts to measure theperformance of theMFsSR for the closest synonym detectionand the IOAC-QL for semantic relationship explorationalgorithms Hence more semantic understanding of the textcontent is achieved by inferring unknown dependenciesand relationships among the MFs Considering the directrelevance among the MFs and the indirect relevance amongthe attributes of the MFs in the proposed MFsSR constitutesa certain advantage over previous measures However mostof the time the incorrect items are due to a wrong syntacticparsing from the OpenNLP parser and AlchemyAPIAccording to the preliminary study it is certain that theaccuracy of parsing toolsrsquo effects is on the performance of theMFsSR

Detecting the closest synonym is a process of MF disam-biguation resulting from the implementation of the MFsSRtechnique and the threshold of the synonyms scores throughthe C-SynD algorithm In the C-SynD algorithm the MFsSRtakes into account the direct relevance among MFs andthe indirect relevance among MFs and their attributes in acontext that gives a significant increase in the recall without

16 The Scientific World Journal

Table 5 The results of the comparison of the accuracy and acceptance rate between the proposed semantic relatedness measure and Islamand Inkpen [7]

MRPC dataset Relatednessthreshold

Human judgment(TP + FN)

Islam and Inkpen [7] The proposed MFsSRAcc AR Acc AR

Trainingsubset (4076)

01

2753 true

067 1 068 102 067 1 068 103 067 1 068 104 067 1 068 105 069 098 068 106 072 089 068 107 068 078 070 09808 056 04 072 08609 037 009 060 0491 033 0 034 002

Test subset(1725)

01

1147 true

066 1 067 102 066 1 067 103 066 1 067 104 066 1 067 105 068 098 067 106 072 089 066 107 068 078 066 09808 056 04 064 08509 038 009 049 051 033 0 034 003

disturbing the precision Thus the MFsSR between two con-cepts considers the information content with most of Word-Net relationships Table 6 illustrates the performance of theC-SynD algorithm based on theMFsSR compared to the bestnear-synonym algorithmusing the PMI and ICA approachesThis comparison was conducted for two corpora of textdocuments that are BNC and NS (DS3 and DS4 in Table 3)As indicated in Table 6 the C-SynD algorithm yielded higheraverage precision recall and 119865-measure values than PMIapproach by 25 20 and 21 on SN dataset respectivelyand also by 8 4 and 7 on BNC dataset Furthermorethe C-SynD algorithm also yielded higher average precisionrecall and 119865-measure values than ICA approach by 6 9and 7 on SN dataset respectively and also by 6 4and 5 on BNC dataset The improvements achieved in theperformance values are due to the increase of the numberof pairs predicted correctly and the decrease of the numberof pairs predicted incorrectly through implementing MFsSRThe important observation from this table is the improve-ments achieved in recall which measures effectiveness of theC-SynD algorithm Achieving better precision values is clearwith a high percentage in different datasets and domains

Inferring the MFs relationships and dependencies is aprocess of achieving a large number of interdependencesamongMFs resulting from the implementation of the IOAC-QL algorithm In the IOAC-QL algorithm the 119876-learning isused to select the optimal action (relationships) that gainsthe largest amount of interdependences among the MFs

resulting from the measure of the AVW Table 7 illustratesthe performance of the IOAC-QL algorithm based on AVWscompared to the implicit relationship extraction based onCRFs and TFICF approaches which was conducted overthe TREC Bless the NYT and the Wikipedia corpus (DS5DS6 DS7 and DS8 in Table 3) The data in this table showsincreases in precision recall and 119865-measure values due tothe increase in the number of pairs predicted correctly afterconsidering directindirect relevance through the expansionof closest synonym In addition the IOAC-QL algorithmthe CRFs and the TFICF approaches are beneficial tothe extraction performance but the IOAC-QL contributesmore than CRFs and TFICF Thus IOAC-QL considerssimilarity contiguity contrast and causality relationshipsbetween MFs and its closest-synonyms while the CRFs andTFICF consider is-a and part-of relationships only betweenconceptsThe improvements of the 119865-measure were achievedthrough an IOAC-QL algorithm up to 23 for TREC dataset22 for BLESS dataset 21 for NYT dataset and 6 forWikipedia dataset approximately from the CRFs approachFurthermore the IOAC-QL algorithm also yielded higher119865-measure values than TFICF approach by 10 for TRECdataset 17 for BLESS dataset and 7 for NYT datasetrespectively

433 Experiment 3 Comparative Study (Task-Based Eval-uation) This experiment shows the effectiveness of the

The Scientific World Journal 17

Table 6 The results of the C-SynD algorithm based on the MFsSR compared to the PMI and ICA for detecting the closest synonym

Datasets PMI ICA C-SynDPrecision Recall 119865-measure Precision Recall 119865-measure Precision Recall 119865-measure

SN 606 606 061 795 716 753 85 80 82BNC 745 679 71 76 678 72 82 719 77

Table 7 The results of the IOAC-QL algorithm based on AVWs compared to CRFs and TFICF approaches

Datasets CRFs TFICF IOAC-QLPrecision Recall 119865-measure Precision Recall 119865-measure Precision Recall 119865-measure

TREC 7508 602 6628 893 714 787 898 881 886BLESS 7304 6266 6703 738 695 716 950 835 889NYT 6846 5402 6038 860 650 740 900 740 812Wikipedia 560 420 480 646 5488 5934 700 440 540

Table 8The results of graph-based approach compared to term-based and concept-based approach using119870-means and the HAC clusteringalgorithms

Datasets Algorithm Term-based Concept-based Graph-basedFC FC FC

Reuters-21578 HAC 73 857 902119870-means 51 846 918

20 Newsgroups HAC 53 82 89119870-means 48 793 87

proposed semantic representation scheme in document min-ing and managing operations such as gathering filteringsearching retrieving extracting classifying clustering andsummarizing documents Many sets of experiments usingthe proposed scheme on different datasets in text clusteringhave been conducted The experiments demonstrate theextensive comparisons between the MF-based analysis andthe traditional analysis Experimental results demonstratethe substantial enhancement of the document clusteringquality In this experiment the application of the semanticunderstanding of the textual documents is presented Textualdocuments are parsed to produce the rich syntactic andsemantic representations of SHGRS Based on these semanticrepresentations document linkages are estimated using thesubgraph matching process that is based on finding alldistinct similarity common subgraphs The semantic repre-sentations of textual documents determining the relatednessbetween themare evaluated by allowing clustering algorithmsto use the produced relatedness matrix of the documentset To test the effectiveness of subgraph matching in deter-mining an accurate measure of the relatedness betweendocuments an experiment using the MFs-based analysisand subgraphs relatedness measures was conducted Twostandard document clustering techniques were chosen fortesting the effect of the document linkages on clustering [43](1) HAC and (2) 119870-means The MF-based weighting is themain factor that captures the importance of a concept in

a document Furthermore document linkages using the sub-graph relatedness measure are among the main factors thataffect clustering techniques Thus to study the effect of theMFs-based weighting and the subgraph relatedness measure(graph-based approach) on the clustering techniques thisexperiment is repeated for the different clustering techniquesas shown in Table 8 Basically the aim is to maximize theweighted average of the 119865-measure for each class (FC) ofclusters to achieve high-quality clustering

Table 8 shows that the graph-based approach achieveshigher performance than term-based and concept-basedapproachesThis comparison was conducted for two corporaof text documents that are the Reuters-21578 dataset and the20 newsgroups dataset (DS9 and DS10 in Table 3)The resultsshown in Table 8 illustrate improvement in the clusteringquality when more attributes are included in graph-basedrelatedness measure The improvements were achieved at thegraph-based approach up to 17 by HAC and 40 by 119870-means for Reuters-21578 dataset while up to 36 byHAC and39 by 119870-means for approximately 20 newsgroup datasetsfrom the base case (term-based) approach Furthermorethe improvements also were achieved at the graph-basedapproach up to 5 by HAC and 7 by119870-Means for Reuters-21578 dataset while up to 7 byHAC and 8 by119870-Means forapproximately 20 Newsgroup datasets from Concept-basedapproach

18 The Scientific World Journal

5 Conclusions

This paper proposed the SHGRS to improve the effectivenessof themining andmanaging operations in textual documentsSpecifically a three-stage approach was proposed for con-structing the scheme First the MFs with their attributesare extracted and hierarchy-based structure is built to rep-resent sentences by the MFs and their attributes Second anovel semantic relatedness computing method was proposedfor inferring relationships and dependencies of MFs andthe relationship between MFs is represented by a graph-based structure Third a subgraph matching algorithm wasconducted for document linkage process The documentsclustering is used as a case study and conducted extensiveexperiments on 10 different datasets and the results showedthat the scheme can result in a better mining performancecompared with traditional approach Future work will focuson conducting other case studies to processes such asgathering filtering retrieving classifying and summarizinginformation

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] I T Fatudimu A G Musa C K Ayo and A B SofoluweldquoKnowledge discovery in online repositories a text miningapproachrdquo European Journal of Scientific Research vol 22 no2 pp 241ndash250 2008

[2] A Hotho A Nurnberger and G Paaszlig ldquoA brief survey of textminingrdquo Journal for Computational Linguistics and LanguageTechnology vol 20 2005

[3] M Bernotas K Karklius R Laurutis and A Slotkiene ldquoThepeculiarities of the text document representation using ontol-ogy and tagging-based clustering techniquerdquo Journal of Infor-mation Technology and Control vol 36 pp 217ndash220 2007

[4] Y Li Z A Bandar andDMcLean ldquoAn approach formeasuringsemantic similarity between words using multiple informationsourcesrdquo IEEE Transactions on Knowledge and Data Engineer-ing vol 15 no 4 pp 871ndash882 2003

[5] K Shaban O Basir and M Kamel ldquoDocument mining basedon semantic understanding of textrdquo in Progress in Pattern Rec-ognition Image Analysis and Applications vol 4225 of LectureNotes in Computer Science pp 834ndash843 Springer BerlinGermany 2006

[6] A Culotta A McCallum and J Betz ldquoIntegrating probabilisticextraction models and data mining to discover relations andpatterns in textrdquo in Proceedings of theHuman Language Technol-ogy ConferencemdashNorth American Chapter of the Association forComputational Linguistics Annual Meeting (HLT-NAACL rsquo06)pp 296ndash303 June 2006

[7] A Islam and D Inkpen ldquoSemantic text similarity using corpus-based word similarity and string similarityrdquo ACM Transactionson Knowledge Discovery fromData vol 2 no 2 article 10 2008

[8] R Feldman and J SangerTheTextMining Handbook AdvancedApproaches in Analyzing Unstructured Data Cambridge Uni-versity Press 2006

[9] R Navigli ldquoWord sense disambiguation a surveyrdquo ACM Com-puting Surveys vol 41 no 2 article 10 2009

[10] W Mao andWW Chu ldquoThe phrase-based vector space modelfor automatic retrieval of free-text medical documentsrdquo Dataand Knowledge Engineering vol 61 no 1 pp 76ndash92 2007

[11] C Siefkes and P Siniakov ldquoAn overview and classification ofadaptive approaches to information extractionrdquo in Journal onData Semantics IV vol 3730 of Lecture Notes in ComputerScience pp 172ndash212 Springer New York NY USA 2005

[12] httpwordnetprincetonedu[13] OpenNLP httpopennlpapacheorg[14] httpwwwalchemyapicom[15] W V Siricharoen ldquoOntologies and object models in object ori-

ented software engineeringrdquo International Association of Engi-neers (IAENG) Journal of Computer Science (JICS) vol 33 no 1pp 1151ndash1162 2006

[16] S Shehata ldquoA WordNet-based semantic model for enhancingtext clusteringrdquo in Proceedings of the IEEE International Con-ference on Data Mining Workshops (ICDMW rsquo09) pp 477ndash482Miami Fla USA December 2009

[17] S Shehata F Karray andM Kamel ldquoAn efficient concept-basedminingmodel for enhancing text clusteringrdquo IEEE Transactionson Knowledge and Data Engineering vol 22 no 10 pp 1360ndash1371 2010

[18] E Zemmouri H Behja A Marzak and B Trousse ldquoOntology-based knowledge model for multi-view KDD processrdquo Interna-tional Journal of Mobile Computing and Multimedia Communi-cations vol 4 no 3 pp 21ndash33 2012

[19] C Marinica and F Guillet ldquoKnowledge-based interactive post-mining of association rules using ontologiesrdquo IEEETransactionson Knowledge and Data Engineering vol 22 no 6 pp 784ndash7972010

[20] B Choudhary and P Bhattacharyya ldquoText clustering usingUni-versal Networking Language representationrdquo in Proceedings ofthe 11th International World Wide Web Conference 2003

[21] G Pirro and J Euzenat ldquoA feature and information theoreticframework for semantic similarity and relatednessrdquo in Proceed-ings of the 9th International Semantic Web Conference on theSemantic Web (ISWC rsquo10) vol 6496 pp 615ndash630 2010

[22] Z Lin M R Lyu and I King ldquoMatchSim a novel similaritymeasure based on maximum neighborhood matchingrdquo Knowl-edge and Information Systems vol 32 no 1 pp 141ndash166 2012

[23] D Lin ldquoAn information-theoretic definition of similarityrdquo inProceedings of the 15th International Conference on MachineLearning Madison Wis USA July 1998

[24] D Inkpen ldquoA statistical model for near-synonym choicerdquo ACMTransactions on Speech and Language Processing vol 4 no 1article 2 2007

[25] LHan T Finin PMcNameeA Joshi andYYesha ldquoImprovingword similarity by augmenting PMIwith estimates of word pol-ysemyrdquo IEEE Transactions on Knowledge and Data Engineeringvol 25 no 6 pp 1307ndash1322 2013

[26] J OrsquoShea Z Bandar K Crockett and D McLean ldquoA compar-ative study of two short text semantic similarity measuresrdquo inAgent and Multi-Agent Systems Technologies and Applicationsvol 4953 of Lecture Notes in Artificial Intelligence pp 172ndash181Springer Berlin Germany 2008

[27] J Oliva J I SerranoMDDel Castillo and A Iglesias ldquoSyMSSa syntax-basedmeasure for short-text semantic similarityrdquoDataand Knowledge Engineering vol 70 no 4 pp 390ndash405 2011

The Scientific World Journal 19

[28] I Yoo and X Hu ldquoClustering large collection of biomedicalliterature based on ontology-enriched bipartite graph represen-tation andmutual refinement strategyrdquo inProceedings of the 10thPAKDD pp 9ndash12 2006

[29] S-T Yuan and Y-C Chen ldquoSemantic ideation learning foragent-based e-brainstormingrdquo IEEE Transactions on Knowledgeand Data Engineering vol 20 no 2 pp 261ndash275 2008

[30] D Liu Z Liu and Q Dong ldquoA dependency grammar andWordNet based sentence similarity measurerdquo Journal of Com-putational Information Systems vol 8 no 3 pp 1027ndash1035 2012

[31] A Zouaq M Gagnon and O Benoit ldquoSemantic analysis usingdependency-based grammars and upper-level ontologiesrdquoInternational Journal of Computational Linguistics and Applica-tions vol 1 no 1-2 pp 85ndash101 2010

[32] C Ramakrishnan K J Kochut and A P Sheth ldquoA frameworkfor schema-driven relationship discovery from unstructuredtextrdquo in The Semantic WebmdashISWC 2006 vol 4273 of LectureNotes in Computer Science pp 583ndash596 Springer BerlinGermany 2006

[33] C J C H Watkins and P Dayan ldquoTechnical note Q-learningrdquoMachine Learning vol 8 no 3-4 pp 279ndash292 1992

[34] M Lan C L Tan J Su and Y Lu ldquoSupervised and traditionalterm weighting methods for automatic text categorizationrdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 31 no 4 pp 721ndash735 2009

[35] L Zhang C Li J Liu and HWang ldquoGraph-based text similar-itymeasurement by exploiting wikipedia as background knowl-edgerdquo World Academy of Science Engineering and Technologyvol 59 pp 1548ndash1553 2011

[36] M Kuramochi and G Karypis ldquoAn efficient algorithm for dis-covering frequent subgraphsrdquo IEEE Transactions on Knowledgeand Data Engineering vol 16 no 9 pp 1038ndash1051 2004

[37] L-C Yu H-M Shih Y-L Lai J-F Yeh and C-H Wu ldquoDis-criminative training for near-synonym substitutionrdquo in Pro-ceedings of the 23rd International Conference on ComputationalLinguistics (COLING rsquo10) pp 1254ndash1262 2010

[38] L-C Yu and W-N Chien ldquoIndependent component analysisfor near-synonym choicerdquoDecision Support Systems vol 55 no1 pp 146ndash155 2013

[39] L Yao S Riedel and A Mccallum ldquoCollective cross-documentrelation extraction without labelled datardquo in Proceedings ofthe Conference on Empirical Methods in Natural LanguageProcessing pp 1013ndash1023MITCambridgeMassUSAOctober2010

[40] M Shen D-R Liu and Y-S Huang ldquoExtracting semantic rela-tions to enrich domain ontologiesrdquo Journal of Intelligent Infor-mation Systems vol 39 no 3 pp 749ndash761 2012

[41] H Chim and X Deng ldquoEfficient phrase-based documentsimilarity for clusteringrdquo IEEE Transactions on Knowledge andData Engineering vol 20 no 9 pp 1217ndash1229 2008

[42] GAMiller andWGCharles ldquoContextual correlates of seman-tic similarityrdquo Language and Cognitive Processes vol 6 no 1 pp1ndash28 1991

[43] M Steinbach G Karypis and V Kumar ldquoA comparison ofdocument clustering techniquesrdquo in Proceedings of the Knowl-edge Discovery and Data Mining (KDD) Workshop Text MiningAugust 2000

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 11: Research Article Exploiting Semantic Annotations and ...downloads.hindawi.com/journals/tswj/2015/136172.pdf · is bene cial for text categorization [ ], text summarization [ ], word

The Scientific World Journal 11

S0

S0

S1

S1S2

a0 a0a0a4 a4 a4

DSO DSO

Sm

Sm

Object model construction

FVO

Vertex [V] Adjacent [V V]

FLWFval

SFKeyFsyn[]

FAss[]FW

The feature linkage graph (FLG)

Descriptive document object

Descriptive sentence object

Action selection

Q-learning mechanism

Assign Q-values with FLG environment

Reinforcement learning model for semanticrelational graph construction

SALA inference engine

DSO DSO DSO DSO DSO

DDO

Sem

antic

actio

nabl

e lea

rnin

g ag

ent (

SALA

)

F1

F1

F1

F2

F3

F4

F2

F2 F3 F4 middot middot middot Fn

DSO

Fn

Fn

middot middot middot

middot middot middot

middot middot middot

middot middot middot middot middot middotmiddot middot middot

middot middot middot

middot middot middotmiddot middot middot

middot middot middot

middot middot middot

middot middot middot

Q(S0 a0) Q(S0 a4)Q(S1 a0) Q(S1 a4)

Q(Sm a0) Q(Sm a4)

WordNet

Figure 4 The IOAC-QL algorithm implementation

the highest reward is selected and the elected action isexecutedTheSALAdecision is accomplished in FLGwith theeffective extracted relationships according to 119876(119904 119886) value asshown in Table 1

33 Document Linkage Conducting Stage This section pre-sents an efficient linkage process at the corpus level for manydocuments Due to the SHGRS scheme of each documentmeasuring the commonality between the FLG of the doc-uments through finding all distinct common similarity orrelatedness subgraphs is required The objective of findingcommon similar or related subgraphs as a relatedness mea-sure is to be utilized in document mining and managingprocesses such as clustering classification and retrieval Itis essential to establish that the measure is metric A metricmeasure ascertains the order of objects in a set and henceoperations such as comparing searching and indexing of theobjects can be performed

Graph relatedness [35 36] involves determining thedegree of similarity between these two graphs (a numberbetween 0 and 1) so the same node in both graphs wouldbe similar if its related nodes were similar The selectionof subgraphs for matching depends on detecting the moreimportant nodes in FLG using FVOFw value The vertexwith the highest weight is the first node selected which thendetects all related vertices from an adjacency matrix as asubgraphThis process depends on computing the maximumcommon subgraphsThe inclusion of all common relatednesssubgraphs is accomplished through recursive iterations andholding onto the value of the relatedness index of thematched

subgraphs Eliminating the overlap is performed using a dy-namic programming technique that keeps track of the nodesvisited and their attribute relatedness values

The function relatedness subgraphs utilize an array Rel-Nodes[] of the related vertices for each document Thefunction SimRel(119909

119894 119910119894) computes the similarity between the

two nodes 119909119894 119910119894as defined in (11) The subgraph matching

process is measured using the following formula

Sub-graph-Rel (SG1 SG2)

=sum

SG1119894

sumSG2119895

SemRel (119873119894 119873119895)

sumSG1119894

119873119894+ sum

SG2119895

119873119895

(11)

where SG1 is the selected subgraph from the FLG of onedocument SG2 is the selected subgraph from the FLG ofanother document and SemRel(119873

119894 119873119895) is the degree of

matching for both nodes 119873119894and 119873

119895 A threshold value for

node matching is also utilized the threshold is set to a 50match

4 Evaluation of the Experiments and Results

This evaluation is especially vital as the aim of building theSHGRS is to use them efficiently and effectively for the fur-ther mining process To explore the effectiveness of the pro-posed SHGRS scheme examining the correlation of the pro-posed semantic relatedness measure compared with otherprevious relatedness measures is required The impact ofthe proposed MFsSR for detecting the closest synonym is

12 The Scientific World Journal

Table 1 Action values table with the SALA decision

State FVO1 FVO1 FVO1 FVO1

Action Similarity Contiguity Contrast CausalityActionResult State FVO1 FVO7 FVO9 FVO22 FVO75 FVO92 FVO112 FVO125 FVO103 FVO17 FVO44 FVO89 FVO63 FVO249

Total reward 184 74 22 25119876(119904 119886) 94 61 329 429Decision Yes No No No

studied and compared to the PMI [25 37] and independentcomponent analysis (ICA) [38] for detecting the best near-synonym The difference between the proposed discoveringsemantic relationships or associations IOAC-QL implicitrelation extraction conditional random fields (CRFs) [6 39]and term frequency and inverse cluster frequency (TFICF)[40] is computed

A case study of amanaging andmining task (ie semanticclustering of text documents) is consideredThis study showshow the proposed approach is workable and how it canbe applied to mining and managing tasks Experimentallysignificant improvements in all performances are achieved inthe document clustering processThis approach outperformsconventional word-based [16] and concept-based [17 41]clusteringmethodsMoreover performance tends to improveas more semantic analysis is achieved in the representation

41 Evaluation Measures Evaluation measures are subcat-egorized into text quality-based evaluation content-basedevaluation coselection-based evaluation and task-basedevaluationsThefirst category of evaluationmeasures is basedon text quality using aspects such as grammaticality nonre-dundancy referential clarity and coherence For content-based evaluations measures such as similarity semanticrelatedness longest common subsequence and other scoresare used The third category is based on coselection evalua-tion using precision recall and 119865-measure values The lastcategory task-based evaluation is based on the performanceof accomplishing the given task Some of these tasks areinformation retrieval question answering document classi-fication document summarizing and document clusteringmethods In this paper the content-based the coselection-based and the task-based evaluations are used to validate theimplementation of the proposed scheme over the real corpusof documents

411 Measure for Content-Based Evaluation The relatednessor similarity measures are inherited from probability theoryand known as the correlation coefficient [42]The correlationcoefficient is one of the most widely used measures todescribe the relatedness 119903 between two vectors119883 and 119884

Correlation Coefficient 119903 The correlation coefficient is a rela-tively efficient relatedness measure which is a symmetricalmeasure of the linear dependence between two randomvariables

Table 2 Contingency table presents human judgment and systemjudgment

Human judgmentTrue False

System judgmentTrue TP FPFalse FN TN

Therefore the correlation coefficient can be consideredas the coefficient for the linear relationship between corre-sponding values of 119883 and 119884 The correlation coefficient 119903between sequences 119883 = 119909

119894 119894 = 1 119899 and 119884 = 119910

119894

119894 = 1 119899 is defined by

119903 =sum119899

119894=1119883119894119884119894

radic(sum119899

119894=11198832119894) lowast (sum

119899

119894=11198842119894)

(12)

Acceptance Rate (AR) Acceptance rate is a proportion ofcorrectly predicted similar or related sentences compared toall related sentences High acceptance ratemeans recognizingalmost all similar or related sentences

AR =TP

(TP + FN) (13)

Accuracy (Acc) Accuracy is a proportion of all correctlypredicted sentences compared to all sentences

Acc = TP + TN(TP + FN + FP + TN)

(14)

where TP TN FP and FN stand for true positive (the num-ber of pairs correctly labeled as similar) true negative (thenumber of pairs correctly labeled as dissimilar) false positive(the number of pairs incorrectly labeled as similar) andfalse negative (the number of pairs incorrectly labeled asdissimilar) as shown in Table 2

412 Measure for Coselection-Based Evaluation For all thedomains a precision 119875 recall 119877 and 119865-measure are utilizedas the measures of performance in the coselection-basedevaluation These measures may be defined via computingthe correlation between the extracted correct and wrongclosest synonyms or semantic relationshipsdependencies

The Scientific World Journal 13

Table 3 Datasets details for the experimental setup

Dataset Experimenting Dataset Name Description

DS1 Experiment 1content-basedevaluation

Miller and Charles (MC)1 RG consists of 65 pairs of nouns extracted from the WordNet rated bymultiple human annotators

DS2Microsoft ResearchParaphrase Corpus

(MRPC)2

The corpus consists of 5801 sentence pairs collected from newswire articles3900 of which were labeled as relatedness by human annotators The wholeset is divided into a training subset (4076 sentences of which 2753 are true)and a test subset (1725 pairs of which 1147 are true)

DS3 Experiment 2coselection-basedevaluation(closest-synonymdetection)

British National Corpus(BNC)3

BNC is a carefully selected collection of 4124 contemporary written andspoken English texts contains 100-million-word text corpus of samples ofwritten and spoken English with the near-synonym collocations

DS4 SN (semantic neighbors)4SN relates 462 target terms (nouns) to 5910 relatum terms with 14682semantic relations (7341 are meaningful and 7341 are random) The SNcontains synonyms coming from three sources WordNet 30 Rogetrsquosthesaurus and a synonyms database

DS5

Experiment 2coselection-basedevaluation (semanticrelationshipsexploration)

BLESS6BLESS relates 200 target terms (100 animate and 100 inanimate nouns) to 8625relatum terms with 26554 semantic relations (14440 are meaningful (correct)and 12154 are random) Every relation has one of the following typeshypernymy cohypernymy meronymy attribute event or random

DS6 TREC5

TREC includes 1437 sentences annotated with entities and relations at leastone relation There are three types of entities person (1685) location (1968)and organization (978) in addition there is a fourth type others (705) whichindicates that the candidate entity is none of the three types There are fivetypes of relations located in (406) indicates that one location is located insideanother location work for (394) indicates that a person works for anorganization OrgBased in (451) indicates that an organization is based in alocation live in (521) indicates that a person lives at a location and kill (268)indicates that a person killed another person There are 17007 pairs of entitiesthat are not related by any of the five relations and hence have the NR relationbetween them which thus significantly outnumbers other relations

DS7 IJCNLP 2011-New YorkTimes (NYT)6

NYT contains 150 business articles from NYT There are 536 instances (208positive 328 negative) with 140 distinct descriptors in NYT dataset

DS8 IJCNLP 2011-Wikipedia8Wikipedia personalsocial relation dataset was previously used in Culotta etal [6] There are 700 instances (122 positive 578 negative) with 70 distinctdescriptors in Wikipedia dataset

DS9 Experiment 3task-basedevaluation

Reuters 215787 Reuters-21578 contains 21578 documents (12902 are used) categorized to 10categories

DS10 20 Newsgroups8 20 newsgroups dataset contains 20000 documents (18846 are used)categorized to 20 categories

1Available at httpwwwcscmuedusimmfaruquisuitehtml2Available at httpresearchmicrosoftcomen-usdownloads3Available at httpcorpusbyuedubnc4Available at httpcentalfltruclacbeteamsimpanchenkosre-evalsncsv5Available at httpl2rcsuiucedusimcogcompDataERconll04corp6Available at httpwwwmysmuedufacultyjingjiangdataIJCNLP2011zip7Available at httpmlrcsumassedumldatasetsReuters-21578+Text+Categorization+Collection8Available at httpwwwcsminingorgindexphpid-20-newsgroupshtml

Let TP denote the number of correctly detected closestsynonyms or semantic relationships explored let FP be thenumber of incorrectly detected closest synonyms or semanticrelationships explored and let FN be the number of correctlybut not detected closest synonyms or semantic relationshipsexplored in a dataset The 119865-measure combines the precisionand recall in one metric and is often used to show theefficiencyThismeasurement can be interpreted as a weightedaverage of the precision and recall where an 119865-measure

reaches its best value at 1 andworst value at 0 Precision recalland 119865-measure are defined as follows

Recall = TP(TP + FN)

Precision =TP

(TP + FP)

119865-measure = 2 (Recall lowast Precision)(Recall + Precision)

(15)

14 The Scientific World Journal

413 Measure for Task-Based Evaluation A case study ofa managing and mining task (ie semantic clustering oftext documents) is studied for task-based evaluation Thetest is performed with the standard clustering algorithmsthat accept the pairwise (dis)similarity between documentsrather than the internal feature space of the documents Theclustering algorithms may be classified as listed below [43](a) flat clustering (which creates a set of clusters without anyexplicit structure that would relate the clusters to each otheralso called exclusive clustering) (b) hierarchical clustering(which creates a hierarchy of clusters) (c) hard clustering(which assigns each documentobject as a member of exactlyone cluster) and (d) soft clustering (which distributes thedocumentsobjects over all clusters) These algorithms areagglomerative (hierarchical clustering) 119870-means (flat clus-tering hard clustering) and EM algorithm (flat clusteringsoft clustering)

Hierarchical agglomerative clustering (HAC) and the 119870-means algorithm have been applied to text clustering ina straightforward way [16 17] The algorithms employedin the experiment included 119870-means clustering and HACTo evaluate the quality of the clustering the 119865-measure isused for quality measures which combine the precision andrecall measures These metrics are computed for every (classcluster) pair The precision 119875 and recall 119877 of a cluster 119878

119894with

respect to a class 119871119903are defined as

119875 (119871119903 119878119894) =

119873119903119894

119873119894

119877 (119871119903 119878119894) =

119873119903119894

119873119903

(16)

where 119873119903119894is the number of documents of class 119871

119903in cluster

119878119894119873119894is the number of documents of cluster 119878

119894 and119873

119903is the

number of documents of class 119871119903

The 119865-measure which is a harmonic mean of precisionand recall is defined as

119865 (119871119903 119878119894) =

2 lowast 119875 (119871119903 119878119894) lowast 119877 (119871

119903 119878119894)

119875 (119871119903 119878119894) + 119877 (119871

119903 119878119894)

(17)

A per-class 119865 score is calculated as follows

119865 (119871119903) = max119878119894isin119862

119865 (119871119903 119878119894) (18)

With respect to class 119871119903 the cluster with the highest 119865-

measure is considered to be the cluster that maps to class 119871119903

that 119865-measure becomes the score for class 119871119903 and 119862 is total

number of clusters The overall 119865-measure for the clusteringresult 119862 is the weighted average of the 119865-measure for eachclass 119871

119903(macroaverage)

119865119862 =sum119871119903

(1003816100381610038161003816119873119903

1003816100381610038161003816 lowast 119865 (119871119903))

sum119871119903

10038161003816100381610038161198731199031003816100381610038161003816

(19)

where |119873119903| is the number of objects in class 119871

119903 The higher

the overall 119865-measure is the better the clustering is due tothe higher accuracy of the clusters mapping to the originalclasses

42 Evaluation Setup (Dataset) Content-based coselection-based and task-based evaluations are used to validateexperimenting over the real corpus of documents theresults are very promising The experimental setup con-sists of some datasets of textual documents as detailed inTable 3 Experiment 1 uses the Miller and Charles (MC) andMicrosoft Research Paraphrase Corpus Experiment 2 usesthe British National Corpus TREC IJCNLP 2011 (NYT andWikipedia) SN and BLESS Experiment 3 usesThe Reuters-21578 and 20 newsgroups

43 Evaluation Results This section reports on the results ofthree experiments conducted using the evaluation datasetsoutlined in the previous section The SHGRS is imple-mented and evaluated based on concept analysis and anno-tation as sentence based in experiment 1 document basedin experiment 2 and corpus based in experiment 3 Inexperiment 1 the DS1 and DS2 investigated the effectivenessof the proposed MFsSR compared to previous studies basedon the availability of semantic relatedness values of humanjudgments and previous studies In experiment 2 the DS3andDS4 investigated the effectiveness of the closest synonymdetection using MFsSR compared to using PMI [25 37]and ICA [38] based on the availability of synonyms comingfrom three sources WordNet 30 Rogetrsquos thesaurus anda synonyms database Furthermore the DS5 DS6 DS7and DS8 investigated the effectiveness of the IOAC-QL forsemantic relationships exploration compared to the implicitrelation extraction CRFs [6 39] and TFICF [40] based on theavailability of semantic relations judgments In experiment 3the DS9 and DS10 investigated the effectiveness of thedocument linkage process for clustering case study thanprevious studies as word-based [16] and concept-based [17]based on the availability of human judgments

These experiments revealed some interesting trends interms of text organization and a representation schemebased on semantic annotation and reinforcement learningThe results show the effectiveness of the proposed schemefor accurate understanding of the underlying meaning Theproposed SHGRS is exploited in mining and managingoperations such as gathering filtering searching retrievingextracting classifying clustering and summarizing docu-ments In addition the experiments illustrate the improve-ments when direct relevance and indirect relevance areincluded in the process of measuring semantic relatednessbetween MFs of documents The main purpose of semanticrelatedness measures is to infer unknown dependencies andrelationships among the MFs The proposed MFsSR is abackbone of the closest synonym detection and semanticrelationships exploration algorithms The performance ofMFsSR for closest synonym detection and IOAC-QL forsemantic relationships exploration algorithms is proportion-ally improved as well as having a semantic understanding oftext content achieved The relatedness subgraph measure isused for the document linkage process at the corpus levelThrough this process better clustering results can be achievedas more attempt to improve the accuracy of measuring therelatedness between documents using semantic information

The Scientific World Journal 15

Table 4The results of the comparison of the correlation coefficientbetween human judgment with some relatedness measures and theproposed semantic relatedness measure

MeasureRelevance

correlation withMampC

Distance-based measuresRada 0688Wu and Palmer 0765

Information-based measuresResnik 077Jiang and Conrath 0848Lin 0853

Hybrid measuresT Hong and D smith 0879Zili Zhou 0882

Informationfeature-base measuresThe proposed MFsSR 0937

in SHGRSThe relatedness subgraphsmeasure affects the per-formance of the output of the document managing and min-ing process relative to howmuch of the semantic informationit considers The most accurate relatedness can be achievedwhen the maximum amount of the semantic information isprovided Consequently the quality of clustering is improvedand when semantic information is considered the approachsurpasses the word-based and concept-based techniques

431 Experiment 1 Comparative Study (Content-Based Eval-uation) This experiment shows the necessity to evaluate theperformance of the proposed MFsSR based on a benchmarkdataset of human judgments so the results of the proposedsemantic relatedness would be comparable with other pre-vious studies in the same field In this experiment theyattempted to compute the correlation between the ratings ofthe proposed semantic relatedness approach and the meanratings reported by Miller and Charles (DS1 in Table 3)Furthermore the results produced are compared againsteight other semantic similarities approaches namely RadaWu and Palmer Resnik Jiang and Conrath Lin T Hong andD Smith and Zili Zhou [21]

Table 4 shows the correlation coefficient results betweennine componential approaches and the Miller and Charlesratings mean The semantic relatedness for the proposedapproach outperformed all the listed approaches Unlike allthe listed methods in the proposed semantic relatednessdifferent properties are considered Furthermore the goodcorrelation value of the approach also results from consider-ing all available relationships between concepts and the indi-rect relationships between attributes of each concept whenmeasuring the semantic relatedness In this experiment theproposed approach achieved a good correlation value withthe human-subject rating reported by Miller and CharlesBased on the results of this study theMFsSR correlation valuehas proven that considering the directindirect relevance

is specifically important for at least 6 improvement overthe best previous results This improvement contributes tothe achievement of the largest amount of relationships andinterdependence among MFs and their attributes at thedocument level and to the document linkage process at acorpus level

Table 5 summarizes the characteristics of the MRPCdataset (DS2 in Table 3) and presents comparison of theAcc and AR values between the proposed semantic related-ness measure MFsSR and Islam and Inkpen [7] Differentrelatedness thresholds ranging from 0 to 1 with interval01 are used to validate the MFsSR with Islam and Inkpen[7] After evaluation the best relatedness thresholds of Accand AR are 06 07 and 08 These results indicate thatthe proposed MFsSR surpasses Islam and Inkpen [7] interms of Acc and AR The results of each approach listedbelow were based on the best Acc and AR through allthresholds instead of under the same relatedness thresholdThis improvement in Acc andAR values is due to the increasein the numbers of pairs predicted correctly after consideringdirectindirect relevance This relevance takes into accountthe closest synonym and the relationships of sentences MFs

432 Experiment 2 Comparative Study of Closest-SynonymDetection and Semantic Relationships Exploration Thisexperiment shows the necessity to study the impact of theMFsSR for inferring unknown dependencies and relation-ships among the MFs This impact is achieved through theclosest-synonym detection and semantic relationship explo-ration algorithms so the results would be comparable withother previous studies in the same field Throughout thisexperiment the impact of the proposed MFsSR for detectingthe closest synonym is studied and compared to the PMIapproach and ICA approach for detecting the best near-synonym The difference among the proposed discoveringsemantic relationships or associations IOAC-QL the implicitrelation extraction CRFs approach and TFICF approachis examined This experiment attempts to measure theperformance of theMFsSR for the closest synonym detectionand the IOAC-QL for semantic relationship explorationalgorithms Hence more semantic understanding of the textcontent is achieved by inferring unknown dependenciesand relationships among the MFs Considering the directrelevance among the MFs and the indirect relevance amongthe attributes of the MFs in the proposed MFsSR constitutesa certain advantage over previous measures However mostof the time the incorrect items are due to a wrong syntacticparsing from the OpenNLP parser and AlchemyAPIAccording to the preliminary study it is certain that theaccuracy of parsing toolsrsquo effects is on the performance of theMFsSR

Detecting the closest synonym is a process of MF disam-biguation resulting from the implementation of the MFsSRtechnique and the threshold of the synonyms scores throughthe C-SynD algorithm In the C-SynD algorithm the MFsSRtakes into account the direct relevance among MFs andthe indirect relevance among MFs and their attributes in acontext that gives a significant increase in the recall without

16 The Scientific World Journal

Table 5 The results of the comparison of the accuracy and acceptance rate between the proposed semantic relatedness measure and Islamand Inkpen [7]

MRPC dataset Relatednessthreshold

Human judgment(TP + FN)

Islam and Inkpen [7] The proposed MFsSRAcc AR Acc AR

Trainingsubset (4076)

01

2753 true

067 1 068 102 067 1 068 103 067 1 068 104 067 1 068 105 069 098 068 106 072 089 068 107 068 078 070 09808 056 04 072 08609 037 009 060 0491 033 0 034 002

Test subset(1725)

01

1147 true

066 1 067 102 066 1 067 103 066 1 067 104 066 1 067 105 068 098 067 106 072 089 066 107 068 078 066 09808 056 04 064 08509 038 009 049 051 033 0 034 003

disturbing the precision Thus the MFsSR between two con-cepts considers the information content with most of Word-Net relationships Table 6 illustrates the performance of theC-SynD algorithm based on theMFsSR compared to the bestnear-synonym algorithmusing the PMI and ICA approachesThis comparison was conducted for two corpora of textdocuments that are BNC and NS (DS3 and DS4 in Table 3)As indicated in Table 6 the C-SynD algorithm yielded higheraverage precision recall and 119865-measure values than PMIapproach by 25 20 and 21 on SN dataset respectivelyand also by 8 4 and 7 on BNC dataset Furthermorethe C-SynD algorithm also yielded higher average precisionrecall and 119865-measure values than ICA approach by 6 9and 7 on SN dataset respectively and also by 6 4and 5 on BNC dataset The improvements achieved in theperformance values are due to the increase of the numberof pairs predicted correctly and the decrease of the numberof pairs predicted incorrectly through implementing MFsSRThe important observation from this table is the improve-ments achieved in recall which measures effectiveness of theC-SynD algorithm Achieving better precision values is clearwith a high percentage in different datasets and domains

Inferring the MFs relationships and dependencies is aprocess of achieving a large number of interdependencesamongMFs resulting from the implementation of the IOAC-QL algorithm In the IOAC-QL algorithm the 119876-learning isused to select the optimal action (relationships) that gainsthe largest amount of interdependences among the MFs

resulting from the measure of the AVW Table 7 illustratesthe performance of the IOAC-QL algorithm based on AVWscompared to the implicit relationship extraction based onCRFs and TFICF approaches which was conducted overthe TREC Bless the NYT and the Wikipedia corpus (DS5DS6 DS7 and DS8 in Table 3) The data in this table showsincreases in precision recall and 119865-measure values due tothe increase in the number of pairs predicted correctly afterconsidering directindirect relevance through the expansionof closest synonym In addition the IOAC-QL algorithmthe CRFs and the TFICF approaches are beneficial tothe extraction performance but the IOAC-QL contributesmore than CRFs and TFICF Thus IOAC-QL considerssimilarity contiguity contrast and causality relationshipsbetween MFs and its closest-synonyms while the CRFs andTFICF consider is-a and part-of relationships only betweenconceptsThe improvements of the 119865-measure were achievedthrough an IOAC-QL algorithm up to 23 for TREC dataset22 for BLESS dataset 21 for NYT dataset and 6 forWikipedia dataset approximately from the CRFs approachFurthermore the IOAC-QL algorithm also yielded higher119865-measure values than TFICF approach by 10 for TRECdataset 17 for BLESS dataset and 7 for NYT datasetrespectively

433 Experiment 3 Comparative Study (Task-Based Eval-uation) This experiment shows the effectiveness of the

The Scientific World Journal 17

Table 6 The results of the C-SynD algorithm based on the MFsSR compared to the PMI and ICA for detecting the closest synonym

Datasets PMI ICA C-SynDPrecision Recall 119865-measure Precision Recall 119865-measure Precision Recall 119865-measure

SN 606 606 061 795 716 753 85 80 82BNC 745 679 71 76 678 72 82 719 77

Table 7 The results of the IOAC-QL algorithm based on AVWs compared to CRFs and TFICF approaches

Datasets CRFs TFICF IOAC-QLPrecision Recall 119865-measure Precision Recall 119865-measure Precision Recall 119865-measure

TREC 7508 602 6628 893 714 787 898 881 886BLESS 7304 6266 6703 738 695 716 950 835 889NYT 6846 5402 6038 860 650 740 900 740 812Wikipedia 560 420 480 646 5488 5934 700 440 540

Table 8The results of graph-based approach compared to term-based and concept-based approach using119870-means and the HAC clusteringalgorithms

Datasets Algorithm Term-based Concept-based Graph-basedFC FC FC

Reuters-21578 HAC 73 857 902119870-means 51 846 918

20 Newsgroups HAC 53 82 89119870-means 48 793 87

proposed semantic representation scheme in document min-ing and managing operations such as gathering filteringsearching retrieving extracting classifying clustering andsummarizing documents Many sets of experiments usingthe proposed scheme on different datasets in text clusteringhave been conducted The experiments demonstrate theextensive comparisons between the MF-based analysis andthe traditional analysis Experimental results demonstratethe substantial enhancement of the document clusteringquality In this experiment the application of the semanticunderstanding of the textual documents is presented Textualdocuments are parsed to produce the rich syntactic andsemantic representations of SHGRS Based on these semanticrepresentations document linkages are estimated using thesubgraph matching process that is based on finding alldistinct similarity common subgraphs The semantic repre-sentations of textual documents determining the relatednessbetween themare evaluated by allowing clustering algorithmsto use the produced relatedness matrix of the documentset To test the effectiveness of subgraph matching in deter-mining an accurate measure of the relatedness betweendocuments an experiment using the MFs-based analysisand subgraphs relatedness measures was conducted Twostandard document clustering techniques were chosen fortesting the effect of the document linkages on clustering [43](1) HAC and (2) 119870-means The MF-based weighting is themain factor that captures the importance of a concept in

a document Furthermore document linkages using the sub-graph relatedness measure are among the main factors thataffect clustering techniques Thus to study the effect of theMFs-based weighting and the subgraph relatedness measure(graph-based approach) on the clustering techniques thisexperiment is repeated for the different clustering techniquesas shown in Table 8 Basically the aim is to maximize theweighted average of the 119865-measure for each class (FC) ofclusters to achieve high-quality clustering

Table 8 shows that the graph-based approach achieveshigher performance than term-based and concept-basedapproachesThis comparison was conducted for two corporaof text documents that are the Reuters-21578 dataset and the20 newsgroups dataset (DS9 and DS10 in Table 3)The resultsshown in Table 8 illustrate improvement in the clusteringquality when more attributes are included in graph-basedrelatedness measure The improvements were achieved at thegraph-based approach up to 17 by HAC and 40 by 119870-means for Reuters-21578 dataset while up to 36 byHAC and39 by 119870-means for approximately 20 newsgroup datasetsfrom the base case (term-based) approach Furthermorethe improvements also were achieved at the graph-basedapproach up to 5 by HAC and 7 by119870-Means for Reuters-21578 dataset while up to 7 byHAC and 8 by119870-Means forapproximately 20 Newsgroup datasets from Concept-basedapproach

18 The Scientific World Journal

5 Conclusions

This paper proposed the SHGRS to improve the effectivenessof themining andmanaging operations in textual documentsSpecifically a three-stage approach was proposed for con-structing the scheme First the MFs with their attributesare extracted and hierarchy-based structure is built to rep-resent sentences by the MFs and their attributes Second anovel semantic relatedness computing method was proposedfor inferring relationships and dependencies of MFs andthe relationship between MFs is represented by a graph-based structure Third a subgraph matching algorithm wasconducted for document linkage process The documentsclustering is used as a case study and conducted extensiveexperiments on 10 different datasets and the results showedthat the scheme can result in a better mining performancecompared with traditional approach Future work will focuson conducting other case studies to processes such asgathering filtering retrieving classifying and summarizinginformation

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] I T Fatudimu A G Musa C K Ayo and A B SofoluweldquoKnowledge discovery in online repositories a text miningapproachrdquo European Journal of Scientific Research vol 22 no2 pp 241ndash250 2008

[2] A Hotho A Nurnberger and G Paaszlig ldquoA brief survey of textminingrdquo Journal for Computational Linguistics and LanguageTechnology vol 20 2005

[3] M Bernotas K Karklius R Laurutis and A Slotkiene ldquoThepeculiarities of the text document representation using ontol-ogy and tagging-based clustering techniquerdquo Journal of Infor-mation Technology and Control vol 36 pp 217ndash220 2007

[4] Y Li Z A Bandar andDMcLean ldquoAn approach formeasuringsemantic similarity between words using multiple informationsourcesrdquo IEEE Transactions on Knowledge and Data Engineer-ing vol 15 no 4 pp 871ndash882 2003

[5] K Shaban O Basir and M Kamel ldquoDocument mining basedon semantic understanding of textrdquo in Progress in Pattern Rec-ognition Image Analysis and Applications vol 4225 of LectureNotes in Computer Science pp 834ndash843 Springer BerlinGermany 2006

[6] A Culotta A McCallum and J Betz ldquoIntegrating probabilisticextraction models and data mining to discover relations andpatterns in textrdquo in Proceedings of theHuman Language Technol-ogy ConferencemdashNorth American Chapter of the Association forComputational Linguistics Annual Meeting (HLT-NAACL rsquo06)pp 296ndash303 June 2006

[7] A Islam and D Inkpen ldquoSemantic text similarity using corpus-based word similarity and string similarityrdquo ACM Transactionson Knowledge Discovery fromData vol 2 no 2 article 10 2008

[8] R Feldman and J SangerTheTextMining Handbook AdvancedApproaches in Analyzing Unstructured Data Cambridge Uni-versity Press 2006

[9] R Navigli ldquoWord sense disambiguation a surveyrdquo ACM Com-puting Surveys vol 41 no 2 article 10 2009

[10] W Mao andWW Chu ldquoThe phrase-based vector space modelfor automatic retrieval of free-text medical documentsrdquo Dataand Knowledge Engineering vol 61 no 1 pp 76ndash92 2007

[11] C Siefkes and P Siniakov ldquoAn overview and classification ofadaptive approaches to information extractionrdquo in Journal onData Semantics IV vol 3730 of Lecture Notes in ComputerScience pp 172ndash212 Springer New York NY USA 2005

[12] httpwordnetprincetonedu[13] OpenNLP httpopennlpapacheorg[14] httpwwwalchemyapicom[15] W V Siricharoen ldquoOntologies and object models in object ori-

ented software engineeringrdquo International Association of Engi-neers (IAENG) Journal of Computer Science (JICS) vol 33 no 1pp 1151ndash1162 2006

[16] S Shehata ldquoA WordNet-based semantic model for enhancingtext clusteringrdquo in Proceedings of the IEEE International Con-ference on Data Mining Workshops (ICDMW rsquo09) pp 477ndash482Miami Fla USA December 2009

[17] S Shehata F Karray andM Kamel ldquoAn efficient concept-basedminingmodel for enhancing text clusteringrdquo IEEE Transactionson Knowledge and Data Engineering vol 22 no 10 pp 1360ndash1371 2010

[18] E Zemmouri H Behja A Marzak and B Trousse ldquoOntology-based knowledge model for multi-view KDD processrdquo Interna-tional Journal of Mobile Computing and Multimedia Communi-cations vol 4 no 3 pp 21ndash33 2012

[19] C Marinica and F Guillet ldquoKnowledge-based interactive post-mining of association rules using ontologiesrdquo IEEETransactionson Knowledge and Data Engineering vol 22 no 6 pp 784ndash7972010

[20] B Choudhary and P Bhattacharyya ldquoText clustering usingUni-versal Networking Language representationrdquo in Proceedings ofthe 11th International World Wide Web Conference 2003

[21] G Pirro and J Euzenat ldquoA feature and information theoreticframework for semantic similarity and relatednessrdquo in Proceed-ings of the 9th International Semantic Web Conference on theSemantic Web (ISWC rsquo10) vol 6496 pp 615ndash630 2010

[22] Z Lin M R Lyu and I King ldquoMatchSim a novel similaritymeasure based on maximum neighborhood matchingrdquo Knowl-edge and Information Systems vol 32 no 1 pp 141ndash166 2012

[23] D Lin ldquoAn information-theoretic definition of similarityrdquo inProceedings of the 15th International Conference on MachineLearning Madison Wis USA July 1998

[24] D Inkpen ldquoA statistical model for near-synonym choicerdquo ACMTransactions on Speech and Language Processing vol 4 no 1article 2 2007

[25] LHan T Finin PMcNameeA Joshi andYYesha ldquoImprovingword similarity by augmenting PMIwith estimates of word pol-ysemyrdquo IEEE Transactions on Knowledge and Data Engineeringvol 25 no 6 pp 1307ndash1322 2013

[26] J OrsquoShea Z Bandar K Crockett and D McLean ldquoA compar-ative study of two short text semantic similarity measuresrdquo inAgent and Multi-Agent Systems Technologies and Applicationsvol 4953 of Lecture Notes in Artificial Intelligence pp 172ndash181Springer Berlin Germany 2008

[27] J Oliva J I SerranoMDDel Castillo and A Iglesias ldquoSyMSSa syntax-basedmeasure for short-text semantic similarityrdquoDataand Knowledge Engineering vol 70 no 4 pp 390ndash405 2011

The Scientific World Journal 19

[28] I Yoo and X Hu ldquoClustering large collection of biomedicalliterature based on ontology-enriched bipartite graph represen-tation andmutual refinement strategyrdquo inProceedings of the 10thPAKDD pp 9ndash12 2006

[29] S-T Yuan and Y-C Chen ldquoSemantic ideation learning foragent-based e-brainstormingrdquo IEEE Transactions on Knowledgeand Data Engineering vol 20 no 2 pp 261ndash275 2008

[30] D Liu Z Liu and Q Dong ldquoA dependency grammar andWordNet based sentence similarity measurerdquo Journal of Com-putational Information Systems vol 8 no 3 pp 1027ndash1035 2012

[31] A Zouaq M Gagnon and O Benoit ldquoSemantic analysis usingdependency-based grammars and upper-level ontologiesrdquoInternational Journal of Computational Linguistics and Applica-tions vol 1 no 1-2 pp 85ndash101 2010

[32] C Ramakrishnan K J Kochut and A P Sheth ldquoA frameworkfor schema-driven relationship discovery from unstructuredtextrdquo in The Semantic WebmdashISWC 2006 vol 4273 of LectureNotes in Computer Science pp 583ndash596 Springer BerlinGermany 2006

[33] C J C H Watkins and P Dayan ldquoTechnical note Q-learningrdquoMachine Learning vol 8 no 3-4 pp 279ndash292 1992

[34] M Lan C L Tan J Su and Y Lu ldquoSupervised and traditionalterm weighting methods for automatic text categorizationrdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 31 no 4 pp 721ndash735 2009

[35] L Zhang C Li J Liu and HWang ldquoGraph-based text similar-itymeasurement by exploiting wikipedia as background knowl-edgerdquo World Academy of Science Engineering and Technologyvol 59 pp 1548ndash1553 2011

[36] M Kuramochi and G Karypis ldquoAn efficient algorithm for dis-covering frequent subgraphsrdquo IEEE Transactions on Knowledgeand Data Engineering vol 16 no 9 pp 1038ndash1051 2004

[37] L-C Yu H-M Shih Y-L Lai J-F Yeh and C-H Wu ldquoDis-criminative training for near-synonym substitutionrdquo in Pro-ceedings of the 23rd International Conference on ComputationalLinguistics (COLING rsquo10) pp 1254ndash1262 2010

[38] L-C Yu and W-N Chien ldquoIndependent component analysisfor near-synonym choicerdquoDecision Support Systems vol 55 no1 pp 146ndash155 2013

[39] L Yao S Riedel and A Mccallum ldquoCollective cross-documentrelation extraction without labelled datardquo in Proceedings ofthe Conference on Empirical Methods in Natural LanguageProcessing pp 1013ndash1023MITCambridgeMassUSAOctober2010

[40] M Shen D-R Liu and Y-S Huang ldquoExtracting semantic rela-tions to enrich domain ontologiesrdquo Journal of Intelligent Infor-mation Systems vol 39 no 3 pp 749ndash761 2012

[41] H Chim and X Deng ldquoEfficient phrase-based documentsimilarity for clusteringrdquo IEEE Transactions on Knowledge andData Engineering vol 20 no 9 pp 1217ndash1229 2008

[42] GAMiller andWGCharles ldquoContextual correlates of seman-tic similarityrdquo Language and Cognitive Processes vol 6 no 1 pp1ndash28 1991

[43] M Steinbach G Karypis and V Kumar ldquoA comparison ofdocument clustering techniquesrdquo in Proceedings of the Knowl-edge Discovery and Data Mining (KDD) Workshop Text MiningAugust 2000

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 12: Research Article Exploiting Semantic Annotations and ...downloads.hindawi.com/journals/tswj/2015/136172.pdf · is bene cial for text categorization [ ], text summarization [ ], word

12 The Scientific World Journal

Table 1 Action values table with the SALA decision

State FVO1 FVO1 FVO1 FVO1

Action Similarity Contiguity Contrast CausalityActionResult State FVO1 FVO7 FVO9 FVO22 FVO75 FVO92 FVO112 FVO125 FVO103 FVO17 FVO44 FVO89 FVO63 FVO249

Total reward 184 74 22 25119876(119904 119886) 94 61 329 429Decision Yes No No No

studied and compared to the PMI [25 37] and independentcomponent analysis (ICA) [38] for detecting the best near-synonym The difference between the proposed discoveringsemantic relationships or associations IOAC-QL implicitrelation extraction conditional random fields (CRFs) [6 39]and term frequency and inverse cluster frequency (TFICF)[40] is computed

A case study of amanaging andmining task (ie semanticclustering of text documents) is consideredThis study showshow the proposed approach is workable and how it canbe applied to mining and managing tasks Experimentallysignificant improvements in all performances are achieved inthe document clustering processThis approach outperformsconventional word-based [16] and concept-based [17 41]clusteringmethodsMoreover performance tends to improveas more semantic analysis is achieved in the representation

41 Evaluation Measures Evaluation measures are subcat-egorized into text quality-based evaluation content-basedevaluation coselection-based evaluation and task-basedevaluationsThefirst category of evaluationmeasures is basedon text quality using aspects such as grammaticality nonre-dundancy referential clarity and coherence For content-based evaluations measures such as similarity semanticrelatedness longest common subsequence and other scoresare used The third category is based on coselection evalua-tion using precision recall and 119865-measure values The lastcategory task-based evaluation is based on the performanceof accomplishing the given task Some of these tasks areinformation retrieval question answering document classi-fication document summarizing and document clusteringmethods In this paper the content-based the coselection-based and the task-based evaluations are used to validate theimplementation of the proposed scheme over the real corpusof documents

411 Measure for Content-Based Evaluation The relatednessor similarity measures are inherited from probability theoryand known as the correlation coefficient [42]The correlationcoefficient is one of the most widely used measures todescribe the relatedness 119903 between two vectors119883 and 119884

Correlation Coefficient 119903 The correlation coefficient is a rela-tively efficient relatedness measure which is a symmetricalmeasure of the linear dependence between two randomvariables

Table 2 Contingency table presents human judgment and systemjudgment

Human judgmentTrue False

System judgmentTrue TP FPFalse FN TN

Therefore the correlation coefficient can be consideredas the coefficient for the linear relationship between corre-sponding values of 119883 and 119884 The correlation coefficient 119903between sequences 119883 = 119909

119894 119894 = 1 119899 and 119884 = 119910

119894

119894 = 1 119899 is defined by

119903 =sum119899

119894=1119883119894119884119894

radic(sum119899

119894=11198832119894) lowast (sum

119899

119894=11198842119894)

(12)

Acceptance Rate (AR) Acceptance rate is a proportion ofcorrectly predicted similar or related sentences compared toall related sentences High acceptance ratemeans recognizingalmost all similar or related sentences

AR =TP

(TP + FN) (13)

Accuracy (Acc) Accuracy is a proportion of all correctlypredicted sentences compared to all sentences

Acc = TP + TN(TP + FN + FP + TN)

(14)

where TP TN FP and FN stand for true positive (the num-ber of pairs correctly labeled as similar) true negative (thenumber of pairs correctly labeled as dissimilar) false positive(the number of pairs incorrectly labeled as similar) andfalse negative (the number of pairs incorrectly labeled asdissimilar) as shown in Table 2

412 Measure for Coselection-Based Evaluation For all thedomains a precision 119875 recall 119877 and 119865-measure are utilizedas the measures of performance in the coselection-basedevaluation These measures may be defined via computingthe correlation between the extracted correct and wrongclosest synonyms or semantic relationshipsdependencies

The Scientific World Journal 13

Table 3 Datasets details for the experimental setup

Dataset Experimenting Dataset Name Description

DS1 Experiment 1content-basedevaluation

Miller and Charles (MC)1 RG consists of 65 pairs of nouns extracted from the WordNet rated bymultiple human annotators

DS2Microsoft ResearchParaphrase Corpus

(MRPC)2

The corpus consists of 5801 sentence pairs collected from newswire articles3900 of which were labeled as relatedness by human annotators The wholeset is divided into a training subset (4076 sentences of which 2753 are true)and a test subset (1725 pairs of which 1147 are true)

DS3 Experiment 2coselection-basedevaluation(closest-synonymdetection)

British National Corpus(BNC)3

BNC is a carefully selected collection of 4124 contemporary written andspoken English texts contains 100-million-word text corpus of samples ofwritten and spoken English with the near-synonym collocations

DS4 SN (semantic neighbors)4SN relates 462 target terms (nouns) to 5910 relatum terms with 14682semantic relations (7341 are meaningful and 7341 are random) The SNcontains synonyms coming from three sources WordNet 30 Rogetrsquosthesaurus and a synonyms database

DS5

Experiment 2coselection-basedevaluation (semanticrelationshipsexploration)

BLESS6BLESS relates 200 target terms (100 animate and 100 inanimate nouns) to 8625relatum terms with 26554 semantic relations (14440 are meaningful (correct)and 12154 are random) Every relation has one of the following typeshypernymy cohypernymy meronymy attribute event or random

DS6 TREC5

TREC includes 1437 sentences annotated with entities and relations at leastone relation There are three types of entities person (1685) location (1968)and organization (978) in addition there is a fourth type others (705) whichindicates that the candidate entity is none of the three types There are fivetypes of relations located in (406) indicates that one location is located insideanother location work for (394) indicates that a person works for anorganization OrgBased in (451) indicates that an organization is based in alocation live in (521) indicates that a person lives at a location and kill (268)indicates that a person killed another person There are 17007 pairs of entitiesthat are not related by any of the five relations and hence have the NR relationbetween them which thus significantly outnumbers other relations

DS7 IJCNLP 2011-New YorkTimes (NYT)6

NYT contains 150 business articles from NYT There are 536 instances (208positive 328 negative) with 140 distinct descriptors in NYT dataset

DS8 IJCNLP 2011-Wikipedia8Wikipedia personalsocial relation dataset was previously used in Culotta etal [6] There are 700 instances (122 positive 578 negative) with 70 distinctdescriptors in Wikipedia dataset

DS9 Experiment 3task-basedevaluation

Reuters 215787 Reuters-21578 contains 21578 documents (12902 are used) categorized to 10categories

DS10 20 Newsgroups8 20 newsgroups dataset contains 20000 documents (18846 are used)categorized to 20 categories

1Available at httpwwwcscmuedusimmfaruquisuitehtml2Available at httpresearchmicrosoftcomen-usdownloads3Available at httpcorpusbyuedubnc4Available at httpcentalfltruclacbeteamsimpanchenkosre-evalsncsv5Available at httpl2rcsuiucedusimcogcompDataERconll04corp6Available at httpwwwmysmuedufacultyjingjiangdataIJCNLP2011zip7Available at httpmlrcsumassedumldatasetsReuters-21578+Text+Categorization+Collection8Available at httpwwwcsminingorgindexphpid-20-newsgroupshtml

Let TP denote the number of correctly detected closestsynonyms or semantic relationships explored let FP be thenumber of incorrectly detected closest synonyms or semanticrelationships explored and let FN be the number of correctlybut not detected closest synonyms or semantic relationshipsexplored in a dataset The 119865-measure combines the precisionand recall in one metric and is often used to show theefficiencyThismeasurement can be interpreted as a weightedaverage of the precision and recall where an 119865-measure

reaches its best value at 1 andworst value at 0 Precision recalland 119865-measure are defined as follows

Recall = TP(TP + FN)

Precision =TP

(TP + FP)

119865-measure = 2 (Recall lowast Precision)(Recall + Precision)

(15)

14 The Scientific World Journal

413 Measure for Task-Based Evaluation A case study ofa managing and mining task (ie semantic clustering oftext documents) is studied for task-based evaluation Thetest is performed with the standard clustering algorithmsthat accept the pairwise (dis)similarity between documentsrather than the internal feature space of the documents Theclustering algorithms may be classified as listed below [43](a) flat clustering (which creates a set of clusters without anyexplicit structure that would relate the clusters to each otheralso called exclusive clustering) (b) hierarchical clustering(which creates a hierarchy of clusters) (c) hard clustering(which assigns each documentobject as a member of exactlyone cluster) and (d) soft clustering (which distributes thedocumentsobjects over all clusters) These algorithms areagglomerative (hierarchical clustering) 119870-means (flat clus-tering hard clustering) and EM algorithm (flat clusteringsoft clustering)

Hierarchical agglomerative clustering (HAC) and the 119870-means algorithm have been applied to text clustering ina straightforward way [16 17] The algorithms employedin the experiment included 119870-means clustering and HACTo evaluate the quality of the clustering the 119865-measure isused for quality measures which combine the precision andrecall measures These metrics are computed for every (classcluster) pair The precision 119875 and recall 119877 of a cluster 119878

119894with

respect to a class 119871119903are defined as

119875 (119871119903 119878119894) =

119873119903119894

119873119894

119877 (119871119903 119878119894) =

119873119903119894

119873119903

(16)

where 119873119903119894is the number of documents of class 119871

119903in cluster

119878119894119873119894is the number of documents of cluster 119878

119894 and119873

119903is the

number of documents of class 119871119903

The 119865-measure which is a harmonic mean of precisionand recall is defined as

119865 (119871119903 119878119894) =

2 lowast 119875 (119871119903 119878119894) lowast 119877 (119871

119903 119878119894)

119875 (119871119903 119878119894) + 119877 (119871

119903 119878119894)

(17)

A per-class 119865 score is calculated as follows

119865 (119871119903) = max119878119894isin119862

119865 (119871119903 119878119894) (18)

With respect to class 119871119903 the cluster with the highest 119865-

measure is considered to be the cluster that maps to class 119871119903

that 119865-measure becomes the score for class 119871119903 and 119862 is total

number of clusters The overall 119865-measure for the clusteringresult 119862 is the weighted average of the 119865-measure for eachclass 119871

119903(macroaverage)

119865119862 =sum119871119903

(1003816100381610038161003816119873119903

1003816100381610038161003816 lowast 119865 (119871119903))

sum119871119903

10038161003816100381610038161198731199031003816100381610038161003816

(19)

where |119873119903| is the number of objects in class 119871

119903 The higher

the overall 119865-measure is the better the clustering is due tothe higher accuracy of the clusters mapping to the originalclasses

42 Evaluation Setup (Dataset) Content-based coselection-based and task-based evaluations are used to validateexperimenting over the real corpus of documents theresults are very promising The experimental setup con-sists of some datasets of textual documents as detailed inTable 3 Experiment 1 uses the Miller and Charles (MC) andMicrosoft Research Paraphrase Corpus Experiment 2 usesthe British National Corpus TREC IJCNLP 2011 (NYT andWikipedia) SN and BLESS Experiment 3 usesThe Reuters-21578 and 20 newsgroups

43 Evaluation Results This section reports on the results ofthree experiments conducted using the evaluation datasetsoutlined in the previous section The SHGRS is imple-mented and evaluated based on concept analysis and anno-tation as sentence based in experiment 1 document basedin experiment 2 and corpus based in experiment 3 Inexperiment 1 the DS1 and DS2 investigated the effectivenessof the proposed MFsSR compared to previous studies basedon the availability of semantic relatedness values of humanjudgments and previous studies In experiment 2 the DS3andDS4 investigated the effectiveness of the closest synonymdetection using MFsSR compared to using PMI [25 37]and ICA [38] based on the availability of synonyms comingfrom three sources WordNet 30 Rogetrsquos thesaurus anda synonyms database Furthermore the DS5 DS6 DS7and DS8 investigated the effectiveness of the IOAC-QL forsemantic relationships exploration compared to the implicitrelation extraction CRFs [6 39] and TFICF [40] based on theavailability of semantic relations judgments In experiment 3the DS9 and DS10 investigated the effectiveness of thedocument linkage process for clustering case study thanprevious studies as word-based [16] and concept-based [17]based on the availability of human judgments

These experiments revealed some interesting trends interms of text organization and a representation schemebased on semantic annotation and reinforcement learningThe results show the effectiveness of the proposed schemefor accurate understanding of the underlying meaning Theproposed SHGRS is exploited in mining and managingoperations such as gathering filtering searching retrievingextracting classifying clustering and summarizing docu-ments In addition the experiments illustrate the improve-ments when direct relevance and indirect relevance areincluded in the process of measuring semantic relatednessbetween MFs of documents The main purpose of semanticrelatedness measures is to infer unknown dependencies andrelationships among the MFs The proposed MFsSR is abackbone of the closest synonym detection and semanticrelationships exploration algorithms The performance ofMFsSR for closest synonym detection and IOAC-QL forsemantic relationships exploration algorithms is proportion-ally improved as well as having a semantic understanding oftext content achieved The relatedness subgraph measure isused for the document linkage process at the corpus levelThrough this process better clustering results can be achievedas more attempt to improve the accuracy of measuring therelatedness between documents using semantic information

The Scientific World Journal 15

Table 4The results of the comparison of the correlation coefficientbetween human judgment with some relatedness measures and theproposed semantic relatedness measure

MeasureRelevance

correlation withMampC

Distance-based measuresRada 0688Wu and Palmer 0765

Information-based measuresResnik 077Jiang and Conrath 0848Lin 0853

Hybrid measuresT Hong and D smith 0879Zili Zhou 0882

Informationfeature-base measuresThe proposed MFsSR 0937

in SHGRSThe relatedness subgraphsmeasure affects the per-formance of the output of the document managing and min-ing process relative to howmuch of the semantic informationit considers The most accurate relatedness can be achievedwhen the maximum amount of the semantic information isprovided Consequently the quality of clustering is improvedand when semantic information is considered the approachsurpasses the word-based and concept-based techniques

431 Experiment 1 Comparative Study (Content-Based Eval-uation) This experiment shows the necessity to evaluate theperformance of the proposed MFsSR based on a benchmarkdataset of human judgments so the results of the proposedsemantic relatedness would be comparable with other pre-vious studies in the same field In this experiment theyattempted to compute the correlation between the ratings ofthe proposed semantic relatedness approach and the meanratings reported by Miller and Charles (DS1 in Table 3)Furthermore the results produced are compared againsteight other semantic similarities approaches namely RadaWu and Palmer Resnik Jiang and Conrath Lin T Hong andD Smith and Zili Zhou [21]

Table 4 shows the correlation coefficient results betweennine componential approaches and the Miller and Charlesratings mean The semantic relatedness for the proposedapproach outperformed all the listed approaches Unlike allthe listed methods in the proposed semantic relatednessdifferent properties are considered Furthermore the goodcorrelation value of the approach also results from consider-ing all available relationships between concepts and the indi-rect relationships between attributes of each concept whenmeasuring the semantic relatedness In this experiment theproposed approach achieved a good correlation value withthe human-subject rating reported by Miller and CharlesBased on the results of this study theMFsSR correlation valuehas proven that considering the directindirect relevance

is specifically important for at least 6 improvement overthe best previous results This improvement contributes tothe achievement of the largest amount of relationships andinterdependence among MFs and their attributes at thedocument level and to the document linkage process at acorpus level

Table 5 summarizes the characteristics of the MRPCdataset (DS2 in Table 3) and presents comparison of theAcc and AR values between the proposed semantic related-ness measure MFsSR and Islam and Inkpen [7] Differentrelatedness thresholds ranging from 0 to 1 with interval01 are used to validate the MFsSR with Islam and Inkpen[7] After evaluation the best relatedness thresholds of Accand AR are 06 07 and 08 These results indicate thatthe proposed MFsSR surpasses Islam and Inkpen [7] interms of Acc and AR The results of each approach listedbelow were based on the best Acc and AR through allthresholds instead of under the same relatedness thresholdThis improvement in Acc andAR values is due to the increasein the numbers of pairs predicted correctly after consideringdirectindirect relevance This relevance takes into accountthe closest synonym and the relationships of sentences MFs

432 Experiment 2 Comparative Study of Closest-SynonymDetection and Semantic Relationships Exploration Thisexperiment shows the necessity to study the impact of theMFsSR for inferring unknown dependencies and relation-ships among the MFs This impact is achieved through theclosest-synonym detection and semantic relationship explo-ration algorithms so the results would be comparable withother previous studies in the same field Throughout thisexperiment the impact of the proposed MFsSR for detectingthe closest synonym is studied and compared to the PMIapproach and ICA approach for detecting the best near-synonym The difference among the proposed discoveringsemantic relationships or associations IOAC-QL the implicitrelation extraction CRFs approach and TFICF approachis examined This experiment attempts to measure theperformance of theMFsSR for the closest synonym detectionand the IOAC-QL for semantic relationship explorationalgorithms Hence more semantic understanding of the textcontent is achieved by inferring unknown dependenciesand relationships among the MFs Considering the directrelevance among the MFs and the indirect relevance amongthe attributes of the MFs in the proposed MFsSR constitutesa certain advantage over previous measures However mostof the time the incorrect items are due to a wrong syntacticparsing from the OpenNLP parser and AlchemyAPIAccording to the preliminary study it is certain that theaccuracy of parsing toolsrsquo effects is on the performance of theMFsSR

Detecting the closest synonym is a process of MF disam-biguation resulting from the implementation of the MFsSRtechnique and the threshold of the synonyms scores throughthe C-SynD algorithm In the C-SynD algorithm the MFsSRtakes into account the direct relevance among MFs andthe indirect relevance among MFs and their attributes in acontext that gives a significant increase in the recall without

16 The Scientific World Journal

Table 5 The results of the comparison of the accuracy and acceptance rate between the proposed semantic relatedness measure and Islamand Inkpen [7]

MRPC dataset Relatednessthreshold

Human judgment(TP + FN)

Islam and Inkpen [7] The proposed MFsSRAcc AR Acc AR

Trainingsubset (4076)

01

2753 true

067 1 068 102 067 1 068 103 067 1 068 104 067 1 068 105 069 098 068 106 072 089 068 107 068 078 070 09808 056 04 072 08609 037 009 060 0491 033 0 034 002

Test subset(1725)

01

1147 true

066 1 067 102 066 1 067 103 066 1 067 104 066 1 067 105 068 098 067 106 072 089 066 107 068 078 066 09808 056 04 064 08509 038 009 049 051 033 0 034 003

disturbing the precision Thus the MFsSR between two con-cepts considers the information content with most of Word-Net relationships Table 6 illustrates the performance of theC-SynD algorithm based on theMFsSR compared to the bestnear-synonym algorithmusing the PMI and ICA approachesThis comparison was conducted for two corpora of textdocuments that are BNC and NS (DS3 and DS4 in Table 3)As indicated in Table 6 the C-SynD algorithm yielded higheraverage precision recall and 119865-measure values than PMIapproach by 25 20 and 21 on SN dataset respectivelyand also by 8 4 and 7 on BNC dataset Furthermorethe C-SynD algorithm also yielded higher average precisionrecall and 119865-measure values than ICA approach by 6 9and 7 on SN dataset respectively and also by 6 4and 5 on BNC dataset The improvements achieved in theperformance values are due to the increase of the numberof pairs predicted correctly and the decrease of the numberof pairs predicted incorrectly through implementing MFsSRThe important observation from this table is the improve-ments achieved in recall which measures effectiveness of theC-SynD algorithm Achieving better precision values is clearwith a high percentage in different datasets and domains

Inferring the MFs relationships and dependencies is aprocess of achieving a large number of interdependencesamongMFs resulting from the implementation of the IOAC-QL algorithm In the IOAC-QL algorithm the 119876-learning isused to select the optimal action (relationships) that gainsthe largest amount of interdependences among the MFs

resulting from the measure of the AVW Table 7 illustratesthe performance of the IOAC-QL algorithm based on AVWscompared to the implicit relationship extraction based onCRFs and TFICF approaches which was conducted overthe TREC Bless the NYT and the Wikipedia corpus (DS5DS6 DS7 and DS8 in Table 3) The data in this table showsincreases in precision recall and 119865-measure values due tothe increase in the number of pairs predicted correctly afterconsidering directindirect relevance through the expansionof closest synonym In addition the IOAC-QL algorithmthe CRFs and the TFICF approaches are beneficial tothe extraction performance but the IOAC-QL contributesmore than CRFs and TFICF Thus IOAC-QL considerssimilarity contiguity contrast and causality relationshipsbetween MFs and its closest-synonyms while the CRFs andTFICF consider is-a and part-of relationships only betweenconceptsThe improvements of the 119865-measure were achievedthrough an IOAC-QL algorithm up to 23 for TREC dataset22 for BLESS dataset 21 for NYT dataset and 6 forWikipedia dataset approximately from the CRFs approachFurthermore the IOAC-QL algorithm also yielded higher119865-measure values than TFICF approach by 10 for TRECdataset 17 for BLESS dataset and 7 for NYT datasetrespectively

433 Experiment 3 Comparative Study (Task-Based Eval-uation) This experiment shows the effectiveness of the

The Scientific World Journal 17

Table 6 The results of the C-SynD algorithm based on the MFsSR compared to the PMI and ICA for detecting the closest synonym

Datasets PMI ICA C-SynDPrecision Recall 119865-measure Precision Recall 119865-measure Precision Recall 119865-measure

SN 606 606 061 795 716 753 85 80 82BNC 745 679 71 76 678 72 82 719 77

Table 7 The results of the IOAC-QL algorithm based on AVWs compared to CRFs and TFICF approaches

Datasets CRFs TFICF IOAC-QLPrecision Recall 119865-measure Precision Recall 119865-measure Precision Recall 119865-measure

TREC 7508 602 6628 893 714 787 898 881 886BLESS 7304 6266 6703 738 695 716 950 835 889NYT 6846 5402 6038 860 650 740 900 740 812Wikipedia 560 420 480 646 5488 5934 700 440 540

Table 8The results of graph-based approach compared to term-based and concept-based approach using119870-means and the HAC clusteringalgorithms

Datasets Algorithm Term-based Concept-based Graph-basedFC FC FC

Reuters-21578 HAC 73 857 902119870-means 51 846 918

20 Newsgroups HAC 53 82 89119870-means 48 793 87

proposed semantic representation scheme in document min-ing and managing operations such as gathering filteringsearching retrieving extracting classifying clustering andsummarizing documents Many sets of experiments usingthe proposed scheme on different datasets in text clusteringhave been conducted The experiments demonstrate theextensive comparisons between the MF-based analysis andthe traditional analysis Experimental results demonstratethe substantial enhancement of the document clusteringquality In this experiment the application of the semanticunderstanding of the textual documents is presented Textualdocuments are parsed to produce the rich syntactic andsemantic representations of SHGRS Based on these semanticrepresentations document linkages are estimated using thesubgraph matching process that is based on finding alldistinct similarity common subgraphs The semantic repre-sentations of textual documents determining the relatednessbetween themare evaluated by allowing clustering algorithmsto use the produced relatedness matrix of the documentset To test the effectiveness of subgraph matching in deter-mining an accurate measure of the relatedness betweendocuments an experiment using the MFs-based analysisand subgraphs relatedness measures was conducted Twostandard document clustering techniques were chosen fortesting the effect of the document linkages on clustering [43](1) HAC and (2) 119870-means The MF-based weighting is themain factor that captures the importance of a concept in

a document Furthermore document linkages using the sub-graph relatedness measure are among the main factors thataffect clustering techniques Thus to study the effect of theMFs-based weighting and the subgraph relatedness measure(graph-based approach) on the clustering techniques thisexperiment is repeated for the different clustering techniquesas shown in Table 8 Basically the aim is to maximize theweighted average of the 119865-measure for each class (FC) ofclusters to achieve high-quality clustering

Table 8 shows that the graph-based approach achieveshigher performance than term-based and concept-basedapproachesThis comparison was conducted for two corporaof text documents that are the Reuters-21578 dataset and the20 newsgroups dataset (DS9 and DS10 in Table 3)The resultsshown in Table 8 illustrate improvement in the clusteringquality when more attributes are included in graph-basedrelatedness measure The improvements were achieved at thegraph-based approach up to 17 by HAC and 40 by 119870-means for Reuters-21578 dataset while up to 36 byHAC and39 by 119870-means for approximately 20 newsgroup datasetsfrom the base case (term-based) approach Furthermorethe improvements also were achieved at the graph-basedapproach up to 5 by HAC and 7 by119870-Means for Reuters-21578 dataset while up to 7 byHAC and 8 by119870-Means forapproximately 20 Newsgroup datasets from Concept-basedapproach

18 The Scientific World Journal

5 Conclusions

This paper proposed the SHGRS to improve the effectivenessof themining andmanaging operations in textual documentsSpecifically a three-stage approach was proposed for con-structing the scheme First the MFs with their attributesare extracted and hierarchy-based structure is built to rep-resent sentences by the MFs and their attributes Second anovel semantic relatedness computing method was proposedfor inferring relationships and dependencies of MFs andthe relationship between MFs is represented by a graph-based structure Third a subgraph matching algorithm wasconducted for document linkage process The documentsclustering is used as a case study and conducted extensiveexperiments on 10 different datasets and the results showedthat the scheme can result in a better mining performancecompared with traditional approach Future work will focuson conducting other case studies to processes such asgathering filtering retrieving classifying and summarizinginformation

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] I T Fatudimu A G Musa C K Ayo and A B SofoluweldquoKnowledge discovery in online repositories a text miningapproachrdquo European Journal of Scientific Research vol 22 no2 pp 241ndash250 2008

[2] A Hotho A Nurnberger and G Paaszlig ldquoA brief survey of textminingrdquo Journal for Computational Linguistics and LanguageTechnology vol 20 2005

[3] M Bernotas K Karklius R Laurutis and A Slotkiene ldquoThepeculiarities of the text document representation using ontol-ogy and tagging-based clustering techniquerdquo Journal of Infor-mation Technology and Control vol 36 pp 217ndash220 2007

[4] Y Li Z A Bandar andDMcLean ldquoAn approach formeasuringsemantic similarity between words using multiple informationsourcesrdquo IEEE Transactions on Knowledge and Data Engineer-ing vol 15 no 4 pp 871ndash882 2003

[5] K Shaban O Basir and M Kamel ldquoDocument mining basedon semantic understanding of textrdquo in Progress in Pattern Rec-ognition Image Analysis and Applications vol 4225 of LectureNotes in Computer Science pp 834ndash843 Springer BerlinGermany 2006

[6] A Culotta A McCallum and J Betz ldquoIntegrating probabilisticextraction models and data mining to discover relations andpatterns in textrdquo in Proceedings of theHuman Language Technol-ogy ConferencemdashNorth American Chapter of the Association forComputational Linguistics Annual Meeting (HLT-NAACL rsquo06)pp 296ndash303 June 2006

[7] A Islam and D Inkpen ldquoSemantic text similarity using corpus-based word similarity and string similarityrdquo ACM Transactionson Knowledge Discovery fromData vol 2 no 2 article 10 2008

[8] R Feldman and J SangerTheTextMining Handbook AdvancedApproaches in Analyzing Unstructured Data Cambridge Uni-versity Press 2006

[9] R Navigli ldquoWord sense disambiguation a surveyrdquo ACM Com-puting Surveys vol 41 no 2 article 10 2009

[10] W Mao andWW Chu ldquoThe phrase-based vector space modelfor automatic retrieval of free-text medical documentsrdquo Dataand Knowledge Engineering vol 61 no 1 pp 76ndash92 2007

[11] C Siefkes and P Siniakov ldquoAn overview and classification ofadaptive approaches to information extractionrdquo in Journal onData Semantics IV vol 3730 of Lecture Notes in ComputerScience pp 172ndash212 Springer New York NY USA 2005

[12] httpwordnetprincetonedu[13] OpenNLP httpopennlpapacheorg[14] httpwwwalchemyapicom[15] W V Siricharoen ldquoOntologies and object models in object ori-

ented software engineeringrdquo International Association of Engi-neers (IAENG) Journal of Computer Science (JICS) vol 33 no 1pp 1151ndash1162 2006

[16] S Shehata ldquoA WordNet-based semantic model for enhancingtext clusteringrdquo in Proceedings of the IEEE International Con-ference on Data Mining Workshops (ICDMW rsquo09) pp 477ndash482Miami Fla USA December 2009

[17] S Shehata F Karray andM Kamel ldquoAn efficient concept-basedminingmodel for enhancing text clusteringrdquo IEEE Transactionson Knowledge and Data Engineering vol 22 no 10 pp 1360ndash1371 2010

[18] E Zemmouri H Behja A Marzak and B Trousse ldquoOntology-based knowledge model for multi-view KDD processrdquo Interna-tional Journal of Mobile Computing and Multimedia Communi-cations vol 4 no 3 pp 21ndash33 2012

[19] C Marinica and F Guillet ldquoKnowledge-based interactive post-mining of association rules using ontologiesrdquo IEEETransactionson Knowledge and Data Engineering vol 22 no 6 pp 784ndash7972010

[20] B Choudhary and P Bhattacharyya ldquoText clustering usingUni-versal Networking Language representationrdquo in Proceedings ofthe 11th International World Wide Web Conference 2003

[21] G Pirro and J Euzenat ldquoA feature and information theoreticframework for semantic similarity and relatednessrdquo in Proceed-ings of the 9th International Semantic Web Conference on theSemantic Web (ISWC rsquo10) vol 6496 pp 615ndash630 2010

[22] Z Lin M R Lyu and I King ldquoMatchSim a novel similaritymeasure based on maximum neighborhood matchingrdquo Knowl-edge and Information Systems vol 32 no 1 pp 141ndash166 2012

[23] D Lin ldquoAn information-theoretic definition of similarityrdquo inProceedings of the 15th International Conference on MachineLearning Madison Wis USA July 1998

[24] D Inkpen ldquoA statistical model for near-synonym choicerdquo ACMTransactions on Speech and Language Processing vol 4 no 1article 2 2007

[25] LHan T Finin PMcNameeA Joshi andYYesha ldquoImprovingword similarity by augmenting PMIwith estimates of word pol-ysemyrdquo IEEE Transactions on Knowledge and Data Engineeringvol 25 no 6 pp 1307ndash1322 2013

[26] J OrsquoShea Z Bandar K Crockett and D McLean ldquoA compar-ative study of two short text semantic similarity measuresrdquo inAgent and Multi-Agent Systems Technologies and Applicationsvol 4953 of Lecture Notes in Artificial Intelligence pp 172ndash181Springer Berlin Germany 2008

[27] J Oliva J I SerranoMDDel Castillo and A Iglesias ldquoSyMSSa syntax-basedmeasure for short-text semantic similarityrdquoDataand Knowledge Engineering vol 70 no 4 pp 390ndash405 2011

The Scientific World Journal 19

[28] I Yoo and X Hu ldquoClustering large collection of biomedicalliterature based on ontology-enriched bipartite graph represen-tation andmutual refinement strategyrdquo inProceedings of the 10thPAKDD pp 9ndash12 2006

[29] S-T Yuan and Y-C Chen ldquoSemantic ideation learning foragent-based e-brainstormingrdquo IEEE Transactions on Knowledgeand Data Engineering vol 20 no 2 pp 261ndash275 2008

[30] D Liu Z Liu and Q Dong ldquoA dependency grammar andWordNet based sentence similarity measurerdquo Journal of Com-putational Information Systems vol 8 no 3 pp 1027ndash1035 2012

[31] A Zouaq M Gagnon and O Benoit ldquoSemantic analysis usingdependency-based grammars and upper-level ontologiesrdquoInternational Journal of Computational Linguistics and Applica-tions vol 1 no 1-2 pp 85ndash101 2010

[32] C Ramakrishnan K J Kochut and A P Sheth ldquoA frameworkfor schema-driven relationship discovery from unstructuredtextrdquo in The Semantic WebmdashISWC 2006 vol 4273 of LectureNotes in Computer Science pp 583ndash596 Springer BerlinGermany 2006

[33] C J C H Watkins and P Dayan ldquoTechnical note Q-learningrdquoMachine Learning vol 8 no 3-4 pp 279ndash292 1992

[34] M Lan C L Tan J Su and Y Lu ldquoSupervised and traditionalterm weighting methods for automatic text categorizationrdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 31 no 4 pp 721ndash735 2009

[35] L Zhang C Li J Liu and HWang ldquoGraph-based text similar-itymeasurement by exploiting wikipedia as background knowl-edgerdquo World Academy of Science Engineering and Technologyvol 59 pp 1548ndash1553 2011

[36] M Kuramochi and G Karypis ldquoAn efficient algorithm for dis-covering frequent subgraphsrdquo IEEE Transactions on Knowledgeand Data Engineering vol 16 no 9 pp 1038ndash1051 2004

[37] L-C Yu H-M Shih Y-L Lai J-F Yeh and C-H Wu ldquoDis-criminative training for near-synonym substitutionrdquo in Pro-ceedings of the 23rd International Conference on ComputationalLinguistics (COLING rsquo10) pp 1254ndash1262 2010

[38] L-C Yu and W-N Chien ldquoIndependent component analysisfor near-synonym choicerdquoDecision Support Systems vol 55 no1 pp 146ndash155 2013

[39] L Yao S Riedel and A Mccallum ldquoCollective cross-documentrelation extraction without labelled datardquo in Proceedings ofthe Conference on Empirical Methods in Natural LanguageProcessing pp 1013ndash1023MITCambridgeMassUSAOctober2010

[40] M Shen D-R Liu and Y-S Huang ldquoExtracting semantic rela-tions to enrich domain ontologiesrdquo Journal of Intelligent Infor-mation Systems vol 39 no 3 pp 749ndash761 2012

[41] H Chim and X Deng ldquoEfficient phrase-based documentsimilarity for clusteringrdquo IEEE Transactions on Knowledge andData Engineering vol 20 no 9 pp 1217ndash1229 2008

[42] GAMiller andWGCharles ldquoContextual correlates of seman-tic similarityrdquo Language and Cognitive Processes vol 6 no 1 pp1ndash28 1991

[43] M Steinbach G Karypis and V Kumar ldquoA comparison ofdocument clustering techniquesrdquo in Proceedings of the Knowl-edge Discovery and Data Mining (KDD) Workshop Text MiningAugust 2000

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 13: Research Article Exploiting Semantic Annotations and ...downloads.hindawi.com/journals/tswj/2015/136172.pdf · is bene cial for text categorization [ ], text summarization [ ], word

The Scientific World Journal 13

Table 3 Datasets details for the experimental setup

Dataset Experimenting Dataset Name Description

DS1 Experiment 1content-basedevaluation

Miller and Charles (MC)1 RG consists of 65 pairs of nouns extracted from the WordNet rated bymultiple human annotators

DS2Microsoft ResearchParaphrase Corpus

(MRPC)2

The corpus consists of 5801 sentence pairs collected from newswire articles3900 of which were labeled as relatedness by human annotators The wholeset is divided into a training subset (4076 sentences of which 2753 are true)and a test subset (1725 pairs of which 1147 are true)

DS3 Experiment 2coselection-basedevaluation(closest-synonymdetection)

British National Corpus(BNC)3

BNC is a carefully selected collection of 4124 contemporary written andspoken English texts contains 100-million-word text corpus of samples ofwritten and spoken English with the near-synonym collocations

DS4 SN (semantic neighbors)4SN relates 462 target terms (nouns) to 5910 relatum terms with 14682semantic relations (7341 are meaningful and 7341 are random) The SNcontains synonyms coming from three sources WordNet 30 Rogetrsquosthesaurus and a synonyms database

DS5

Experiment 2coselection-basedevaluation (semanticrelationshipsexploration)

BLESS6BLESS relates 200 target terms (100 animate and 100 inanimate nouns) to 8625relatum terms with 26554 semantic relations (14440 are meaningful (correct)and 12154 are random) Every relation has one of the following typeshypernymy cohypernymy meronymy attribute event or random

DS6 TREC5

TREC includes 1437 sentences annotated with entities and relations at leastone relation There are three types of entities person (1685) location (1968)and organization (978) in addition there is a fourth type others (705) whichindicates that the candidate entity is none of the three types There are fivetypes of relations located in (406) indicates that one location is located insideanother location work for (394) indicates that a person works for anorganization OrgBased in (451) indicates that an organization is based in alocation live in (521) indicates that a person lives at a location and kill (268)indicates that a person killed another person There are 17007 pairs of entitiesthat are not related by any of the five relations and hence have the NR relationbetween them which thus significantly outnumbers other relations

DS7 IJCNLP 2011-New YorkTimes (NYT)6

NYT contains 150 business articles from NYT There are 536 instances (208positive 328 negative) with 140 distinct descriptors in NYT dataset

DS8 IJCNLP 2011-Wikipedia8Wikipedia personalsocial relation dataset was previously used in Culotta etal [6] There are 700 instances (122 positive 578 negative) with 70 distinctdescriptors in Wikipedia dataset

DS9 Experiment 3task-basedevaluation

Reuters 215787 Reuters-21578 contains 21578 documents (12902 are used) categorized to 10categories

DS10 20 Newsgroups8 20 newsgroups dataset contains 20000 documents (18846 are used)categorized to 20 categories

1Available at httpwwwcscmuedusimmfaruquisuitehtml2Available at httpresearchmicrosoftcomen-usdownloads3Available at httpcorpusbyuedubnc4Available at httpcentalfltruclacbeteamsimpanchenkosre-evalsncsv5Available at httpl2rcsuiucedusimcogcompDataERconll04corp6Available at httpwwwmysmuedufacultyjingjiangdataIJCNLP2011zip7Available at httpmlrcsumassedumldatasetsReuters-21578+Text+Categorization+Collection8Available at httpwwwcsminingorgindexphpid-20-newsgroupshtml

Let TP denote the number of correctly detected closestsynonyms or semantic relationships explored let FP be thenumber of incorrectly detected closest synonyms or semanticrelationships explored and let FN be the number of correctlybut not detected closest synonyms or semantic relationshipsexplored in a dataset The 119865-measure combines the precisionand recall in one metric and is often used to show theefficiencyThismeasurement can be interpreted as a weightedaverage of the precision and recall where an 119865-measure

reaches its best value at 1 andworst value at 0 Precision recalland 119865-measure are defined as follows

Recall = TP(TP + FN)

Precision =TP

(TP + FP)

119865-measure = 2 (Recall lowast Precision)(Recall + Precision)

(15)

14 The Scientific World Journal

413 Measure for Task-Based Evaluation A case study ofa managing and mining task (ie semantic clustering oftext documents) is studied for task-based evaluation Thetest is performed with the standard clustering algorithmsthat accept the pairwise (dis)similarity between documentsrather than the internal feature space of the documents Theclustering algorithms may be classified as listed below [43](a) flat clustering (which creates a set of clusters without anyexplicit structure that would relate the clusters to each otheralso called exclusive clustering) (b) hierarchical clustering(which creates a hierarchy of clusters) (c) hard clustering(which assigns each documentobject as a member of exactlyone cluster) and (d) soft clustering (which distributes thedocumentsobjects over all clusters) These algorithms areagglomerative (hierarchical clustering) 119870-means (flat clus-tering hard clustering) and EM algorithm (flat clusteringsoft clustering)

Hierarchical agglomerative clustering (HAC) and the 119870-means algorithm have been applied to text clustering ina straightforward way [16 17] The algorithms employedin the experiment included 119870-means clustering and HACTo evaluate the quality of the clustering the 119865-measure isused for quality measures which combine the precision andrecall measures These metrics are computed for every (classcluster) pair The precision 119875 and recall 119877 of a cluster 119878

119894with

respect to a class 119871119903are defined as

119875 (119871119903 119878119894) =

119873119903119894

119873119894

119877 (119871119903 119878119894) =

119873119903119894

119873119903

(16)

where 119873119903119894is the number of documents of class 119871

119903in cluster

119878119894119873119894is the number of documents of cluster 119878

119894 and119873

119903is the

number of documents of class 119871119903

The 119865-measure which is a harmonic mean of precisionand recall is defined as

119865 (119871119903 119878119894) =

2 lowast 119875 (119871119903 119878119894) lowast 119877 (119871

119903 119878119894)

119875 (119871119903 119878119894) + 119877 (119871

119903 119878119894)

(17)

A per-class 119865 score is calculated as follows

119865 (119871119903) = max119878119894isin119862

119865 (119871119903 119878119894) (18)

With respect to class 119871119903 the cluster with the highest 119865-

measure is considered to be the cluster that maps to class 119871119903

that 119865-measure becomes the score for class 119871119903 and 119862 is total

number of clusters The overall 119865-measure for the clusteringresult 119862 is the weighted average of the 119865-measure for eachclass 119871

119903(macroaverage)

119865119862 =sum119871119903

(1003816100381610038161003816119873119903

1003816100381610038161003816 lowast 119865 (119871119903))

sum119871119903

10038161003816100381610038161198731199031003816100381610038161003816

(19)

where |119873119903| is the number of objects in class 119871

119903 The higher

the overall 119865-measure is the better the clustering is due tothe higher accuracy of the clusters mapping to the originalclasses

42 Evaluation Setup (Dataset) Content-based coselection-based and task-based evaluations are used to validateexperimenting over the real corpus of documents theresults are very promising The experimental setup con-sists of some datasets of textual documents as detailed inTable 3 Experiment 1 uses the Miller and Charles (MC) andMicrosoft Research Paraphrase Corpus Experiment 2 usesthe British National Corpus TREC IJCNLP 2011 (NYT andWikipedia) SN and BLESS Experiment 3 usesThe Reuters-21578 and 20 newsgroups

43 Evaluation Results This section reports on the results ofthree experiments conducted using the evaluation datasetsoutlined in the previous section The SHGRS is imple-mented and evaluated based on concept analysis and anno-tation as sentence based in experiment 1 document basedin experiment 2 and corpus based in experiment 3 Inexperiment 1 the DS1 and DS2 investigated the effectivenessof the proposed MFsSR compared to previous studies basedon the availability of semantic relatedness values of humanjudgments and previous studies In experiment 2 the DS3andDS4 investigated the effectiveness of the closest synonymdetection using MFsSR compared to using PMI [25 37]and ICA [38] based on the availability of synonyms comingfrom three sources WordNet 30 Rogetrsquos thesaurus anda synonyms database Furthermore the DS5 DS6 DS7and DS8 investigated the effectiveness of the IOAC-QL forsemantic relationships exploration compared to the implicitrelation extraction CRFs [6 39] and TFICF [40] based on theavailability of semantic relations judgments In experiment 3the DS9 and DS10 investigated the effectiveness of thedocument linkage process for clustering case study thanprevious studies as word-based [16] and concept-based [17]based on the availability of human judgments

These experiments revealed some interesting trends interms of text organization and a representation schemebased on semantic annotation and reinforcement learningThe results show the effectiveness of the proposed schemefor accurate understanding of the underlying meaning Theproposed SHGRS is exploited in mining and managingoperations such as gathering filtering searching retrievingextracting classifying clustering and summarizing docu-ments In addition the experiments illustrate the improve-ments when direct relevance and indirect relevance areincluded in the process of measuring semantic relatednessbetween MFs of documents The main purpose of semanticrelatedness measures is to infer unknown dependencies andrelationships among the MFs The proposed MFsSR is abackbone of the closest synonym detection and semanticrelationships exploration algorithms The performance ofMFsSR for closest synonym detection and IOAC-QL forsemantic relationships exploration algorithms is proportion-ally improved as well as having a semantic understanding oftext content achieved The relatedness subgraph measure isused for the document linkage process at the corpus levelThrough this process better clustering results can be achievedas more attempt to improve the accuracy of measuring therelatedness between documents using semantic information

The Scientific World Journal 15

Table 4The results of the comparison of the correlation coefficientbetween human judgment with some relatedness measures and theproposed semantic relatedness measure

MeasureRelevance

correlation withMampC

Distance-based measuresRada 0688Wu and Palmer 0765

Information-based measuresResnik 077Jiang and Conrath 0848Lin 0853

Hybrid measuresT Hong and D smith 0879Zili Zhou 0882

Informationfeature-base measuresThe proposed MFsSR 0937

in SHGRSThe relatedness subgraphsmeasure affects the per-formance of the output of the document managing and min-ing process relative to howmuch of the semantic informationit considers The most accurate relatedness can be achievedwhen the maximum amount of the semantic information isprovided Consequently the quality of clustering is improvedand when semantic information is considered the approachsurpasses the word-based and concept-based techniques

431 Experiment 1 Comparative Study (Content-Based Eval-uation) This experiment shows the necessity to evaluate theperformance of the proposed MFsSR based on a benchmarkdataset of human judgments so the results of the proposedsemantic relatedness would be comparable with other pre-vious studies in the same field In this experiment theyattempted to compute the correlation between the ratings ofthe proposed semantic relatedness approach and the meanratings reported by Miller and Charles (DS1 in Table 3)Furthermore the results produced are compared againsteight other semantic similarities approaches namely RadaWu and Palmer Resnik Jiang and Conrath Lin T Hong andD Smith and Zili Zhou [21]

Table 4 shows the correlation coefficient results betweennine componential approaches and the Miller and Charlesratings mean The semantic relatedness for the proposedapproach outperformed all the listed approaches Unlike allthe listed methods in the proposed semantic relatednessdifferent properties are considered Furthermore the goodcorrelation value of the approach also results from consider-ing all available relationships between concepts and the indi-rect relationships between attributes of each concept whenmeasuring the semantic relatedness In this experiment theproposed approach achieved a good correlation value withthe human-subject rating reported by Miller and CharlesBased on the results of this study theMFsSR correlation valuehas proven that considering the directindirect relevance

is specifically important for at least 6 improvement overthe best previous results This improvement contributes tothe achievement of the largest amount of relationships andinterdependence among MFs and their attributes at thedocument level and to the document linkage process at acorpus level

Table 5 summarizes the characteristics of the MRPCdataset (DS2 in Table 3) and presents comparison of theAcc and AR values between the proposed semantic related-ness measure MFsSR and Islam and Inkpen [7] Differentrelatedness thresholds ranging from 0 to 1 with interval01 are used to validate the MFsSR with Islam and Inkpen[7] After evaluation the best relatedness thresholds of Accand AR are 06 07 and 08 These results indicate thatthe proposed MFsSR surpasses Islam and Inkpen [7] interms of Acc and AR The results of each approach listedbelow were based on the best Acc and AR through allthresholds instead of under the same relatedness thresholdThis improvement in Acc andAR values is due to the increasein the numbers of pairs predicted correctly after consideringdirectindirect relevance This relevance takes into accountthe closest synonym and the relationships of sentences MFs

432 Experiment 2 Comparative Study of Closest-SynonymDetection and Semantic Relationships Exploration Thisexperiment shows the necessity to study the impact of theMFsSR for inferring unknown dependencies and relation-ships among the MFs This impact is achieved through theclosest-synonym detection and semantic relationship explo-ration algorithms so the results would be comparable withother previous studies in the same field Throughout thisexperiment the impact of the proposed MFsSR for detectingthe closest synonym is studied and compared to the PMIapproach and ICA approach for detecting the best near-synonym The difference among the proposed discoveringsemantic relationships or associations IOAC-QL the implicitrelation extraction CRFs approach and TFICF approachis examined This experiment attempts to measure theperformance of theMFsSR for the closest synonym detectionand the IOAC-QL for semantic relationship explorationalgorithms Hence more semantic understanding of the textcontent is achieved by inferring unknown dependenciesand relationships among the MFs Considering the directrelevance among the MFs and the indirect relevance amongthe attributes of the MFs in the proposed MFsSR constitutesa certain advantage over previous measures However mostof the time the incorrect items are due to a wrong syntacticparsing from the OpenNLP parser and AlchemyAPIAccording to the preliminary study it is certain that theaccuracy of parsing toolsrsquo effects is on the performance of theMFsSR

Detecting the closest synonym is a process of MF disam-biguation resulting from the implementation of the MFsSRtechnique and the threshold of the synonyms scores throughthe C-SynD algorithm In the C-SynD algorithm the MFsSRtakes into account the direct relevance among MFs andthe indirect relevance among MFs and their attributes in acontext that gives a significant increase in the recall without

16 The Scientific World Journal

Table 5 The results of the comparison of the accuracy and acceptance rate between the proposed semantic relatedness measure and Islamand Inkpen [7]

MRPC dataset Relatednessthreshold

Human judgment(TP + FN)

Islam and Inkpen [7] The proposed MFsSRAcc AR Acc AR

Trainingsubset (4076)

01

2753 true

067 1 068 102 067 1 068 103 067 1 068 104 067 1 068 105 069 098 068 106 072 089 068 107 068 078 070 09808 056 04 072 08609 037 009 060 0491 033 0 034 002

Test subset(1725)

01

1147 true

066 1 067 102 066 1 067 103 066 1 067 104 066 1 067 105 068 098 067 106 072 089 066 107 068 078 066 09808 056 04 064 08509 038 009 049 051 033 0 034 003

disturbing the precision Thus the MFsSR between two con-cepts considers the information content with most of Word-Net relationships Table 6 illustrates the performance of theC-SynD algorithm based on theMFsSR compared to the bestnear-synonym algorithmusing the PMI and ICA approachesThis comparison was conducted for two corpora of textdocuments that are BNC and NS (DS3 and DS4 in Table 3)As indicated in Table 6 the C-SynD algorithm yielded higheraverage precision recall and 119865-measure values than PMIapproach by 25 20 and 21 on SN dataset respectivelyand also by 8 4 and 7 on BNC dataset Furthermorethe C-SynD algorithm also yielded higher average precisionrecall and 119865-measure values than ICA approach by 6 9and 7 on SN dataset respectively and also by 6 4and 5 on BNC dataset The improvements achieved in theperformance values are due to the increase of the numberof pairs predicted correctly and the decrease of the numberof pairs predicted incorrectly through implementing MFsSRThe important observation from this table is the improve-ments achieved in recall which measures effectiveness of theC-SynD algorithm Achieving better precision values is clearwith a high percentage in different datasets and domains

Inferring the MFs relationships and dependencies is aprocess of achieving a large number of interdependencesamongMFs resulting from the implementation of the IOAC-QL algorithm In the IOAC-QL algorithm the 119876-learning isused to select the optimal action (relationships) that gainsthe largest amount of interdependences among the MFs

resulting from the measure of the AVW Table 7 illustratesthe performance of the IOAC-QL algorithm based on AVWscompared to the implicit relationship extraction based onCRFs and TFICF approaches which was conducted overthe TREC Bless the NYT and the Wikipedia corpus (DS5DS6 DS7 and DS8 in Table 3) The data in this table showsincreases in precision recall and 119865-measure values due tothe increase in the number of pairs predicted correctly afterconsidering directindirect relevance through the expansionof closest synonym In addition the IOAC-QL algorithmthe CRFs and the TFICF approaches are beneficial tothe extraction performance but the IOAC-QL contributesmore than CRFs and TFICF Thus IOAC-QL considerssimilarity contiguity contrast and causality relationshipsbetween MFs and its closest-synonyms while the CRFs andTFICF consider is-a and part-of relationships only betweenconceptsThe improvements of the 119865-measure were achievedthrough an IOAC-QL algorithm up to 23 for TREC dataset22 for BLESS dataset 21 for NYT dataset and 6 forWikipedia dataset approximately from the CRFs approachFurthermore the IOAC-QL algorithm also yielded higher119865-measure values than TFICF approach by 10 for TRECdataset 17 for BLESS dataset and 7 for NYT datasetrespectively

433 Experiment 3 Comparative Study (Task-Based Eval-uation) This experiment shows the effectiveness of the

The Scientific World Journal 17

Table 6 The results of the C-SynD algorithm based on the MFsSR compared to the PMI and ICA for detecting the closest synonym

Datasets PMI ICA C-SynDPrecision Recall 119865-measure Precision Recall 119865-measure Precision Recall 119865-measure

SN 606 606 061 795 716 753 85 80 82BNC 745 679 71 76 678 72 82 719 77

Table 7 The results of the IOAC-QL algorithm based on AVWs compared to CRFs and TFICF approaches

Datasets CRFs TFICF IOAC-QLPrecision Recall 119865-measure Precision Recall 119865-measure Precision Recall 119865-measure

TREC 7508 602 6628 893 714 787 898 881 886BLESS 7304 6266 6703 738 695 716 950 835 889NYT 6846 5402 6038 860 650 740 900 740 812Wikipedia 560 420 480 646 5488 5934 700 440 540

Table 8The results of graph-based approach compared to term-based and concept-based approach using119870-means and the HAC clusteringalgorithms

Datasets Algorithm Term-based Concept-based Graph-basedFC FC FC

Reuters-21578 HAC 73 857 902119870-means 51 846 918

20 Newsgroups HAC 53 82 89119870-means 48 793 87

proposed semantic representation scheme in document min-ing and managing operations such as gathering filteringsearching retrieving extracting classifying clustering andsummarizing documents Many sets of experiments usingthe proposed scheme on different datasets in text clusteringhave been conducted The experiments demonstrate theextensive comparisons between the MF-based analysis andthe traditional analysis Experimental results demonstratethe substantial enhancement of the document clusteringquality In this experiment the application of the semanticunderstanding of the textual documents is presented Textualdocuments are parsed to produce the rich syntactic andsemantic representations of SHGRS Based on these semanticrepresentations document linkages are estimated using thesubgraph matching process that is based on finding alldistinct similarity common subgraphs The semantic repre-sentations of textual documents determining the relatednessbetween themare evaluated by allowing clustering algorithmsto use the produced relatedness matrix of the documentset To test the effectiveness of subgraph matching in deter-mining an accurate measure of the relatedness betweendocuments an experiment using the MFs-based analysisand subgraphs relatedness measures was conducted Twostandard document clustering techniques were chosen fortesting the effect of the document linkages on clustering [43](1) HAC and (2) 119870-means The MF-based weighting is themain factor that captures the importance of a concept in

a document Furthermore document linkages using the sub-graph relatedness measure are among the main factors thataffect clustering techniques Thus to study the effect of theMFs-based weighting and the subgraph relatedness measure(graph-based approach) on the clustering techniques thisexperiment is repeated for the different clustering techniquesas shown in Table 8 Basically the aim is to maximize theweighted average of the 119865-measure for each class (FC) ofclusters to achieve high-quality clustering

Table 8 shows that the graph-based approach achieveshigher performance than term-based and concept-basedapproachesThis comparison was conducted for two corporaof text documents that are the Reuters-21578 dataset and the20 newsgroups dataset (DS9 and DS10 in Table 3)The resultsshown in Table 8 illustrate improvement in the clusteringquality when more attributes are included in graph-basedrelatedness measure The improvements were achieved at thegraph-based approach up to 17 by HAC and 40 by 119870-means for Reuters-21578 dataset while up to 36 byHAC and39 by 119870-means for approximately 20 newsgroup datasetsfrom the base case (term-based) approach Furthermorethe improvements also were achieved at the graph-basedapproach up to 5 by HAC and 7 by119870-Means for Reuters-21578 dataset while up to 7 byHAC and 8 by119870-Means forapproximately 20 Newsgroup datasets from Concept-basedapproach

18 The Scientific World Journal

5 Conclusions

This paper proposed the SHGRS to improve the effectivenessof themining andmanaging operations in textual documentsSpecifically a three-stage approach was proposed for con-structing the scheme First the MFs with their attributesare extracted and hierarchy-based structure is built to rep-resent sentences by the MFs and their attributes Second anovel semantic relatedness computing method was proposedfor inferring relationships and dependencies of MFs andthe relationship between MFs is represented by a graph-based structure Third a subgraph matching algorithm wasconducted for document linkage process The documentsclustering is used as a case study and conducted extensiveexperiments on 10 different datasets and the results showedthat the scheme can result in a better mining performancecompared with traditional approach Future work will focuson conducting other case studies to processes such asgathering filtering retrieving classifying and summarizinginformation

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] I T Fatudimu A G Musa C K Ayo and A B SofoluweldquoKnowledge discovery in online repositories a text miningapproachrdquo European Journal of Scientific Research vol 22 no2 pp 241ndash250 2008

[2] A Hotho A Nurnberger and G Paaszlig ldquoA brief survey of textminingrdquo Journal for Computational Linguistics and LanguageTechnology vol 20 2005

[3] M Bernotas K Karklius R Laurutis and A Slotkiene ldquoThepeculiarities of the text document representation using ontol-ogy and tagging-based clustering techniquerdquo Journal of Infor-mation Technology and Control vol 36 pp 217ndash220 2007

[4] Y Li Z A Bandar andDMcLean ldquoAn approach formeasuringsemantic similarity between words using multiple informationsourcesrdquo IEEE Transactions on Knowledge and Data Engineer-ing vol 15 no 4 pp 871ndash882 2003

[5] K Shaban O Basir and M Kamel ldquoDocument mining basedon semantic understanding of textrdquo in Progress in Pattern Rec-ognition Image Analysis and Applications vol 4225 of LectureNotes in Computer Science pp 834ndash843 Springer BerlinGermany 2006

[6] A Culotta A McCallum and J Betz ldquoIntegrating probabilisticextraction models and data mining to discover relations andpatterns in textrdquo in Proceedings of theHuman Language Technol-ogy ConferencemdashNorth American Chapter of the Association forComputational Linguistics Annual Meeting (HLT-NAACL rsquo06)pp 296ndash303 June 2006

[7] A Islam and D Inkpen ldquoSemantic text similarity using corpus-based word similarity and string similarityrdquo ACM Transactionson Knowledge Discovery fromData vol 2 no 2 article 10 2008

[8] R Feldman and J SangerTheTextMining Handbook AdvancedApproaches in Analyzing Unstructured Data Cambridge Uni-versity Press 2006

[9] R Navigli ldquoWord sense disambiguation a surveyrdquo ACM Com-puting Surveys vol 41 no 2 article 10 2009

[10] W Mao andWW Chu ldquoThe phrase-based vector space modelfor automatic retrieval of free-text medical documentsrdquo Dataand Knowledge Engineering vol 61 no 1 pp 76ndash92 2007

[11] C Siefkes and P Siniakov ldquoAn overview and classification ofadaptive approaches to information extractionrdquo in Journal onData Semantics IV vol 3730 of Lecture Notes in ComputerScience pp 172ndash212 Springer New York NY USA 2005

[12] httpwordnetprincetonedu[13] OpenNLP httpopennlpapacheorg[14] httpwwwalchemyapicom[15] W V Siricharoen ldquoOntologies and object models in object ori-

ented software engineeringrdquo International Association of Engi-neers (IAENG) Journal of Computer Science (JICS) vol 33 no 1pp 1151ndash1162 2006

[16] S Shehata ldquoA WordNet-based semantic model for enhancingtext clusteringrdquo in Proceedings of the IEEE International Con-ference on Data Mining Workshops (ICDMW rsquo09) pp 477ndash482Miami Fla USA December 2009

[17] S Shehata F Karray andM Kamel ldquoAn efficient concept-basedminingmodel for enhancing text clusteringrdquo IEEE Transactionson Knowledge and Data Engineering vol 22 no 10 pp 1360ndash1371 2010

[18] E Zemmouri H Behja A Marzak and B Trousse ldquoOntology-based knowledge model for multi-view KDD processrdquo Interna-tional Journal of Mobile Computing and Multimedia Communi-cations vol 4 no 3 pp 21ndash33 2012

[19] C Marinica and F Guillet ldquoKnowledge-based interactive post-mining of association rules using ontologiesrdquo IEEETransactionson Knowledge and Data Engineering vol 22 no 6 pp 784ndash7972010

[20] B Choudhary and P Bhattacharyya ldquoText clustering usingUni-versal Networking Language representationrdquo in Proceedings ofthe 11th International World Wide Web Conference 2003

[21] G Pirro and J Euzenat ldquoA feature and information theoreticframework for semantic similarity and relatednessrdquo in Proceed-ings of the 9th International Semantic Web Conference on theSemantic Web (ISWC rsquo10) vol 6496 pp 615ndash630 2010

[22] Z Lin M R Lyu and I King ldquoMatchSim a novel similaritymeasure based on maximum neighborhood matchingrdquo Knowl-edge and Information Systems vol 32 no 1 pp 141ndash166 2012

[23] D Lin ldquoAn information-theoretic definition of similarityrdquo inProceedings of the 15th International Conference on MachineLearning Madison Wis USA July 1998

[24] D Inkpen ldquoA statistical model for near-synonym choicerdquo ACMTransactions on Speech and Language Processing vol 4 no 1article 2 2007

[25] LHan T Finin PMcNameeA Joshi andYYesha ldquoImprovingword similarity by augmenting PMIwith estimates of word pol-ysemyrdquo IEEE Transactions on Knowledge and Data Engineeringvol 25 no 6 pp 1307ndash1322 2013

[26] J OrsquoShea Z Bandar K Crockett and D McLean ldquoA compar-ative study of two short text semantic similarity measuresrdquo inAgent and Multi-Agent Systems Technologies and Applicationsvol 4953 of Lecture Notes in Artificial Intelligence pp 172ndash181Springer Berlin Germany 2008

[27] J Oliva J I SerranoMDDel Castillo and A Iglesias ldquoSyMSSa syntax-basedmeasure for short-text semantic similarityrdquoDataand Knowledge Engineering vol 70 no 4 pp 390ndash405 2011

The Scientific World Journal 19

[28] I Yoo and X Hu ldquoClustering large collection of biomedicalliterature based on ontology-enriched bipartite graph represen-tation andmutual refinement strategyrdquo inProceedings of the 10thPAKDD pp 9ndash12 2006

[29] S-T Yuan and Y-C Chen ldquoSemantic ideation learning foragent-based e-brainstormingrdquo IEEE Transactions on Knowledgeand Data Engineering vol 20 no 2 pp 261ndash275 2008

[30] D Liu Z Liu and Q Dong ldquoA dependency grammar andWordNet based sentence similarity measurerdquo Journal of Com-putational Information Systems vol 8 no 3 pp 1027ndash1035 2012

[31] A Zouaq M Gagnon and O Benoit ldquoSemantic analysis usingdependency-based grammars and upper-level ontologiesrdquoInternational Journal of Computational Linguistics and Applica-tions vol 1 no 1-2 pp 85ndash101 2010

[32] C Ramakrishnan K J Kochut and A P Sheth ldquoA frameworkfor schema-driven relationship discovery from unstructuredtextrdquo in The Semantic WebmdashISWC 2006 vol 4273 of LectureNotes in Computer Science pp 583ndash596 Springer BerlinGermany 2006

[33] C J C H Watkins and P Dayan ldquoTechnical note Q-learningrdquoMachine Learning vol 8 no 3-4 pp 279ndash292 1992

[34] M Lan C L Tan J Su and Y Lu ldquoSupervised and traditionalterm weighting methods for automatic text categorizationrdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 31 no 4 pp 721ndash735 2009

[35] L Zhang C Li J Liu and HWang ldquoGraph-based text similar-itymeasurement by exploiting wikipedia as background knowl-edgerdquo World Academy of Science Engineering and Technologyvol 59 pp 1548ndash1553 2011

[36] M Kuramochi and G Karypis ldquoAn efficient algorithm for dis-covering frequent subgraphsrdquo IEEE Transactions on Knowledgeand Data Engineering vol 16 no 9 pp 1038ndash1051 2004

[37] L-C Yu H-M Shih Y-L Lai J-F Yeh and C-H Wu ldquoDis-criminative training for near-synonym substitutionrdquo in Pro-ceedings of the 23rd International Conference on ComputationalLinguistics (COLING rsquo10) pp 1254ndash1262 2010

[38] L-C Yu and W-N Chien ldquoIndependent component analysisfor near-synonym choicerdquoDecision Support Systems vol 55 no1 pp 146ndash155 2013

[39] L Yao S Riedel and A Mccallum ldquoCollective cross-documentrelation extraction without labelled datardquo in Proceedings ofthe Conference on Empirical Methods in Natural LanguageProcessing pp 1013ndash1023MITCambridgeMassUSAOctober2010

[40] M Shen D-R Liu and Y-S Huang ldquoExtracting semantic rela-tions to enrich domain ontologiesrdquo Journal of Intelligent Infor-mation Systems vol 39 no 3 pp 749ndash761 2012

[41] H Chim and X Deng ldquoEfficient phrase-based documentsimilarity for clusteringrdquo IEEE Transactions on Knowledge andData Engineering vol 20 no 9 pp 1217ndash1229 2008

[42] GAMiller andWGCharles ldquoContextual correlates of seman-tic similarityrdquo Language and Cognitive Processes vol 6 no 1 pp1ndash28 1991

[43] M Steinbach G Karypis and V Kumar ldquoA comparison ofdocument clustering techniquesrdquo in Proceedings of the Knowl-edge Discovery and Data Mining (KDD) Workshop Text MiningAugust 2000

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 14: Research Article Exploiting Semantic Annotations and ...downloads.hindawi.com/journals/tswj/2015/136172.pdf · is bene cial for text categorization [ ], text summarization [ ], word

14 The Scientific World Journal

413 Measure for Task-Based Evaluation A case study ofa managing and mining task (ie semantic clustering oftext documents) is studied for task-based evaluation Thetest is performed with the standard clustering algorithmsthat accept the pairwise (dis)similarity between documentsrather than the internal feature space of the documents Theclustering algorithms may be classified as listed below [43](a) flat clustering (which creates a set of clusters without anyexplicit structure that would relate the clusters to each otheralso called exclusive clustering) (b) hierarchical clustering(which creates a hierarchy of clusters) (c) hard clustering(which assigns each documentobject as a member of exactlyone cluster) and (d) soft clustering (which distributes thedocumentsobjects over all clusters) These algorithms areagglomerative (hierarchical clustering) 119870-means (flat clus-tering hard clustering) and EM algorithm (flat clusteringsoft clustering)

Hierarchical agglomerative clustering (HAC) and the 119870-means algorithm have been applied to text clustering ina straightforward way [16 17] The algorithms employedin the experiment included 119870-means clustering and HACTo evaluate the quality of the clustering the 119865-measure isused for quality measures which combine the precision andrecall measures These metrics are computed for every (classcluster) pair The precision 119875 and recall 119877 of a cluster 119878

119894with

respect to a class 119871119903are defined as

119875 (119871119903 119878119894) =

119873119903119894

119873119894

119877 (119871119903 119878119894) =

119873119903119894

119873119903

(16)

where 119873119903119894is the number of documents of class 119871

119903in cluster

119878119894119873119894is the number of documents of cluster 119878

119894 and119873

119903is the

number of documents of class 119871119903

The 119865-measure which is a harmonic mean of precisionand recall is defined as

119865 (119871119903 119878119894) =

2 lowast 119875 (119871119903 119878119894) lowast 119877 (119871

119903 119878119894)

119875 (119871119903 119878119894) + 119877 (119871

119903 119878119894)

(17)

A per-class 119865 score is calculated as follows

119865 (119871119903) = max119878119894isin119862

119865 (119871119903 119878119894) (18)

With respect to class 119871119903 the cluster with the highest 119865-

measure is considered to be the cluster that maps to class 119871119903

that 119865-measure becomes the score for class 119871119903 and 119862 is total

number of clusters The overall 119865-measure for the clusteringresult 119862 is the weighted average of the 119865-measure for eachclass 119871

119903(macroaverage)

119865119862 =sum119871119903

(1003816100381610038161003816119873119903

1003816100381610038161003816 lowast 119865 (119871119903))

sum119871119903

10038161003816100381610038161198731199031003816100381610038161003816

(19)

where |119873119903| is the number of objects in class 119871

119903 The higher

the overall 119865-measure is the better the clustering is due tothe higher accuracy of the clusters mapping to the originalclasses

42 Evaluation Setup (Dataset) Content-based coselection-based and task-based evaluations are used to validateexperimenting over the real corpus of documents theresults are very promising The experimental setup con-sists of some datasets of textual documents as detailed inTable 3 Experiment 1 uses the Miller and Charles (MC) andMicrosoft Research Paraphrase Corpus Experiment 2 usesthe British National Corpus TREC IJCNLP 2011 (NYT andWikipedia) SN and BLESS Experiment 3 usesThe Reuters-21578 and 20 newsgroups

43 Evaluation Results This section reports on the results ofthree experiments conducted using the evaluation datasetsoutlined in the previous section The SHGRS is imple-mented and evaluated based on concept analysis and anno-tation as sentence based in experiment 1 document basedin experiment 2 and corpus based in experiment 3 Inexperiment 1 the DS1 and DS2 investigated the effectivenessof the proposed MFsSR compared to previous studies basedon the availability of semantic relatedness values of humanjudgments and previous studies In experiment 2 the DS3andDS4 investigated the effectiveness of the closest synonymdetection using MFsSR compared to using PMI [25 37]and ICA [38] based on the availability of synonyms comingfrom three sources WordNet 30 Rogetrsquos thesaurus anda synonyms database Furthermore the DS5 DS6 DS7and DS8 investigated the effectiveness of the IOAC-QL forsemantic relationships exploration compared to the implicitrelation extraction CRFs [6 39] and TFICF [40] based on theavailability of semantic relations judgments In experiment 3the DS9 and DS10 investigated the effectiveness of thedocument linkage process for clustering case study thanprevious studies as word-based [16] and concept-based [17]based on the availability of human judgments

These experiments revealed some interesting trends interms of text organization and a representation schemebased on semantic annotation and reinforcement learningThe results show the effectiveness of the proposed schemefor accurate understanding of the underlying meaning Theproposed SHGRS is exploited in mining and managingoperations such as gathering filtering searching retrievingextracting classifying clustering and summarizing docu-ments In addition the experiments illustrate the improve-ments when direct relevance and indirect relevance areincluded in the process of measuring semantic relatednessbetween MFs of documents The main purpose of semanticrelatedness measures is to infer unknown dependencies andrelationships among the MFs The proposed MFsSR is abackbone of the closest synonym detection and semanticrelationships exploration algorithms The performance ofMFsSR for closest synonym detection and IOAC-QL forsemantic relationships exploration algorithms is proportion-ally improved as well as having a semantic understanding oftext content achieved The relatedness subgraph measure isused for the document linkage process at the corpus levelThrough this process better clustering results can be achievedas more attempt to improve the accuracy of measuring therelatedness between documents using semantic information

The Scientific World Journal 15

Table 4The results of the comparison of the correlation coefficientbetween human judgment with some relatedness measures and theproposed semantic relatedness measure

MeasureRelevance

correlation withMampC

Distance-based measuresRada 0688Wu and Palmer 0765

Information-based measuresResnik 077Jiang and Conrath 0848Lin 0853

Hybrid measuresT Hong and D smith 0879Zili Zhou 0882

Informationfeature-base measuresThe proposed MFsSR 0937

in SHGRSThe relatedness subgraphsmeasure affects the per-formance of the output of the document managing and min-ing process relative to howmuch of the semantic informationit considers The most accurate relatedness can be achievedwhen the maximum amount of the semantic information isprovided Consequently the quality of clustering is improvedand when semantic information is considered the approachsurpasses the word-based and concept-based techniques

431 Experiment 1 Comparative Study (Content-Based Eval-uation) This experiment shows the necessity to evaluate theperformance of the proposed MFsSR based on a benchmarkdataset of human judgments so the results of the proposedsemantic relatedness would be comparable with other pre-vious studies in the same field In this experiment theyattempted to compute the correlation between the ratings ofthe proposed semantic relatedness approach and the meanratings reported by Miller and Charles (DS1 in Table 3)Furthermore the results produced are compared againsteight other semantic similarities approaches namely RadaWu and Palmer Resnik Jiang and Conrath Lin T Hong andD Smith and Zili Zhou [21]

Table 4 shows the correlation coefficient results betweennine componential approaches and the Miller and Charlesratings mean The semantic relatedness for the proposedapproach outperformed all the listed approaches Unlike allthe listed methods in the proposed semantic relatednessdifferent properties are considered Furthermore the goodcorrelation value of the approach also results from consider-ing all available relationships between concepts and the indi-rect relationships between attributes of each concept whenmeasuring the semantic relatedness In this experiment theproposed approach achieved a good correlation value withthe human-subject rating reported by Miller and CharlesBased on the results of this study theMFsSR correlation valuehas proven that considering the directindirect relevance

is specifically important for at least 6 improvement overthe best previous results This improvement contributes tothe achievement of the largest amount of relationships andinterdependence among MFs and their attributes at thedocument level and to the document linkage process at acorpus level

Table 5 summarizes the characteristics of the MRPCdataset (DS2 in Table 3) and presents comparison of theAcc and AR values between the proposed semantic related-ness measure MFsSR and Islam and Inkpen [7] Differentrelatedness thresholds ranging from 0 to 1 with interval01 are used to validate the MFsSR with Islam and Inkpen[7] After evaluation the best relatedness thresholds of Accand AR are 06 07 and 08 These results indicate thatthe proposed MFsSR surpasses Islam and Inkpen [7] interms of Acc and AR The results of each approach listedbelow were based on the best Acc and AR through allthresholds instead of under the same relatedness thresholdThis improvement in Acc andAR values is due to the increasein the numbers of pairs predicted correctly after consideringdirectindirect relevance This relevance takes into accountthe closest synonym and the relationships of sentences MFs

432 Experiment 2 Comparative Study of Closest-SynonymDetection and Semantic Relationships Exploration Thisexperiment shows the necessity to study the impact of theMFsSR for inferring unknown dependencies and relation-ships among the MFs This impact is achieved through theclosest-synonym detection and semantic relationship explo-ration algorithms so the results would be comparable withother previous studies in the same field Throughout thisexperiment the impact of the proposed MFsSR for detectingthe closest synonym is studied and compared to the PMIapproach and ICA approach for detecting the best near-synonym The difference among the proposed discoveringsemantic relationships or associations IOAC-QL the implicitrelation extraction CRFs approach and TFICF approachis examined This experiment attempts to measure theperformance of theMFsSR for the closest synonym detectionand the IOAC-QL for semantic relationship explorationalgorithms Hence more semantic understanding of the textcontent is achieved by inferring unknown dependenciesand relationships among the MFs Considering the directrelevance among the MFs and the indirect relevance amongthe attributes of the MFs in the proposed MFsSR constitutesa certain advantage over previous measures However mostof the time the incorrect items are due to a wrong syntacticparsing from the OpenNLP parser and AlchemyAPIAccording to the preliminary study it is certain that theaccuracy of parsing toolsrsquo effects is on the performance of theMFsSR

Detecting the closest synonym is a process of MF disam-biguation resulting from the implementation of the MFsSRtechnique and the threshold of the synonyms scores throughthe C-SynD algorithm In the C-SynD algorithm the MFsSRtakes into account the direct relevance among MFs andthe indirect relevance among MFs and their attributes in acontext that gives a significant increase in the recall without

16 The Scientific World Journal

Table 5 The results of the comparison of the accuracy and acceptance rate between the proposed semantic relatedness measure and Islamand Inkpen [7]

MRPC dataset Relatednessthreshold

Human judgment(TP + FN)

Islam and Inkpen [7] The proposed MFsSRAcc AR Acc AR

Trainingsubset (4076)

01

2753 true

067 1 068 102 067 1 068 103 067 1 068 104 067 1 068 105 069 098 068 106 072 089 068 107 068 078 070 09808 056 04 072 08609 037 009 060 0491 033 0 034 002

Test subset(1725)

01

1147 true

066 1 067 102 066 1 067 103 066 1 067 104 066 1 067 105 068 098 067 106 072 089 066 107 068 078 066 09808 056 04 064 08509 038 009 049 051 033 0 034 003

disturbing the precision Thus the MFsSR between two con-cepts considers the information content with most of Word-Net relationships Table 6 illustrates the performance of theC-SynD algorithm based on theMFsSR compared to the bestnear-synonym algorithmusing the PMI and ICA approachesThis comparison was conducted for two corpora of textdocuments that are BNC and NS (DS3 and DS4 in Table 3)As indicated in Table 6 the C-SynD algorithm yielded higheraverage precision recall and 119865-measure values than PMIapproach by 25 20 and 21 on SN dataset respectivelyand also by 8 4 and 7 on BNC dataset Furthermorethe C-SynD algorithm also yielded higher average precisionrecall and 119865-measure values than ICA approach by 6 9and 7 on SN dataset respectively and also by 6 4and 5 on BNC dataset The improvements achieved in theperformance values are due to the increase of the numberof pairs predicted correctly and the decrease of the numberof pairs predicted incorrectly through implementing MFsSRThe important observation from this table is the improve-ments achieved in recall which measures effectiveness of theC-SynD algorithm Achieving better precision values is clearwith a high percentage in different datasets and domains

Inferring the MFs relationships and dependencies is aprocess of achieving a large number of interdependencesamongMFs resulting from the implementation of the IOAC-QL algorithm In the IOAC-QL algorithm the 119876-learning isused to select the optimal action (relationships) that gainsthe largest amount of interdependences among the MFs

resulting from the measure of the AVW Table 7 illustratesthe performance of the IOAC-QL algorithm based on AVWscompared to the implicit relationship extraction based onCRFs and TFICF approaches which was conducted overthe TREC Bless the NYT and the Wikipedia corpus (DS5DS6 DS7 and DS8 in Table 3) The data in this table showsincreases in precision recall and 119865-measure values due tothe increase in the number of pairs predicted correctly afterconsidering directindirect relevance through the expansionof closest synonym In addition the IOAC-QL algorithmthe CRFs and the TFICF approaches are beneficial tothe extraction performance but the IOAC-QL contributesmore than CRFs and TFICF Thus IOAC-QL considerssimilarity contiguity contrast and causality relationshipsbetween MFs and its closest-synonyms while the CRFs andTFICF consider is-a and part-of relationships only betweenconceptsThe improvements of the 119865-measure were achievedthrough an IOAC-QL algorithm up to 23 for TREC dataset22 for BLESS dataset 21 for NYT dataset and 6 forWikipedia dataset approximately from the CRFs approachFurthermore the IOAC-QL algorithm also yielded higher119865-measure values than TFICF approach by 10 for TRECdataset 17 for BLESS dataset and 7 for NYT datasetrespectively

433 Experiment 3 Comparative Study (Task-Based Eval-uation) This experiment shows the effectiveness of the

The Scientific World Journal 17

Table 6 The results of the C-SynD algorithm based on the MFsSR compared to the PMI and ICA for detecting the closest synonym

Datasets PMI ICA C-SynDPrecision Recall 119865-measure Precision Recall 119865-measure Precision Recall 119865-measure

SN 606 606 061 795 716 753 85 80 82BNC 745 679 71 76 678 72 82 719 77

Table 7 The results of the IOAC-QL algorithm based on AVWs compared to CRFs and TFICF approaches

Datasets CRFs TFICF IOAC-QLPrecision Recall 119865-measure Precision Recall 119865-measure Precision Recall 119865-measure

TREC 7508 602 6628 893 714 787 898 881 886BLESS 7304 6266 6703 738 695 716 950 835 889NYT 6846 5402 6038 860 650 740 900 740 812Wikipedia 560 420 480 646 5488 5934 700 440 540

Table 8The results of graph-based approach compared to term-based and concept-based approach using119870-means and the HAC clusteringalgorithms

Datasets Algorithm Term-based Concept-based Graph-basedFC FC FC

Reuters-21578 HAC 73 857 902119870-means 51 846 918

20 Newsgroups HAC 53 82 89119870-means 48 793 87

proposed semantic representation scheme in document min-ing and managing operations such as gathering filteringsearching retrieving extracting classifying clustering andsummarizing documents Many sets of experiments usingthe proposed scheme on different datasets in text clusteringhave been conducted The experiments demonstrate theextensive comparisons between the MF-based analysis andthe traditional analysis Experimental results demonstratethe substantial enhancement of the document clusteringquality In this experiment the application of the semanticunderstanding of the textual documents is presented Textualdocuments are parsed to produce the rich syntactic andsemantic representations of SHGRS Based on these semanticrepresentations document linkages are estimated using thesubgraph matching process that is based on finding alldistinct similarity common subgraphs The semantic repre-sentations of textual documents determining the relatednessbetween themare evaluated by allowing clustering algorithmsto use the produced relatedness matrix of the documentset To test the effectiveness of subgraph matching in deter-mining an accurate measure of the relatedness betweendocuments an experiment using the MFs-based analysisand subgraphs relatedness measures was conducted Twostandard document clustering techniques were chosen fortesting the effect of the document linkages on clustering [43](1) HAC and (2) 119870-means The MF-based weighting is themain factor that captures the importance of a concept in

a document Furthermore document linkages using the sub-graph relatedness measure are among the main factors thataffect clustering techniques Thus to study the effect of theMFs-based weighting and the subgraph relatedness measure(graph-based approach) on the clustering techniques thisexperiment is repeated for the different clustering techniquesas shown in Table 8 Basically the aim is to maximize theweighted average of the 119865-measure for each class (FC) ofclusters to achieve high-quality clustering

Table 8 shows that the graph-based approach achieveshigher performance than term-based and concept-basedapproachesThis comparison was conducted for two corporaof text documents that are the Reuters-21578 dataset and the20 newsgroups dataset (DS9 and DS10 in Table 3)The resultsshown in Table 8 illustrate improvement in the clusteringquality when more attributes are included in graph-basedrelatedness measure The improvements were achieved at thegraph-based approach up to 17 by HAC and 40 by 119870-means for Reuters-21578 dataset while up to 36 byHAC and39 by 119870-means for approximately 20 newsgroup datasetsfrom the base case (term-based) approach Furthermorethe improvements also were achieved at the graph-basedapproach up to 5 by HAC and 7 by119870-Means for Reuters-21578 dataset while up to 7 byHAC and 8 by119870-Means forapproximately 20 Newsgroup datasets from Concept-basedapproach

18 The Scientific World Journal

5 Conclusions

This paper proposed the SHGRS to improve the effectivenessof themining andmanaging operations in textual documentsSpecifically a three-stage approach was proposed for con-structing the scheme First the MFs with their attributesare extracted and hierarchy-based structure is built to rep-resent sentences by the MFs and their attributes Second anovel semantic relatedness computing method was proposedfor inferring relationships and dependencies of MFs andthe relationship between MFs is represented by a graph-based structure Third a subgraph matching algorithm wasconducted for document linkage process The documentsclustering is used as a case study and conducted extensiveexperiments on 10 different datasets and the results showedthat the scheme can result in a better mining performancecompared with traditional approach Future work will focuson conducting other case studies to processes such asgathering filtering retrieving classifying and summarizinginformation

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] I T Fatudimu A G Musa C K Ayo and A B SofoluweldquoKnowledge discovery in online repositories a text miningapproachrdquo European Journal of Scientific Research vol 22 no2 pp 241ndash250 2008

[2] A Hotho A Nurnberger and G Paaszlig ldquoA brief survey of textminingrdquo Journal for Computational Linguistics and LanguageTechnology vol 20 2005

[3] M Bernotas K Karklius R Laurutis and A Slotkiene ldquoThepeculiarities of the text document representation using ontol-ogy and tagging-based clustering techniquerdquo Journal of Infor-mation Technology and Control vol 36 pp 217ndash220 2007

[4] Y Li Z A Bandar andDMcLean ldquoAn approach formeasuringsemantic similarity between words using multiple informationsourcesrdquo IEEE Transactions on Knowledge and Data Engineer-ing vol 15 no 4 pp 871ndash882 2003

[5] K Shaban O Basir and M Kamel ldquoDocument mining basedon semantic understanding of textrdquo in Progress in Pattern Rec-ognition Image Analysis and Applications vol 4225 of LectureNotes in Computer Science pp 834ndash843 Springer BerlinGermany 2006

[6] A Culotta A McCallum and J Betz ldquoIntegrating probabilisticextraction models and data mining to discover relations andpatterns in textrdquo in Proceedings of theHuman Language Technol-ogy ConferencemdashNorth American Chapter of the Association forComputational Linguistics Annual Meeting (HLT-NAACL rsquo06)pp 296ndash303 June 2006

[7] A Islam and D Inkpen ldquoSemantic text similarity using corpus-based word similarity and string similarityrdquo ACM Transactionson Knowledge Discovery fromData vol 2 no 2 article 10 2008

[8] R Feldman and J SangerTheTextMining Handbook AdvancedApproaches in Analyzing Unstructured Data Cambridge Uni-versity Press 2006

[9] R Navigli ldquoWord sense disambiguation a surveyrdquo ACM Com-puting Surveys vol 41 no 2 article 10 2009

[10] W Mao andWW Chu ldquoThe phrase-based vector space modelfor automatic retrieval of free-text medical documentsrdquo Dataand Knowledge Engineering vol 61 no 1 pp 76ndash92 2007

[11] C Siefkes and P Siniakov ldquoAn overview and classification ofadaptive approaches to information extractionrdquo in Journal onData Semantics IV vol 3730 of Lecture Notes in ComputerScience pp 172ndash212 Springer New York NY USA 2005

[12] httpwordnetprincetonedu[13] OpenNLP httpopennlpapacheorg[14] httpwwwalchemyapicom[15] W V Siricharoen ldquoOntologies and object models in object ori-

ented software engineeringrdquo International Association of Engi-neers (IAENG) Journal of Computer Science (JICS) vol 33 no 1pp 1151ndash1162 2006

[16] S Shehata ldquoA WordNet-based semantic model for enhancingtext clusteringrdquo in Proceedings of the IEEE International Con-ference on Data Mining Workshops (ICDMW rsquo09) pp 477ndash482Miami Fla USA December 2009

[17] S Shehata F Karray andM Kamel ldquoAn efficient concept-basedminingmodel for enhancing text clusteringrdquo IEEE Transactionson Knowledge and Data Engineering vol 22 no 10 pp 1360ndash1371 2010

[18] E Zemmouri H Behja A Marzak and B Trousse ldquoOntology-based knowledge model for multi-view KDD processrdquo Interna-tional Journal of Mobile Computing and Multimedia Communi-cations vol 4 no 3 pp 21ndash33 2012

[19] C Marinica and F Guillet ldquoKnowledge-based interactive post-mining of association rules using ontologiesrdquo IEEETransactionson Knowledge and Data Engineering vol 22 no 6 pp 784ndash7972010

[20] B Choudhary and P Bhattacharyya ldquoText clustering usingUni-versal Networking Language representationrdquo in Proceedings ofthe 11th International World Wide Web Conference 2003

[21] G Pirro and J Euzenat ldquoA feature and information theoreticframework for semantic similarity and relatednessrdquo in Proceed-ings of the 9th International Semantic Web Conference on theSemantic Web (ISWC rsquo10) vol 6496 pp 615ndash630 2010

[22] Z Lin M R Lyu and I King ldquoMatchSim a novel similaritymeasure based on maximum neighborhood matchingrdquo Knowl-edge and Information Systems vol 32 no 1 pp 141ndash166 2012

[23] D Lin ldquoAn information-theoretic definition of similarityrdquo inProceedings of the 15th International Conference on MachineLearning Madison Wis USA July 1998

[24] D Inkpen ldquoA statistical model for near-synonym choicerdquo ACMTransactions on Speech and Language Processing vol 4 no 1article 2 2007

[25] LHan T Finin PMcNameeA Joshi andYYesha ldquoImprovingword similarity by augmenting PMIwith estimates of word pol-ysemyrdquo IEEE Transactions on Knowledge and Data Engineeringvol 25 no 6 pp 1307ndash1322 2013

[26] J OrsquoShea Z Bandar K Crockett and D McLean ldquoA compar-ative study of two short text semantic similarity measuresrdquo inAgent and Multi-Agent Systems Technologies and Applicationsvol 4953 of Lecture Notes in Artificial Intelligence pp 172ndash181Springer Berlin Germany 2008

[27] J Oliva J I SerranoMDDel Castillo and A Iglesias ldquoSyMSSa syntax-basedmeasure for short-text semantic similarityrdquoDataand Knowledge Engineering vol 70 no 4 pp 390ndash405 2011

The Scientific World Journal 19

[28] I Yoo and X Hu ldquoClustering large collection of biomedicalliterature based on ontology-enriched bipartite graph represen-tation andmutual refinement strategyrdquo inProceedings of the 10thPAKDD pp 9ndash12 2006

[29] S-T Yuan and Y-C Chen ldquoSemantic ideation learning foragent-based e-brainstormingrdquo IEEE Transactions on Knowledgeand Data Engineering vol 20 no 2 pp 261ndash275 2008

[30] D Liu Z Liu and Q Dong ldquoA dependency grammar andWordNet based sentence similarity measurerdquo Journal of Com-putational Information Systems vol 8 no 3 pp 1027ndash1035 2012

[31] A Zouaq M Gagnon and O Benoit ldquoSemantic analysis usingdependency-based grammars and upper-level ontologiesrdquoInternational Journal of Computational Linguistics and Applica-tions vol 1 no 1-2 pp 85ndash101 2010

[32] C Ramakrishnan K J Kochut and A P Sheth ldquoA frameworkfor schema-driven relationship discovery from unstructuredtextrdquo in The Semantic WebmdashISWC 2006 vol 4273 of LectureNotes in Computer Science pp 583ndash596 Springer BerlinGermany 2006

[33] C J C H Watkins and P Dayan ldquoTechnical note Q-learningrdquoMachine Learning vol 8 no 3-4 pp 279ndash292 1992

[34] M Lan C L Tan J Su and Y Lu ldquoSupervised and traditionalterm weighting methods for automatic text categorizationrdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 31 no 4 pp 721ndash735 2009

[35] L Zhang C Li J Liu and HWang ldquoGraph-based text similar-itymeasurement by exploiting wikipedia as background knowl-edgerdquo World Academy of Science Engineering and Technologyvol 59 pp 1548ndash1553 2011

[36] M Kuramochi and G Karypis ldquoAn efficient algorithm for dis-covering frequent subgraphsrdquo IEEE Transactions on Knowledgeand Data Engineering vol 16 no 9 pp 1038ndash1051 2004

[37] L-C Yu H-M Shih Y-L Lai J-F Yeh and C-H Wu ldquoDis-criminative training for near-synonym substitutionrdquo in Pro-ceedings of the 23rd International Conference on ComputationalLinguistics (COLING rsquo10) pp 1254ndash1262 2010

[38] L-C Yu and W-N Chien ldquoIndependent component analysisfor near-synonym choicerdquoDecision Support Systems vol 55 no1 pp 146ndash155 2013

[39] L Yao S Riedel and A Mccallum ldquoCollective cross-documentrelation extraction without labelled datardquo in Proceedings ofthe Conference on Empirical Methods in Natural LanguageProcessing pp 1013ndash1023MITCambridgeMassUSAOctober2010

[40] M Shen D-R Liu and Y-S Huang ldquoExtracting semantic rela-tions to enrich domain ontologiesrdquo Journal of Intelligent Infor-mation Systems vol 39 no 3 pp 749ndash761 2012

[41] H Chim and X Deng ldquoEfficient phrase-based documentsimilarity for clusteringrdquo IEEE Transactions on Knowledge andData Engineering vol 20 no 9 pp 1217ndash1229 2008

[42] GAMiller andWGCharles ldquoContextual correlates of seman-tic similarityrdquo Language and Cognitive Processes vol 6 no 1 pp1ndash28 1991

[43] M Steinbach G Karypis and V Kumar ldquoA comparison ofdocument clustering techniquesrdquo in Proceedings of the Knowl-edge Discovery and Data Mining (KDD) Workshop Text MiningAugust 2000

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 15: Research Article Exploiting Semantic Annotations and ...downloads.hindawi.com/journals/tswj/2015/136172.pdf · is bene cial for text categorization [ ], text summarization [ ], word

The Scientific World Journal 15

Table 4The results of the comparison of the correlation coefficientbetween human judgment with some relatedness measures and theproposed semantic relatedness measure

MeasureRelevance

correlation withMampC

Distance-based measuresRada 0688Wu and Palmer 0765

Information-based measuresResnik 077Jiang and Conrath 0848Lin 0853

Hybrid measuresT Hong and D smith 0879Zili Zhou 0882

Informationfeature-base measuresThe proposed MFsSR 0937

in SHGRSThe relatedness subgraphsmeasure affects the per-formance of the output of the document managing and min-ing process relative to howmuch of the semantic informationit considers The most accurate relatedness can be achievedwhen the maximum amount of the semantic information isprovided Consequently the quality of clustering is improvedand when semantic information is considered the approachsurpasses the word-based and concept-based techniques

431 Experiment 1 Comparative Study (Content-Based Eval-uation) This experiment shows the necessity to evaluate theperformance of the proposed MFsSR based on a benchmarkdataset of human judgments so the results of the proposedsemantic relatedness would be comparable with other pre-vious studies in the same field In this experiment theyattempted to compute the correlation between the ratings ofthe proposed semantic relatedness approach and the meanratings reported by Miller and Charles (DS1 in Table 3)Furthermore the results produced are compared againsteight other semantic similarities approaches namely RadaWu and Palmer Resnik Jiang and Conrath Lin T Hong andD Smith and Zili Zhou [21]

Table 4 shows the correlation coefficient results betweennine componential approaches and the Miller and Charlesratings mean The semantic relatedness for the proposedapproach outperformed all the listed approaches Unlike allthe listed methods in the proposed semantic relatednessdifferent properties are considered Furthermore the goodcorrelation value of the approach also results from consider-ing all available relationships between concepts and the indi-rect relationships between attributes of each concept whenmeasuring the semantic relatedness In this experiment theproposed approach achieved a good correlation value withthe human-subject rating reported by Miller and CharlesBased on the results of this study theMFsSR correlation valuehas proven that considering the directindirect relevance

is specifically important for at least 6 improvement overthe best previous results This improvement contributes tothe achievement of the largest amount of relationships andinterdependence among MFs and their attributes at thedocument level and to the document linkage process at acorpus level

Table 5 summarizes the characteristics of the MRPCdataset (DS2 in Table 3) and presents comparison of theAcc and AR values between the proposed semantic related-ness measure MFsSR and Islam and Inkpen [7] Differentrelatedness thresholds ranging from 0 to 1 with interval01 are used to validate the MFsSR with Islam and Inkpen[7] After evaluation the best relatedness thresholds of Accand AR are 06 07 and 08 These results indicate thatthe proposed MFsSR surpasses Islam and Inkpen [7] interms of Acc and AR The results of each approach listedbelow were based on the best Acc and AR through allthresholds instead of under the same relatedness thresholdThis improvement in Acc andAR values is due to the increasein the numbers of pairs predicted correctly after consideringdirectindirect relevance This relevance takes into accountthe closest synonym and the relationships of sentences MFs

432 Experiment 2 Comparative Study of Closest-SynonymDetection and Semantic Relationships Exploration Thisexperiment shows the necessity to study the impact of theMFsSR for inferring unknown dependencies and relation-ships among the MFs This impact is achieved through theclosest-synonym detection and semantic relationship explo-ration algorithms so the results would be comparable withother previous studies in the same field Throughout thisexperiment the impact of the proposed MFsSR for detectingthe closest synonym is studied and compared to the PMIapproach and ICA approach for detecting the best near-synonym The difference among the proposed discoveringsemantic relationships or associations IOAC-QL the implicitrelation extraction CRFs approach and TFICF approachis examined This experiment attempts to measure theperformance of theMFsSR for the closest synonym detectionand the IOAC-QL for semantic relationship explorationalgorithms Hence more semantic understanding of the textcontent is achieved by inferring unknown dependenciesand relationships among the MFs Considering the directrelevance among the MFs and the indirect relevance amongthe attributes of the MFs in the proposed MFsSR constitutesa certain advantage over previous measures However mostof the time the incorrect items are due to a wrong syntacticparsing from the OpenNLP parser and AlchemyAPIAccording to the preliminary study it is certain that theaccuracy of parsing toolsrsquo effects is on the performance of theMFsSR

Detecting the closest synonym is a process of MF disam-biguation resulting from the implementation of the MFsSRtechnique and the threshold of the synonyms scores throughthe C-SynD algorithm In the C-SynD algorithm the MFsSRtakes into account the direct relevance among MFs andthe indirect relevance among MFs and their attributes in acontext that gives a significant increase in the recall without

16 The Scientific World Journal

Table 5 The results of the comparison of the accuracy and acceptance rate between the proposed semantic relatedness measure and Islamand Inkpen [7]

MRPC dataset Relatednessthreshold

Human judgment(TP + FN)

Islam and Inkpen [7] The proposed MFsSRAcc AR Acc AR

Trainingsubset (4076)

01

2753 true

067 1 068 102 067 1 068 103 067 1 068 104 067 1 068 105 069 098 068 106 072 089 068 107 068 078 070 09808 056 04 072 08609 037 009 060 0491 033 0 034 002

Test subset(1725)

01

1147 true

066 1 067 102 066 1 067 103 066 1 067 104 066 1 067 105 068 098 067 106 072 089 066 107 068 078 066 09808 056 04 064 08509 038 009 049 051 033 0 034 003

disturbing the precision Thus the MFsSR between two con-cepts considers the information content with most of Word-Net relationships Table 6 illustrates the performance of theC-SynD algorithm based on theMFsSR compared to the bestnear-synonym algorithmusing the PMI and ICA approachesThis comparison was conducted for two corpora of textdocuments that are BNC and NS (DS3 and DS4 in Table 3)As indicated in Table 6 the C-SynD algorithm yielded higheraverage precision recall and 119865-measure values than PMIapproach by 25 20 and 21 on SN dataset respectivelyand also by 8 4 and 7 on BNC dataset Furthermorethe C-SynD algorithm also yielded higher average precisionrecall and 119865-measure values than ICA approach by 6 9and 7 on SN dataset respectively and also by 6 4and 5 on BNC dataset The improvements achieved in theperformance values are due to the increase of the numberof pairs predicted correctly and the decrease of the numberof pairs predicted incorrectly through implementing MFsSRThe important observation from this table is the improve-ments achieved in recall which measures effectiveness of theC-SynD algorithm Achieving better precision values is clearwith a high percentage in different datasets and domains

Inferring the MFs relationships and dependencies is aprocess of achieving a large number of interdependencesamongMFs resulting from the implementation of the IOAC-QL algorithm In the IOAC-QL algorithm the 119876-learning isused to select the optimal action (relationships) that gainsthe largest amount of interdependences among the MFs

resulting from the measure of the AVW Table 7 illustratesthe performance of the IOAC-QL algorithm based on AVWscompared to the implicit relationship extraction based onCRFs and TFICF approaches which was conducted overthe TREC Bless the NYT and the Wikipedia corpus (DS5DS6 DS7 and DS8 in Table 3) The data in this table showsincreases in precision recall and 119865-measure values due tothe increase in the number of pairs predicted correctly afterconsidering directindirect relevance through the expansionof closest synonym In addition the IOAC-QL algorithmthe CRFs and the TFICF approaches are beneficial tothe extraction performance but the IOAC-QL contributesmore than CRFs and TFICF Thus IOAC-QL considerssimilarity contiguity contrast and causality relationshipsbetween MFs and its closest-synonyms while the CRFs andTFICF consider is-a and part-of relationships only betweenconceptsThe improvements of the 119865-measure were achievedthrough an IOAC-QL algorithm up to 23 for TREC dataset22 for BLESS dataset 21 for NYT dataset and 6 forWikipedia dataset approximately from the CRFs approachFurthermore the IOAC-QL algorithm also yielded higher119865-measure values than TFICF approach by 10 for TRECdataset 17 for BLESS dataset and 7 for NYT datasetrespectively

433 Experiment 3 Comparative Study (Task-Based Eval-uation) This experiment shows the effectiveness of the

The Scientific World Journal 17

Table 6 The results of the C-SynD algorithm based on the MFsSR compared to the PMI and ICA for detecting the closest synonym

Datasets PMI ICA C-SynDPrecision Recall 119865-measure Precision Recall 119865-measure Precision Recall 119865-measure

SN 606 606 061 795 716 753 85 80 82BNC 745 679 71 76 678 72 82 719 77

Table 7 The results of the IOAC-QL algorithm based on AVWs compared to CRFs and TFICF approaches

Datasets CRFs TFICF IOAC-QLPrecision Recall 119865-measure Precision Recall 119865-measure Precision Recall 119865-measure

TREC 7508 602 6628 893 714 787 898 881 886BLESS 7304 6266 6703 738 695 716 950 835 889NYT 6846 5402 6038 860 650 740 900 740 812Wikipedia 560 420 480 646 5488 5934 700 440 540

Table 8The results of graph-based approach compared to term-based and concept-based approach using119870-means and the HAC clusteringalgorithms

Datasets Algorithm Term-based Concept-based Graph-basedFC FC FC

Reuters-21578 HAC 73 857 902119870-means 51 846 918

20 Newsgroups HAC 53 82 89119870-means 48 793 87

proposed semantic representation scheme in document min-ing and managing operations such as gathering filteringsearching retrieving extracting classifying clustering andsummarizing documents Many sets of experiments usingthe proposed scheme on different datasets in text clusteringhave been conducted The experiments demonstrate theextensive comparisons between the MF-based analysis andthe traditional analysis Experimental results demonstratethe substantial enhancement of the document clusteringquality In this experiment the application of the semanticunderstanding of the textual documents is presented Textualdocuments are parsed to produce the rich syntactic andsemantic representations of SHGRS Based on these semanticrepresentations document linkages are estimated using thesubgraph matching process that is based on finding alldistinct similarity common subgraphs The semantic repre-sentations of textual documents determining the relatednessbetween themare evaluated by allowing clustering algorithmsto use the produced relatedness matrix of the documentset To test the effectiveness of subgraph matching in deter-mining an accurate measure of the relatedness betweendocuments an experiment using the MFs-based analysisand subgraphs relatedness measures was conducted Twostandard document clustering techniques were chosen fortesting the effect of the document linkages on clustering [43](1) HAC and (2) 119870-means The MF-based weighting is themain factor that captures the importance of a concept in

a document Furthermore document linkages using the sub-graph relatedness measure are among the main factors thataffect clustering techniques Thus to study the effect of theMFs-based weighting and the subgraph relatedness measure(graph-based approach) on the clustering techniques thisexperiment is repeated for the different clustering techniquesas shown in Table 8 Basically the aim is to maximize theweighted average of the 119865-measure for each class (FC) ofclusters to achieve high-quality clustering

Table 8 shows that the graph-based approach achieveshigher performance than term-based and concept-basedapproachesThis comparison was conducted for two corporaof text documents that are the Reuters-21578 dataset and the20 newsgroups dataset (DS9 and DS10 in Table 3)The resultsshown in Table 8 illustrate improvement in the clusteringquality when more attributes are included in graph-basedrelatedness measure The improvements were achieved at thegraph-based approach up to 17 by HAC and 40 by 119870-means for Reuters-21578 dataset while up to 36 byHAC and39 by 119870-means for approximately 20 newsgroup datasetsfrom the base case (term-based) approach Furthermorethe improvements also were achieved at the graph-basedapproach up to 5 by HAC and 7 by119870-Means for Reuters-21578 dataset while up to 7 byHAC and 8 by119870-Means forapproximately 20 Newsgroup datasets from Concept-basedapproach

18 The Scientific World Journal

5 Conclusions

This paper proposed the SHGRS to improve the effectivenessof themining andmanaging operations in textual documentsSpecifically a three-stage approach was proposed for con-structing the scheme First the MFs with their attributesare extracted and hierarchy-based structure is built to rep-resent sentences by the MFs and their attributes Second anovel semantic relatedness computing method was proposedfor inferring relationships and dependencies of MFs andthe relationship between MFs is represented by a graph-based structure Third a subgraph matching algorithm wasconducted for document linkage process The documentsclustering is used as a case study and conducted extensiveexperiments on 10 different datasets and the results showedthat the scheme can result in a better mining performancecompared with traditional approach Future work will focuson conducting other case studies to processes such asgathering filtering retrieving classifying and summarizinginformation

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] I T Fatudimu A G Musa C K Ayo and A B SofoluweldquoKnowledge discovery in online repositories a text miningapproachrdquo European Journal of Scientific Research vol 22 no2 pp 241ndash250 2008

[2] A Hotho A Nurnberger and G Paaszlig ldquoA brief survey of textminingrdquo Journal for Computational Linguistics and LanguageTechnology vol 20 2005

[3] M Bernotas K Karklius R Laurutis and A Slotkiene ldquoThepeculiarities of the text document representation using ontol-ogy and tagging-based clustering techniquerdquo Journal of Infor-mation Technology and Control vol 36 pp 217ndash220 2007

[4] Y Li Z A Bandar andDMcLean ldquoAn approach formeasuringsemantic similarity between words using multiple informationsourcesrdquo IEEE Transactions on Knowledge and Data Engineer-ing vol 15 no 4 pp 871ndash882 2003

[5] K Shaban O Basir and M Kamel ldquoDocument mining basedon semantic understanding of textrdquo in Progress in Pattern Rec-ognition Image Analysis and Applications vol 4225 of LectureNotes in Computer Science pp 834ndash843 Springer BerlinGermany 2006

[6] A Culotta A McCallum and J Betz ldquoIntegrating probabilisticextraction models and data mining to discover relations andpatterns in textrdquo in Proceedings of theHuman Language Technol-ogy ConferencemdashNorth American Chapter of the Association forComputational Linguistics Annual Meeting (HLT-NAACL rsquo06)pp 296ndash303 June 2006

[7] A Islam and D Inkpen ldquoSemantic text similarity using corpus-based word similarity and string similarityrdquo ACM Transactionson Knowledge Discovery fromData vol 2 no 2 article 10 2008

[8] R Feldman and J SangerTheTextMining Handbook AdvancedApproaches in Analyzing Unstructured Data Cambridge Uni-versity Press 2006

[9] R Navigli ldquoWord sense disambiguation a surveyrdquo ACM Com-puting Surveys vol 41 no 2 article 10 2009

[10] W Mao andWW Chu ldquoThe phrase-based vector space modelfor automatic retrieval of free-text medical documentsrdquo Dataand Knowledge Engineering vol 61 no 1 pp 76ndash92 2007

[11] C Siefkes and P Siniakov ldquoAn overview and classification ofadaptive approaches to information extractionrdquo in Journal onData Semantics IV vol 3730 of Lecture Notes in ComputerScience pp 172ndash212 Springer New York NY USA 2005

[12] httpwordnetprincetonedu[13] OpenNLP httpopennlpapacheorg[14] httpwwwalchemyapicom[15] W V Siricharoen ldquoOntologies and object models in object ori-

ented software engineeringrdquo International Association of Engi-neers (IAENG) Journal of Computer Science (JICS) vol 33 no 1pp 1151ndash1162 2006

[16] S Shehata ldquoA WordNet-based semantic model for enhancingtext clusteringrdquo in Proceedings of the IEEE International Con-ference on Data Mining Workshops (ICDMW rsquo09) pp 477ndash482Miami Fla USA December 2009

[17] S Shehata F Karray andM Kamel ldquoAn efficient concept-basedminingmodel for enhancing text clusteringrdquo IEEE Transactionson Knowledge and Data Engineering vol 22 no 10 pp 1360ndash1371 2010

[18] E Zemmouri H Behja A Marzak and B Trousse ldquoOntology-based knowledge model for multi-view KDD processrdquo Interna-tional Journal of Mobile Computing and Multimedia Communi-cations vol 4 no 3 pp 21ndash33 2012

[19] C Marinica and F Guillet ldquoKnowledge-based interactive post-mining of association rules using ontologiesrdquo IEEETransactionson Knowledge and Data Engineering vol 22 no 6 pp 784ndash7972010

[20] B Choudhary and P Bhattacharyya ldquoText clustering usingUni-versal Networking Language representationrdquo in Proceedings ofthe 11th International World Wide Web Conference 2003

[21] G Pirro and J Euzenat ldquoA feature and information theoreticframework for semantic similarity and relatednessrdquo in Proceed-ings of the 9th International Semantic Web Conference on theSemantic Web (ISWC rsquo10) vol 6496 pp 615ndash630 2010

[22] Z Lin M R Lyu and I King ldquoMatchSim a novel similaritymeasure based on maximum neighborhood matchingrdquo Knowl-edge and Information Systems vol 32 no 1 pp 141ndash166 2012

[23] D Lin ldquoAn information-theoretic definition of similarityrdquo inProceedings of the 15th International Conference on MachineLearning Madison Wis USA July 1998

[24] D Inkpen ldquoA statistical model for near-synonym choicerdquo ACMTransactions on Speech and Language Processing vol 4 no 1article 2 2007

[25] LHan T Finin PMcNameeA Joshi andYYesha ldquoImprovingword similarity by augmenting PMIwith estimates of word pol-ysemyrdquo IEEE Transactions on Knowledge and Data Engineeringvol 25 no 6 pp 1307ndash1322 2013

[26] J OrsquoShea Z Bandar K Crockett and D McLean ldquoA compar-ative study of two short text semantic similarity measuresrdquo inAgent and Multi-Agent Systems Technologies and Applicationsvol 4953 of Lecture Notes in Artificial Intelligence pp 172ndash181Springer Berlin Germany 2008

[27] J Oliva J I SerranoMDDel Castillo and A Iglesias ldquoSyMSSa syntax-basedmeasure for short-text semantic similarityrdquoDataand Knowledge Engineering vol 70 no 4 pp 390ndash405 2011

The Scientific World Journal 19

[28] I Yoo and X Hu ldquoClustering large collection of biomedicalliterature based on ontology-enriched bipartite graph represen-tation andmutual refinement strategyrdquo inProceedings of the 10thPAKDD pp 9ndash12 2006

[29] S-T Yuan and Y-C Chen ldquoSemantic ideation learning foragent-based e-brainstormingrdquo IEEE Transactions on Knowledgeand Data Engineering vol 20 no 2 pp 261ndash275 2008

[30] D Liu Z Liu and Q Dong ldquoA dependency grammar andWordNet based sentence similarity measurerdquo Journal of Com-putational Information Systems vol 8 no 3 pp 1027ndash1035 2012

[31] A Zouaq M Gagnon and O Benoit ldquoSemantic analysis usingdependency-based grammars and upper-level ontologiesrdquoInternational Journal of Computational Linguistics and Applica-tions vol 1 no 1-2 pp 85ndash101 2010

[32] C Ramakrishnan K J Kochut and A P Sheth ldquoA frameworkfor schema-driven relationship discovery from unstructuredtextrdquo in The Semantic WebmdashISWC 2006 vol 4273 of LectureNotes in Computer Science pp 583ndash596 Springer BerlinGermany 2006

[33] C J C H Watkins and P Dayan ldquoTechnical note Q-learningrdquoMachine Learning vol 8 no 3-4 pp 279ndash292 1992

[34] M Lan C L Tan J Su and Y Lu ldquoSupervised and traditionalterm weighting methods for automatic text categorizationrdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 31 no 4 pp 721ndash735 2009

[35] L Zhang C Li J Liu and HWang ldquoGraph-based text similar-itymeasurement by exploiting wikipedia as background knowl-edgerdquo World Academy of Science Engineering and Technologyvol 59 pp 1548ndash1553 2011

[36] M Kuramochi and G Karypis ldquoAn efficient algorithm for dis-covering frequent subgraphsrdquo IEEE Transactions on Knowledgeand Data Engineering vol 16 no 9 pp 1038ndash1051 2004

[37] L-C Yu H-M Shih Y-L Lai J-F Yeh and C-H Wu ldquoDis-criminative training for near-synonym substitutionrdquo in Pro-ceedings of the 23rd International Conference on ComputationalLinguistics (COLING rsquo10) pp 1254ndash1262 2010

[38] L-C Yu and W-N Chien ldquoIndependent component analysisfor near-synonym choicerdquoDecision Support Systems vol 55 no1 pp 146ndash155 2013

[39] L Yao S Riedel and A Mccallum ldquoCollective cross-documentrelation extraction without labelled datardquo in Proceedings ofthe Conference on Empirical Methods in Natural LanguageProcessing pp 1013ndash1023MITCambridgeMassUSAOctober2010

[40] M Shen D-R Liu and Y-S Huang ldquoExtracting semantic rela-tions to enrich domain ontologiesrdquo Journal of Intelligent Infor-mation Systems vol 39 no 3 pp 749ndash761 2012

[41] H Chim and X Deng ldquoEfficient phrase-based documentsimilarity for clusteringrdquo IEEE Transactions on Knowledge andData Engineering vol 20 no 9 pp 1217ndash1229 2008

[42] GAMiller andWGCharles ldquoContextual correlates of seman-tic similarityrdquo Language and Cognitive Processes vol 6 no 1 pp1ndash28 1991

[43] M Steinbach G Karypis and V Kumar ldquoA comparison ofdocument clustering techniquesrdquo in Proceedings of the Knowl-edge Discovery and Data Mining (KDD) Workshop Text MiningAugust 2000

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 16: Research Article Exploiting Semantic Annotations and ...downloads.hindawi.com/journals/tswj/2015/136172.pdf · is bene cial for text categorization [ ], text summarization [ ], word

16 The Scientific World Journal

Table 5 The results of the comparison of the accuracy and acceptance rate between the proposed semantic relatedness measure and Islamand Inkpen [7]

MRPC dataset Relatednessthreshold

Human judgment(TP + FN)

Islam and Inkpen [7] The proposed MFsSRAcc AR Acc AR

Trainingsubset (4076)

01

2753 true

067 1 068 102 067 1 068 103 067 1 068 104 067 1 068 105 069 098 068 106 072 089 068 107 068 078 070 09808 056 04 072 08609 037 009 060 0491 033 0 034 002

Test subset(1725)

01

1147 true

066 1 067 102 066 1 067 103 066 1 067 104 066 1 067 105 068 098 067 106 072 089 066 107 068 078 066 09808 056 04 064 08509 038 009 049 051 033 0 034 003

disturbing the precision Thus the MFsSR between two con-cepts considers the information content with most of Word-Net relationships Table 6 illustrates the performance of theC-SynD algorithm based on theMFsSR compared to the bestnear-synonym algorithmusing the PMI and ICA approachesThis comparison was conducted for two corpora of textdocuments that are BNC and NS (DS3 and DS4 in Table 3)As indicated in Table 6 the C-SynD algorithm yielded higheraverage precision recall and 119865-measure values than PMIapproach by 25 20 and 21 on SN dataset respectivelyand also by 8 4 and 7 on BNC dataset Furthermorethe C-SynD algorithm also yielded higher average precisionrecall and 119865-measure values than ICA approach by 6 9and 7 on SN dataset respectively and also by 6 4and 5 on BNC dataset The improvements achieved in theperformance values are due to the increase of the numberof pairs predicted correctly and the decrease of the numberof pairs predicted incorrectly through implementing MFsSRThe important observation from this table is the improve-ments achieved in recall which measures effectiveness of theC-SynD algorithm Achieving better precision values is clearwith a high percentage in different datasets and domains

Inferring the MFs relationships and dependencies is aprocess of achieving a large number of interdependencesamongMFs resulting from the implementation of the IOAC-QL algorithm In the IOAC-QL algorithm the 119876-learning isused to select the optimal action (relationships) that gainsthe largest amount of interdependences among the MFs

resulting from the measure of the AVW Table 7 illustratesthe performance of the IOAC-QL algorithm based on AVWscompared to the implicit relationship extraction based onCRFs and TFICF approaches which was conducted overthe TREC Bless the NYT and the Wikipedia corpus (DS5DS6 DS7 and DS8 in Table 3) The data in this table showsincreases in precision recall and 119865-measure values due tothe increase in the number of pairs predicted correctly afterconsidering directindirect relevance through the expansionof closest synonym In addition the IOAC-QL algorithmthe CRFs and the TFICF approaches are beneficial tothe extraction performance but the IOAC-QL contributesmore than CRFs and TFICF Thus IOAC-QL considerssimilarity contiguity contrast and causality relationshipsbetween MFs and its closest-synonyms while the CRFs andTFICF consider is-a and part-of relationships only betweenconceptsThe improvements of the 119865-measure were achievedthrough an IOAC-QL algorithm up to 23 for TREC dataset22 for BLESS dataset 21 for NYT dataset and 6 forWikipedia dataset approximately from the CRFs approachFurthermore the IOAC-QL algorithm also yielded higher119865-measure values than TFICF approach by 10 for TRECdataset 17 for BLESS dataset and 7 for NYT datasetrespectively

433 Experiment 3 Comparative Study (Task-Based Eval-uation) This experiment shows the effectiveness of the

The Scientific World Journal 17

Table 6 The results of the C-SynD algorithm based on the MFsSR compared to the PMI and ICA for detecting the closest synonym

Datasets PMI ICA C-SynDPrecision Recall 119865-measure Precision Recall 119865-measure Precision Recall 119865-measure

SN 606 606 061 795 716 753 85 80 82BNC 745 679 71 76 678 72 82 719 77

Table 7 The results of the IOAC-QL algorithm based on AVWs compared to CRFs and TFICF approaches

Datasets CRFs TFICF IOAC-QLPrecision Recall 119865-measure Precision Recall 119865-measure Precision Recall 119865-measure

TREC 7508 602 6628 893 714 787 898 881 886BLESS 7304 6266 6703 738 695 716 950 835 889NYT 6846 5402 6038 860 650 740 900 740 812Wikipedia 560 420 480 646 5488 5934 700 440 540

Table 8The results of graph-based approach compared to term-based and concept-based approach using119870-means and the HAC clusteringalgorithms

Datasets Algorithm Term-based Concept-based Graph-basedFC FC FC

Reuters-21578 HAC 73 857 902119870-means 51 846 918

20 Newsgroups HAC 53 82 89119870-means 48 793 87

proposed semantic representation scheme in document min-ing and managing operations such as gathering filteringsearching retrieving extracting classifying clustering andsummarizing documents Many sets of experiments usingthe proposed scheme on different datasets in text clusteringhave been conducted The experiments demonstrate theextensive comparisons between the MF-based analysis andthe traditional analysis Experimental results demonstratethe substantial enhancement of the document clusteringquality In this experiment the application of the semanticunderstanding of the textual documents is presented Textualdocuments are parsed to produce the rich syntactic andsemantic representations of SHGRS Based on these semanticrepresentations document linkages are estimated using thesubgraph matching process that is based on finding alldistinct similarity common subgraphs The semantic repre-sentations of textual documents determining the relatednessbetween themare evaluated by allowing clustering algorithmsto use the produced relatedness matrix of the documentset To test the effectiveness of subgraph matching in deter-mining an accurate measure of the relatedness betweendocuments an experiment using the MFs-based analysisand subgraphs relatedness measures was conducted Twostandard document clustering techniques were chosen fortesting the effect of the document linkages on clustering [43](1) HAC and (2) 119870-means The MF-based weighting is themain factor that captures the importance of a concept in

a document Furthermore document linkages using the sub-graph relatedness measure are among the main factors thataffect clustering techniques Thus to study the effect of theMFs-based weighting and the subgraph relatedness measure(graph-based approach) on the clustering techniques thisexperiment is repeated for the different clustering techniquesas shown in Table 8 Basically the aim is to maximize theweighted average of the 119865-measure for each class (FC) ofclusters to achieve high-quality clustering

Table 8 shows that the graph-based approach achieveshigher performance than term-based and concept-basedapproachesThis comparison was conducted for two corporaof text documents that are the Reuters-21578 dataset and the20 newsgroups dataset (DS9 and DS10 in Table 3)The resultsshown in Table 8 illustrate improvement in the clusteringquality when more attributes are included in graph-basedrelatedness measure The improvements were achieved at thegraph-based approach up to 17 by HAC and 40 by 119870-means for Reuters-21578 dataset while up to 36 byHAC and39 by 119870-means for approximately 20 newsgroup datasetsfrom the base case (term-based) approach Furthermorethe improvements also were achieved at the graph-basedapproach up to 5 by HAC and 7 by119870-Means for Reuters-21578 dataset while up to 7 byHAC and 8 by119870-Means forapproximately 20 Newsgroup datasets from Concept-basedapproach

18 The Scientific World Journal

5 Conclusions

This paper proposed the SHGRS to improve the effectivenessof themining andmanaging operations in textual documentsSpecifically a three-stage approach was proposed for con-structing the scheme First the MFs with their attributesare extracted and hierarchy-based structure is built to rep-resent sentences by the MFs and their attributes Second anovel semantic relatedness computing method was proposedfor inferring relationships and dependencies of MFs andthe relationship between MFs is represented by a graph-based structure Third a subgraph matching algorithm wasconducted for document linkage process The documentsclustering is used as a case study and conducted extensiveexperiments on 10 different datasets and the results showedthat the scheme can result in a better mining performancecompared with traditional approach Future work will focuson conducting other case studies to processes such asgathering filtering retrieving classifying and summarizinginformation

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] I T Fatudimu A G Musa C K Ayo and A B SofoluweldquoKnowledge discovery in online repositories a text miningapproachrdquo European Journal of Scientific Research vol 22 no2 pp 241ndash250 2008

[2] A Hotho A Nurnberger and G Paaszlig ldquoA brief survey of textminingrdquo Journal for Computational Linguistics and LanguageTechnology vol 20 2005

[3] M Bernotas K Karklius R Laurutis and A Slotkiene ldquoThepeculiarities of the text document representation using ontol-ogy and tagging-based clustering techniquerdquo Journal of Infor-mation Technology and Control vol 36 pp 217ndash220 2007

[4] Y Li Z A Bandar andDMcLean ldquoAn approach formeasuringsemantic similarity between words using multiple informationsourcesrdquo IEEE Transactions on Knowledge and Data Engineer-ing vol 15 no 4 pp 871ndash882 2003

[5] K Shaban O Basir and M Kamel ldquoDocument mining basedon semantic understanding of textrdquo in Progress in Pattern Rec-ognition Image Analysis and Applications vol 4225 of LectureNotes in Computer Science pp 834ndash843 Springer BerlinGermany 2006

[6] A Culotta A McCallum and J Betz ldquoIntegrating probabilisticextraction models and data mining to discover relations andpatterns in textrdquo in Proceedings of theHuman Language Technol-ogy ConferencemdashNorth American Chapter of the Association forComputational Linguistics Annual Meeting (HLT-NAACL rsquo06)pp 296ndash303 June 2006

[7] A Islam and D Inkpen ldquoSemantic text similarity using corpus-based word similarity and string similarityrdquo ACM Transactionson Knowledge Discovery fromData vol 2 no 2 article 10 2008

[8] R Feldman and J SangerTheTextMining Handbook AdvancedApproaches in Analyzing Unstructured Data Cambridge Uni-versity Press 2006

[9] R Navigli ldquoWord sense disambiguation a surveyrdquo ACM Com-puting Surveys vol 41 no 2 article 10 2009

[10] W Mao andWW Chu ldquoThe phrase-based vector space modelfor automatic retrieval of free-text medical documentsrdquo Dataand Knowledge Engineering vol 61 no 1 pp 76ndash92 2007

[11] C Siefkes and P Siniakov ldquoAn overview and classification ofadaptive approaches to information extractionrdquo in Journal onData Semantics IV vol 3730 of Lecture Notes in ComputerScience pp 172ndash212 Springer New York NY USA 2005

[12] httpwordnetprincetonedu[13] OpenNLP httpopennlpapacheorg[14] httpwwwalchemyapicom[15] W V Siricharoen ldquoOntologies and object models in object ori-

ented software engineeringrdquo International Association of Engi-neers (IAENG) Journal of Computer Science (JICS) vol 33 no 1pp 1151ndash1162 2006

[16] S Shehata ldquoA WordNet-based semantic model for enhancingtext clusteringrdquo in Proceedings of the IEEE International Con-ference on Data Mining Workshops (ICDMW rsquo09) pp 477ndash482Miami Fla USA December 2009

[17] S Shehata F Karray andM Kamel ldquoAn efficient concept-basedminingmodel for enhancing text clusteringrdquo IEEE Transactionson Knowledge and Data Engineering vol 22 no 10 pp 1360ndash1371 2010

[18] E Zemmouri H Behja A Marzak and B Trousse ldquoOntology-based knowledge model for multi-view KDD processrdquo Interna-tional Journal of Mobile Computing and Multimedia Communi-cations vol 4 no 3 pp 21ndash33 2012

[19] C Marinica and F Guillet ldquoKnowledge-based interactive post-mining of association rules using ontologiesrdquo IEEETransactionson Knowledge and Data Engineering vol 22 no 6 pp 784ndash7972010

[20] B Choudhary and P Bhattacharyya ldquoText clustering usingUni-versal Networking Language representationrdquo in Proceedings ofthe 11th International World Wide Web Conference 2003

[21] G Pirro and J Euzenat ldquoA feature and information theoreticframework for semantic similarity and relatednessrdquo in Proceed-ings of the 9th International Semantic Web Conference on theSemantic Web (ISWC rsquo10) vol 6496 pp 615ndash630 2010

[22] Z Lin M R Lyu and I King ldquoMatchSim a novel similaritymeasure based on maximum neighborhood matchingrdquo Knowl-edge and Information Systems vol 32 no 1 pp 141ndash166 2012

[23] D Lin ldquoAn information-theoretic definition of similarityrdquo inProceedings of the 15th International Conference on MachineLearning Madison Wis USA July 1998

[24] D Inkpen ldquoA statistical model for near-synonym choicerdquo ACMTransactions on Speech and Language Processing vol 4 no 1article 2 2007

[25] LHan T Finin PMcNameeA Joshi andYYesha ldquoImprovingword similarity by augmenting PMIwith estimates of word pol-ysemyrdquo IEEE Transactions on Knowledge and Data Engineeringvol 25 no 6 pp 1307ndash1322 2013

[26] J OrsquoShea Z Bandar K Crockett and D McLean ldquoA compar-ative study of two short text semantic similarity measuresrdquo inAgent and Multi-Agent Systems Technologies and Applicationsvol 4953 of Lecture Notes in Artificial Intelligence pp 172ndash181Springer Berlin Germany 2008

[27] J Oliva J I SerranoMDDel Castillo and A Iglesias ldquoSyMSSa syntax-basedmeasure for short-text semantic similarityrdquoDataand Knowledge Engineering vol 70 no 4 pp 390ndash405 2011

The Scientific World Journal 19

[28] I Yoo and X Hu ldquoClustering large collection of biomedicalliterature based on ontology-enriched bipartite graph represen-tation andmutual refinement strategyrdquo inProceedings of the 10thPAKDD pp 9ndash12 2006

[29] S-T Yuan and Y-C Chen ldquoSemantic ideation learning foragent-based e-brainstormingrdquo IEEE Transactions on Knowledgeand Data Engineering vol 20 no 2 pp 261ndash275 2008

[30] D Liu Z Liu and Q Dong ldquoA dependency grammar andWordNet based sentence similarity measurerdquo Journal of Com-putational Information Systems vol 8 no 3 pp 1027ndash1035 2012

[31] A Zouaq M Gagnon and O Benoit ldquoSemantic analysis usingdependency-based grammars and upper-level ontologiesrdquoInternational Journal of Computational Linguistics and Applica-tions vol 1 no 1-2 pp 85ndash101 2010

[32] C Ramakrishnan K J Kochut and A P Sheth ldquoA frameworkfor schema-driven relationship discovery from unstructuredtextrdquo in The Semantic WebmdashISWC 2006 vol 4273 of LectureNotes in Computer Science pp 583ndash596 Springer BerlinGermany 2006

[33] C J C H Watkins and P Dayan ldquoTechnical note Q-learningrdquoMachine Learning vol 8 no 3-4 pp 279ndash292 1992

[34] M Lan C L Tan J Su and Y Lu ldquoSupervised and traditionalterm weighting methods for automatic text categorizationrdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 31 no 4 pp 721ndash735 2009

[35] L Zhang C Li J Liu and HWang ldquoGraph-based text similar-itymeasurement by exploiting wikipedia as background knowl-edgerdquo World Academy of Science Engineering and Technologyvol 59 pp 1548ndash1553 2011

[36] M Kuramochi and G Karypis ldquoAn efficient algorithm for dis-covering frequent subgraphsrdquo IEEE Transactions on Knowledgeand Data Engineering vol 16 no 9 pp 1038ndash1051 2004

[37] L-C Yu H-M Shih Y-L Lai J-F Yeh and C-H Wu ldquoDis-criminative training for near-synonym substitutionrdquo in Pro-ceedings of the 23rd International Conference on ComputationalLinguistics (COLING rsquo10) pp 1254ndash1262 2010

[38] L-C Yu and W-N Chien ldquoIndependent component analysisfor near-synonym choicerdquoDecision Support Systems vol 55 no1 pp 146ndash155 2013

[39] L Yao S Riedel and A Mccallum ldquoCollective cross-documentrelation extraction without labelled datardquo in Proceedings ofthe Conference on Empirical Methods in Natural LanguageProcessing pp 1013ndash1023MITCambridgeMassUSAOctober2010

[40] M Shen D-R Liu and Y-S Huang ldquoExtracting semantic rela-tions to enrich domain ontologiesrdquo Journal of Intelligent Infor-mation Systems vol 39 no 3 pp 749ndash761 2012

[41] H Chim and X Deng ldquoEfficient phrase-based documentsimilarity for clusteringrdquo IEEE Transactions on Knowledge andData Engineering vol 20 no 9 pp 1217ndash1229 2008

[42] GAMiller andWGCharles ldquoContextual correlates of seman-tic similarityrdquo Language and Cognitive Processes vol 6 no 1 pp1ndash28 1991

[43] M Steinbach G Karypis and V Kumar ldquoA comparison ofdocument clustering techniquesrdquo in Proceedings of the Knowl-edge Discovery and Data Mining (KDD) Workshop Text MiningAugust 2000

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 17: Research Article Exploiting Semantic Annotations and ...downloads.hindawi.com/journals/tswj/2015/136172.pdf · is bene cial for text categorization [ ], text summarization [ ], word

The Scientific World Journal 17

Table 6 The results of the C-SynD algorithm based on the MFsSR compared to the PMI and ICA for detecting the closest synonym

Datasets PMI ICA C-SynDPrecision Recall 119865-measure Precision Recall 119865-measure Precision Recall 119865-measure

SN 606 606 061 795 716 753 85 80 82BNC 745 679 71 76 678 72 82 719 77

Table 7 The results of the IOAC-QL algorithm based on AVWs compared to CRFs and TFICF approaches

Datasets CRFs TFICF IOAC-QLPrecision Recall 119865-measure Precision Recall 119865-measure Precision Recall 119865-measure

TREC 7508 602 6628 893 714 787 898 881 886BLESS 7304 6266 6703 738 695 716 950 835 889NYT 6846 5402 6038 860 650 740 900 740 812Wikipedia 560 420 480 646 5488 5934 700 440 540

Table 8The results of graph-based approach compared to term-based and concept-based approach using119870-means and the HAC clusteringalgorithms

Datasets Algorithm Term-based Concept-based Graph-basedFC FC FC

Reuters-21578 HAC 73 857 902119870-means 51 846 918

20 Newsgroups HAC 53 82 89119870-means 48 793 87

proposed semantic representation scheme in document min-ing and managing operations such as gathering filteringsearching retrieving extracting classifying clustering andsummarizing documents Many sets of experiments usingthe proposed scheme on different datasets in text clusteringhave been conducted The experiments demonstrate theextensive comparisons between the MF-based analysis andthe traditional analysis Experimental results demonstratethe substantial enhancement of the document clusteringquality In this experiment the application of the semanticunderstanding of the textual documents is presented Textualdocuments are parsed to produce the rich syntactic andsemantic representations of SHGRS Based on these semanticrepresentations document linkages are estimated using thesubgraph matching process that is based on finding alldistinct similarity common subgraphs The semantic repre-sentations of textual documents determining the relatednessbetween themare evaluated by allowing clustering algorithmsto use the produced relatedness matrix of the documentset To test the effectiveness of subgraph matching in deter-mining an accurate measure of the relatedness betweendocuments an experiment using the MFs-based analysisand subgraphs relatedness measures was conducted Twostandard document clustering techniques were chosen fortesting the effect of the document linkages on clustering [43](1) HAC and (2) 119870-means The MF-based weighting is themain factor that captures the importance of a concept in

a document Furthermore document linkages using the sub-graph relatedness measure are among the main factors thataffect clustering techniques Thus to study the effect of theMFs-based weighting and the subgraph relatedness measure(graph-based approach) on the clustering techniques thisexperiment is repeated for the different clustering techniquesas shown in Table 8 Basically the aim is to maximize theweighted average of the 119865-measure for each class (FC) ofclusters to achieve high-quality clustering

Table 8 shows that the graph-based approach achieveshigher performance than term-based and concept-basedapproachesThis comparison was conducted for two corporaof text documents that are the Reuters-21578 dataset and the20 newsgroups dataset (DS9 and DS10 in Table 3)The resultsshown in Table 8 illustrate improvement in the clusteringquality when more attributes are included in graph-basedrelatedness measure The improvements were achieved at thegraph-based approach up to 17 by HAC and 40 by 119870-means for Reuters-21578 dataset while up to 36 byHAC and39 by 119870-means for approximately 20 newsgroup datasetsfrom the base case (term-based) approach Furthermorethe improvements also were achieved at the graph-basedapproach up to 5 by HAC and 7 by119870-Means for Reuters-21578 dataset while up to 7 byHAC and 8 by119870-Means forapproximately 20 Newsgroup datasets from Concept-basedapproach

18 The Scientific World Journal

5 Conclusions

This paper proposed the SHGRS to improve the effectivenessof themining andmanaging operations in textual documentsSpecifically a three-stage approach was proposed for con-structing the scheme First the MFs with their attributesare extracted and hierarchy-based structure is built to rep-resent sentences by the MFs and their attributes Second anovel semantic relatedness computing method was proposedfor inferring relationships and dependencies of MFs andthe relationship between MFs is represented by a graph-based structure Third a subgraph matching algorithm wasconducted for document linkage process The documentsclustering is used as a case study and conducted extensiveexperiments on 10 different datasets and the results showedthat the scheme can result in a better mining performancecompared with traditional approach Future work will focuson conducting other case studies to processes such asgathering filtering retrieving classifying and summarizinginformation

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] I T Fatudimu A G Musa C K Ayo and A B SofoluweldquoKnowledge discovery in online repositories a text miningapproachrdquo European Journal of Scientific Research vol 22 no2 pp 241ndash250 2008

[2] A Hotho A Nurnberger and G Paaszlig ldquoA brief survey of textminingrdquo Journal for Computational Linguistics and LanguageTechnology vol 20 2005

[3] M Bernotas K Karklius R Laurutis and A Slotkiene ldquoThepeculiarities of the text document representation using ontol-ogy and tagging-based clustering techniquerdquo Journal of Infor-mation Technology and Control vol 36 pp 217ndash220 2007

[4] Y Li Z A Bandar andDMcLean ldquoAn approach formeasuringsemantic similarity between words using multiple informationsourcesrdquo IEEE Transactions on Knowledge and Data Engineer-ing vol 15 no 4 pp 871ndash882 2003

[5] K Shaban O Basir and M Kamel ldquoDocument mining basedon semantic understanding of textrdquo in Progress in Pattern Rec-ognition Image Analysis and Applications vol 4225 of LectureNotes in Computer Science pp 834ndash843 Springer BerlinGermany 2006

[6] A Culotta A McCallum and J Betz ldquoIntegrating probabilisticextraction models and data mining to discover relations andpatterns in textrdquo in Proceedings of theHuman Language Technol-ogy ConferencemdashNorth American Chapter of the Association forComputational Linguistics Annual Meeting (HLT-NAACL rsquo06)pp 296ndash303 June 2006

[7] A Islam and D Inkpen ldquoSemantic text similarity using corpus-based word similarity and string similarityrdquo ACM Transactionson Knowledge Discovery fromData vol 2 no 2 article 10 2008

[8] R Feldman and J SangerTheTextMining Handbook AdvancedApproaches in Analyzing Unstructured Data Cambridge Uni-versity Press 2006

[9] R Navigli ldquoWord sense disambiguation a surveyrdquo ACM Com-puting Surveys vol 41 no 2 article 10 2009

[10] W Mao andWW Chu ldquoThe phrase-based vector space modelfor automatic retrieval of free-text medical documentsrdquo Dataand Knowledge Engineering vol 61 no 1 pp 76ndash92 2007

[11] C Siefkes and P Siniakov ldquoAn overview and classification ofadaptive approaches to information extractionrdquo in Journal onData Semantics IV vol 3730 of Lecture Notes in ComputerScience pp 172ndash212 Springer New York NY USA 2005

[12] httpwordnetprincetonedu[13] OpenNLP httpopennlpapacheorg[14] httpwwwalchemyapicom[15] W V Siricharoen ldquoOntologies and object models in object ori-

ented software engineeringrdquo International Association of Engi-neers (IAENG) Journal of Computer Science (JICS) vol 33 no 1pp 1151ndash1162 2006

[16] S Shehata ldquoA WordNet-based semantic model for enhancingtext clusteringrdquo in Proceedings of the IEEE International Con-ference on Data Mining Workshops (ICDMW rsquo09) pp 477ndash482Miami Fla USA December 2009

[17] S Shehata F Karray andM Kamel ldquoAn efficient concept-basedminingmodel for enhancing text clusteringrdquo IEEE Transactionson Knowledge and Data Engineering vol 22 no 10 pp 1360ndash1371 2010

[18] E Zemmouri H Behja A Marzak and B Trousse ldquoOntology-based knowledge model for multi-view KDD processrdquo Interna-tional Journal of Mobile Computing and Multimedia Communi-cations vol 4 no 3 pp 21ndash33 2012

[19] C Marinica and F Guillet ldquoKnowledge-based interactive post-mining of association rules using ontologiesrdquo IEEETransactionson Knowledge and Data Engineering vol 22 no 6 pp 784ndash7972010

[20] B Choudhary and P Bhattacharyya ldquoText clustering usingUni-versal Networking Language representationrdquo in Proceedings ofthe 11th International World Wide Web Conference 2003

[21] G Pirro and J Euzenat ldquoA feature and information theoreticframework for semantic similarity and relatednessrdquo in Proceed-ings of the 9th International Semantic Web Conference on theSemantic Web (ISWC rsquo10) vol 6496 pp 615ndash630 2010

[22] Z Lin M R Lyu and I King ldquoMatchSim a novel similaritymeasure based on maximum neighborhood matchingrdquo Knowl-edge and Information Systems vol 32 no 1 pp 141ndash166 2012

[23] D Lin ldquoAn information-theoretic definition of similarityrdquo inProceedings of the 15th International Conference on MachineLearning Madison Wis USA July 1998

[24] D Inkpen ldquoA statistical model for near-synonym choicerdquo ACMTransactions on Speech and Language Processing vol 4 no 1article 2 2007

[25] LHan T Finin PMcNameeA Joshi andYYesha ldquoImprovingword similarity by augmenting PMIwith estimates of word pol-ysemyrdquo IEEE Transactions on Knowledge and Data Engineeringvol 25 no 6 pp 1307ndash1322 2013

[26] J OrsquoShea Z Bandar K Crockett and D McLean ldquoA compar-ative study of two short text semantic similarity measuresrdquo inAgent and Multi-Agent Systems Technologies and Applicationsvol 4953 of Lecture Notes in Artificial Intelligence pp 172ndash181Springer Berlin Germany 2008

[27] J Oliva J I SerranoMDDel Castillo and A Iglesias ldquoSyMSSa syntax-basedmeasure for short-text semantic similarityrdquoDataand Knowledge Engineering vol 70 no 4 pp 390ndash405 2011

The Scientific World Journal 19

[28] I Yoo and X Hu ldquoClustering large collection of biomedicalliterature based on ontology-enriched bipartite graph represen-tation andmutual refinement strategyrdquo inProceedings of the 10thPAKDD pp 9ndash12 2006

[29] S-T Yuan and Y-C Chen ldquoSemantic ideation learning foragent-based e-brainstormingrdquo IEEE Transactions on Knowledgeand Data Engineering vol 20 no 2 pp 261ndash275 2008

[30] D Liu Z Liu and Q Dong ldquoA dependency grammar andWordNet based sentence similarity measurerdquo Journal of Com-putational Information Systems vol 8 no 3 pp 1027ndash1035 2012

[31] A Zouaq M Gagnon and O Benoit ldquoSemantic analysis usingdependency-based grammars and upper-level ontologiesrdquoInternational Journal of Computational Linguistics and Applica-tions vol 1 no 1-2 pp 85ndash101 2010

[32] C Ramakrishnan K J Kochut and A P Sheth ldquoA frameworkfor schema-driven relationship discovery from unstructuredtextrdquo in The Semantic WebmdashISWC 2006 vol 4273 of LectureNotes in Computer Science pp 583ndash596 Springer BerlinGermany 2006

[33] C J C H Watkins and P Dayan ldquoTechnical note Q-learningrdquoMachine Learning vol 8 no 3-4 pp 279ndash292 1992

[34] M Lan C L Tan J Su and Y Lu ldquoSupervised and traditionalterm weighting methods for automatic text categorizationrdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 31 no 4 pp 721ndash735 2009

[35] L Zhang C Li J Liu and HWang ldquoGraph-based text similar-itymeasurement by exploiting wikipedia as background knowl-edgerdquo World Academy of Science Engineering and Technologyvol 59 pp 1548ndash1553 2011

[36] M Kuramochi and G Karypis ldquoAn efficient algorithm for dis-covering frequent subgraphsrdquo IEEE Transactions on Knowledgeand Data Engineering vol 16 no 9 pp 1038ndash1051 2004

[37] L-C Yu H-M Shih Y-L Lai J-F Yeh and C-H Wu ldquoDis-criminative training for near-synonym substitutionrdquo in Pro-ceedings of the 23rd International Conference on ComputationalLinguistics (COLING rsquo10) pp 1254ndash1262 2010

[38] L-C Yu and W-N Chien ldquoIndependent component analysisfor near-synonym choicerdquoDecision Support Systems vol 55 no1 pp 146ndash155 2013

[39] L Yao S Riedel and A Mccallum ldquoCollective cross-documentrelation extraction without labelled datardquo in Proceedings ofthe Conference on Empirical Methods in Natural LanguageProcessing pp 1013ndash1023MITCambridgeMassUSAOctober2010

[40] M Shen D-R Liu and Y-S Huang ldquoExtracting semantic rela-tions to enrich domain ontologiesrdquo Journal of Intelligent Infor-mation Systems vol 39 no 3 pp 749ndash761 2012

[41] H Chim and X Deng ldquoEfficient phrase-based documentsimilarity for clusteringrdquo IEEE Transactions on Knowledge andData Engineering vol 20 no 9 pp 1217ndash1229 2008

[42] GAMiller andWGCharles ldquoContextual correlates of seman-tic similarityrdquo Language and Cognitive Processes vol 6 no 1 pp1ndash28 1991

[43] M Steinbach G Karypis and V Kumar ldquoA comparison ofdocument clustering techniquesrdquo in Proceedings of the Knowl-edge Discovery and Data Mining (KDD) Workshop Text MiningAugust 2000

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 18: Research Article Exploiting Semantic Annotations and ...downloads.hindawi.com/journals/tswj/2015/136172.pdf · is bene cial for text categorization [ ], text summarization [ ], word

18 The Scientific World Journal

5 Conclusions

This paper proposed the SHGRS to improve the effectivenessof themining andmanaging operations in textual documentsSpecifically a three-stage approach was proposed for con-structing the scheme First the MFs with their attributesare extracted and hierarchy-based structure is built to rep-resent sentences by the MFs and their attributes Second anovel semantic relatedness computing method was proposedfor inferring relationships and dependencies of MFs andthe relationship between MFs is represented by a graph-based structure Third a subgraph matching algorithm wasconducted for document linkage process The documentsclustering is used as a case study and conducted extensiveexperiments on 10 different datasets and the results showedthat the scheme can result in a better mining performancecompared with traditional approach Future work will focuson conducting other case studies to processes such asgathering filtering retrieving classifying and summarizinginformation

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] I T Fatudimu A G Musa C K Ayo and A B SofoluweldquoKnowledge discovery in online repositories a text miningapproachrdquo European Journal of Scientific Research vol 22 no2 pp 241ndash250 2008

[2] A Hotho A Nurnberger and G Paaszlig ldquoA brief survey of textminingrdquo Journal for Computational Linguistics and LanguageTechnology vol 20 2005

[3] M Bernotas K Karklius R Laurutis and A Slotkiene ldquoThepeculiarities of the text document representation using ontol-ogy and tagging-based clustering techniquerdquo Journal of Infor-mation Technology and Control vol 36 pp 217ndash220 2007

[4] Y Li Z A Bandar andDMcLean ldquoAn approach formeasuringsemantic similarity between words using multiple informationsourcesrdquo IEEE Transactions on Knowledge and Data Engineer-ing vol 15 no 4 pp 871ndash882 2003

[5] K Shaban O Basir and M Kamel ldquoDocument mining basedon semantic understanding of textrdquo in Progress in Pattern Rec-ognition Image Analysis and Applications vol 4225 of LectureNotes in Computer Science pp 834ndash843 Springer BerlinGermany 2006

[6] A Culotta A McCallum and J Betz ldquoIntegrating probabilisticextraction models and data mining to discover relations andpatterns in textrdquo in Proceedings of theHuman Language Technol-ogy ConferencemdashNorth American Chapter of the Association forComputational Linguistics Annual Meeting (HLT-NAACL rsquo06)pp 296ndash303 June 2006

[7] A Islam and D Inkpen ldquoSemantic text similarity using corpus-based word similarity and string similarityrdquo ACM Transactionson Knowledge Discovery fromData vol 2 no 2 article 10 2008

[8] R Feldman and J SangerTheTextMining Handbook AdvancedApproaches in Analyzing Unstructured Data Cambridge Uni-versity Press 2006

[9] R Navigli ldquoWord sense disambiguation a surveyrdquo ACM Com-puting Surveys vol 41 no 2 article 10 2009

[10] W Mao andWW Chu ldquoThe phrase-based vector space modelfor automatic retrieval of free-text medical documentsrdquo Dataand Knowledge Engineering vol 61 no 1 pp 76ndash92 2007

[11] C Siefkes and P Siniakov ldquoAn overview and classification ofadaptive approaches to information extractionrdquo in Journal onData Semantics IV vol 3730 of Lecture Notes in ComputerScience pp 172ndash212 Springer New York NY USA 2005

[12] httpwordnetprincetonedu[13] OpenNLP httpopennlpapacheorg[14] httpwwwalchemyapicom[15] W V Siricharoen ldquoOntologies and object models in object ori-

ented software engineeringrdquo International Association of Engi-neers (IAENG) Journal of Computer Science (JICS) vol 33 no 1pp 1151ndash1162 2006

[16] S Shehata ldquoA WordNet-based semantic model for enhancingtext clusteringrdquo in Proceedings of the IEEE International Con-ference on Data Mining Workshops (ICDMW rsquo09) pp 477ndash482Miami Fla USA December 2009

[17] S Shehata F Karray andM Kamel ldquoAn efficient concept-basedminingmodel for enhancing text clusteringrdquo IEEE Transactionson Knowledge and Data Engineering vol 22 no 10 pp 1360ndash1371 2010

[18] E Zemmouri H Behja A Marzak and B Trousse ldquoOntology-based knowledge model for multi-view KDD processrdquo Interna-tional Journal of Mobile Computing and Multimedia Communi-cations vol 4 no 3 pp 21ndash33 2012

[19] C Marinica and F Guillet ldquoKnowledge-based interactive post-mining of association rules using ontologiesrdquo IEEETransactionson Knowledge and Data Engineering vol 22 no 6 pp 784ndash7972010

[20] B Choudhary and P Bhattacharyya ldquoText clustering usingUni-versal Networking Language representationrdquo in Proceedings ofthe 11th International World Wide Web Conference 2003

[21] G Pirro and J Euzenat ldquoA feature and information theoreticframework for semantic similarity and relatednessrdquo in Proceed-ings of the 9th International Semantic Web Conference on theSemantic Web (ISWC rsquo10) vol 6496 pp 615ndash630 2010

[22] Z Lin M R Lyu and I King ldquoMatchSim a novel similaritymeasure based on maximum neighborhood matchingrdquo Knowl-edge and Information Systems vol 32 no 1 pp 141ndash166 2012

[23] D Lin ldquoAn information-theoretic definition of similarityrdquo inProceedings of the 15th International Conference on MachineLearning Madison Wis USA July 1998

[24] D Inkpen ldquoA statistical model for near-synonym choicerdquo ACMTransactions on Speech and Language Processing vol 4 no 1article 2 2007

[25] LHan T Finin PMcNameeA Joshi andYYesha ldquoImprovingword similarity by augmenting PMIwith estimates of word pol-ysemyrdquo IEEE Transactions on Knowledge and Data Engineeringvol 25 no 6 pp 1307ndash1322 2013

[26] J OrsquoShea Z Bandar K Crockett and D McLean ldquoA compar-ative study of two short text semantic similarity measuresrdquo inAgent and Multi-Agent Systems Technologies and Applicationsvol 4953 of Lecture Notes in Artificial Intelligence pp 172ndash181Springer Berlin Germany 2008

[27] J Oliva J I SerranoMDDel Castillo and A Iglesias ldquoSyMSSa syntax-basedmeasure for short-text semantic similarityrdquoDataand Knowledge Engineering vol 70 no 4 pp 390ndash405 2011

The Scientific World Journal 19

[28] I Yoo and X Hu ldquoClustering large collection of biomedicalliterature based on ontology-enriched bipartite graph represen-tation andmutual refinement strategyrdquo inProceedings of the 10thPAKDD pp 9ndash12 2006

[29] S-T Yuan and Y-C Chen ldquoSemantic ideation learning foragent-based e-brainstormingrdquo IEEE Transactions on Knowledgeand Data Engineering vol 20 no 2 pp 261ndash275 2008

[30] D Liu Z Liu and Q Dong ldquoA dependency grammar andWordNet based sentence similarity measurerdquo Journal of Com-putational Information Systems vol 8 no 3 pp 1027ndash1035 2012

[31] A Zouaq M Gagnon and O Benoit ldquoSemantic analysis usingdependency-based grammars and upper-level ontologiesrdquoInternational Journal of Computational Linguistics and Applica-tions vol 1 no 1-2 pp 85ndash101 2010

[32] C Ramakrishnan K J Kochut and A P Sheth ldquoA frameworkfor schema-driven relationship discovery from unstructuredtextrdquo in The Semantic WebmdashISWC 2006 vol 4273 of LectureNotes in Computer Science pp 583ndash596 Springer BerlinGermany 2006

[33] C J C H Watkins and P Dayan ldquoTechnical note Q-learningrdquoMachine Learning vol 8 no 3-4 pp 279ndash292 1992

[34] M Lan C L Tan J Su and Y Lu ldquoSupervised and traditionalterm weighting methods for automatic text categorizationrdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 31 no 4 pp 721ndash735 2009

[35] L Zhang C Li J Liu and HWang ldquoGraph-based text similar-itymeasurement by exploiting wikipedia as background knowl-edgerdquo World Academy of Science Engineering and Technologyvol 59 pp 1548ndash1553 2011

[36] M Kuramochi and G Karypis ldquoAn efficient algorithm for dis-covering frequent subgraphsrdquo IEEE Transactions on Knowledgeand Data Engineering vol 16 no 9 pp 1038ndash1051 2004

[37] L-C Yu H-M Shih Y-L Lai J-F Yeh and C-H Wu ldquoDis-criminative training for near-synonym substitutionrdquo in Pro-ceedings of the 23rd International Conference on ComputationalLinguistics (COLING rsquo10) pp 1254ndash1262 2010

[38] L-C Yu and W-N Chien ldquoIndependent component analysisfor near-synonym choicerdquoDecision Support Systems vol 55 no1 pp 146ndash155 2013

[39] L Yao S Riedel and A Mccallum ldquoCollective cross-documentrelation extraction without labelled datardquo in Proceedings ofthe Conference on Empirical Methods in Natural LanguageProcessing pp 1013ndash1023MITCambridgeMassUSAOctober2010

[40] M Shen D-R Liu and Y-S Huang ldquoExtracting semantic rela-tions to enrich domain ontologiesrdquo Journal of Intelligent Infor-mation Systems vol 39 no 3 pp 749ndash761 2012

[41] H Chim and X Deng ldquoEfficient phrase-based documentsimilarity for clusteringrdquo IEEE Transactions on Knowledge andData Engineering vol 20 no 9 pp 1217ndash1229 2008

[42] GAMiller andWGCharles ldquoContextual correlates of seman-tic similarityrdquo Language and Cognitive Processes vol 6 no 1 pp1ndash28 1991

[43] M Steinbach G Karypis and V Kumar ldquoA comparison ofdocument clustering techniquesrdquo in Proceedings of the Knowl-edge Discovery and Data Mining (KDD) Workshop Text MiningAugust 2000

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 19: Research Article Exploiting Semantic Annotations and ...downloads.hindawi.com/journals/tswj/2015/136172.pdf · is bene cial for text categorization [ ], text summarization [ ], word

The Scientific World Journal 19

[28] I Yoo and X Hu ldquoClustering large collection of biomedicalliterature based on ontology-enriched bipartite graph represen-tation andmutual refinement strategyrdquo inProceedings of the 10thPAKDD pp 9ndash12 2006

[29] S-T Yuan and Y-C Chen ldquoSemantic ideation learning foragent-based e-brainstormingrdquo IEEE Transactions on Knowledgeand Data Engineering vol 20 no 2 pp 261ndash275 2008

[30] D Liu Z Liu and Q Dong ldquoA dependency grammar andWordNet based sentence similarity measurerdquo Journal of Com-putational Information Systems vol 8 no 3 pp 1027ndash1035 2012

[31] A Zouaq M Gagnon and O Benoit ldquoSemantic analysis usingdependency-based grammars and upper-level ontologiesrdquoInternational Journal of Computational Linguistics and Applica-tions vol 1 no 1-2 pp 85ndash101 2010

[32] C Ramakrishnan K J Kochut and A P Sheth ldquoA frameworkfor schema-driven relationship discovery from unstructuredtextrdquo in The Semantic WebmdashISWC 2006 vol 4273 of LectureNotes in Computer Science pp 583ndash596 Springer BerlinGermany 2006

[33] C J C H Watkins and P Dayan ldquoTechnical note Q-learningrdquoMachine Learning vol 8 no 3-4 pp 279ndash292 1992

[34] M Lan C L Tan J Su and Y Lu ldquoSupervised and traditionalterm weighting methods for automatic text categorizationrdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 31 no 4 pp 721ndash735 2009

[35] L Zhang C Li J Liu and HWang ldquoGraph-based text similar-itymeasurement by exploiting wikipedia as background knowl-edgerdquo World Academy of Science Engineering and Technologyvol 59 pp 1548ndash1553 2011

[36] M Kuramochi and G Karypis ldquoAn efficient algorithm for dis-covering frequent subgraphsrdquo IEEE Transactions on Knowledgeand Data Engineering vol 16 no 9 pp 1038ndash1051 2004

[37] L-C Yu H-M Shih Y-L Lai J-F Yeh and C-H Wu ldquoDis-criminative training for near-synonym substitutionrdquo in Pro-ceedings of the 23rd International Conference on ComputationalLinguistics (COLING rsquo10) pp 1254ndash1262 2010

[38] L-C Yu and W-N Chien ldquoIndependent component analysisfor near-synonym choicerdquoDecision Support Systems vol 55 no1 pp 146ndash155 2013

[39] L Yao S Riedel and A Mccallum ldquoCollective cross-documentrelation extraction without labelled datardquo in Proceedings ofthe Conference on Empirical Methods in Natural LanguageProcessing pp 1013ndash1023MITCambridgeMassUSAOctober2010

[40] M Shen D-R Liu and Y-S Huang ldquoExtracting semantic rela-tions to enrich domain ontologiesrdquo Journal of Intelligent Infor-mation Systems vol 39 no 3 pp 749ndash761 2012

[41] H Chim and X Deng ldquoEfficient phrase-based documentsimilarity for clusteringrdquo IEEE Transactions on Knowledge andData Engineering vol 20 no 9 pp 1217ndash1229 2008

[42] GAMiller andWGCharles ldquoContextual correlates of seman-tic similarityrdquo Language and Cognitive Processes vol 6 no 1 pp1ndash28 1991

[43] M Steinbach G Karypis and V Kumar ldquoA comparison ofdocument clustering techniquesrdquo in Proceedings of the Knowl-edge Discovery and Data Mining (KDD) Workshop Text MiningAugust 2000

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 20: Research Article Exploiting Semantic Annotations and ...downloads.hindawi.com/journals/tswj/2015/136172.pdf · is bene cial for text categorization [ ], text summarization [ ], word

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014