inductive model generation for text classification using a bipartite heterogeneous network

15
Rossi RG, Lopes AA, Faleiros TP et al. Inductive model generation for text classification using a bipartite heterogeneous network. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 29(3): 361–375 May 2014. DOI 10.1007/s11390- 014-1436-7 Inductive Model Generation for Text Classification Using a Bipartite Heterogeneous Network Rafael Geraldeli Rossi, Alneu de Andrade Lopes, Thiago de Paulo Faleiros, and Solange Oliveira Rezende Institute of Mathematics and Computer Science, University of S˜ao Paulo, S˜ao Carlos, Brasil E-mail: {ragero, alneu, thiagopf, solange}@icmc.usp.br Received September 2, 2013; revised March 6, 2014. Abstract Algorithms for numeric data classification have been applied for text classification. Usually the vector space model is used to represent text collections. The characteristics of this representation such as sparsity and high dimensionality sometimes impair the quality of general-purpose classifiers. Networks can be used to represent text collections, avoiding the high sparsity and allowing to model relationships among different objects that compose a text collection. Such network- based representations can improve the quality of the classification results. One of the simplest ways to represent textual collections by a network is through a bipartite heterogeneous network, which is composed of objects that represent the documents connected to objects that represent the terms. Heterogeneous bipartite networks do not require computation of similarities or relations among the objects and can be used to model any type of text collection. Due to the advantages of representing text collections through bipartite heterogeneous networks, in this article we present a text classifier which builds a classification model using the structure of a bipartite heterogeneous network. Such an algorithm, referred to as IMBHN (Inductive Model Based on Bipartite Heterogeneous Network), induces a classification model assigning weights to objects that represent the terms for each class of the text collection. An empirical evaluation using a large amount of text collections from different domains shows that the proposed IMBHN algorithm produces significantly better results than k-NN, C4.5, SVM, and Naive Bayes algorithms. Keywords heterogeneous network, text classification, inductive model generation 1 Introduction A huge amount of data available in the digital world is in textual format as e-mails, reports, newsletters, articles and web pages. Manual analysis, organiza- tion, management, and knowledge extraction of such textual data is impractical. Text automatic classifica- tion (TAC) which has gained importance in the last decades [1-3] can be used to perform these tasks auto- matically. TAC consists of automatically assigning predefined classes to text documents. Applications of TAC are: information retrieval [4] , routing [5] , filtering [6] , file management[7], metadata generation [8] , genre determination [9] , hierarchical classification of web pages [10] , etc. The algorithms typically used for TAC aim to induce a classification model. The induction process usually considers that the collections are represented in a vec- tor space model (VSM) [11] , i.e., each document is rep- resented by a vector and each dimension corresponds to a term (feature) of the document. The characteris- tics of this representation are high dimensionality and sparsity which impair the quality of the results as well as degrade the performance of text classification algo- rithms. Algorithms that use this type of representa- tion consider that the entities of a collection (such as documents and terms) are not related and can produce spurious results [12] . A graph or network is an alternative to represent re- lations among the entities of a problem. A simple way to represent text collections using networks is to use a bipartite heterogeneous network. A bipartite hetero- geneous network consists of objects that represent the documents and objects that represent the terms of the collection. In this network, an object of type document is linked to objects of type term. The representation of a text document collection using a bipartite network Regular Paper The work is supported by S˜ao Paulo Research Foundation (FAPESP) of Brasil under Grant Nos. 2011/12823-6, 2011/23689-9, and 2011/19850-9. A preliminary version of the paper was published in the Proceedings of ICDM 2012. 2014 Springer Science + Business Media, LLC & Science Press, China

Upload: solange-oliveira

Post on 25-Jan-2017

221 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Inductive Model Generation for Text Classification Using a Bipartite Heterogeneous Network

Rossi RG, Lopes AA, Faleiros TP et al. Inductive model generation for text classification using a bipartite heterogeneous

network. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 29(3): 361–375 May 2014. DOI 10.1007/s11390-

014-1436-7

Inductive Model Generation for Text Classification Using a Bipartite

Heterogeneous Network

Rafael Geraldeli Rossi, Alneu de Andrade Lopes, Thiago de Paulo Faleiros, and Solange Oliveira Rezende

Institute of Mathematics and Computer Science, University of Sao Paulo, Sao Carlos, Brasil

E-mail: {ragero, alneu, thiagopf, solange}@icmc.usp.br

Received September 2, 2013; revised March 6, 2014.

Abstract Algorithms for numeric data classification have been applied for text classification. Usually the vector spacemodel is used to represent text collections. The characteristics of this representation such as sparsity and high dimensionalitysometimes impair the quality of general-purpose classifiers. Networks can be used to represent text collections, avoiding thehigh sparsity and allowing to model relationships among different objects that compose a text collection. Such network-based representations can improve the quality of the classification results. One of the simplest ways to represent textualcollections by a network is through a bipartite heterogeneous network, which is composed of objects that represent thedocuments connected to objects that represent the terms. Heterogeneous bipartite networks do not require computation ofsimilarities or relations among the objects and can be used to model any type of text collection. Due to the advantagesof representing text collections through bipartite heterogeneous networks, in this article we present a text classifier whichbuilds a classification model using the structure of a bipartite heterogeneous network. Such an algorithm, referred to asIMBHN (Inductive Model Based on Bipartite Heterogeneous Network), induces a classification model assigning weights toobjects that represent the terms for each class of the text collection. An empirical evaluation using a large amount of textcollections from different domains shows that the proposed IMBHN algorithm produces significantly better results thank-NN, C4.5, SVM, and Naive Bayes algorithms.

Keywords heterogeneous network, text classification, inductive model generation

1 Introduction

A huge amount of data available in the digital worldis in textual format as e-mails, reports, newsletters,articles and web pages. Manual analysis, organiza-tion, management, and knowledge extraction of suchtextual data is impractical. Text automatic classifica-tion (TAC) which has gained importance in the lastdecades[1-3] can be used to perform these tasks auto-matically.

TAC consists of automatically assigning predefinedclasses to text documents. Applications of TACare: information retrieval[4], routing[5], filtering[6],file management[7], metadata generation[8], genredetermination[9], hierarchical classification of webpages[10], etc.

The algorithms typically used for TAC aim to inducea classification model. The induction process usuallyconsiders that the collections are represented in a vec-

tor space model (VSM)[11], i.e., each document is rep-resented by a vector and each dimension correspondsto a term (feature) of the document. The characteris-tics of this representation are high dimensionality andsparsity which impair the quality of the results as wellas degrade the performance of text classification algo-rithms. Algorithms that use this type of representa-tion consider that the entities of a collection (such asdocuments and terms) are not related and can producespurious results[12].

A graph or network is an alternative to represent re-lations among the entities of a problem. A simple wayto represent text collections using networks is to usea bipartite heterogeneous network. A bipartite hetero-geneous network consists of objects that represent thedocuments and objects that represent the terms of thecollection. In this network, an object of type documentis linked to objects of type term. The representation ofa text document collection using a bipartite network

Regular PaperThe work is supported by Sao Paulo Research Foundation (FAPESP) of Brasil under Grant Nos. 2011/12823-6, 2011/23689-9,

and 2011/19850-9.A preliminary version of the paper was published in the Proceedings of ICDM 2012.©2014 Springer Science+Business Media, LLC & Science Press, China

Page 2: Inductive Model Generation for Text Classification Using a Bipartite Heterogeneous Network

362 J. Comput. Sci. & Technol., May 2014, Vol.29, No.3

has the following advantages: 1) does not requirehyperlinks[13] or citations[14], 2) need not compute simi-larities among documents[15] or terms[16-17].

Considering just the terms linked to each documentin a bipartite network we avoid the high dimensionalityand high sparsity, since we do not have to store infor-mation about all the terms in the collection for eachdocument. These features allow the algorithms to bemore efficient since we do not have to consider the 0 spresented in the VSM representation. In addition, ac-cording to [18], this type of representation has been un-derexplored for textual data representation and there ispotential for further investigation.

In this article we present a textual document clas-sification algorithm which uses the structure of a bi-partite network to induce a classification model. Thealgorithm, named IMBHN (Inductive Model Based onBipartite Heterogeneous Network), assigns weights toterms related to each class of the document collection.In order to do so, the algorithm considers the labeleddocuments in the training data and the bipartite net-work structure. The weights of the terms are inducedusing the Least-Mean-Square method[19]. The weightsreflect the relevance of the terms for discriminating eachclass present in the collection, see Fig.1. In the classi-fication phase, the induced weights and the bipartitestructure are considered to assign categories to newdocuments.

We conduct a rigorous comparative evaluation of theproposed classification algorithm with traditional andstate-of-the-art algorithms based on VSM from differ-ent paradigms. The chosen algorithms for comparisoninclude: 1) Naive Bayes (NB), and Multinomial NaiveBayes (MNB), probabilistic paradigm; 2) C4.5, sym-bolic paradigm; 3) Support Vector Machine (SVM),statistical paradigm; and 4) k-Nearest Neighbors (k-NN), instance-based paradigm. These algorithms areapplied to 43 textual document collections of differentdomains. The results show that the proposed algorithmis superior with significant differences when comparedwith NB, C4.5, SVM and k-NN.

Some concepts of this work have been briefly intro-duced in a previous work[20] and, in this paper, wepresent a substantially improved version. The majorimprovements are: 1) a deeper analysis on the IMBHNalgorithm on real datasets, considering more datasets,parameter analysis and including different preprocess-ing techniques; 2) an extension about the concepts anda modification in the notations to allow a better under-standing of IMBHN; and 3) a step-by-step explanationof the mechanics of IMBHN.

The remainder of this article is organized as follows:Section 2 presents the basic concepts employed in this

paper and related work. Section 3 presents the detailsof the proposed classification algorithm that induces aclassification model using the structure of a bipartitenetwork. Section 4 presents the details of the experi-ments and the results. Finally, Section 5 presents theconclusions and future work.

2 Background and Related Work

For an automatic text classification task, let D ={d1, d2, . . . , dn} be a collection of text documents com-pounded by labeled (DL) and unlabeled (DU ) docu-ments, i.e., D = DL ∪ DU . Let T = {t1, t2, . . . , tm} bethe set of terms and C = {c1, c2, . . . , cl} be the set ofclass labels. Let Y = {y1, y2, . . . , y|D|} be the true la-bels of the labeled documents and the labels assigned tothe unlabeled documents during the classification pro-cess, i.e., Y = YL ∪ YU . The TAC can be inductive ortransductive.

In the inductive classification a function F : DL ×C → {0, 1} is induced to approximate a real categoryassignment function R : DL → YL[2]. The value 1 isassigned to the documents that belong to a categorycj ∈ C.

The inductive learning strategy for text classifi-cation has been widely used, mainly using NB, k-NN and SVM algorithms, in applications such as:sentiment analysis[21-23], categorization of medicaldocuments[24-25], website categorization[24,26-29], newscategorization[30-31], and e-mail categorization[32-33].There is also recent interest in the use of Centroid-Based Classifier (CBC)[34-36].

The traditional and state-of-the-art inductive algo-rithms used in this article (NB, MNB, C4.5, SVM, andk-NN) may not be effective in some scenarios. For exa-mple, NB and MNB algorithms assume the indepen-dence of attributes, a fact that is usually not true andmay lead to erroneous results. In C4.5 algorithm, ir-relevant attributes may affect the tree building, smallvariations in the training set can cause differences inthe generated trees, and the computational cost can behigh since all the remaining features are tested to selectthe best feature for a new split in each node of the tree.SVM may generate a new feature space to separate thedata that may have a much higher dimension than theoriginal space, and may have a high computation costto find the best hyperplane to separate the categories.k-NN algorithm can be very costly to classify objects ina dataset with a large number of training examples dueto the need to compute the similarities among the docu-ments to be classified and all the training documents.CBC classifier is closely related to MNB classifier[37].However, MNB are much faster than CBC and is ad-visable in practical situations.

Page 3: Inductive Model Generation for Text Classification Using a Bipartite Heterogeneous Network

Rafael Geraldeli Rossi et al.: Inductive Model Using Bipartite Networks 363

The objective of transductive classification is to findan admissible function F : DL+U → YL+U . The trans-ductive learning does not create a model to classifynew documents as the inductive model does. Instead,the transductive learning considers a dataset of bothlabeled and unlabeled examples to perform the catego-rization task, spreading the information from labeled tounlabeled data. Unlabeled data labeled during the pro-cess aids the categorization of the remaining unlabeleddata. Usually the data is represented by a network tocarry out the transductive learning.

A network can be defined by N = (O,R,W) inwhich O represents the set of objects, R represents theset of relations among the objects, and W representsthe edge weights. When there is a single type of ob-ject in the network, this network is called homogeneousnetwork. When O is composed by m different types ofobjects (m > 2), i.e., O = O1 ∪ . . .Om, this network iscalled heterogeneous network[38-39].

One way to classify objects in a network is by usinga weight vector f = (f1, f2, . . . , f|C|) which contains theweights of an object for each class of the collection. Aweight vector will be also treated by class informationin this article. All the weight vectors of the objects in anetwork are stored in a matrix F = (f1,f2, . . . ,f|O|)T.The matrix F for objects of the type Oi will be denotedby F (Oi).

The class information of a labeled object oi is storedin a vector yi = (y1, y2, . . . , y|C|)T, which has the value1 in the position corresponding to the class and 0in the other positions. All the class information ofthe labeled documents are represented by the matrixY = (y1,y2, . . . ,y|OL|)T.

Studies have used the heterogeneous network repre-sentation for object classification. Some of them usebipartite networks to represent the data. In this casethere are two types of objects: 1) target objects, i.e., thetype of object that needs to be classified, and 2) bridgeobjects, i.e., the type of object that transmits the classinformation among the target objects. Examples of al-gorithms that use bipartite networks to perform thetransductive classification are 1) Iterative Reinforce-ment Categorization (IRC)[40], 2) Tag-Based Classifi-cation Model (TM)[41] and 3) GNetMine[38].

IRC performs the classification of interrelated ob-jects by iterative reinforcement using the individualclassification of different types of objects. The class in-formation of the labeled target objects and the classinformation assigned to unlabeled target objects arepropagated to the bridge objects. These bridges prop-agate their class information to the unlabeled targetobjects. This process is repeated until convergence.

TM algorithm attempts to minimize the differencesof 1) the class information assigned to the labeled tar-

get objects and their real class information, 2) the classinformation assigned to auxiliary objects that can be in-serted in the networks to aid the classification processand their real class information, and 3) the differenceof the class information of target objects that are con-nected to the same bridge object.

GNetMine is a general framework for object catego-rization in heterogeneous networks, i.e., it can be ap-plied to more than two types of objects and more thanone type of relation. The class information propaga-tion in GNetMine is similar to [42], however, GNetMineconsiders the different semantics of each type of rela-tionship. GNetMine aims to minimize the differencesamong the weight vectors of neighboring objects, andthe differences between the assigned class informationof an object and its real class information.

Text collections can also be represented by doc-ument networks or term networks. Both are ho-mogeneous networks. In a document network, ob-jects correspond to documents and relations aregiven by hyperlinks/citations[12-14] or the most simi-lar objects[15]. As most of the collections do not havehyperlinks or citations, similarity-based networks canbe used in any type of collection. Besides, similaritiesnetworks can produce better results[15].

Term networks consider the terms of a collectionas objects and the relations among them are givenby some measure based on co-occurrence[16-17], wordorder[43-44], or syntactic/semantic relationship[45-46].Document and term networks require: 1) thresholdsfor minimum similarity/co-occurrence or 2) the num-ber of similar objects that will be connected. Theseparameters have a great influence in the classificationresults[47]. Document and term networks cannot bemapped back to VSM maintaining the terms and fre-quencies since they just keep the connections among theelements and their weights. We do not focus on theserepresentations since they generate a completely differ-ent representation than the VSM/bipartite networks.

The inductive classification have been successfullyused in text mining as presented in this section, how-ever, for certain scenarios, the traditional/state-of-the-art inductive algorithms are not efficient because: 1)their own characteristic (presented previously) or 2)high dimensionality and sparsity of the data. The rep-resentation in bipartite networks is an alternative textrepresentation strategy, but, to the best of our knowle-dge, all the algorithms for data classification based onheterogeneous networks use the process of transductiveclassification. However, in some scenarios the inductiveclassification is needed, as information retrieval systemsor spam filtering. The next section will present the pro-posed algorithm, which induces a classification modelbased on a bipartite network.

Page 4: Inductive Model Generation for Text Classification Using a Bipartite Heterogeneous Network

364 J. Comput. Sci. & Technol., May 2014, Vol.29, No.3

3 Inductive Model Based on BipartiteHeterogeneous Network — IMBHN

The Inductive Model Based on Bipartite Heteroge-neous Network (IMBHN) algorithm is presented in thissection. IMBHN aims to induce a classification modelfrom the structure of a bipartite heterogeneous networkused to model a textual document collection. In thenext subsections we present: 1) the details of the pro-posed algorithm, 2) an example of the algorithm func-tioning, and 3) the time complexity analysis of the al-gorithm.

3.1 Algorithm

The rationale behind the proposed algorithm is tocompute the influence of each term for each class byinducing the weight vector to the terms. This type ofapproach is useful for text classification as we can see in[37, 48]. Our approach, however, is capable of settingnegative weights, i.e., some terms decrease the weightof a document to a class.

The induction process is guided by the minimizationof the cost function presented in the following equation:

Q(F (T ))

=12

( ∑

cj∈C

dk∈DL

(class

( ∑

ti∈Twdk,tifti,cj

)− ydk,cj

)2)

=12

( ∑

cj∈C

dk∈DT

error2dk,cj

), (1)

in which wdk,ti is the weight of the relation between thedocument dk and the term ti, and

class( ∑

ti∈Twdk,tifti,cj

)

=

1, if cj = argmaxcj∈C

( ∑

ti∈Twdk,tifti,cj

),

0, otherwise. (2)

The proposed algorithm induces the matrix F (T )which minimizes the quadratic error (error2kj), i.e., thesquared sum of the differences between the predictedand real classes of the training documents. The waythe problem was modeled allows the use of the gradientdescent methods to induce the weights of the elementsfrom matrix F (T ). We chose the Least-Mean-Squaremethod[19] to induce the weights due to its simplic-ity, but any other method for adjusting the weights tominimize the error can also be used. The Least-Mean-Square makes successive corrections in the weight vec-tor in the direction of the negative gradient vector lead-ing to the minimum mean squared error. The weight

vector equation using the steepest descent method ispresented in (3).

f (n+1) = f (n) + η(−5 (Q(F ))). (3)

The direction of the gradient can be estimated bythe derivative of Q(F ):

5(Q(F )) =∂Q(F )∂F

=∑

cj∈C

dk∈D

(class

( ∑

ti∈Twdk,tifti,cj

)−

ydk,cj

cj∈C

dk∈D

ti∈Twdk,ti ,

5(Q(F )) =∑

cj∈C

dk∈Derrordk,cj

cj∈C

dk∈D

ti∈Twdk,ti .

(4)

Considering (3) and (4), the weight of a term ti forthe class cj in time (n + 1) is given by the followingequation:

fti,cj (T )(n+1) = fti,cj (T )(n)+(η ×

dk∈Dwdk,tierror

(n)dk,cj

),

(5)

where η is the correction rate, i.e., the rate in which theerror will be considered in the weight updating. We no-tice that the weight updating is a function of the currentweight, the obtained error, and the weight of the termin the document.

Algorithm 1 summarizes the IMBHN algorithm.The proposed algorithm has three main steps: weightvectors initialization, error calculation, and weights ad-justment.

In step 1, weight vector initialization, the initialweight vectors for the terms are defined (line 2 of Algo-rithm 1). The weight values can be 0, randomly chosen,or considered as the likelihood of each term to belongto each class.

In step 2, error calculation, an output vector for eachdocument dk (outdk

) is computed (lines 6∼12). Eachposition of this vector is obtained by the sum of theweights of the document-term relations multiplied bythe weight of each term for each class. The weight ofa document dk for a class cj is given by (2). The errorof a document dk for a class cj is calculated by sub-tracting the corresponding position of the output vectoroutdk,cj and the real class vector ydk,cj (lines 16∼17),as presented in (6).

errordk,cj = ydk,cj − outdk,cj . (6)

In step 3, weight adjustment step, the error of eachdocument for each class is used to update the weight

Page 5: Inductive Model Generation for Text Classification Using a Bipartite Heterogeneous Network

Rafael Geraldeli Rossi et al.: Inductive Model Using Bipartite Networks 365

Algorithm 1. IMBHN – Defining the Weight Vectors

for the Terms

input:

D – set of training documents

T – set of terms

C – set of classes

W – weights of the connections among the documents

and tems

– classes of the training documents

η – correction rate

τ – maximum number of iterations

ε – minimum mean squared error

output:

W – term weights induced during the learning process

1: num it = 0

2: (T ) ← weight initialization(W)

3: while stopping criterion is not reached do

4: squared error acm = 0

5: foreach dk ∈ D do

/*The output for each training document

is calculated in this loop */

6 induced weights ← [ ]

7 foreach cj ∈ C do

8 class weight = 0

9 foreach ti ∈ T do

10 class weight = class weight + fti,cj×wdk,ti

11 end

12 outcj = class weight

13 end

14 out [ ] = class(induced weights) /* set the

value 1 to the highest value and 0

to the others */

15

/* Calculating the error */

16 foreach cj ∈ C do

17 error = ydk,cj − outcj18 squared error acm = error2/2

/* Weight correction

for each term connected to the docu-

ment */

19 foreach ti ∈ T do

20 current weight = fti,cj21 new weight = current weight + (η×

wdk,ti × error)

22 fti,cj = new weight

23 end

24 end

25 end

26 mean sqr error = squared error acm/|D|27 num it = num it + 1;

Stopping Analysis(num it ,mean sqr error , τ, ε)

28 end

vector of the terms connected to each document (lines19∼21). To update the weight of a term ti for a classcj (5) is applied.

Steps 2 and 3 are repeated for every trainingdocument until a stopping criterion is reached, asthe maximum number of iterations① or a minimummean squared error, i.e., when the mean squared er-ror of an iteration is less than a threshold ε.

In the classification phase, the induced term weightsare employed for the classification of unseen documents.This is achieved through the maximum argument (arg-max) of the sum of the connection weights among thedocument and their terms multiplied by the weight ofeach term for each class ((2)).

3.2 Example

To exemplify the IMBHN functioning, consider thebipartite network presented in Fig.1(a). To simplify theexample, let all the connections among documents andterms have weight values 1. We choose random val-ues for the initial weight vectors of the terms and useη = 0.9. After two iterations the algorithm converges(error = 0) and the weight vectors for the terms arepresented in Fig.1(b).

We observe that IMBHN algorithm defines negativevalues for some classes, i.e., some terms inhibit or de-crease the weight of a document for some class. Forinstance, Term 1 and Term 2 occur exclusively for doc-uments that belong to Class 1. Thus, these termshave a high positive weight for Class 1 and a nega-tive weight for Class 2. The same occurs for Term 6and Term 7.

To classify unseen documents we link them in thenetwork, i.e., we connect the documents to the termscontained in it. The values of the weight vectors fromthe terms are spread to the documents, as presentedin Fig.1(c). The weight vector calculated for each un-seen document is used to defined the class (using thearg-max value), as presented in Fig.1(d).

3.3 Time Complexity Analysis

The complexity of the IMBHN algorithm is a func-tion of 1) the average number of terms (|T |), since foreach document of the collection only the terms con-nected to the document are updated; 2) the number of

①An iteration is one pass for all training documents.

Page 6: Inductive Model Generation for Text Classification Using a Bipartite Heterogeneous Network

366 J. Comput. Sci. & Technol., May 2014, Vol.29, No.3

Fig.1. Defining the weight vectors for the terms using the IMBHN algorithm. (a) Bipartite network with initial values for the weight

vector of the term objects. (b) Defined vector weights of the terms using the IMBHN algorithm. (c) Spreading the weight vectors of

the terms to unseen documents. (d) Defining the class of the unseen documents according to the max-value of the document weight

vectors.

of documents in the training set (|DL|), since ev-ery training document is used for updating the vectorweights of the terms, and 3) the number of iterationsnecessary to achieve the stopping criterion (r). Thecomplexity of the IMBHN algorithm isO(r×|DL|×|T |).

4 Experimental Evaluation

This section presents the text document collectionsused in the experiments, algorithms and their parame-

ters used for comparison, experimental setup, evalua-tion criteria, results, and discussion.

4.1 Document Collections

43 textual document collections were used in theevaluation of the IMBHN algorithm. For the 19MCla-ssTextWc collections (Fbis, La1s, La2s, New3s, Oh0,Oh10, Oh15, Oh5, Ohscal, Tr11, Tr12, Tr21, Tr23,

Page 7: Inductive Model Generation for Text Classification Using a Bipartite Heterogeneous Network

Rafael Geraldeli Rossi et al.: Inductive Model Using Bipartite Networks 367

Tr31, Tr41, Tr45, Re0, Re1, Wap)② no preprocess-ing was performed since these collections are alreadyin a structured format (document-term matrix). Forthe other collections, single words were considered asterms, stop words were removed, terms were stemmedusing Porter’s algorithm, HTML tags and e-mail head-ers were removed, and terms with document frequencydf > 2 were considered. We used term frequency (tf)to weight terms in documents.

The text collections were drawn from different do-mains: e-mails (EM), scientific documents (SD), ab-stracts (AB), web pages (WP), news articles (NA), sen-timent analysis (SA), medical documents (MD), andTREC③ documents (TD). The collections have differ-ent characteristics. The number of documents rangesfrom 204 to 18 808, the number of terms from 1 726to 100 464, the number of classes from 2 to 51, andthe average number of terms from 6.65 to 720.30. Ta-ble 1 presents the number of documents (|D|), num-ber of generated terms (|T |), average number of termsper document (|T |), number of classes (|C|), and stan-dard deviation considering the percentage of classes(σ(C))④-⑤.

4.2 Experiment Configuration and EvaluationCriteria

The results obtained by the proposed algorithm,IMBHN, were compared with six inductive classifica-tion algorithms from the Weka library[49]. The algo-rithms used for comparison were: Naive Bayes (NB),Multinomial Naive Bayes (MNB), J48 (implementationof C4.5 algorithm), Sequential Minimal Optimization(SMO, which is an algorithm for solving optimizationproblems during the training of SVMs), and IBk (im-plementation of the k-NN algorithm).

For the SMO algorithm, we considered three types ofkernel: linear, polynomial (exponent = 2) and rbf (ra-dial basis function). The C values considered for eachtype of kernel were {10−5, 10−4, 10−3, 10−2, 10−1, 100,101, 102, 103, 104, 105}. These parameters are basedon [50].

We used IBk algorithm without and with a weightedvote, which gives for each of the nearest neigh-bors a weight vote equal to 1/(1 − s), where s is

a similarity measure among neighbors. The valuesof k were {1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 25, 29, 35, 41,49, 57, 73, 89}. k values in some studies range from 1to |DL|. We did not use this range because withoutthe weighted vote, as the number of neighbors becomescloser to |DL|, the documents are classified as the ma-jority class. The cosine distance was used as similaritymeasure.

For the proposed algorithm, IMBHN, we used theerror correction rates η = {0.1, 0.2, 0.3, 0.4, 0.5, 0.6,0.7, 0.8, 0.9, 1.0}. The error correction impacts onthe definition of the weight vectors. If the correctionrate is very small, the algorithm converges very slowly.If the correction rate is large, a fast convergence isachieved, but it can be unstable around the minimumerror. We used small and high values to verify the be-havior of the convergence and the accuracy of the al-gorithm. As stopping criteria we used the minimummean squared error {0.01, 0.005}. These values controlhow the weights adjust the model to the training data.We then chose a commonly value (0.01)[51] and a valuethat leads to a higher adjustment to the training data(0.005). In our experiments, the weights of the termswere initially set according to (7). The value given bythis equation is closer to 1 for a term that occurs almostexclusively in documents of a specific class.

f(0)ti,cj (T ) =

dk∈D

wdk,tiydk,cj

wdk,ti

. (7)

The default parameters of the Weka tool wereadopted for NB, MNB, and J48 algorithms. The se-lected evaluation measure was classification accuracy,i.e., the percentage of test documents correctly classi-fied, obtained by the 10-fold cross-validation process.All algorithms were subjected to the same folds of thecross-validation procedure. We carried out statisticalsignificance tests using the Friedman test and the Ne-menyi post-hoc test with 95% confidence level[52] tocompare results.

4.3 Results

In Table 2 we present the best accuracies obtainedby each algorithm⑥. The highest accuracy for each col-lection is highlighted in bold. The penultimate line of

②19MClassTextWc collections are available at http://sourceforge.net/projects/weka/files/datasets/text-datasets/19Mclass-TextWc.zip/download, Apr. 2014.③ Text Retrieval Conference, http://trec.nist.gov/, Apr. 2014.④The representations of the document collections are in ARFF format[49] and are available at http://sites.labic.icmc.

usp.br/text collections/, Apr. 2014.⑤More details about the collections are found in http://www.icmc.usp.br/CMS/Arquivos/arquivos enviados/BIBLIOTECA 113

RT 395.pdf, Apr. 2014.⑥All the results obtained by IMBHN are available at http://sites.labic.icmc.usp.br/ragero/docs/imbhn results.pdf and by SMO

and IBK are available at http://www.icmc.usp.br/CMS/Arquivos/arquivos enviados/BIBLIOTECA 113 RT 395.pdf, Apr. 2014.

Page 8: Inductive Model Generation for Text Classification Using a Bipartite Heterogeneous Network

368 J. Comput. Sci. & Technol., May 2014, Vol.29, No.3

Table 1. Characteristics of the Textual Document Collections Used in the Experimental Evaluation

Collection |D| |T | |T | |C| σ(C)20ng (EM) 18 808 45 434 76.47 20 0.52

ACM (SD) 3 493 60 768 720.30 40 0.37

Classic4 (AB) 7 095 7 749 35.28 4 1.94

CSTR (SD) 299 1 726 54.27 4 18.89

Dmoz-Business-500 (WP) 18 500 8 303 11.93 37 0.00

Dmoz-Computers-500 (WP) 9 500 5 011 10.83 19 0.00

Dmoz-Health-500 (WP) 6 500 4 217 12.40 13 0.00

Dmoz-Science-500 (WP) 6 000 4 821 11.52 12 0.00

Dmoz-Sports-500 (WP) 13 500 5 682 11.87 27 0.00

Enron-Top-20 (EM) 13 199 18 194 50.69 20 2.37

Fbis (NA) 2 463 2 001 159.24 17 5.66

Hitech (NA) 2 301 12 942 141.93 6 8.25

Industri-Sector (WP) 8 817 21 490 88.48 12 8.25

Irish-Sentiment (SA) 1 660 8 659 112.65 3 6.83

La1 (NA) 3 204 13 196 144.64 6 8.22

La2 (NA) 3 075 12 433 144.83 6 8.59

Multi-Domain-Sentiment (SA) 8 000 13 360 42.36 2 0.00

New3s (NA) 9 558 26 833 234.53 44 1.32

NFS (SD) 10 524 3 888 6.65 16 3.82

Oh0 (MD) 1 003 3 183 52.50 10 5.33

Oh10 (MD) 1 050 3 239 55.64 10 4.25

Oh15 (MD) 913 3 101 59.30 10 4.27

Oh5 (MD) 918 3 013 54.43 10 3.72

Ohscal (MD) 11 162 11 466 60.39 10 2.66

Opinosis (SA) 6 457 2 693 7.56 51 1.42

Polarity (SA) 2 000 15 698 205.06 2 0.00

Re0 (NA) 1 504 2 887 51.73 13 11.56

Re1 (NA) 1 657 3 759 52.70 25 5.54

Re8 (NA) 7 674 8 901 35.31 8 18.24

Reviews (NA) 4 069 22 927 183.10 5 12.80

SpamAssissin (EM) 9 348 97 851 108.02 2 34.45

SyskillWebbert (WP) 334 4 340 93.16 4 10.75

Tr11 (TD) 414 6 430 281.66 9 9.80

Tr12 (TD) 313 5 805 273.60 8 7.98

Tr21 (TD) 336 7 903 469.86 6 25.88

Tr23 (TD) 204 5 833 385.29 6 15.58

Tr31 (TD) 927 10 129 268.50 7 13.37

Tr41 (TD) 878 7 455 195.33 10 9.13

Tr45 (TD) 690 8 262 280.58 10 6.69

Trec7-3000 (EM) 6 000 100 464 244.08 2 0.00

Wap (WP) 1 560 8 461 141.33 20 5.20

WebACE (WP) 3 900 8 881 43.15 21 8.44

WebKb (WP) 8 282 22 892 89.78 7 15.19

presents the average ranking and the last line presentsthe algorithm position in the ranking. The IMBHNalgorithm obtained the highest accuracy in 14 of 43collections and was the best in the average ranking.

Fig.2 presents the critical difference diagram to illus-trate the results of the statistical significance test. Inthis diagram, the algorithms connected by a line do notpresent statistically significant differences among them.According to Fig.2, IMBHN is superior with statisti-cally significant differences to IBk, SMO, J48 and NB

Fig.2. Critical difference diagram considering the best accuracies

for each algorithm.

Page 9: Inductive Model Generation for Text Classification Using a Bipartite Heterogeneous Network

Rafael Geraldeli Rossi et al.: Inductive Model Using Bipartite Networks 369

Table 2. Best Accuracies and Average Ranking for the Algorithms Used in Experiments

Collections NB MNB J48 SMO IBk IMBHN

20ng 64.61 90.08 73.96 84.81 86.71 91.41

ACM 73.89 76.04 66.36 77.38 64.36 69.83

Classic4 88.48 95.79 90.35 94.53 94.24 95.19

CSTR 78.59 84.64 66.85 75.26 82.29 80.24

Dmoz-Business-500 58.61 68.28 53.10 65.84 61.49 63.99

Dmoz-Health-500 73.15 82.07 73.61 79.90 77.83 81.76

Dmoz-Computers-500 59.67 70.13 54.88 66.01 63.70 65.07

Dmoz-Science-500 62.11 73.81 57.20 67.38 64.38 70.94

Dmoz-Sports-500 75.85 83.76 83.88 85.48 80.09 88.95

Enron-Top-20 51.54 72.60 64.47 65.98 66.72 72.67

Fbis 61.79 77.18 71.49 78.92 80.99 81.76

Hitech 62.92 72.92 56.76 66.44 71.79 71.88

Industry-Sector 41.54 76.23 57.56 70.34 77.58 82.66

IrishEconomic 59.75 67.65 51.08 65.54 60.90 64.81

La1s 75.21 88.17 76.65 84.30 80.55 88.01

La2s 75.25 89.91 76.84 86.76 82.79 89.29

Multi-Domain-Sentiment 72.85 78.65 74.95 81.90 71.40 78.90

New3s 56.72 79.16 70.85 71.93 79.26 83.03

NFS 70.86 83.84 70.74 81.87 78.88 82.22

Oh0 79.66 89.83 80.95 81.55 81.85 88.14

Oh5 78.76 86.27 80.39 77.24 79.41 86.26

Oh10 72.38 80.66 72.09 76.00 73.61 77.52

Oh15 75.03 83.68 75.78 75.03 75.57 81.16

Ohscal 62.78 74.73 71.30 76.69 68.65 76.01

Opinosis 60.74 59.56 60.83 61.03 62.87 58.26

Polarity 66.80 80.10 68.25 83.65 70.50 82.70

Re0 57.05 79.92 75.26 77.79 83.51 84.70

Re1 66.73 83.34 79.60 72.72 81.89 85.09

Re8 81.27 95.33 90.73 93.95 94.14 96.92

Reviews 85.22 93.33 88.35 91.64 92.30 94.20

SpamAssassin 87.80 96.51 96.61 98.96 98.72 98.50

SyskillWebbert 72.51 90.75 95.81 77.85 95.81 94.93

Tr11 54.06 85.00 78.98 77.06 86.95 85.02

Tr12 57.82 80.15 79.23 69.61 81.74 79.87

Tr21 47.95 61.35 81.27 79.77 88.66 87.84

Tr23 56.85 70.61 93.19 72.54 84.33 88.26

Tr31 79.72 94.38 93.10 90.72 94.60 95.57

Tr41 84.95 94.52 90.77 87.80 93.04 93.28

Tr45 66.66 82.46 90.28 81.30 88.84 89.13

Trec7-3000 81.42 97.42 96.88 97.72 98.61 98.73

Wap 71.73 81.02 67.05 81.85 74.48 83.58

WebKb 41.39 60.38 69.08 57.20 67.95 68.47

WebACE 83.17 87.43 81.28 87.23 83.84 89.74

Average Ranking 5.56 2.39 4.45 3.43 3.17 1.97

Position 6◦ 2◦ 5◦ 4◦ 3◦ 1◦

algorithms. There are no significant differences withMNB algorithm, but the IMBHN algorithm is the firstin the ranking of the statistical test.

A single parameter for each algorithm was also con-sidered for comparison, since it is difficult and some-times impossible to carry out tests with a large numberof parameters for different algorithms in practical situa-tions. A statistical test was carried out considering the

accuracies obtained by the parameters from each algo-rithm and we considered the best parameter accordingto the ranking.

Fig.3 presents the critical difference diagrams con-sidering the results obtained by the different parame-ters of IMBHN (η: minimum mean square error). Ingeneral, the smallest values of η and minimum meansquare error lead to better classification accuracy.

Page 10: Inductive Model Generation for Text Classification Using a Bipartite Heterogeneous Network

370 J. Comput. Sci. & Technol., May 2014, Vol.29, No.3

The best parameters of other algorithms are linearkernel with C = 1 for SMO, and k = 7 and weightedvote for IBk. Table 3 presents the results obtained bythe best parameters of each algorithm. In this eval-uation IMBHN obtained the highest accuracy in 16of 43 collections and the first position in the averageranking.

Fig.4 presents the critical difference diagramconsidering a single parameter for the algorithms.IMBHN was superior again with statistically significantdifference compared with IBk, SMO, J48 and NB.

Fig.3. Critical difference diagram considering the results obtained

by the different parameters of IMBHN.

Table 3. Accuracies and Average Ranking Considering a Single Parameter for Each Algorithm Used in the Experiments

Collections NB MNB J48 SMO IBk IMBHN

20ng 64.61 90.08 73.96 88.41 84.84 91.33

ACM 73.89 76.04 66.36 77.35 62.55 69.11

Classic4 88.48 95.79 90.35 94.53 94.15 94.93

CSTR 78.59 84.64 66.85 74.91 78.94 76.92

Dmoz-Business-500 58.61 68.28 53.10 65.30 60.94 63.99

Dmoz-Computers-500 59.67 70.13 54.88 66.01 63.55 65.07

Dmoz-Health-500 73.15 82.07 73.61 79.49 77.83 81.52

Dmoz-Science-500 62.11 73.81 57.20 65.51 64.28 70.95

Dmooz-Sports-500 75.85 83.76 83.88 85.48 79.88 88.96

Enron-Top-20 51.54 72.60 64.47 65.98 66.33 72.67

Fbis 61.79 77.18 71.49 78.24 80.99 81.00

Hitech 62.92 72.92 56.76 66.14 71.79 70.40

Industry-Sector 41.51 76.23 57.56 66.34 76.42 82.66

IrishEconomic 59.75 67.65 51.08 61.80 60.90 64.70

La1s 75.21 88.17 76.65 84.30 79.65 87.98

La2s 75.25 89.91 76.84 86.76 81.39 88.52

Multi-Domain-Sentiment 72.85 78.65 74.94 80.17 66.71 78.90

New3s 56.72 79.16 70.85 71.93 79.85 82.89

NFS 70.86 83.84 70.74 81.87 78.82 82.14

Oh0 79.66 89.83 80.95 81.55 81.55 87.94

Oh10 72.38 80.66 72.09 75.04 72.38 75.62

Oh15 75.03 83.68 75.78 74.70 75.57 80.50

Oh5 78.76 86.27 80.39 76.80 78.43 84.75

Ohscal 62.78 74.73 71.30 74.35 67.34 75.57

Opinosis 60.74 59.56 60.83 61.03 58.16 56.64

Polarity 66.80 80.10 68.25 82.65 69.30 80.05

Re0 57.05 79.92 75.26 75.19 82.98 84.05

Re1 66.73 83.34 79.60 72.72 80.86 83.53

Re8 81.27 95.33 90.73 93.32 93.66 96.55

Reviews 85.22 93.33 88.35 91.64 92.06 94.08

SpamAssassin 87.80 96.51 96.61 74.35 97.70 98.50

SyskillWebbert 72.51 90.75 95.81 77.85 95.81 85.70

Tr11 54.06 85.00 78.98 77.06 84.28 84.30

Tr12 57.82 80.15 79.23 69.61 81.74 77.61

Tr21 47.95 61.35 81.27 79.77 88.06 87.25

Tr23 56.85 70.61 93.19 72.52 79.50 83.88

Tr31 79.72 94.38 93.10 90.61 92.33 95.58

Tr41 84.95 94.52 90.77 87.80 92.47 93.17

Tr45 66.66 82.46 90.28 81.15 87.39 88.99

Trec7-3000 81.42 97.42 96.88 97.72 98.08 98.52

Wap 71.73 81.02 67.05 81.85 73.58 82.88

WebACE 83.17 87.43 81.28 87.23 83.28 89.92

WebKb 41.39 60.38 69.08 64.22 67.30 68.23

Average Ranking 5.45 2.23 4.33 3.56 3.40 2.00

Position 6◦ 2◦ 5◦ 4◦ 3◦ 1◦

Page 11: Inductive Model Generation for Text Classification Using a Bipartite Heterogeneous Network

Rafael Geraldeli Rossi et al.: Inductive Model Using Bipartite Networks 371

Fig.4. Critical difference diagram considering the accuracies ob-

tained by a single parameter for each algorithm.

We also analyzed IMBHN using a different term-weighting method, performing feature selection, andanalyzing its robustness considering a poor preprocess-ing step. First we changed term frequency (tf) byterm frequency — inverse document frequency (tfidf)to weight the terms in the documents. The weight of aterm ti in a document dj considering tfidf is[4]:

tfidf ti,dj= tf ti,dj

× idf ti , (8)

in whichidf ti = log |D|/dfti . (9)

We selected two collections of each domain to sim-plify the experiments. The results using tfidf are pre-sented in Table 4. We performed the Friedman statisti-cal significance test considering the results using tf andtfidf. Fig.5 presents the critical difference diagram. Wenotice that the use of tfidf decreased the positions ofthe methods in the ranking for almost all algorithms.The exception is IBk which had a little improvement inthe ranking. We observe that IMBHN obtained a bet-ter position in the ranking than the other algorithmsconsidering just tfidf.

We used tfidf to rank terms for feature selection. Inthis case, the score of a term ti in the ranking was thesum of the values of its tfidf over the collection, i.e.,

ScoreTFIDF (ti) =∑

dj∈Dtfidf ti,dj

. (10)

Fig.5. Critical difference diagram considering tf and tfidf as term

weights.

There are two approaches to selecting features: con-sidering terms above a threshold and considering thetop ranked terms. In this work we considered the sec-ond approach to selecting an equal number of termsper collection and simplifying the analysis of the re-sults. We considered 5 000 and 1 000 top ranked terms.

Table 5 shows the results considering the 5 000ranked terms according to tfidf. Fig.6 presents the criti-cal difference diagram considering all terms and the5 000 top ranked terms. We observe that the use ofjust 5 000 terms decreased the positions of most of thealgorithms in the statistical test ranking. We highlightthat the first two positions in the ranking belong toIMBHN and IMBHN with the 5 000 top ranked terms.Both obtained the same performance.

Fig.6. Critical difference diagram considering all terms and the

5000 top ranked terms according to tfidf.

Table 4. Accuracies Considering a Single Parameter for Each Algorithm and tfidf as Term Weights

Collections NB MNB J48 SMO IBk IMBHN

ACM 62.24 71.23 54.42 74.38 60.32 67.01

Classic4 88.43 95.26 90.26 94.47 94.53 95.40

Enron-Top-20 51.28 73.86 64.16 65.97 68.73 69.64

Industry-Sector 41.36 83.04 57.47 66.36 86.96 83.74

Multi-Domain-Sentiment 72.84 73.50 75.03 80.17 64.49 78.08

Oh0 79.96 88.74 81.65 81.56 84.85 86.64

Ohscal 62.73 71.42 71.16 74.34 62.78 74.94

Polarity 66.60 76.85 67.85 82.65 67.40 78.05

Re8 81.41 93.16 90.57 93.33 89.56 95.80

Reviews 85.40 93.34 88.06 91.64 91.20 93.04

Tr11 51.88 83.82 79.22 76.10 84.74 83.58

Tr31 79.19 93.10 93.10 90.61 93.20 94.28

Trec7-3000 81.33 98.05 96.97 97.72 98.68 97.80

Wap 72.12 79.87 68.08 81.73 74.87 78.91

Page 12: Inductive Model Generation for Text Classification Using a Bipartite Heterogeneous Network

372 J. Comput. Sci. & Technol., May 2014, Vol.29, No.3

Table 5. Accuracies Considering a Single Parameter for Each Algorithm (tfidf as Term Weighting) and 5 000

Top Ranked Terms According to tfidf

Collections NB MNB J48 SMO IBk IMBHN

ACM 71.60 76.67 63.47 77.36 59.60 65.58

Classic4 88.43 95.18 90.26 94.66 94.38 95.58

Enron-Top-20 51.16 72.73 64.02 65.96 67.89 69.15

Industry-Sector 40.40 77.13 56.56 61.34 84.86 82.45

Multi-Domain-Sentiment 72.84 76.04 75.05 80.49 64.29 77.18

Oh0 79.96 88.74 81.65 81.56 84.85 86.64

Ohscal 62.72 72.07 71.08 74.91 62.61 74.33

Polarity 66.55 77.70 67.85 79.65 68.15 78.25

Re8 81.41 93.41 90.59 93.59 89.58 96.22

Reviews 85.25 91.94 88.06 91.67 91.72 93.41

Tr11 51.64 83.83 79.70 77.07 84.98 82.34

Tr31 77.14 89.21 92.88 91.91 94.06 94.60

Trec7-3000 80.22 96.05 96.85 95.28 98.15 97.21

Wap 72.31 80.00 67.18 82.05 75.90 79.23

Table 6 shows the results considering the 1 000ranked terms according to tfidf. Fig.7 presents the criti-cal difference diagram considering all terms and the1 000 top ranked terms according to tfidf. We noticethat the use of just 1 000 terms decreased the positionsof all the algorithms in the statistical test ranking. Wehighlight that IMBHN obtained the best position in theranking considering just 1 000 top ranked terms.

We also analyzed the behavior of the classificationalgorithms if the user performs a bad preprocessing.We disregard stop word removal and word stemming.We still keep the terms with df > 2 to avoid the di-mensionality explosion, which could make the experi-ments impracticable. For this analysis we disconsiderOh0, Ohscal, Reviews, Tr11, Tr31, and Wap since wejust have the structured representations for these col-lections. Table 7 presents the obtained results disre-garding preprocessing steps. Fig.8 presents the critical

difference diagram considering the preprocessed repre-sentations and the representation without preprocess-ing (WP). We observe that the lack of preprocessingimpaired the results of the algorithms. IBk in par-ticular suffered the worst decline. The exceptions areSMO and IMBHN. SMO obtained better results with-out preprocessing while the results of IMBHN were notaffected.

Fig.7. Critical difference diagram considering all terms and the

1 000 top ranked terms according to tfidf.

Table 6. Accuracies Considering a Single Parameter for Each Algorithm (tfidf as Term Weighting) and 1 000

Top Ranked Terms According to tfidf

Collections NB MNB J48 SMO IBk IMBHN

ACM 68.28 73.49 63.47 76.01 55.91 64.27

Classic4 88.16 94.14 90.16 94.91 92.77 93.58

Enron-Top-20 45.12 66.47 61.93 61.33 64.37 65.29

Industry-Sector 27.47 50.32 52.32 42.66 64.69 57.89

Multi-Domain-Sentiment 71.94 76.67 74.12 80.05 63.92 75.20

Oh0 77.76 87.94 82.85 78.87 82.85 84.84

Ohscal 60.50 71.77 70.47 76.06 62.36 69.08

Polarity 66.10 77.30 62.90 78.45 64.70 75.65

Re8 81.34 95.14 90.68 94.25 92.89 96.56

Reviews 79.72 88.72 86.73 92.16 90.34 93.19

Tr11 46.59 81.64 78.28 68.83 83.82 79.94

Tr31 63.64 74.19 92.23 86.84 93.85 94.49

Trec7-3000 71.43 83.35 96.67 90.12 97.43 95.16

Wap 69.55 80.00 64.36 81.86 76.22 78.20

Page 13: Inductive Model Generation for Text Classification Using a Bipartite Heterogeneous Network

Rafael Geraldeli Rossi et al.: Inductive Model Using Bipartite Networks 373

Table 7. Accuracies Considering a Single Parameter for Each Algorithm and Without Stop Words Removal and Word Stemming

Collections NB MNB J48 SMO IBk IMBHN

ACM 70.20 76.90 68.99 82.05 58.09 74.66

Classic4 87.44 95.88 88.85 93.76 79.56 94.95

Enron-Top-20 52.50 71.60 64.79 66.63 59.05 69.16

Industry-Sector 40.26 75.42 58.10 67.39 63.41 72.41

Multi-Domain-Sentiment 75.10 80.74 74.89 82.34 65.95 82.45

Polarity 68.25 80.75 68.25 85.55 57.35 82.85

Re8 81.25 93.77 90.44 93.99 92.09 96.22

Trec7-3000 83.32 96.37 96.73 98.08 96.68 97.43

Fig.8. Critical difference diagram considering preprocessed rep-

resentations and representations without preprocessing (WP).

4.4 Discussion

The IMBHN algorithm was proved to be competitivewhen compared with other inductive algorithms. Theresults demonstrate that IMBHN obtained higher accu-racy for a larger number of text document collections.IMBHN obtained better accuracy than the other al-gorithms even using different term weighting methods,feature selection, or different preprocessing steps.

In general, IMBHN obtained a better result than theother algorithms for collections of e-mails and for newsarticles when the number of classes is larger than 7. Italso obtained a better result for collections with a num-ber of classes higher than 12 or for less than 12 classeswhen the average number of terms per document washigher than 50.

We can notice that the parameters impact on theclassification accuracy of IBk and SMO algorithms,since the classification accuracy decreased when usingjust a single parameter. On the other hand, IMBHNobtained the first position in the statistical test rankingconsidering the best case or a single parameter for eachalgorithm used in the experimental evaluation, whichshows that IMBHN is useful for practical application inwhich the test with a large range of parameters is notfeasible. IMBHN presented better results with statisti-cally significant differences compared with IBk, SMO,J48 and NB algorithms considering the best accuraciesor one single parameter for each algorithm.

5 Conclusions and Future Work

In this article we presented the IMBHN (InductiveModel Based on Heterogeneous Network) algorithm.IMBHN generates a classification model consideringdocuments modeled as a bipartite heterogeneous net-work. The algorithm induces weights to objects thatrepresent the terms of the collection, which indicatethe influence of these terms in the definition of the doc-ument classes. These weights are used as a model toclassify unseen documents.

Experiments using 43 collections with different cha-racteristics showed that the IMBHN algorithm outper-formed, with statistically significant differences, tradi-tional and state-of-the-art inductive algorithms. As fu-ture work, other types of relationships among objectswill be addressed to improve the induction process. Wealso intend to identify spurious terms analyzing theirvector weights and reduce their impact on the propa-gations of the weights of these terms to the documents.

References

[1] Aggarwal C C, Zhai C. Mining Text Data. Springer, 2012.

[2] Feldman R, Sanger J. The Text Mining Handbook: AdvancedApproaches in Analyzing Unstructured Data. CambridgeUniversity Press, 2006.

[3] Sebastiani F. Machine learning in automated text categoriza-tion. ACM Computing Surveys, 2002, 34(1): 1-47.

[4] Manning C D, Raghavan P, Schutze H.An Introduction toInformation Retrieval. Cambridge University Press, 2008.

[5] Schutze H, Hull D A, Pedersen J O. A comparison of clas-sifiers and document representations for the routing prob-lem. In Proc. the 18th Int. ACM SIGIR Conference on Re-search and Development in Information Retrieval, July 1995,pp.229-237.

[6] Blanzieri E, Bryl A. A survey of learning-based techniquesof email spam filtering. Artificial Intelligence Review, 2008,29(1): 63-92.

[7] Kao A, Quach L, Poteet S, Woods S. User assisted text clas-sification and knowledge management. In Proc. the 12th In-ternational Conference on Information and Knowledge Man-agement, November 2003, pp.524-527.

[8] Han H, Giles C L, Manavoglu E, Zha H, Zhang Z, Fox E A.Automatic document metadata extraction using support vec-tor machines. In Proc. ACM/IEEE-CS Joint Conference onDigital Libraries, May 2003, pp.37-48.

Page 14: Inductive Model Generation for Text Classification Using a Bipartite Heterogeneous Network

374 J. Comput. Sci. & Technol., May 2014, Vol.29, No.3

[9] Kessler B, Numberg G, Schutze H. Automatic detection oftext genre. In Proc. the 35th Annual Meeting of the Associa-tion for Computational Linguistics and the 8th Conference ofthe European Chapter of the Association for ComputationalLinguistics, August 1997, pp.32-38.

[10] Dumais S, Chen H. Hierarchical classification of Web content.In Proc. the 23rd Annual International Conference on Re-search and Development in Information Retrieval, July 2000,pp.256-263.

[11] Salton G. Automatic Text Processing: The Transforma-tion, Analysis, and Retrieval of Information by Computer.Addison-Wesley Longman Publishing Co., Inc., 1989.

[12] Lu Q, Getoor L. Link-based classification. In Proc. In-ternational Conference on Machine Learning, August 2003,pp.496-503.

[13] Chakrabarti S. Mining the Web: Discovering Knowledge fromHypertext Data. Morgan-Kauffman, 2002.

[14] Oh H J, Myaeng S H, Lee M H. A practical hypertext catego-rization method using links and incrementally available classinformation. In Proc. the 23rd ACM Int. SIGIR Conf. Re-search and Development in Information Retrieval, July 2000,pp.264-271.

[15] Angelova R, Weikum G. Graph-based text classification:Learn from your neighbors. In Proc. the 29th Annual Int.SIGIR Conf. Research and Development in Information Re-trieval Conference, August 2006, pp.485-492.

[16] Tseng Y H, Ho Z P, Yang, K S, Chen C C. Mining term net-works from text collections for crime investigation. ExpertSystems with Applications, 2012, 39(11): 10082-10090.

[17] Wang W, Do D B, Lin X. Term graph model for text classifi-cation. In Proc. International Conference on Advanced DataMining and Applications, July 2005, pp.19-30.

[18] Newman M. Networks: An Introduction. Oxford UniversityPress, 2010.

[19] Widrow B, Hoff M E. Adaptive switching circuits. In Neu-rocomputing: Foundation of Research, Anderson J A (ed.),Cambridge.USA: MIT Press, 1998, pp.123-134.

[20] Rossi R G, Faleiros T P, Lopes A A, Rezende S O. Induc-tive model generation for text categorization using a bipar-tite heterogeneous network. In Proc. the 12th InternationalConference on Data Mining, December 2012, pp.1086-1091.

[21] Melville P, Gryc W, Lawrence R D. Sentiment analysis ofblogs by combining lexical knowledge with text classification.In Proc. the 15th International Conference on KnowledgeDiscovery and Data Mining, June 2009, pp.1275-1284.

[22] Boiy E, Hens P, Deschacht K, Moens M F. Automatic senti-ment analysis in on-line text. In Proc. the 11th InternationalConference on Electronic Publishing, June 2007, pp.349-360.

[23] Durant K T, Smith M D. Predicting the political sentiment ofweb log posts using supervised machine learning techniquescoupled with feature selection. In Proc. the 8th InternationalWorkshop on Knowledge Discovery on the Web, August 2006,pp.187-206.

[24] Chen R C, Hsieh C H. Web page classification based on asupport vector machine using a weighted vote schema. Ex-pert Systems with Applications, 2006, 31(2): 427-435.

[25] Wilcox A, Hripcsak G. Medical text representations for in-ductive learning. In Proc. American Medical InformaticsAssociation Symposium, Nov. 2000, pp.923-927.

[26] Sun A, Lim E P, Ng W K. Web classification using supportvector machine. In Proc. the 4th International Workshop onWeb Information and Data Management, November 2002,pp.96-99.

[27] Yu H, Han J, Chang K C C. PEBL: Positive example basedlearning for Web page classification using SVM. In Proc. the8th International Conference on Knowledge Discovery andData Mining, July 2002, pp.239-248.

[28] Yang Y, Slattery S, Ghani R. A study of approaches to hyper-text categorization. Journal of Intelligent Information Sys-tems, 2002, 18(2/3): 219-241.

[29] Dumais S T, Chen H. Hierarchical classification of Web con-tent. In Proc. the 23rd Int. ACM SIGIR Conf. Research andDevelopment in Information Retrieval, July 2000, pp.256-263.

[30] Han E H, Karypis G, Kumar V. Text categorization usingweight adjusted k-nearest neighbor classification. In Proc.Pacific-Asia Conference on Knowledge Discovery and DataMining, April 2001, pp.53-65.

[31] Yang Y. An evaluation of statistical approaches to text cate-gorization. Information Retrieval, 1999, 1(1/2): 69-90.

[32] Androutsopoulos I, Koutsias J, Chandrinos K, Paliouras G,Spyropoulos C. An evaluation of naive Bayesian anti-spam fil-tering. In Proc. Workshop on Machine Learning in the NewInformation Age, May 2000, pp.9-17.

[33] Drucker H, Wu D, Vapnik V. Support vector machines forspam categorization. IEEE Transactions on Neural Net-works, 1999, 10(5): 1048-1054.

[34] Han E, Karypis G. Centroid-based document classification:Analysis and experimental results. In Proc. the 4th Euro-pean Conference Principles of Data Mining and KnowledgeDiscovery, June 2000, pp.424-431.

[35] Nguyen T T, Chang K, Hui S C. Supervised term weightingcentroid-based classifiers for text categorization. Knowledgeand Information Systems, 2013, 35(1): 61-85.

[36] Marcacini R M, Cherman E A, Metz J, Rezende S O. A fastdendrogram refinement approach for unsupervised expansionof hierarchies. In Proc. ECML/PKDD Discovery Challenge:Third Challenge on Large Scale Hierarchical Text Classifica-tion, September 2012, pp. 1-12.

[37] Frank E, Bouckaert R R. Naive Bayes for text classificationwith unbalanced classes. In Proc. the 10th European Con-ference on Principle and Practice of Knowledge Discovery inDatabases, September 2003, pp.503-510.

[38] Ji M, Sun Y, Danilevsky M, Han J, Gao J. Graph regular-ized transductive classification on heterogeneous informationnetworks. In Proc. European Conference on Machine Learn-ing and Knowledge Discovery in Databases, September 2010,pp.570-586.

[39] Chiang M, Liou J, Wang J, Peng W, Shan M. Exploringheterogeneous information networks and random walk withrestart for academic search. Knowledge and Information Sys-tems, 2013, 36(1): 59-82.

[40] Xue G R, Shen D, Yang Q et al. IRC: An iterative reinforce-ment categorization algorithm for interrelated Web objects.In Proc. the 4th International Conference on Data Mining,November 2004, pp. 273–280.

[41] Yin Z, Li R, Mei Q, Han J. Exploring social tagging graphfor web object classification. In Proc. International Confer-ence on Knowledge Discovery and Data Mining, June 2009,pp.957-966.

[42] Zhou D, Bousquet O, Lal T N, Weston J, Scholkopf B. Learn-ing with local and global consistency. In Proc. Advances inNeural Information Processing Systems, December 2003.

[43] Aggarwal C C, Zhao P. Towards graphical models for text pro-cessing. Knowledge and Information Systems, 2013, 36(1):1-21.

[44] Markov A, Last M, Kandel A. Model-based classification ofWeb documents represented by graphs. In Proc. WEBKDD,August 2006, pp.84-89.

[45] Mishra M, Huan J, Bleik S, Song M. Biomedical text catego-rization with concept graph representations using a controlledvocabulary. In Proc. the 11th International Workshop onData Mining in Bioinformatics, August 2012, pp.26-32.

Page 15: Inductive Model Generation for Text Classification Using a Bipartite Heterogeneous Network

Rafael Geraldeli Rossi et al.: Inductive Model Using Bipartite Networks 375

[46] Cancho R F, Sole R V, Kohler. Patterns in syntactic depen-dency networks. Physical Review E, 2004, 69(1): 051915.

[47] Sousa C A R, Rezende S O, Batista G E A P A. Influenceof graph construction on semi-supervised learning. In Proc.European Conference on Machine Learning and KnowledgeDiscovery in Databases, September 2013, pp.160-175.

[48] Tomas D, Vicedo J L. Minimally supervised question classifi-cation on fine-grained taxonomies. Knowledge and Informa-tion Systems, 2013, 36(2): 303-334.

[49] Witten I H, Frank E. Data Mining: Practical Machine Learn-ing Tools and Techniques (2nd edition). Morgan Kaufmann,2005.

[50] Caruana R, Niculescu-Mizil A. An empirical comparison of su-pervised learning algorithms. In Proc. the 23rd InternationalConference on Machine Learning, June 2006, pp.161-168.

[51] Kohonen T, Barna G, Chrisley R. Statistical pattern recog-nition with neural networks: Benchmarking studies. In Proc.International Conference on Neural Networks, July 1988,pp.61-68.

[52] Demsar J. Statistical comparisons of classifiers over multipledatasets. Journal of Machine Learning Research, 2006, 7(1):1-30.

Rafael Geraldeli Rossi re-ceived the B.S degree in informationsystems and M.S degree in computerscience and computational mathe-matics from University of Sao Paulo,Brazil, in 2009 and 2011 respectively.He is a Ph.D. candidate at Universityof Sao Paulo. His research interestsinclude machine learning, text min-ing and graph-based methods.

Alneu de Andrade Lopes re-ceived the B.S degree in civil engi-neering from Federal University ofMato Grosso do Sul in 1985, the M.Sdegree in computer science and com-putational mathematics from Uni-versity of Sao Paulo in 1995, andthe Ph.D. degree in computer sciencefrom University Porto in 1995. Cur-rently, he is assistant professor in the

Institute of Mathematics and Computer Science at Univer-sity of Sao Paulo. His research interests lie in data mining,visual data mining, and mining networked data.

Thiago de Paulo Faleiros re-ceived the B.S degree in computerscience from Federal University ofGoias in 2007, and the M.S degree incomputer science from State Univer-sity of Campinas in 2010. Currently,he is a Ph.D. candidate at Universityof Sao Paulo. His research interestsinclude data and text mining, opti-mization, and computational theory.

Solange Oliveira Rezende gother B.S. degree in mathematics fromthe Federal University of Uberlan-dia, M.S. degree in computer sci-ence and computational mathemat-ics from the University of Sao Pauloand Ph.D. degree in mechanical en-gineering from the University of SaoPaulo, Sao Carlos. She is currentlyan associate professor at the Institute

of Mathematics and Computer Science at the University ofSao Paulo. She has experience in artificial intelligence, act-ing on the following topics: machine learning, and data andtext mining.