a method based on the chi-square test for document classification michael oakes¹ robert...

A Method Based on the Chi-Square Test for Document ClassificationMichael Oakes Robert Gaizauskas Helene Fowkes Anna Jonsson Vincent Wan & Micheline BeaulieuSCET, University of SunderlandDepartment of Computer Science, University of SheffieldDepartment of Information Studies, University of SheffieldUKIntroductionWe introduce a method for document classification based on using the chi-square test to identify characteristic vocabulary of document classes. The method was developed to assist in the classification of texts occurring in Scrip, an electronic news bulletin in the pharmaceutical industry. We describe the creation of training and test collections of Scrip texts and the results of training and evaluating our method against these texts. For purposes of comparison, we have also trained and tested a support vector machine (SVM) classifier on the same data. The accuracy of the two document classifiers was compared using the precision and recall metrics, and it was found that the SVM scored more highly. Despite this negative result, we believe the chi-square approach shows promise, and that further refinements, which we propose, will lead to better results.

Given these observed and expected frequencies, the chi-square value is calculated as the sum of the quantities (O-E) x (O E) / E for each position in the contingency table, i.e.CLASSIFYING DOCUMENTS USING CHARACTERISTIC VOCABULARY EvaluationTopics of interest

Product Launches refers to predicted or future product launches on the market

Licensing describes newly licensed products and licensing opportunities / agreements between companies Meetings tracks forthcoming meetings in the pharmaceutical world

People Tracking outlines recent appointments, promotions and retirements in the pharmaceutical industry

Regulatory refers to announcements and investigations about processes and practice in the industry and changes to regulatory guidelines

Company Relations covers collaborations, strategic partnerships, mergers, acquisitions and joint ventures Clinical Trials covers the whole process of clinical development of a drug and including drug approvalsSCRIP

THE CHI-SQUARE APPROACH

The chi-square procedure for identifyingcharacteristic vocabulary was first used by Rayson, Leech & Hodges [2] to find the vocabulary characteristic of male vs. femalespeech.We use the same technique to find vocabularycharacteristic of a topic vs. vocabulary typicalof Scrip documents in general.The approach includes two stages:1 Find the vocabulary characteristic ofa topic.2 Find which documents tend tocontain this vocabulary these should beabout the topic of interest.

a = number of times W occurs in the specific corpus, i.e. a = count(S,W)b = number of times W occurs in the residue corpus, i.e. b = count(R,W)c = total number of words (tokens) in the specific corpus which are not W, i.e. c = |S| - ad = total number of words (tokens) in the residue corpus which are not W, i.e. d = |R| - b

For example, for the observed frequency a, the total for the row in the contingency table containing a is a + b, the column total for a is a + c, and the grand total is the total of all four counts, a + b + c + d.If a / b > (a + c) / (a + d), the word is more typical of the general corpus a positive indicator, otherwise it is more typical of the residue corpus a negative indicator.For the chi-square calculation to be reliable, we must discard any words whose expected frequencies are less than 5.If chi-square > 3.84, we can be 95% confident that the word does occur more frequently in one of the two text types.

We next calculate the expected frequencies (E) for each table cell (the counts which would have been obtained given the size of the corpora and the rarity of the word in question, if the word were equally typical of each corpus) using the formula:

The values a d are called the observed frequencies (O), and may be arranged in a 2 x 2 contingency table, i.e.

Scrip is a daily pharmaceutical industry news bulletin, available electronically, and circulated widely.

The work described here is motivated by the practical aim of dividing an incoming stream of Scrip texts into topic-based substreams each of which is fed to a topic-specific information extraction (IE) system.

Together the document routing and IE work are part of the TRESTLE project, whose overall goal is to develop an advanced text access facility to support information workers at GlaxoSmithKline. Table 1: Words Most Typical of Clinical Trials as Opposed to Non-Clinical Trials, as Found by the Chi-Square Test Table 2: Words Least Typical of Clinical Trials as Opposed to Non-Clinical Trials Texts, as Found by the Chi-Square Test Table 3. Bigrams (Adjacent Word Pairs) Most Typical of Clinical Trials as Opposed to Non-Clinical Trials Texts Characteristic bigrams can also be found using the chi-square test, but we have not yet used them for document classification.

We use the ranked lists of characteristic vocabulary to classify unseen Scrip texts. This is a simple approach, just one of many possible methods, and, as the subsequent experiments show, it almost certainly needs refinement.

The classification algorithm takes two inputs:

1 A list L of all the keywords with significant chi-square scores (X > 3.84), noting which words are positive indicators and which are negative indicators.

2 A set D of documents to be classified.

The algorithm proceeds by reading each document d D in turn and assigning a score to it as follows:

Score (d) = 0For each word (token) w DIf w L and w is a positive indicatorScore (d) = Score (d) + 1Else if w L and w is a negative indicatorScore (d) = Score (d) 1

To draw the boundary between in-category and out-of-category documents, we have used the following approach:If one assumes that ones test set contains texts divided into categories in the same proportion as ones training data, then one can select the equivalent proportion of documents with the highest document scores from the test set. For example, given a training corpus of 525 documents of which 122 are classified as dealing with clinical trials, one can assume that 122 / 525 or ~ 23% of the documents in the test set will also be about clinical trials and select the 23% with the highest document scores.

Three months supply of Scrip, covering the period January to March 1999, was used to derive both training and test sets. The 525 documents covering the period the 1st to the 25th January were used as the training set, while the remainder (1549 documents) were used as the test set.

The output of this process is a ranked list of document scores for all the documents in D, with the highest total given to those documents containing typical vocabulary for the specific domain, ideally those actually about the domain. THE SUPPORT VECTOR MACHINE (SVM) APPROACH

A Kappa value of 1 indicates perfect agreement, while a value of 0 indicates no better than random agreement. The results for the two judges over both the training and the test sets are given in Table 4.

The judgements for the first 525 documents became the training set, while the judgements for the last 1579 documents became the test set.

The degree of consistency between the two judges (reflecting the relative ease with which the documents could be classified into each domain) was measured using the Kappa statistic [3].

We compared the chi-square method with an SVM approach to document classification. SVMs have been used for text classification by Dumais et al. [1].Our training documents were converted into feature vectors (inputs to the SVM) in two slightly different ways. In each case, the entire vocabulary of the document set (both training and test) was found, e.g.:apple banana carrot dill eggplantthen for a document containing the words apple carrot carrot dill, either a binary feature vector would be created: 1 0 1 1 0or a word-occurrence-count feature vector: 1 0 2 1 0.

In our experiments, all words were used (even those with a frequency of 1), without stemming.

The classifications for both the test set and the training set were produced manually by two independent judges, who had helped to develop the categories. A Scrip document could be assigned to any number of categories (or none). A third person acted as an adjudicator if the two judges disagreed whether a document should be placed in a given category. TRAINING

Columns a1 and a2 are the number of documents assigned to each domain category by each annotator; column match is the number of times both judges assigned the same document to that category. Table 4: Inter-annotator Agreement Measures

Recall was defined as the number of matches (where the machine placed a document in a category where the human judges also placed it) divided by the total number of documents places by the human judges in that category.

Table 5 shows results for the chi-square method. Note that the worst results are those obtained for the launches and licensing categories the two for which the kappa statistic showed the worst results for human agreement.Table 5: Results for the Chi-Square Method Table 7: Results for the SVM Method using Binary Vectors, Separated by Quadratic SVMTable 6: Results for the SVM Method using Binary Vectors, Separated by Linear SVM Table 8: Results for the SVM Method using Word Occurrence Feature Vectors, Separated by Linear SVMSearle's Celebrex approved in Brazil:Publication Date: 22 Jan 1999 Text Searle's selective COX-2 inhibitor, Celebrex (celecoxib), has been approved in Brazil, its first approval outside the US. The product is currently being detailed to select US physicians, with a major product roll-out expected shortly. Searle is co-promoting the product with Pfizer in all markets except Japan. Additional approvals are pending in more than 30 countries. Source: Scrip PJB Publications Ltd Conclusions

The chi-square procedure for identifying characteristic vocabulary is as follows. For each unique word (type) W in each corpus, count the following, where count(C,W) is the number of times word type W occurs in corpus C and |C| is the size of C in word tokens:

In our setting we can identify a specific corpus S a set of texts from Scrip in a given period about a specific topic such as clinical trials, a residue corpus R the set of Scrip texts from the same period which are not about the given topic. And finally a general corpus G, which is all of the Scrip texts from the given period; i.e., G = S R.

Other results are substantially better. Tables 6-8 show the results of the SVM method, in various configurations.

Precision was defined as the number of matches divided by the total number of documents placed by the machine in that category.

The SVM classifier performs considerably better than the chi-square classifier on the same data. Despite this, we believe that the chi-square approach does show sufficient promise to warrant further experimentation. In particular, we should examine the following:

1All positive indicators are given the same score of +1 at present. One could use differential scores reflecting the chi-square value for each word; or one could learn a function of these values which optimally categorised texts in the training set. 2 An absolute document score threshold for determining category membership should be considered instead of a proportional one (which relies on the assumption that the proportion of texts in various categories in Scrip remains constant over time). 3 Characteristic n-grams (possibly with variable n) should be considered as well as just single words.Acknowledgement:Project funded by GlaxoSmithKlineReferences

[1] S. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorisation. In Proceedings of the 7th Int. Conf. on Information and Knowledge Management, 1998.[2] P. Rayson, G. Leech, and M. Hodges. Social Differentiation in the use of English vocabulary. Int J. of Corpus Linguistics 2(1): 133-152, 1997.[3] S. Siegel and N. J. Castellan. Nonparametric Statistics for the Behavioural Sciences. McGraw-Hill, 1988.Scrip is the trademark of PJB Publications Ltd. See http://www.pjbpub.co.uk

The TRESTLE project: See http://nlp.shef.ac.uk/trestle/AND TEST DATA

Word # in Residue Corpus # in Specific Corpus patients 101282366.0phase 17154306.2study 25128219.4treatment 62141159.0trial 24100158.6trials 41118155.7III 773148.9with 497432134.8cancer 3698125.0II 1167121.2results 2179119.8studies 3185108.9product 17619496.6therapy 549687.8approval 609984.0failure 33879.8placebo 23576.4days 104471.3in 1893107471.1

Word # in Residue Corpus# in Specific Corpus X pharmaceutical 2471079.9PEOPLE 148061.2companies 2532358.4products 3224551.2makes 129346.41999 61213644.3industry 115242.9president 120440.6on 66915940.5appointment 96039.7

Bigram # in Residue Corpus# in SpecificCorpus X Phase III 1255111.0the product 578688.3patients with 165086.2a Phase 33484.5Phase II 164573.4approved in 124071.1in patients 123865.9the treatment 325461.6treatment of 415957.7Phase I 52757.4

a1a2matchkappasignific.launches6810351.579not signifi.licensing868643.479not signifi.meetings149149148.993p < .001people304306289.939p < .001regulatory173214105.496p < .01relations196288160.617p < .001trials602553476.758p < .001

matchmachinehumanPrReclaunches105859.172.169licensing127270.167.171meetings87138103.630.845people205260222.788.923regulatory105259154.405.682relations142406181.350.785trials288371430.776.670

Specific CorpusResidue CorpusWordab Wordcd


matchmachinehumanPrReclaunches102859.357.169licensing101370.769.143meetings101102103.990.981people206208222.990.928regulatory73137154.533.474relations77107181720.425trials179199430.899.416


a method based on the chi-square test for document classification michael oakes¹ robert...

Documents