automatic subject indexing of chinese documentszhengzheng.buaa.edu.cn/en/pdf/conference/zhang s, he...

7
See discussions, stats, and author profiles for this publication at: http://www.researchgate.net/publication/4225024 Automatic subject indexing of Chinese documents CONFERENCE PAPER · JANUARY 2005 DOI: 10.1109/NLPKE.2005.1598744 · Source: IEEE Xplore CITATION 1 DOWNLOADS 3 VIEWS 47 4 AUTHORS, INCLUDING: Qing He Chinese Academy of Sciences 111 PUBLICATIONS 517 CITATIONS SEE PROFILE Zheng Zheng Beihang University(BUAA) 51 PUBLICATIONS 181 CITATIONS SEE PROFILE Zhongzhi Shi Chinese Academy of Sciences 440 PUBLICATIONS 1,976 CITATIONS SEE PROFILE Available from: Zheng Zheng Retrieved on: 03 July 2015

Upload: dinhnhu

Post on 23-Apr-2018

219 views

Category:

Documents


5 download

TRANSCRIPT

Seediscussions,stats,andauthorprofilesforthispublicationat:http://www.researchgate.net/publication/4225024

AutomaticsubjectindexingofChinesedocuments

CONFERENCEPAPER·JANUARY2005

DOI:10.1109/NLPKE.2005.1598744·Source:IEEEXplore

CITATION

1

DOWNLOADS

3

VIEWS

47

4AUTHORS,INCLUDING:

QingHe

ChineseAcademyofSciences

111PUBLICATIONS517CITATIONS

SEEPROFILE

ZhengZheng

BeihangUniversity(BUAA)

51PUBLICATIONS181CITATIONS

SEEPROFILE

ZhongzhiShi

ChineseAcademyofSciences

440PUBLICATIONS1,976CITATIONS

SEEPROFILE

Availablefrom:ZhengZheng

Retrievedon:03July2015

Proceeding ofNLP-KE '05

Automatic Subject Indexing of Chinese DocumentsSulan ZHANG1 2, Qing HE', Zheng ZHENG 2, Zhongzhi SHI'

I. Institute of Computing Technology Chinese Academy of Sciences, Beijing, 100080, P. R. China2. Graduate School of Chinese Academy of Science, Beijing 100039,P. R. China

{zhangsl, heq. zhengz. shizz}@ics.ict.ac.cn

Abstract-Automatic Subject indexing is a process to produceautomatically a set of attributes that represent the content or topicof a document. In this paper, two approaches of automatic subjectindexing based on VSM (Vector Space Model) and subject wordssegmentation respectively are presented. The experimental resultsshow that the first approach based on VSM is appropriate whenthe documents, which will be indexed, are concentrative and thesubject words available is less. The second approach based onsubject words segmentation improves greatly efficiency ofindexing and inter-indexer consistency.

I. INTRODUCTION

At present, the ordered' organization and search ofinformation on Intemet usually employ full-text search orsubject search. The technique of full-text search is easy, but theorganization of information lack of scientific ordering, manyimportant subjects are often submerged in a plenty of irrelevantterms, and the precision and recall are poor.

The premise of subject search is that there exist subjectindexes of documents on the Intemet. Subject indexing ofdocuments has traditionally been one of the most importantresearch topics in information science. Indexes facilitateretrieval of information in both traditional manual systems andnewer computerized systems. Without proper indexing andindexes, search and retrieval are virtually impossible [I].The subject vocabulary is the kernel of subject search. It is a

structural set of conception. It is used mainly in the division ofinformation and automatic or auxiliary selection of indexingwords. It is an important approach to improve the precision andrecall and to realize the scientific ordering of the informationon Internet and intelligent conceptual search. During theordering and indexing of information, some conceptions arepicked out from subject vocabularies as the identification of adocument according to the topic and content of the documentand saved into database. So long as words for search areselected appropriately from subject vocabulary, thecorresponding document can be picked up.

In this paper, we present two approaches of automaticsubject indexing: one based on VSM (Vector Space Model),the other based on subject words segmentation. The firstapproach builds VSM of every subject word according totraining documents, and then compares the pairwise VSMbetween test documents and subject words and gets thesimilarity between them, at last indexes the test document withthe top several similar subject words. The second approach firstcuts the subject words in controlled vocabulary, then computesthe frequency of every subject word in a document, at lastindexes the document with the top several frequent subjectwords.The paper is outlined as follows: In the section2, we review

the subject indexing. In the section 3 and section 4, we willintroduce the two approaches respectively and their ownexperimental results. In the section 5, we will makeconclusions of the two approaches and expectation ofautomatic subject indexing.

II. RELATED WORKS

Subject indexing is a process to produce a set of attributesthat represent the content or topics of a document [2]. The goalof indexing is to open out the characteristics of a document asexactly as possible and provide for effective searching, so theindexing process should be user oriented. An efficient way toattain is to employ the subject vocabularies provided by user inthe process.

Indexing has been one of the most important research topicsin information science, retrieval of information benefit from itin both traditional manual systems and newer computerizedsystems. Traditional human indexing first recognizes andselects the essence of a document by reading or scanning thedocument, then represents the essence of the document [3].Because of severe time constraints, it is difficult for indexers toselect a small set of "best" terms among all the possible termsthat can represent a document. Moreover, there exists lowagreement in indexer assignment of terms. In addition, It iscommonly known that a single term can have differentmeanings in different contexts and that a single concept can berepresented with several various term. Therefore, termambiguity and indexing inconsistency often lead to poorprecision and poor recall in retrieval. To improve inter-indexerconsistency, the controlled vocabularies, which defineconceptual terms and their relationships to each other in aparticular domain of knowledge, are greatly employed [1].

Manual indexing is expensive and time consuming, andsome techniques for automatic subject indexing have beendeveloped. One of them is The Identification System (TIS), acontent-based indexing system for news stories developed byHayes and Weinstein [4]. It uses if-then rules, which take intoaccount concepts in the document and their respective"strengths" of occurrence. Another is the NameFinder, which isto find occurrences of names in online text with specificfeatures [5]. NameFinder also attempted to solve the problemof name variation in that different names may be used to meanthe same thing. This was attained by using built-in knowledgeof the names as structured in a specific domain.

III. AUTOMATIC SUBJECT INDEXING SYSTEM BASED ON VSM

A. Relevant Description of VSM

0-7803-9361-9/05/$20.00 (C2005 IEEE

256

Since VSM (Vector Space Model) was presented by Saltonet al. [6], it and its relevant techniques including term'sselection, weight assignment strategy and query optimizationhave been utilized extensively in many areas and already beenone of the most simple and efficient text representation models.

According to the description by Salton et al., the contentfeatures of a document often are represented with term set as

D(tl,t2,...,tN), tk is a term, and .<k<N, and everyterm tk is assigned a weight wk, which shows the importanceof the term in the document D. That is,D=Dtlt WIl t-)IW2 . t~N<k<N

simplyD = D(WI IW2, I-) WN ) -

Given document D (t, wI; t2,v; ...;tNv,wig) , Ifconsidering the terms in D are out-of-order and every term isdifferent with the others, the (t1 S t) I..., tN ) can be viewed as a

N-dimensional coordinates system and the (wk, w2,..., WN) isthe corresponding coordinates value. Then the document

D(w,,w2,..,wN) is viewed as a vector in N-dimensional

space, the D(w,1,iv9,..., WN) is the vector representation orvector space model of the document D.

B. Build VSM ofSubject WordGiven training documents and controlled vocabularies, how

can we acquire the VSM of a subject word? This consists oftwo processes. First, build the two-level hierarchy of all subjectwords. The first level consists of all subject words fromcontrolled vocabularies; the second level consists of all trainingdocuments relevant to every subject word. Second, build theVSM of subject word according to the hierarchy.

Controlled vocabulary

Subject wordlSubject word2......

Subject wordn

Training documents

Fig. 1. The two-level hierarchy of subject word

During building VSM of every subject word, computation ofterm weight makes use of the accepted widely automaticweight assignment strategy, that is tf * idf (Term FrequencyTimes Inverse Document Frequency). In order to deal with thelimitation of TF.IDF, we improved it by combining theinformation gain from Information Theory [7]. The formula is:

tfik Xlog2 (Ni nk±0. 1)xIGk (Wi,iS ()

Where tfik is the frequency of term k (Tk ) in document d,XN is the number of all documents, nk is the number of

documents containing Tk. The inverse document frequency

(idff)is log1 (N /nk + 0.01). The if is a statistics relevantto document which measures importance of a term in adocument. The idf is a global statistics relevant to the wholedocument set which measures distribution of the term in the set.The bigger the inverse document frequency idf of a term, thesmaller the contribution of the term for differentiating among,documents. JGk is the amount of information of Tk its formulais as (2)

IGk =H(D)-H(DITk ) (2)

H (D) is the entropy of document set D, its formula is :

H(D) =-dED(P(di)Xlog2 P(di)) (3)

The condition entropy of the term Tk is as follow:

H(DI Tk) d-ED(P(dt Tk ) Xlog2 P(di Tk )) (4)

The probability of the document di is

P(di) = wordfreq (dj)Zd wordfreq (d,) (5)

Where wordfreq (di) is the sum of all term frequencies in

document di.The process of building VSM of every subject word as

follow:Stepl. For every document under a subject word, compute

fiequency of every term in the document.Step2. For a term (Tk ) under the ith subject word, compute

its whole frequency in all documents under the subject word(tfjk ) and the number of documents containing the term ( nk

Step3. For every subject word, compute the number of termsunder the subject word and compute the whole term frequency

(wordfreq (di/)).Step4. Compute P (di), H (D), H (DI Tk), IGkStepS. Build VSM of the subject word according to the

following steps:1) Compute weight of every term.2) Sort every term descending according to their weights

into T,T7...,7.3) Initialize VSM with null.4) Estimate every term T,, T2,..., T in tum, if the number

of feature terms in VSM exceeds thethreshold NUMT . VSM is complete. Otherwise, if

257

"Im

OEtik X [1092 (N / nk + 0.01) X IG,k=

IK \1 I /

the weight of term Tk exceeds the threshold ar and its

frequency exceeds the threshold ,8, the term can be a

feature term of the subject word and be coupled withits weight into the VSM of the subject word.

C. Automatically Index the DocumentsThe process of subject indexing is as follow:

Stepl1. Vectorize a document into

d=d(T7>Wi;T2,W2%;...;T7,U,Wn) according to the frequencyand-address of the term in the document.

Step2. Assuming that there are t subject words

V1,2,..,V ,the VSM of the subject word VjisC'= C',(t,, wVVt2, w2; ...;tn,wn) , i = ,2.,t.

Step3. Compute the pairwise similarity between documentand subject word according to the Cos of angle betweenC, and d , namely the computation of. Sim (Ci,,d) is as

follow:1) SetSim(Ci, d)=O, WG=0,wg=0;2) lfm > n ,for every term t,, (U = 1,2,...,n) inC1

if it exists in document d and its correspondingtermn is T"

wg=g+~ WG=WG+W.

3) Otherwise, for every term L, (V =12.. m) in the

document d , if it exists in C, and its correspondingterm ist,, Sim(C,,d)=Sim(C1,d)+W, xw5,,

wg=g±w,2WG=WG±

4)im C,d) Sm (C,, d)4)Sim(G,,d)=WGxwg

5) Find out the top k subject words (k is the number

of indexing) in the descending order ofSim (C,, d).6) Index the document d with the k subject words.

D. Experimental resultsThe advantages of the approach are that it considers the

whole effects. of training documents and can index a documentat a higher level with a subject word even if the subject word isnot in the document, and its disadvantage are that theefficiency of indexing will be very low when the controlledvocabulary is too big.

Considering the controlled vocabulary relevant to thosedocuments, which will be indexed, is sometimes too big, wepresent the other approach. It is described in detail in thefollowing section.

±ttKM EZffSl4,S4ke RUW*4;~X*ZYS% *Wtas7 Jc tft'I~~~~~~~~~~~~~~ i~~~~~~t~~~I R~~~~ W~~~~ U 7JclflJ.......... .......I.................. 3..tt..............

t~*,tE T Ii F4 tE.

~464

I sa~~~~r~~~nmw~~~sxmmnmw~~~~~~.w~E~H A' 3nm,i8 4tM.. ...................

3±STPb Z~~~~~~~**Yk±Q~~~hi2ffi ~~im4AfftU 1 ZattaT ivlR A-... 33I'F,t*R .Tf-tftJll~~~~~~~~~~Wt~~~~~St~~~~~it0SJU1i~~~~~~~~~~~~~~ft~~~~~'0i~~~~~~~G B I Uf~~~~~~~~~~~~~t~~~M ~~~~~'**~~~~~~t..........M.

* i&R4 mft .t§1~JM n ts~

Fig-.2. Results of subject indexing based on VSM

iv. AUTOMATIC SUBJECT INDEXING SYSTEm BASED ONSUBJECT WORDS SEGMENTATION

A. Subject Word SegmentationIt is universally agreed among researchers that word

segmentation is a necessary first step in Chinese language

processing, because in Chinese text, sentences are representedas strings of hanzi without similar natural delimiters as whitespaces by which sentences are delimited as sequences ofwords in English text. Most existing word segmentationsystems fall into three main categories, depending on whetherthey use statistical information and electronic dictionaries.They are statistics-based approaches, non-statisticaldictionary-based approaches , and statistical and dictionary-

258

based approaches[8].Before the following discuss, we should answer the question:

why are subject words in the controlled vocabulary segmentedbefore subject indexing? To our knowledge and experience, asubject word often is marked by indexer experts if its variouscut words are contained in a document, even if the subjectword itself is not contained.The statistics-based word segmentation approaches cut

words based on the joint probability between words in thedocuments. However, the segmentation of subject words incontrolled. vocabulary is a segmentation without statisticinformation, so the dictionary-based approaches is the onlyway to do it.The main components of dictionary-based word

segmentation are the segmentation dictionary, the order ofscanning text and the principle of matching. There are threeorders to scan text: positive order and inverse order by whichthe sentences segmented are read from start or end,respectively, while the composite order is the combination ofthe two orders. There are four principles of matching: themaximum matching, the minimum matching, the matchingword by word and the optimal matching. In our paper, we usethe maximum matching algorithm word by word. Its principalroutine is:

Step I. take out a string of hanzi whose size are not in excessof the largest string of hanzi in dictionary from the start of asubject word as matching string of hanzi.

Step2. look for the matching string in the dictionary.Step3. If found, then view the matching string as a cut word,

otherwise, remove the a rearmost hanzi from the matchingstring into a new matching string of hanzi, and turn to the step2.

Repeat the process until all subject word are segmented.

B. The Process Description ofSubject Indexing

Fig.3 shows the process of subject indexing. In thepreprocess stage, every subject word in the controlledvocabularies is cut based on general vocabulary, the results ofcutting which means independent and complete are as long aspossible. For convenience and efficiency, these results besidesthe subject word are saved into a cut controlled vocabulary.The cutting process should be performed only when thecontrolled vocabulary changes. Fig.4 shows the cut controlledvocabulary. The KEYWORD is the subject word of thecontrolled vocabulary, the CUTWORD is the cutting results ofthe KEYWORD (different words in the CUTWORD is splitwith a white space), and the COUNT is the number of words inthe CUTWORD.

In the subject indexing stage, first all documents areorganized into a table named as BEMARKTABLE, then thesubject indexing system index every document based on the cutcontrolled vocabulary. Users can freely designate the numberof subject indexing words.

S f

| 3E3s 42 t:=iiF5,~~U;3sx .. ... 1. . . L .. ".

j2........... '5 iM

Aex~m vms3

3 1

Fig.4. The cut controlled vocabulary

C. The Mathematics Description ofSubject IndexingLet D = (DI, D2,..., Dn ) expresses the space of

documents, DL (1 < i < n) is the ith document.

T (t,,t2,..., tin) is the controlled vocabulary,t. (1 < i m) is the ith subject word, candidate for indexing.The following weight matrix describes the pairwise indexingweight of documents and subject words. For example, theelement value wj of the ith row and the jth column

designates the weight of the ith document is indexed with thejth subject word.

t *-- t

parametermodulation

DIl w,.

D.wFig.3. The process of subject indexing ... Wm

*.. . I~

259

The computation of weight is as the (6):

'y 'yI +1y2 (6)num(TITLE: of D,t )x3+

:numt(CONTENT of D,,tj

rmnum(TITLE of Di,cutk (t0))x3+')1.k.COUvT(Q) num(cONTENT of DijCUtk (ti))

In the above equation, num (TITLE of Di, t) and

num (CONTENT of Di, t1 ) are the frequency of the subject

word tj in the title and content of the document D*,

respectively. COUNT (t) is the number of words cut from

the subject word tj (see fig. 4, tj is relevant to the jth value

of the KEYWORD). CUtk (tj) is the kth word in the

CUTWORD relevant to the subject word t1.

num(TnImE of Di,ck (tj))x33 is the minimum

1s.5akewp1) t+num(COWEvNTofD,zt (to)))of the frequency of various word cut from tj in the title and

the content of Di .

The title of a document should be more significant than its

content in representing its subject, so the frequency of the

subject word in the title multiplied by 3 (the value 3 is the

result of a lot of experiments). A document can be indexedwith multi-subject words, it is found from a mass of analysis ofdocuments that COUNT (a subject word) contributes morepossibly to the degree of specialization of the documentindexed by the subject word, so it should be reasonable that thefrequency of the subject word in the title and content multipliedby the COUNT (the subject word). If a subject word does notexist in a document, the aftermost item of the (6) must becomputed, because the co-occurrence of all words cut from thesubject word in a document can also account for the subjectword relevant to the document.

In fact, a document is indexed rather with multi-subjectwords than all subject words. The number of subject indexingwords is designated by user in the process as a parametervariant MARKNUM. Thus, only the top MARKNUM weightsare saved. Moreover, the compare of frequency can onlyapplied on those weights.

D. Experimental ResultsThe test environment is as follow:

Hardware: Intel © 4 CPU 2.40GHz, 512M RAM, 80Ghard disk

Software: Windows2000; Oracle 815Experimental data arise from real and natural data from user.

The controlled vocabulary consists of 230755 subject words,and the number of documents indexed is 199945. Thefrequency of cutting process and the subject indexing is506/min and 125/min, respectively. The results of indexingshow in fig.5.

In the fig.5, the TITLES and CONTENTS list the title andcontent of the documents, respectively. The KEYWORD is thesubject words from indexer experts, and the MARKWORD isthe results of automatic indexing.

DCOylli4M)tSfltf- uELfL&t altZt&lKffi,Jte.t;-T'6t ... HnmtR& a.ts?ar ±

*^4 ^"n- #untzi ,n~,jn;*,,

*wwmn*si~~~~~~±x~~t tm%4,± "±mtt [tRS}wS

._iiii '' ' ; ..................A-.......

KSA-i08 4~~~~iA I14,*1*Mntm& ii~atI1Hfi~l

~~~~~~~~~~~~~~~~~~~~~~~~~~~~........................................................................................................... 4 ....... ........................................

..~~~ ~ ~ ~ ~~ ~Fg5 .h .eut .ujc..ndexing

From the results, it can be found that KEYWORD fromindexer experts and MARKWORD from automatic indexinghave intersection in general, but do not go all the way. Thereason consists in two aspects. One is the uncertainty ofmanual indexing; the other is the approach's inexactness in

itself. In spite of the inconsistency between KEYWORD andMARKWORD in some degree, indexers have confirmed theresults of automatic subject indexing. Further improvementcould be made by analyzing the part of speech and structure ofsubject words or by considering and analyzing the role of

260

general words and proper words in the documents duringcomputation of weight besides their frequency.

V. CONCLUSION

Two approaches of automatic subject indexing based onVSM and subject words segmentation respectively arepresented in the paper. The first approach builds VSM of everysubject word according to training documents, and thencompares the pairwise VSM between test documents andsubject words and gets the similarity between them, at lastindexes the test document with the top several similar subjectwords. The second approach first cuts the subject words incontrolled vocabulary, then computes the frequency of everysubject word in a document, at last indexes the document withthe top several frequent subject words. The computation offrequency arises from not only the subject word match, but alsoits cut words match. Different with the first, the secondapproach is an unsupervised process. The both approachesprovide an interactive indexing support for users who canfreely designate the number of subject indexing words of anydocument. In addition, both of them are assignment keywordindexing, namely the keyword is from controlled vocabulariessupplied by some experts in relevant domains. The firstapproach based on VSM considers the whole effects of trainingdocuments and can index a document at a higher level with asubject word even if the subject word is not in the document, itis fit when the documents, which will be indexed, areconcentrative and the subject words available are less. Thesecond approach based on subject words segmentationimproves greatly efficiency of indexing and indexer expertsconfirm its accuracy. At present, it has been applied greatly.

Further research on automatic subject indexing should bedone to improve its performance. For example, we can improvethe accuracy of cutting words in the preprocess by analyzingthe part of speech and structure of subject words. Incomputation of weight, besides applying frequency of subjectindexing we can also improve its performance by analyzing the

role of general words and proper words in the Chinesedocuments.

ACKNOWLEDGMENTS

The research work in this paper is supported by the NationalScience Foundation of China (No. 60173017, 60435010), theNature Science Foundation of Beijing (No. 4052025), the 863Project (No.2003AA115220) and National Basic ResearchPriorities Programme (No. 2003CB3 17004)

REFERENCES

[1] Yi-Ming Chung, William M. Pottenger, and Bruce R. Schatz, "AutomaticSubject Indexing Using an Associative Neural Network," In Proceedingsof the third ACM conference on Digital libraries, Pittsburgh,Pennsylvania, United States, pp. 59 - 68, 1998.

[2] M. J. Bates, "Subject Access in Online Catalogs: A Design Model,"Journal of the American Society for Information Science, pp. 357-376,1986.

[3] R. Fugmann, "Subject Analysis and Indexing: Theoretical Foundation andPractical Advice," Frankfurt/Main, INDEKS VERLAG, Germany, 1993.

[4] P. Hayes and P. M. Weinstein, "CONSTRU/TIS: A system for content-based indexing of a database of news stories," In Innovative Applicationsof Artificial Intelligence2, The AAAI Press/The MIT Press, Cambridge,pp. 49-64, 1991.

[5] P. Hayes and G. Koemer, "Intelligent text technologies and their successfuluse by the information industry," In Proceedings of National OnlineMeeting. 14th National Online MIeeting, New York, pp. 189-196, 1993.

[6] Salton G, "AuLtomatic Text Processing: The Transformation, Analysis, andRetrieval of Information by Computer," Addison-Wesley Series InComputer Science, Addison-Wesley Longman Publishing Co., Inc.,Boston, MA USA, pp. 530, 1989

[7] Qing He, Ziyan Jia, Jiayou Li,Haijun Zhang,Qingyong Li, Zhongzhi Shi,"GHUNT: A SEMANTIC INDEXING SYSTEM BASED ONCONCEPT SPACE," in 2003 International Conference on NatutralLangutage Processing and Knowvledge Engineering (IEEE NLP&KE-2003), pp.716-721, Oct., 2003.

[8] Nianwen Xue, "Chinese Word Segmentation as Character Tagging,"Computational Linguistics and Chinese Language Processing, Vol. 8, No.1, pp.29-48, 2003.

261