joachims ecml 98

Upload: mauricio-martis

Post on 03-Jun-2018

226 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/12/2019 Joachims ECML 98

    1/6

    T e x t C a t e g o r i z a t i o n w i t h S u p p o r t V e c t o rM a c h i n e s L e a r ni n g w i t h M a n y R e l e v a n tFeatures

    Thors t en JoachimsUniversitt Dortmundlnformatik LS8 , Barope r Str. 30144221 Dortmund, Germany

    A b s tr a c t. This pa per explores the use of Support Vector Machines(SVMs) for learning tex t classifiers from examples. It analyzes the pa r-ticular p rope rties o f learning w ith text d ata and identifies why SVMsarc appropriate for this task. Empirical results support the theoreticalfindings. SVMs achieve substantial improvements over the currently be stperforming methods and behave robu stly over a variety o f different learn-ing tasks. Furthermore, they are fully autom atic, eliminating the needfor manual parameter tuning.

    1 I n t r o d u c t i o nW ith the rapid growth of online informat ion, text categorization has beco me oneof the key techniques for handling and organizing text data. Text categorizat iontechniques are used to classify news stories, to find interest ing information onthe W W W , and to guide a user s search through hypertext . Since bui lding textclassifiers by hand is difficul t and t ime-consuming, i t is advantageous to learnclassifiers fro m examples.

    In this paper I wil l explore and identify the benefi ts of Support Vector Ma-chines SVMs) for text categorizat ion. SVMs are a new learning method intro-duce d b y V. Vapnik e t al . [9] [1]. Th ey are well-founded in term s o f co m pu tat io na llearning theory and very open to theoret ical understanding and analysis .

    After reviewing the standar d feature ve ctor represe ntat ion of text , I willidentify the part icular propert ies of text in this representat ion in sect ion 4. Iwil l argue that SVMs are very well sui ted for learning in this set t ing. The em-pirical resul ts in sect ion 5 wil l support this claim. Compared to state-of-the-artmethods, SVMs show substant ia l performance gains. Moreover , in cont rast toconventional text classificat ion methods SVMs wil l prove to be very robust ,el iminating the need for expensive par am eter tuning.

    2 T e x t C a t e g o r i z a t i o nTh e goal of tex t categorizat ion is the classificat ion of doc um ents into a fixednum ber of predefined categories. Each do cum ent can be in mul t iple , exact ly one,or no cate gory at all . Using ma chine learning, the objec t ive is to learn classifiers

  • 8/12/2019 Joachims ECML 98

    2/6

    38

    f r o m e x a m p l e s w h i c h p e r f o r m t h e c a t e g o r y a s s i g n m e n t s a u t o m a t i c a l l y . T h i s i sa su p e rv i sed l ea rn in g p ro b lem . S in ce ca t eg o r i e s m ay o v e r l ap , e ach ca t eg o ry i st r ea t ed a s a sep a ra t e b in a ry c l a ss i f i ca t io n p ro b lem .Th e f i r s t s t ep in t ex t ca t eg o r i za t io n i s t o t r an s fo rm d o cu m en t s , wh ich ty p -ic,~ lly a re s t r ings o f chara c te rs , in to a represe n ta t ion su i tab le fo r tim learn inga lg o r i th m an d th e c l a ss i f i ca t io n t a sk . I n fo rm a t io n Re t r i ev a l r e sea r ch su g g es t sth a t wo rd s t em s wo rk we ll a s r ep re sen ta t io n u n i t s a n d th a t t h e i r o rd e r in g in ad o c u m e n t is o f m i n o r i m p o r t a n c e f o r m a n y t a s k s . T h i s l e ad s t o a n a t t r i b u t e -v a lu e r ep re sen ta t io n o f t ex t . E ach d i s t in c t wo rd 1 w i co r r e sp o n d s to a f ea tu r e ,

    w i t h t h e n u m b e r o f t i m e s w o r d w l o c c u r s in t h e d o c u m e n t a s i ts v a l ue . T o a v o i du n n ecessa r i ly l a rg e f ea tu r e v ec to r s , wo rd s a r e co n s id e r ed a s f ea tu r e s o n ly i f t h eyo ccu r in th e t ra in in g d a ta a t l e a s t 3 t im es an d i f t h ey a r e n o t s to p -wo rd s ( li k e

    an d , o r , e t c . ) .Th i s r ep re sen ta t io n sch em e l ead s to v e ry h ig h -d im en s io n a l f ea tu r e sp acesco n ta in in g 1 0 0 0 0 d im en s io n s an d m o re . Man y h av e n o ted th e n eed fo r f ea tu r ese l ec t io n to m ak e th e u se o f co n v en t io n a l l e a rn in g m e th o d s p o ss ib l e , t o im p ro v eg en e ra l i za tio n accu racy , an d to av o id o v e r f i tt i n g . Fo l lo win g th e r eco m m e n d a -t ion of [11] , the reformat ion gain cr i te r ion wi l l be used in th is paper to se lec t asu b se t o f fea tu r e s .

    F in a l ly , f r o m IR i t i s k n o wn th a t sca l in g th e d im e n s io n s o f t h e f ea tu r e v ec to rwi th th e i r inverse document f r equency IDF) [8 ] im p ro v es p e r fo rm an ce . He reth e t f c v a r i an t is u sed . To ab s t r ac t f r o m d i f fe r en t d o c u m en t l en g th s , e achd o c u m e n t f e a t u r e v e c t o r is n o r m a l i z e d t o u n i t l e n g t h .

    S u p p o r t V e c t o r M a c h i n e sS u p p o r t v e c t o r m a c h i n e s a r e b a s e d o n t h e Struc tura l R i sk M in imiza t ion pr inc ip le[9] f i'o m co m p u ta t io n a l l e a rn in g th eo ry . Th e id ea o f s t r u c tu r a l r i sk m in im iza t io ni s to fin d a h y p o th e s i s h fo r wh ich we can g u a ran tee th e lo w es t t r u e e r ro r . T h et ru e e r ro r o f h i s t h e p ro b ab i l i t y th a t h wi ll m ak e an e r ro r o n an u n seen an dr a n d o m l y s e l e c t e d t e s t e x a m p l e . A n u p p e r b o u n d c a n b e u s e d t o c o n n e c t t h et ru e e r ro r o f a h y p o th es i s h w i th th e e r ro r o f h o n th e t r a in in g se t an d th ec o m p l e x i t y o f H ( m e a s u r e d b y V C - D i m e n s i o n) , t h e h y p o t h e s i s sp a c e c o n ta i n in gh [9 ] . Su p p o r t v ec to r m ach in es f in d th e h y p o th es i s h wh ich ( ap p ro x im a te ly )m in im izes th i s b o u n d o n th e t r u e e r ro r b y e f f ec t iv e ly an d e f f i c i en t ly co n t ro l l in gt h e V C - D i m e n s i on o f H .

    S V M s ar e ve ry u n i v e r s a l l e a r n e r s . I n th e ir b a s i c f o rm , S V M s le a rn l in e arth r e sh o ld fu n c t io n . Nev e r th e le ss , b y a s im p le p lu g - in o f an ap p ro p r i a t e k e rn e lfu n c t io n , t h ey can b e u s ed to l ea rn p o ly n o m ia l c l a ss i fi e rs , r ad ia l b a s i c f u n c t io n(R B F) n e two rk s , an d th r ee - l ay e r s ig m o id n eu ra l n e t s .

    O n e r e m a r k a b l e p r o p e r t y o f S V M s is t h a t t h e ir a b i l i t y t o l e a rn c a n b e i n -d e p e n d e n t o f t h e d l m e n s i o n a l i t y o f t h e f e a t u r e s p a c e . SV M s m e as uret h e c o m p l e x i t y o f h y p o t h e s e s b a s e d o n t h e m a r g i n w i t h w h i c h t h e y s e p a r a t e t h et Th e term s word and word stem w ill be used synonymously in the following.

  • 8/12/2019 Joachims ECML 98

    3/6

    39

    GO

    4Q

    2 0

    ~ o c o ~ ~ 4o~ 7ooo eooG e o c ~

    F i g . 1 . L e a r n i ng w i t hou t u s ing t he be s t f e a t u r e s .d a ta ., n o t t h e n u m b er o f f ea tu r e s . Th i s m ean s t h a t we can g en e r a l i ze ev en in t i lep r e sen ce o f v e r y m a n y f ea tu r e s , i f o u r d a t a is sep a r ab le w i th a w id e m ar g in u s in gf u n c t io n s f r o m th e h y p o th es i s sp ace .

    T h e s a m e m a r g in a r g u m e n t a l so s u g ge s t a h e u ri st ic f o r s e l e c t i n g g o o d p a -r a m e t e r s e t t i n g s f o r t h e l e ar n e r (lik e t h e k e rn e l w i d t h in a ll R B F n e t w o r k )[9 ]. Th e b es t p a r a m e te r se t t i n g is t h e o n e wh ich p r o d u ces th e h y p o th es i s w i tht h e l o w e s t V C - D i m e n s i o n . T h i s a l lo w s f u ll y a u t o m a t i c p a r a m e t e r t u n i n g w i t h o u tex p en s iv e c r o ss - v a l id a tio n .

    4 W h y S h o u l d S V M s W o r k W e l l f o r T e x t C a t e g o r i z a t i o n ?To f in d o u t wh a t m e th o d s a r e p r o m is in g f o r l e a r n in g t ex t c l a s s i f i e r s , we sh o u ldf in d o u t m o r e a b o u t t h e p r o p e r t i e s o f t e x t .H i g h d i m e n s i o n a l i n p u t s p a c e : W h e n l e a r n i n g t e x t c l a s s i f i e r s , o n e h a s t o

    d ea l w i th v e r y m an y ( m o r e th an 1 00 00 ) fea tu r e s . S in ce SV M s u se o v e r f i t t in gp r o t e c t i o n , w h i c h d o e s n o t n e c e s s a ri ly d e p e n d o n t h e n u m b e r o f f e a t u re s ,t h ey h av e th e p o ten t i a l t o h an d le th e se l a r g e f ea tu r e sp aces .F e w i r r e l e v a n t f e a t u r e s : O n e w a y t o a v oi d t he s e h ig h d im e n s io n al i n p u t s p ac e si s t o a s su m e th a t m o s t o f t h e f ea tu r e s a r e i r re l ev an t. Fea tu r e se l ec t io n t r i e sto d e t e r m in e th e se i r r e l ev an t f ea tu r e s . Un f o r tu n a te ly , i n t ex t ca t eg o r i za t io nth e r e a r e o n ly v e r y f ew i rr e l ev an t f ea tu r e s . F ig u r e 1 sh o w s th e r e su l t s o f anex p e r im en t o n th e Reu te r s acq ca t eg o r y (see sec t io n 5 ) . A l l f ea tu r e s a r er an k ed acco r d in g to th e i r ( b in a r y ) i n f o r m a t io n g a in . Th en a n a iv e Bay esc lass i f ie r [2 ] i s t r a ined us ing on ly those fea tu res ranked 1-200 , 201-500 ,501-1000 , 1001-2000 , 2001-4000 , 4001-9962 . T he resu l t s in f igure 1 show th a tev en f ea tu r e s r an k ed lo wes t s t il l co n ta in c o n s id e r ab le i n f o r m a t io n an d a r eso m ew h a t r e l ev an t. A c l as s if ie r u s in g o n ly th o se wo r s t f ea tu r e s h as a p e r -f o r l n a n c e m u c h b e t t e r t h a n r a n d o m . S i n c e i t s e e m s u n l i k e l y t h a t a l l t h o s ef e a t u r e s a r e c o m p l e t e l y r e d u n d a n t , t h is l e a ds t o th e c o n j e c t u r e t h a t a g o o dc lass if ie r sh o u ld co m b in e m an y f ea tu r e s (l ea rn a d en se co n cep t ) an d th a tag g r e ss iv e f ea tu r e se l ec tio n m ay r e su l t i n a l o ss o f i n f o r m a t io n .

  • 8/12/2019 Joachims ECML 98

    4/6

    14

    D o c u m e n t v e c t o r s a r e s p a r s e : F or e ac h d o c u m e n t, t h e c o rr es p on d in g d o cu -m en t vec to r con t a ins on ly few en t r ie s wh ich a re no t ze ro . Kiv inen e t a l. [4]g i ve bo t h t he o r e t i c a l a nd e m p i r i c a l e v i de nc e f o r t he m i s t a ke bound m ode ltha t add i t i ve a lgor i t hm s , which have a s imi l a r i ndu c t ive b i a s l ike SVM s,a re we ll su i t ed fo r p rob l em s wi th dense concep t s and spa rse i ns t ances .M o s t t e x t c a t e g o r i z a t i o n p r o b l e m s a r e l i n e a r ly s e p a r a b l e : A ll O h su m edc a t e go r ie s a r e l ine a r ly s e pa r ab l e a nd s o a r e m a n y o f t he R e u t e r s ( se e se c t ion5 ) t a s k s . T he i de a o f S V M s i s to f i nd s uch l i ne a r ( o r po l yno m i a l , R B F , e t c .)s e pa r a t o r s .T he s e a r gum e n t s g i ve t he o r e t i c a l e v i de nc e t ha t S V Ms s hou l d pe r f o r m w e l lfo r t ex t ca t egor i za t i on .

    5 E x p e r i m e n t sT h e f o ll ow i ng e xpe r i m e n t s c om p a r e t he pe r f o r m a nc e o f S V Ms us i ng po l yno -m i a l a n d R B F k e rn e ls w i t h f o u r c o n v e n ti o n a l l e ar n in g m e t h o d s c o m m o n l y u s edf o r t e x t c a t e go r i z a t i on . E a c h m e t hod r e p r e s e n t s a d i f f e r e n t m a c h i ne l e a r n i ngapproach : d ens i t y e s t ima t ion us ing a na ive Bayes c l as s if ie r [2 ] , t he Rocch io al -go r i t hm [7] a s t he m os t po pu l a r l e a r n ing m e t ho d f r om i n f o r m a t i on re t ri e va l,a di s ta nc e we ighted k- nea res t ne ighb or c lassi f ie r [5][10], an d the C4.5 dec is iont r e e / r u l e l e a r ne r [ 6 ]. S V M t r a i n ing i s c ar r ie d o u t w i t h t he SV M ~ ah t2 package .T h e SV M liah t package wi ll be desc r ibed i n a fo r thco m ing pape r .Test Collect ions: T h e e m p i r ic a l e va l ua t ion i s done on t w o t e s t c o l le c ti on . T h ef ir st one is t he M odA pt e s p l it o f t he R e u te r s- 21578 da t a s e t c om p i l e d by D a v i dL ew is . T h e M odA pt e s p l it le a ds t o a c o r pus o f 9603 t r a i n i ng doc um e n t s a nd3299 t e s t d ocu m ent s . O f t he 135 po t en t i a l t op i c ca t egor i e s on ly t hose 90 a re usedfor wh ich t he re is a t l e a s t one t ra in ing an d on e t e s t exam ple . Af t e r p reprocess ing ,t he t r a i n i ng c o r pus c on t a i n s 9962 d i s t inc t t e r m s .T he s e c ond t e s t c o l l e c t i on i s t a ke n f r om t he O hs um e d c o r pus c om pi l e d byW i l l ia m H e r s h . F r om t he 50216 do c um e n t s in 1991 w h i c h ha ve a bs t r a c ts , t hef i rs t 100 00 a re used for t r a in ing and the second 10000 a re used for t e st i ng .The c l a s s i f i ca t i on t a sk cons ide red he re i s t o a s s ign t he document s t o one orm ul t i p l e ca t egor i e s o f t he 23 MeSH d i seases ca t egor ie s . A do cu m en t be longst o a c a t e go r y i f i t i s i nde xe d w i t h a t l e a s t one inde x i ng t e r m f r om t h a t c a t ego r y .Af t e r p reprocess ing , t he t r a in ing corpus con ta ins 15561 d i s t i nc t t e rms .Resul t s : F i gu r e 2 show s t he r e s u l ts on t he R e u t e r s c o r pus . T h e Prec i s ion /Recal lBreakeven Poin t ( see e . g . [3 ] ) is used a s a m easu re o f pe r f orm anc e and m i .croaveraging [10][3] is app l i ed t o ge t a s i ng l e pe r form anc e va lue ove r a l l b in a ryc la s si fi ca ti on ta s k s. T o m a ke s u r e t h a t t he r e s u lt s f o r t he c on ve n t i ona l m e t hod sa r e no t b i as e d by a n i na pp r o p r i a t e c ho i c e o f pa r a m e t e r s , a l l f ou r m e t ho ds w e reru n a f te r se lec t ing the 500 bes t , 100 0 bes t , 2000 bes t , 5000 bes t , (10000 bes t , )o r a ll fe a t u r e s u s ing i n f o r m a t i on ga i n . A t e a c h nu m be r o f f e a t u r e s t he va lue sf l E {0, 0 .1 , 0 .25, 0 .5 ,1 .0} for th e Ro cchio a lgo r i th m an d k E {1, 15, 30, 45, 60}

    ht tp: / /www-ai . informat ik .uni -dor tmund.de / thors ten/svm-l ight .html

  • 8/12/2019 Joachims ECML 98

    5/6

    4

    Bayes Ro cchio C4,5 k-NNearn 9 5.9 9 6 . 1 9 6 ,1 9 7.3cq 91.5 92 ,1 85,3 92.0 92,6 9,1.6 95 .2 95.2mo ney-fx 62.9 67.6 6 9 .4 78 ,2 66.97215 75.4 74,9gra in 72.5 7 9 . 5 8 9 .1 82 .2 91 .3 931192~4 91 .3cru de 81.0 81.5 7 5 .5 85.7 86.13 87.3 88.6 88 .9t r ad e 5 0.0 7 7 . 4 5 9 . 2 7 7 .4 6 9 .2 7 5.5 i 7 6.6 7 7.3in te res t 58.0 72 .5 4 9 .1 74 .0 69 .8 63 .36 7 .9 73 .178 .7 83 .1 80 .9 7 9 .2 _.821'085.4 i 86.0 8 6 . 5hipw he at 60.6 79.4 8 5 .5 76.6 83.184.5 85.2 8 5 . 9corn 47.3 62.2 87.7 7 7 .9 86.0 86.5 i 85.3 8 5 7microavg.[] 72 .0 I 79 .9 7 ' 8412185.1185.9 86.2 [cmbined: , ,86

    SVM (po ly) SVM (rb f )degree d = w idth 7 =1 I 2 I 3 [ . 4 I 5 0 .6 1 0 .8 1 1 .0 [ L 298.2 98.4 98.5 98 .4 98.3 98.5 98.5 98.4 98.395.3 95.0 95.3 95,3 95.4

    7 6 . 2 7 4.0 7 5 .4 7 6 .3 7 5 .989.9i 93.1 91.9 91.9 90.687 .8 i 88.9 89.0 88.9 88 .277 .1 76.9 78 .0 77 .8 76 .87 6 . 2 74.4 75.0 76 .2 76.186.01 8 5 .4 86.5 87.6 87.1:83.8 85 .2 85.9 85.9 8 5.983.9 85.1i85.7 85.7 84.58 5 9 t t 8 6 4 1 8 6 5 1 8 6 3 1 8 6 2combined : 86 .4

    F i g . 2 . P r e c i s io n / r e c a ll - b re a k e v e n p o i n t o n t h e t e n m o s t f r e q u e n t R e u t e r s c a t -e g o r i e s a n d m i c r o a v e r a g e d p e r f o r m a n c e o v e r a l l R e u t e r s c a t e g o r i e s , k - N N , R o c -o h i o , an d C4 ,5 ach i ev e h i g h es t p e r fo rman ce a , t 1 0 0 0 f ea t u re s (w i t h k = 3 0 fo rk - N N a n d / 3 = 1 .0 f o r R o c c h i o ) , N a i v e B a y e s p e r f o r m s b e s t u s i n g al l f e a t u r c s .f o r t h e k - N N c l a s s i f i e r w e r e t r i e d . T i l e r e s u l t s f o r t h e p a r a m e t e r s w i t h t h e b e s tp e r f o r m a n c e o n t h e t e s t s e t a r e r e p o r te d .O n t h e R e u t e r s d a t a t h e k - N N c l a s s i f i e r p e r f o r m s b e s t a m o n g t h e c o n v e n -t i o n a l m e t h o d s ( se e f ig u r e 2 ). T h i s r e p l i c a te s t h e f i n d i n g s o f [ 10 ]. C o m p a r e d t ot il e c o n v e n t i o n a l m e t h o d s a ll S V M s p e r f o r m b e t t e r i n d e p e n d e n t o f t h e c h o ic eo f p a r a m e t e r s . E v e n f o r c o m p l e x h y p o t h e s e s s p a c e s, l i k e p o l y n o m i a l s o f d e g r e e5 , n o o v e r f i t t i n g o c c u rs d e s p i t e u s i n g a ll 99 6 2 f e a t u r e s . T h e n u m b e r s p r i n t e d i nb o l d i n f ig u re 2 m a r k t h e p a r a m e t e r s e t t i n g w i t h t h e l ow e s t V C d i m e s t i m a t e a sd e s c r i b e d i n s e c t i o n 3 . T h e r e s u l t s s h o w t h a t t h i s s t r a t e g y i s w e l l - s u i t e d t o p i c ka g o o d p a r a m e t e r s e t t i n g a u t o m a t i c a l l y a n d a c h ie v e s a m i c r o a v e r a g e o f 8 6 .0 f o rt h e p o l y n o m i a l S V M a n d 8 6 .4 f or t h e R B F S V M . W i t h t h is p a r a m e t e r s e l e c ti o ns t r a t e g y , t h e R B F s u p p o r t v e c t o r m a c h i n e is b e t t e r t h a n k - N N o n 6 3 o f t h e 9 0c a t e g o r i e s ( 1 9 ti e s ), w h i c h i s a si g n i f ic a n t i m p r o v e m e n t a c c o r d i n g t o t h e b i n o m i a ls i g n t e s t .

    T h e r e s u l t s f o r t h e O h s u m e d c o l l e c t i o n a r e s i m i l a r . A g a i n k - N N i s t h e b e s tc o n v e n t i o n a l m e t h o d w i t h a m i c r o a v e r a g e d p r ec i s io n / r e c al l- b r e a k e v e n p o i n t o f5 9 .1 . C 4 . 5 f a i l s o n t h i s t a s k ( 5 0. 0) a n d h e a v y o v e r f i t t in g is o b s e r v e d w h e n u s i n gm o r e t h a n 5 0 0 fe a t u r e s . N a i v e B a y e s a c h i ev e s a p e r f o r m a n c e o f 57 . 0 m a d R o c -c h io r e a c he s 5 6 .6 . A g a i n , w i t h 6 5.9 ( p o l y n o m i a l S V M ) a n d 6 6 .0 ( K B F S V M ) t h eS V M s p e r f o r m s u b s t a n t ia l l y b et te r t h a n a ll c o n v e n ti o n a l m e t h o d s . T h e R B FS V M o u t p e r f o r m s k - N N o n a l l 2 3 c a t e g o r i e s, w h i c h is a g a i n a s i g n i f i c a n t i m -p r o v e m e n t .

    C o m p a r i n g t r a in i n g t im e , S V M s a re r o u g h l y c o m p a r a b l e t o C 4 . 5 , b u t t h e ya r e m o r e e x p e n s i v e t h a n n a i v e B a y e s , R o c c h i o , a n d k - N N . N e v e r t h e l e s s , c u r -r e n t r e se a rc h is l ik e ly t o i m p r o v e ef fi ci en c y o f S V M - t y p e q u a d r a t i c p r o g r a m m i n g

  • 8/12/2019 Joachims ECML 98

    6/6

    42

    problems. SVMs are faster than k-NN at classification time. More details canfound in [3].6 Co n c l u s i o n sThis paper introduces support vector machines for text categorization. It pro-vides both theoretical and empirical evidence that SVMs are very well suited fortext categorization. The theoretical analysis concludes tha t SVMs acknowledgethe particular properties of text: a) high dimensional feature spaces, b) fewirrelevant features dense concept vector), and c) sparse instance vectol-s.

    The experimental results show th at SVMs consistently achieve good perfor-mance on text categorization tasks, outperforming existing methods substan-tially and significantly. With their ability to generalize well in high dimensionalfeature spaces, SVMs eliminate the need for feature selection, making the ap-plication of text categorization considerably easier. Another advantage of SVMsover the conventional methods is their robustness. SVMs show good performancein all experiments, avoiding catastrophic failure, as observed with the conven-tional methods on some tasks. Furthermore, SVMs do not require any parametertuning, since they can lind good parameter settings au tomat ically. All this makesSVMs a very promising and easy-to-use method for learning text classifiers fromexamples.R e f e r e n c e s

    1. C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20:273-297, November 1995.2. T. Joachims. A probabilistic analysis of the rocchio algorithm with tfidf for textcategorization. In Interna tional Conference on Machine Learning IC M L), 1997.3. T. Joachims. Text categorization with support vector machines: Learning withmany relevant features. Technical Report 23, Universit~it Dortmund, LS VIII,1997.4. J. Kivinen, M. Warmuth, and P. Auer. The perceptron algorithm vs. winnow:Linear vs. logarithmic mistake bounds when few input variables are relevant. InConference on Computational Learning Theory, 1995.5. T. Mitchell. Machine Learning. McGraw-Hill, 1997.6. J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.7. J. Rocchio. Relevance feedback in information retrieval. In G. Salton, editor, TheS M A R T Retrieval System : Experiments in Automa tic Docum ent Processing, pages313-323. Prentice-Hall Inc., 1971.8. G. Salton and C. Buckley. Term weighting approaches in automatic text retrieval.Infor m ation P~,ocessing and Managem ent, 24 5):513-523, 1988.9. Vladimir N. Vapnik. The N atur e o] Statistical L earning Theory. Springer, NewYork, 1995.10. Y. Yang. An evaluation of statistical approaches to text categorization. TechnicalReport CMU-CS-97-127, Carnegie Mellon University, April 1997.11. Y. Yang mid J. Pedersen. A comparative study on feature selection in text cate-gorization. In Intern ation al Conference on Machine Learning IC M L), 1997.