personal web watcher, › ~nreed › ics606 › papers › mladenic99text...10 personal webwatcher...

Dunja Mladenic, J. Stefan Institute

ciation iiiodcls ant1 vuluc sclieiiies. While this personal toiicli iiiid opportunity 1'w creativ- i ly con be cxtrcmcly uscf'ul Cor l iu i i ia i is wlicii providing and ob1:iiniiig inlormation, innst computer systems have a hard l ime coping with the coiiiplexily, especially because [)eo- plc wai t i t i forma~ion sooner rather tliaii later.

Tlic large ~IIII~IIII~S of iti l 'orniatiii i i avail-

ERE SIMPLY AREN'T ENOUGH hours iii Ihc clay anymore, arc llicrc'? Every- body scciiis to liave :i stack o l th i i igs wail ing that j u s t had to be ckine yeslcrclay, and l l ic prohleni (inly gels worsc, riot better. InlOr- mation overload: i t 's tlie bane of our times. This impression seems so realistic that any help in handling iit least some si i i iplc tasks i s iisually appreciated.

' l hc Inlcrnct's rccciit asceiidancy-not only i n the research commiinity but iilsn i n many areas i ifeverytlay lile-is ii iiiii,jor c111- ~pr i l . N o longer coiiiincd to ~ ~ r w i d i n g rc-

the source o f clioice I'br in1'orin~it i~in eh(iul cverylhing Iron1 what's happening iiroiiiid tlic world, to whcrc to get the hest air l ine tickets, to liow tu cook a pal-ticiilar dish, to whcrc to h i d the best hik ing lri i i ls. It's woii- t lcr l i i l , bul i t also caiiscs heatl$ichcs.

The impact on coii i lxitcr systcms i s 1)iir- Licularly 1~roiioi1iiccd. When tlillcrcnt pcuplc coii ic mgclhcr witl iout ccntralizctl rii lcs and guitlaiice, many crc:itivc iiiid pleasant, a s well 21s soi i ie less pleas;int, aspects ciiicrge. So, wlieii potting in1'orm;ition on the World Wide )Vel>, each o i u s can decide wliiil to piit tlierc a i d liow to organize i l . Tlic result i s :i distributed, w~irld-wide-accccsihlc inI i i r i i i$i l ion source t l i i i t contains ~ioiilioiiiogciicoi~s data organized accoiding 10 dil lerenl I i i i i i ia i i IISSO-

scnrchcrs :icccss to iliitii, the Intcr~iet i s o i k n

text from We11 docuinents, and others rely- ing on other inlormation about documcnt rcl- evaiicy, siicl i a s user ratings.

Recent tleveliipiiients iit tlie intersection ol in i i i rn iat i~ i i i rc~i~icvi i l ancl iiiiicliinc learning- a s well as work i n inlelligent agcnts and ititcl- ligent iiser interkices-offer iiovel solutioiis

, ior helping users quickly select Ihc inlorma-

I N SCXVEYING CURRENT lU?SMRCH IN THE DEVELOPMENT OF TEXT-LMlWlNG INTELLIGENT AGENTS, THE AUTHOR FOCUSES

ON THREE KEY CRITERlA: WHAT REPRESENTATION THE PAKTlCUI,AR APPLICATlON USES FOR DOCUMENTS, HOW I T

CTS FEATURES, AND WHAT LEAKNlNG ALGOKITHM IT USES. SHE THEN DESCNBES PERSONAL WEB WATCHER, A CONTENT-

BASED INTELLIGENT AGENT THAT USES TEXT-LTARNING FOR USER-CUSTOlVIlZEl) W E B BROWSING.

speech, image, anti videti in the siiiiie docii- ii iciit; and tlic tlynaniic iiature d ' the priividetl informat ion. A ni imhcr 01' systems liiive

i i iqiics on text tlatahascs, called k t - / l e n m i Org, which combine rcscarcli i n m;ichine

learning with informalion retrieval. 'l'tiis x l i -

Authorized licensed use limited to: UNIV OF HAWAII LIBRARY. Downloaded on February 3, 2010 at 19:14 from IEEE Xplore. Restrictions apply.

Table 1. (onleiit-based approaches that use niathine-learning techniques

WHERE AGENT DEVELOPED GOAL PLWCATION

Antagonomy

Calendar Apprentice CiteSeer

ContactFinder

FAQFinder

Internet Fish

Letiria

Lira

Musag

Newsweeder

Personal Webwatcher

NEC Personalized newspaper

CMU Meeting scheduling

TX, NEC UMIACS

Finding papers on WWW

Andersen Finding experts Consulting

Chicago Univ. Answering questions

MIT Find iiifo 011 lnteriiet

MIT Browsing WWW

Stanford Browsing WWW

Hebrew Uiiiv. Browsing WWW

CMU Usenet news filtering

CMU, IJS Browsing WWW

Syskill & Webert UCI Browsing WWW

WAWA Wisconsin Browsing WWW

Webwatcher CMU Browsing WWW

classification, :ilicr wl i ic l i 1 discuss the Per- s o n a l WcbWatchcr iiitclligcnc agent iii mire tlctail.

Machine learning for intelligent agents

Alihiiugh there ;ire varioiis dcfinitions 01 the tcrin i i i f [ , ~ / ; ~ [ , i i ~ ~ i , ~ [ , i i ~ , I wi l l locus on systciiis such as user assistants and rccoiiiiiicii- diilioti hystems that employ iii i iclii i ic learning' or &ita-mining tccliiiiqiics.' l'hcsc systcnis assist L I S Z ~ S hy f inding iiifi)rm;itii i i i o r [per- lorming some siiiiplcr t;isks on their bcliall. For inst;iticc, such a syslciii Iriiglit hclp in Web

T, Kamba, H. Sakagami, and Y. Koseki, "Anatagonoiny: A Personalized News- paper or the World Wide Web,'' 1111'1 J. Human-Computer Studies, Vol. 46 No. G. June 1997, pp. 789-803.

T, Mitchell et al., "Experience with a Learniiig Personal Assistant" Goiiiiii. AGM,

K. Bollacker, S. Lawrence, and L. Giles, "Citeseer: An Autoiioinous Systein for Vol. 37, NO. 7, JUly 1994, pp. 81-91

Processing and Organizing Scientific Literature on the Web," Workiiig Notes of Learning froin Text and the We6, Goiif. Automated Learning and Discovery (CONALD-98), Carnegie Mellon Univ., Pittsburgh, 1998; Iittp://www.cs.cinu. edui-conald/conalrl.shtiiil

B. Krulwich ancl C. Burkey, "The CoiitactFiiitler Agent: Answering Bulletin Board Questions with Referrals," Proc. 13th /Val'/ Coiif. Ai (AAA/ SG), AAAi Press, Menlo Park, Calif., 199G, pp. 10-15.

R. Burke, K. Haininonrl, and J. Kozlovsky, "Knowledge-Based Information Retrieval for Semi-Structured Text," Woricing Notes from AAA/ Fa// Syiiip. A/ Applicatioiis iri Kiiowiedge Navigatioii and Actricva/, AAA1 Press, Meiilo Park, Calif. 1995, pp. 19-24.

R. Burkc ct al., "Question Answering froin Fretlueiitly Asked Ritestion Files," A/ Magazine, Vol. 18, No. 2, Summer 1997, pp. 57-66,

B.A. LaMaccliia, "Internet Fish, A Revised Versioii of a Thesis Proposal," MIT, AI Lab and Dept. of Electrical Eng. and Computer Science, Cambridge, Mass., 1996.

H. Lieberinan, "Letiria: An Agent that Assists Web Browsing," Pioc. 14th inti Joint Conf Ai (/JCA/95), AAA1 Press, Meiilo Park, Calif., 1995, lip. 924-929.

M. Balabanovic and Y. Shoham, "Learning Information Retrieval Agents: Experiinents with Automated Web Browsing," AAA/ 1995 Spring Syinp. Iiiforiiintion Gatlieriiig from Heteiogeiieous, DistriDiited Eiiviionments, AAA1 Press, Menlo Park, Calif., 1995.

C.V. Golrlnian, A. Latiger, aiid J.S. Rosenscliein, "Musag: An Agent That Learns What You Mean," App/iedA/, Vol. 11, No. 5, 1997, pp. 413-435.

K. Lang, "News Weeder: Learning to Filter Netnews," Proc. 12th hit'/ Coiif. Macliiiie Learning, Morgaii Kaufinann, Sail Francisco, 1995, pp. 331-339.

D. Mladeiiic, Persona/ WeDWatciier; hnpieineiitatioii aiid Dcsigii, Tecii. Report IJS-DP-7472, Dept. of Computer Science, J. Stefan Inst., 1996; littp://cs.cinu. ed ti/-Textleaiciiiigipww.

M. Parzaiii, J. Murainatsu, aiid D. Billsus, "Syskiil & Webert: Identifying liiterestiiig Web Sites," Proc. 13th Mat'/ Conf, A/ AAA/ 96, AAA1 Press, Menlo Park, Calif., 1996, pp 54-61.

Identification of liiterestiiig Web Sites," Machine Learniiig 27, Kluwer Acadeinic Publishers, Dorrdrecht, The Netherlands, 1997, pp. 313-331

J. Shavlik and T. Eliassi-Rad, "Building Intelligent Agetits for Web-based Tasks: A Theory-Refinement Approach," Working Notes of Leariiiiiy from Text and the Web, Coiif. Automated Learning and Discovery (CONALD-98)J, Carnegie Melloii Univ., Pittsburgh. 1998; Iittp://www.cs.ciiiti.ed~i/~conald/coiiald.shtml.

R. Armstrong et al., "WebWatclier: A Leariling Apprentice for the Worid Wide Web,'' AAA/ 1995 Spiing Syiiip. iniorniatioii Gathering from Heterogeiieoiis, Distributed Environments, AAA1 Press, Meiilo Park, Calif., 1995.

M. Parzani and D. Billstis, "Leariiing and Revising User Profiles: The

Ihrowsiiig hy retrieving tlocuiiiciits s i i i i i l i w [(I

i i ~ ~ c : a t ~ y - r c t ~ t i c ~ t c i l docii i i Ici its.3 Two frequently iiscd methods [or t lcvcl-

oping intcll igcot agents h;isctl on machine- Icatt i i i ig tcchniqucs ai-c i.oiitciit-hri.sei/ and co/ i r ihoi .u/ i iv ap1mr:ichcs. Bo th ciiti help tiscrs find atid retrieve relcvant i i i lornial iot i 1roiii the Wch.

Txhlc I stiiiiniarixcs the systcnis I discuss. l'or each syhLciii. the tahlc lists Ihc systcin's tiiinic. the mg;inizalioii that (IcvcIopcd i l , ii suinniwy o l its l'iinctionality, iiiiil a rclcrciicc 10 the piipci- t lcscribi i ig it. Maiiy other iiilcl- l igcnl :igcntc not iiicntioiictl licrc l i m e hcctr dcvcIopcd, but my intent i s to i i idiciitc gcw cral cLIrIcIlL trends.

'The content-b;iscrl ;~pproach. 111 !his i i p

prod1 to Lcxt c lus i f icat io i i , tlic systcni scarclics lor itciiis sii i i i l i ir to thrrsc l l ic user l ~ r e l c r ~ hascd 011 a coiiipiirisoti ol'coiileiil. The approach liiis i ts roots iii inhrini i t iot i t-ctricv:il. ' l ' l i is apprciach li i is dilliculty, Iiowcvcr, iii c a p tur i i ig t l i f c r cn t :ispccts or (lie cmtci i t- i i i i is ic, i i iovics, and iinagcs, 1)r cxaiiiplc. I'vcii ior text doiiiiiiiis, tiiost rqircsciit:itioiis capturc wi ly ccriait i ~ispccts OS the content, which results in ~ioorsyy"tcm pcrli)rmaticc. 111 at l t l i l ioi i to the rcl)rcsciitation;11 prohlciiis, coiiIciit-hiiscd systcins tei id to leiirii in ii way tliar they rccoiniiiciid itctiis s in i i h r to thc nlrcntly-seen ilciiih.

Applying thc ~ o ~ ~ t e i ~ t - l ~ : i h e d approiicli on LexL

45


10

Personal Webwatcher Project

d:itii lets LIS use dillereiit text-lcariiiiig ~iietliods. lb r iiistiincc, take tlic pnihlciii wliere we want to rccommcnd Wch pages to the user, b:lrctlo11 thcircontciit. H'thc mer is intcrestetl in thcWch

.. --

' i ict on ii iiscr's hchiill. I t searches [ l ie Wch by taking a hountlcd ii~noiint ut tiinc, sclccting tlie hest pages uid receiving an eval tiation I'roiii the LISCI'. L i ra uses tlic evaluation to

Overview I'ersonnl W'ebWmksris a "personal" agent that accompaiiies you from pageto pageas you browsetheweb, BipliliplitiiigI,~n)eiliiik( i l iatit believeswill beaiintcreut. Its s t r a w far $ving advice is lenrnnl fronl iecdbaekfroni earlier tours. Personal WebWatcher is ~nainly inspired by XckWatcIier.

Unlike WebWatcher, Personal WebWatcher is stmchired to specialize for a particulariixer, modeling herihis interests. It 'Watches over tlteusers shoiilder" the siinilar way WebWarclier does, but It avoids involviiiguie user in its learning process (it doesn't ask theuser for any keywords or opiiiionr about vages). ltnolely records the addresses of pages reguested by Uieaser and higliliglia interesting hnerlinb. In thelearniiig phase (typically during the eight), reguested pages are analyzed atid a niodel of user intereub is geiieratedhilidated. This madcl is used to give advice for liyperlinkv on retriwed NTML-gages regoestal by and presented to the user via Web browser.

More about Personal Webwatcher a Perraual W ~ : b W ; U c h e r r ~ ~ ~ i i ~ ~ i i ~ l ~ t i " i i and D a Q i , Tccluiical Repon US-UP-7472, October. 1996.

install you o w Personal Webwatcher (research version avrilableoii rq!mt.)

Project Members

a ~ i a d e n ! k m Toin Mitchell (facullyprisclpal investigator)

Lostrhanae: OctoberZ1. 19%

from page to pigc ancl highlights interesting

tlic co i i tc~ i t analysis of the rcqiiected pages without rcqiicsti i ig m y kcywtirtls o r ratings lrom the iiscr. Syskill & Wcbcrt, ii syslcni tlial collccts ratings o i the cxploretl Weh pages lroni llic user iitid Icirns a i iser profile from them, scpmxtcs pagcs according tu their t o p ics, antl leiiriis ii separate prof i le lor each topic. The systcin uses the geiicratcd user protile to form qi~erics fcir tlic existing search engines to get i i iore potcnti;illy interesting ~OCLIIIICI~~~. WAWA, :in intc l l igcnt ;igcnt lor

d tiisks, lets i iscrs inpi i t ~~e rsu i i i i l interests iiiitl prclcrcnccs, stores tlicni i n ii iiettriil network, and uses theory revision tu refine t l ie ohtainctl kntiwledgc.

based research public;itions hy getting lkcy- words from tlie user and celling sciirch engines to lind rc lcv~i t~t Ixipcrs (rclcviiiit PostScript liles 011 tlic Wch). I t then extracts hcatlcrs, abstracts, atid citations Croiu 11ic p:ipcrs. 'I'Iic cyskm iilsci

liivl\ siiii i l i ir pipers hasctl011 tlic coi i i i i io i i c i k - tiuns in the p;ipcrs. NcwsWcctler, ii system for electrunic iicws filtering, iiscs text clasilica- tion to generate ii inodel 01. ii iiscr's interests. l l i e system uses ii Web intcrlacc to give tlic iiscr access to the news i n thc iisi ial way :itid

to enahlc the systciii to collect l l ie i !~cr 's riitings as lccdbick. NcwsWcctler iils(i assigns prctlictctl ratings to ~ a c l i nrticle and generales ii pcrsoiiali/.cd lis[ ollhc 101) articles ( s d i iis 50 kirticlcs pretlictetl 21s the iiiust intcrcstiitg) louiitl mong all arlicles.

The ~proposcd Contactl:i ndcr agent reads and rcspontls to bul lct in hciiird messages. m i s t s LISCII by reicrring tliciii to otlicr pcoplc who ciiii hclp thciii, ;ind categor im nicesages iiiid extlacts their topic areas. Tlic system opcr- atcs iii twu phases. It l i n t sciirclics tlic bullctin brfiird looking lor contact pcoplc and tlicii- topic m a s , storing tlic round i i i ioriiiatioti iii its dat;i- base. Sccond, it searches new mcssagcs looking Ibi~ questions, cxtriictiiig the topic area 01' the liiiiiitl qiiestioti and linding acoiitact pcrsm in i ts tlatahasc lor that topic iirea. 'lir extract tlrc topic area, ContectFinder iiscs sonic hcuristics. For iiistiiiice, il exI r~~cts seni;inticelly signil i- ciiiit plifiiscs (such as , l l i l ly c;ipitali/etl words, short p1ir;iscs olunc tu live words, or wtirtls in ii clillcrcnt loriniit lroin tlie surrounding text).

FAQPintlcr uses ii natursl~I:inguagc qiics- tioii-hased intcrlhcc to aeccs~ distributed text inl ' i irmation sources and helps users lint1 iiiiswcrs to their qitcstioiis i n tlatnb a s FAQ files. I t iiiatclics qucstions froin relevant FAQ files iigaiiist iiser questions antl

hypcrlinks, gCllCl'alCS tl LISCI'IJl'(1filC h:lSCd on

~ ~~

IEEE INTELLIGENTSYSTEMS ~~ ~


gl.

CMIJText Lenmiiig Group

(4 Figure 2. Exomple of two Web poges ltoving differen1 content orid lhur recognized os no1 similar by the content-bnsed opprooch: (n) this page would be recognized as similnr to Ihe poge in Figure 1 and recommended by the tontent-bared iipprooth; (b) lhir poge would be recognixd os different from both previous Web poges.

tct~irns the i'ivc best iiiatching clucstioiis togctlicr with their iiiiswcrs. Antiigon~imy. ii system h i t coiiiposcs pcrsniuilizcd newspa-

pers OII the Weh, mcrnitors user operations oil the ;irticlcs x i id rcllects them i i i the iiscr 1pro-

(1 011 the scorcs given to zirticlcs that rcl lcct the dcgrcc to whic l i arliclcs i i i i i tcl i tlic i isc i profile-articles wi th the higher scorcs

Internct Fish is a class (if rcsourcc-dis-

li lc. l'hc I;ryout (11 the c~lmpo'ed llcwsl>apcr

iippc11t at the t q 1 ~ 1 i l h C IIcws~)apcr.

covcry too ls clcsigncd to help tiscrs iiscl'ii I i iiforinzition l r m i the I iitcsn systcin iiicliidcs ii t i a tu ra l - l ~ i i i ~ i i i i ~e iiiteiliicc that cuirciitly pcrinik (rnly IimiLcd, s~riic~~iicd intcl.actioii. 'l'hc proposed sysicm also i ticludcs help iii Iirowsii ig [lie Well hy i isiiig existing search eiigii ics and iiscr tittiirgs o l ilociiincnts. Calciitliir Appre i i~ i

coiiiiects to the ~iscr'c clcctnitiic calciidiir antl generales w t s n l i i i l e s that C L I ~ J ~ U ~ C thc iisci's schctlt i l ing prefcrciiccs t i i id d i e r iiit'~rrin:i- t io i i atiotii indiv idual i i icct i i ig :iitcndecs. 11 iiscs these tules to pruvitle ;ictvicc to the tiwr l o r new, iinschetlulctl incctiiigs.

Tlic collabor;itivc approach. Iii contrast to l l l C coIItcIIt-h1Isc~l appr~lach to text c;ition, wl i icl i ciii i he sncccsslully ;ipplictl to

systcin that helps I I S C t S schctl

Music Chopin Bach Matheny Balasevic Prodigy Presley ABBA Enya ~

Userl 1 6 7 5 7 1 2 3 7

User2 7 6 6 7 1 5 G

User3 2 1 7 G 6 3 I 1 ~~

Figure 3 Exnmple of music rotings ghen by itnnginary Web users, willt 7 being the highest rnling and I - ' ineaning no roting collected tor hot i ~ e n i The tolloborotive opprooth would find User2 similor to Userl, while User3 would be found different froin Userl

ii sii iglc itser, the co1lnlxii:itivc i i p~~r~ i i i ch

i i ss~ i i i i cs that tlicrc i s ii set 01' .iscrs using the systcin. In the cnl1;ihnr;itivc a~iproacli (soinc- tiirics rcfcrrcd to :is s w i d / ( ~ n i i i ~ ~ ) , advicc kr t he iiscr is Ixiscd nii the rc ict ioi i 0 1 other ~ iscrs . l ' l i e syslcin sciirchcs lor iisers w i t h sini i ler interests iiiid rccoiiiiIiciids the items

Inbolativc approach, thcrc is i io analysis of the i tcm coittcnt, so itciiis of any content can he haiitllctl will1 cqiiii l S I I C C C S ~ . l iach i l c m is asqigiietl ii i i i i iq i ic idci i t i l ' icr antl ii i isci- (Ictivcd ri i l i i ig. l h e simi1;iriiy .:iting betwccn itscrs is haset1 oii 11ic co i i i~~: i i~ iso i i of the rat-

ings tlicy nssignctl IO tlic sit i i ie items. Wi th Llic cnllnhol.;itivc approach, however,

the si i i i i l l iiiiiiibcr oliiscis rcliilivc to tlic iiiiiii- hcc- t i l itcins usually rcsults in a s~xirsc covctxge or ratings. Ibr any iicw itcin iii i l ic tl;italxise, the system iiii ist collcct inlormation lioin d i l l (biciit tiscrs to he able IO rcconiinciid it, iiiid

siiiiilar iiscrs are iiot iiiatchctl iiiilcss they have rated a suiliciciil iiiiiiibcr olsitiiihir itciiis. Also, i l i i iiscr has iini~sii;iI tastes c o ~ i ~ ~ i ~ i r c d kr h e rest ofthc users, there will he no other siiiiiliir user iiiid systciri ~icrloriiiiincc wi l l he poor.

For iiistiiiicc, coiisidei iii:ikiiig a rccoiii- iiieiidalion lirr Uccr I, whosc ratings lor i i i i isic iirc listed i n Figure 3. 'l'lie r c c o ~ i i ~ i ~ c i i d ~ i t i o i i i s 1i:iscd on tlic niIiiigs nl'otlicc- 11.

JULY/AUGUST 1999 ~ ~~

~

47


Table 2. Some colloborative opprooches thot use motliine-leorniiig-relnted tethniquer

AGENT DEVELOPED GDAL PUBLICATION

Firefly, Ringo MIT

GroupLense Minnesota Usenet news filtering

Phoaks AT&T Labs Browsing WWW

Referrai Web AT&T Labs Finding experts

Finding music, movie, book P. Maes, "Agents that Reduce Work antl Information Overload," Go/iim. ACM, Vol.

J.A. Konstan et al., "GroupLense: Applying Filtering to Usenet News," Cornm AGM, Vol. 40, No. 3, Mar. 1997, pp. 71-87.

T. Terveen et al., "PHOAKS: A Systein for Sharing Recommendations," Comin. ACM, Vol. 40, No. 3, Mar. 1997, pp. 59-62.

H. Kautr, B. Selmaii, and M. Shah, "Referral Web: Combining Social Networks and Collaliorative Filtering," Comm. ACM, Vol. 40, No. 3, Mar. 1997, pp. 63-65.

H. Kautz, B. Selman, and M. Shah, "The Hidden Web," A / Magazhe, Vol. 18, No. 2, Summer 1997, pp. 27-36.

J. Rucker and J.P. Marcos, "Siteseer: Personalized Navigation for the Web," Comin. ACM, Vol. 40, No. 3, Mar. 1997, pp. 73-75.

37, NO. 7, July 1994, pp. 30-40.

Siteseer linana Browsing WWW

~ ~~ - ~~ ~~ ~ - - ~~

Table 3. Systems tho1 use both content-bored ond colloborolive approother.

WHERE AGENT DEVELOPEO GOAL PUBLICATION

~~ ~ -

Fab Stanford Browsing WWW M. Balabanovic and Y. Sholiam, "Fab: Content-Based, Collaborative

Lifestyle Finder AgentSoft Browsing WWW

WebCobra James Cook Browsino WWW

Recoinniendatioii," Comm ACM, Vol. 40, No. 3, Mar. 1997, pp. 66-70. B. Krulwich, "Lifestyle Finder," A/ Magazine, Vol. 18, No. 2, Summer 1997, pp.

37-46. 0. de Vel and S.A. Nesbitt. "Collaborative Filterina Aaent Svstein for Dvnamic

Univ.

Uscrl). I3ccausc iiiiist oltlic ratings User I ant1 Uscr2 gave arc ciiiiilar (they both l ike Chopin and Biich, Irir cx:iiiiplc), ihe systcin wil l rcc- iignizc User2 iis having similar iiitisical tastes and will recoinniend the iiitisic he or she likes Lo Uscrl as probably iiitcrecting. By coiitrasl, i t wi l l recognix User3 a s having dil'lcrcnt intisiciil lastcs from User1 and wi l l not rcc- oiniiieiid the iiiusic slic or he likes.

The collahnfiitivc approach i s usu;illy iiscd tor inontext data (iiiovics o r music. lor example), but there arc also systciiis tliiil iisc it on text h k i (such iis lnr news liltering). Table 2 lists some collaborativc systems. Firef ly and Ringi i ere two intcrfiicc agents that IcIirii from

ii serve Ibr clectronic niail handling, meeting scheduling, clccti-onic iiews likering, and entertainment rcc~)iiiiiiciitlatii)ii, Sonic use Lhc content-hasctl approiicli and adopt inlor- ~ i i i i t i ~ i i - r c t r i c v ~ i l iiictliods (lor cxui iplc, news filtering). Others rely oii the correlation he- Lwccii difl'crcnt wers pcrlorming col lahorativc ti I teri rig (siicli tis eiitcr(iii iiiiien t rccoinnientla- tions). For instance, h c Ring0 intisic-recoiii- mcndatioii syslcm recoinmends niusic that wiis highly scored by iiscrs w i th siiiiilar inusic tastes. Ringii tries to overconic the prohlcm of spiirsc ratings coverage by builtling inoclcls of virLiuiI iiscrs iiitcrcstcd i n :I very inarrow r:inge o l i i i i isic. A dcvcliipcd syslcin lor ims ic , ii iovic, and book reCotiiiiiciidiitioiis, I j i rcf ly rcqiiires iiscrs to start by reting sc\wal 11rcdc-

the LISCI ;Is well iis froln l ither agcnts. Such

Al)

Virtual Communities on the Web," Working i o t i s of Lkamhig from Text and the Web, Gonf. Automated Learning and Discovery (COIVALD-YB), Carnegie Mellon Univ., Pittsburgh, 1998: http://www.cs.cmu.edu/~conald/coiiald.shtml.

liiictl i t e m , k i etistire thc possibility of c o n - paring m y two iiscrs (two iiscrs can he c o n - Iiiircd only i f l hcy r;itcd thc s m c itciiis)."

Siteseer, ii Wcb-Ixige ~ccommendation system, iiscs iiii iiitlividuiil's bookmarks and the orgaiii7atiiin 01' bookmarks wi lh i i i folders lor predicting ancl recommending rclcvniit pages. The syslciii iiicasiircs the dcgrcc of ovcrlq) (sucli as coiiii i ioii UKLs) hctwecn llic lxiiik- mark tiles 01' rlill'erciit IIYC~S mtl then gI(iiips iiscrs according to tlint siiiiilarity. Wlicii making recoiirriicii~l;ilio~i, Sitesccr gives priority to URLs obtaiiied f ~ i i ~ i siiriilar IhltlcIs and URLs that iiplmir i n bookmark lilcs (if similar users.

Plmiks, a priipi)sctl system that autiimal- i ca l ly recognizes and rctlistrihutcs recoin- mcndatiiins ol t l ic Web rcsotirccs i i i i i icd l'roiii Uscnet i i cws messages, iisstiincs t1i;it the roles 01 the pri ividcr and tlic scconiiiieiidii- tion recipient arc spccializctl and dillcrcnt. 11 ~CLISCS rccoiiriiicnda~ioiis lrom tlic existing oiili i ic cniiversiitioiis. 'I'hc system pays spc- cia1 attcntiiin to the priihlcni ni'distinguisli- ing rccommcndctl Wcb pages (IJKLs) frriin atlvcrtisctl or iii i i ioiii iccd page\. lt incliitles ciiLcgnrization rules that implement :I strat- egy to dist inguish difiercnt pi i rp i iscs lor which tl ic Web rcsoiir

GroupLcnsc, ii proposed ci)llahoralive lil-

iliitiiliiisc to stoic ratings t l i i i ~ ~ l i c uscrs liiivc givci i tu iiicssiiges :ind correlations hetwccn pairs o l iisers hiiscd 011 their riitings. Recause

Lcrillg systcm Ibr Ilscnct IIZWS, h a s 1 two-liar1

each uscr rciids ii si i i i i l l Ipcrcciitagc o f the total

~iiiiiihcr o f iicws iiiessiigcs, f inding other

tiscrs with wliotii to co~reI;itc is t l i l l ' icull, so iin ciiornioiic number 0 1 ratings arc nectlctl to cover all the iucssagcs. 'This pr(ih1eni of ratings scarcity is coiiii i ioii lor ciillaborative approiiclics, and dill'crcnt systems address i t i n tlikrcnt ways. GroiipLciisc partitions the set 0 1 news nrcssiigcs iiitii clusters that arc comnioiily read together, improving the Inciil density o l txtings.

R c l c r ~ i l Wch, :in interactive systciii lor re- constrtictiiig, visualizing, and searching the sociiil networks (111 the Wch, lias a motivation similar ti) (:onliictPinder's, which is to help search for iiii expert on ~i given lopic. This sys- tci i i models ii snci;il network by ii graph, where nodes rc~>rcsc~it iiitlivit1ti;ils antl etlgcs between notlcs indicate that ii dircct relationship hc- twccii the iiidividuals 1x1s bccn dctcctctl. Refer- riil Web constriicte tlic network iiiodcl iiicrc- iiicntally w i th iiew tiscrs, searching lor the co-occiitrcncc iif iiiiiiics iii close proximity i n m y diicuiricnts ptihlicly availablc on the Web.

The Fal> system for Web ilocut~icnl rcc- omnicndatirin cnitiliiiies content-biiccil and collahoralivc approdics ('Milc :i lists soi i ie

systems tl1at ilsc both cllntcnt-l>ascd illld col- l i i txiri i t ivc agctits). It LISCS the colitclit-hesctl ;ip~>roach to gcncratc a prolilc tliat reprewits ii single user's interests iriid IISCS the collaborative qiproiicli t i i liiid similar iiscrs. 'lhc user's ratings arc ~ iscd to tipd:ite (hat person's

IEEE I N T E L L I G ~ S Y S T E S .-


pcrso~ial prol'ilc. 'l'hc two i ip~~ro~ichcs arc combiiictl, and the pages matching the user's prol i lc as wel l ;is l l ic pages highly rated by siiiiiliir iisct-s arc rccoiiiniciitlcd. kih iiicii- siircs llic siiii i larily betwecii iiscrs by the si i i i - i l x i l y o f thc i r lprolilcs.

WchCobra iises ii siitiiliir idca i t i comhining Ihc coiilciit-h;iscd ;iiid tlic collahoralivc q - 1)ro:icIics to ~ c x t classilicalion. I t gciicnilcs ii

iiscr prolilc froin rclc~aiil documents using ii pari-ol-spccch tiiggcr. 11 gi-oups iiscrs i i i lo col- I;iborativc cIiis(crs hnscd on h e i r prolilc s i n - ilarity, I .ifcstylc Pindcr, a systcin lor iiscr- prolile generation hasctl oil tlic usage of clcnrogsaphic (lata, iilso combiiics the coiitciit- based and tlic colliihnriitivc ;ippro;iclics. I( uses ii iIiicstionnairc to get ii user's chiiriictcristics, iiiid it groiips similar 11.

graphic tlala. Lilcstylc Pindei-suggcsts a set 0 1 Ihc I S ii iost highly scored Weh dociiiiiciits to ciicli iiscr. It tises tiscrcviiliiiitioiis olsuggcstctl docuiiiciits IO evaluate its pci-lorinance.

~

learriing

joiirnal iritelligcnce text

agent

internet webwatclicr per15

Kclatccl systcins. I<otmt I Iollc iintl Chris Druiiiniond"~X tlcsigiicd a systciii that assists browsing 01' soitw;irc lihrarics, laking kcy- words i rol i i Ihc iiser ;uid tising ii rule-hesccl systcin w i th a f o r w ~ i r ~ l - c l i ; i i ~ i i ~ i g inlcrcncc. The sysleiii assiiiiics t1i;it lhc librury consists nlonc type o l items and the iiser's goal i s lo liiitl :I single ilciii. Orcii I'tiziwii iind I h i c l Weld devclopctl ai inlegratctl iiitcrlacc lo the In lcrncl comhi i i i i ig a tlnix shell i i i id the World Wide Wch lo i i i lcract wi th Intcriict rcsot~rccs." Their agent ;icccpts high-lcvcl user goals iiiid dynaiiiic:illy syiithcsizcs the appropriate scqiiciicc of I i i~crnel conini:iiids lo salisly Iliosc goals. M a l j u Gams xid Mwko Grobcli i ik dcvcIopctl iin cinployiiiciit agent avai1;ihlc through the Iiitcrtict lhiit le& iiscrs browse d;ila iiiitl ortlcr w i i i i i l s when i iilcrcstiiig in lormat ioii appcnrs. It) I ippo Mcnczer tlcvclopctl iid;iplivc inlclligciil iiictli- o d s In aiihiiiatc onl ine infornialioii search and discovery iii tlic Web (sticli ;IS Wch rohols

Machine learning on text data

Mosl intelligent agents lhat use m:icRinc-lcaroiiIg taclniii~ucs actually iisc the content-hosed approach and leiirn from the entilent of tcxl cliicu- ments. Text Icarning is tlic application o f in;icliinc-lc;irning techniques

onsidemble work underwiiy involviiig learw iiig on lcxt documents that i s not necessarily rclatccd to the Weh. I k r e I take a look n l soine of this work through the prism o l three queslions I lind imporIan1 cor using nuchine Icarning for text cl;issilication: wliat rcprcscntation i s used for docuniciits, hiiw i s the high nuniher of fealures tlciilt with, ;ind which letirning algorithm i s used?

Tiihlc A suii ir i i i ir im lhcsc qiicsfions over sirmc related Duhlicalions to give an idea ahout

~ ~~~~~

I Ihc stale of the iirt in using unsupervised learning on text data. This tahle includes systems given in Tahlcs I lo 3 in lhc main text with a inore &%ailed

' ciiougti inforination avaikihle dcscrihing llie insides olthc systcin. I iilsu include descriptions of soiiic rese;ircIi work t1i;il does not ncccssisily include the dcvelopmani ot'a worhing intelligent ageill and was therefore not iiicludcd in 'l'ihlcs I 10 3.

Representation. The so-called vrcfnr re/)re.seti- firtion is tlic most trcqueiitly used docu~nciit rcpresentalion in inlbrrnation rclriwal ancl text Icarning. IL is ii bo,~-ol:ivorr/s r~~~rrserirtifiun, iiieaning that ;ill words from tlie dncumenl arc t;iken and no ordering of words or any structure of the text i s used. 111 a se1 ofducunicnts, each docurnenl i s revresenled as ii bae of words.

o r cr:iwleIs) txisctl 011 i i poptilalion 01 iri(cl1i- gent agcnls.' I l c iiscd genetic algcirilhms oii tlic popiilatioii 01 agents, rcwml i t ig mi agent will1 ciicrgy Ibr c:icIi 1-clcviiiit doctiiiiciit Ioiintl on the Wch i i r i d ch:irging theii i cncrgy for tlic tis;igc o l iiclworh rcsoiirccs iiiciirrcd by trans- Scrriiig ilocuiiicnls.

Sec tlic "Machine Icariiiiig oii lex1 d:ita" siclelxir ior ii disciissioii ol'llic iisc ol . i i ix l i i i i c - Icariiing tcchiiiqiics on text dakihiscs.

Text-learning for user- customized Web browsing

'I'lic Wch is a rapidly growing inlori i ial ion soiircc, currently a t l r x t i i i g 1ii:iny users wil l1 dil' lcrciit ii itcrcsls. Hcc;iiisc tlic inlcractioii wi th tlic Web is tliroiigh i i coiiipiikr, we ciiti IISC coinpii lcrs to ohscrvc :iiid rccord iiscr x h i i s , lo coii i i i i i inicate wi th iiscrs, iiiid ti)

iisc all h e collcclcd iii1oriii;itioii to help iiscrs

cxamplc. sentence struclurc, word Iiositinii, or iieigllhoring words. The questioii i s how much can we gain in considering addition;il inforiii;i- tion in learning (and whiil iiit'orin;itini~ IO consider), iiiid what is tlic price wc liiivc to pay lor it'! I am not aware of any current well-studied coiiil)iirison or dirccliolis

for 1ext-dociiiiient rcprcsentation. Some inlbrin;itioii-retriev:il research suggests that for loiig documents. consitlcring inlbrmatioii additional lo the bag of words is not worth the clfort. Work on document cl;issitice- tion llinf extends the hag-of-words reprcscritiition by usiiig word sc- quences (+grains) iiistciid olsiiiglc words suggests that using singlc words ;~nd word pairs as Ccaiurcs in the Iiag-of-words represcatation iinproves pcrlbrmancc olclassiliers generated frim shorl ilocunimts. .2

I I 0

Journal of Altificial Intelligence Research

JAlR IS a refereed jotirnal coveririg all arcas of Artificiril liitelligeiice whicli IS distriliuted free of charge over the interiict Fach volu(ne of thc lnurnal IS also published by Morgan Kaufmari

Vollil l le

" incIu<iing all tllc words that uccur in the sei of docunients (see Figure A). Addition;il inform;i- tion in lex1 i l ~ c i i i ~ i c n t ~ could bc used-for

Figure A. Illuslration of the bag-of-words document representalion using frequency vector. Each word occurring in a set of documents is mapped intoa feature. Adacument is represented as a veclor of features where each feature is assigned a frequency (the number of times it occurs in the document).

(curitinwed) ~~~ ~ ~ ~ ~~~ ~ ~~~ ~~ -~ ~~ ~~~


Table A. Document representotion, feotureseledion, rind leorning algorithms used in some text-learning opprooches. (The bag-of-words representation is used on Bwleon feotures unless notified thot word frequency i s used-frq.)

DOCUMENT A ~ O R S AEPRESEHTATION FEATURE SELECTION CLASSIFICATION

C. Apte, F. Damerau, and S.M. Weiss, "Toward Language Independent Automated Learning of Text Categorization Models," Proc. Seventh Ann. lnt'l AGM-SIGIR Conf. Research and Development in lnformation Retrieval, AGM Press, New York, 1994, pp. 23-30.

Decision Trees," Working Notes of Learning from Text and the Web, Conf. Automated Learning and Discovery (CONALD-98), Carnegie Mellon Univ., Pittsburgh, 1998; http://www.cs.cmu.edu/-conald/conald.shtml.

R. Armstrong et al., "Webwatcher: A Learning Apprentice for the World Wide Web," M A / 1995 Spring Symp. lnformation Gathering from Heterogeneous, Distributed Environments, AAA1 Press, Menio Park, Calif., 1995.

M. Balabanovic and Y. Shoham, "learning Information Retrieval Agents: Experi- ments with Automated Web Browsing," MA/ 1995 Spring Symp. lnformation Gathering from Heterogeneous, Distributed Environments, AAAi Press, Menlo Park, Calif., 1995

B.T. Barteil. G.W. Cottrell, and R.K. Beiew, "Latent Semantic Indexing is an Optimal Special Case of Multidimensional Scaling," Proc. AGM SIG Information Refriev- a/, AGM Press, New York, 1992, pp. 161-167.

Information Retrieval," SlAM Review, Vol. 37, No. 4, Dec. 1995, pp. 573-595.

Information-Filtering Methods," Comm. AGM, Vol. 35, No. 12, 1992, pp. 51-60.

C. Apte, F. Damerau, and S.M. Weiss, "Text Mining with Decision Rules and

M.W. Berry, S.T. Dumais, and G.W. D'Brein. "Using Linear Algebra for Intelligent

P.W. Foltz and S.T. Dumais, "Personalized information Delivery: An Analysis of

R.M. Creecy et al., "Trading MIPS and Memory for Knowledge Eng.," Comm.

W.W. Cohen, "Learning to Classify English Text with ILP Methods," Workshop on Inductive Logic Programming, CS Dept., K.U. Leuven, 1995, pp. 3-24.

W.W. Cohen and Y. Singer, "Context-Sensitive Learning Methods for Text Categorization," Proc. 19th Ann. lnt'l AGM SlGIR Conf. Research and Development in Information Retrieval (SIGIR '96), ACM Press, New York,

ACM, Vol. 35, NO. 8, Aug. 1992, pp. 48-64.

1996, pp. 307-315. B. Gelfand, M. Wulfekuhler, and W.F. Punch Ill, "Automated Concept Extraction

from Plain Text," Working Notes of Learning from Text and the Web, Conf. Automated Learning and Discovery (CONALD-98), Carnegie Meilon Univ., Pittsburgh, 1998; http://www.cs.cmu.edu/-conaidlconaid.shtm1.

T.A. Joachims. "Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization,'' Proc. 14th lnt l Conf. Machine Learning (lCML97). Morgan

T.A. Joachims, "Text Categorization with Support Vector Machines: Learning Kaufmann, San Francisco, 1997, pp. 143-151.

with Many Relevant Features," Proc. 10th European Conf. Machine Learning (ECML 98), Springer-Verlag. Berlin, 1998. pp. 137-142.

for Text Categorization," 15th In?/ Joint Conf. Artificial lnteiligence (I,/CAl97), AAA1 Press, Menlo Park, Calif., 1997, pp. 745-750.

Categorization," Proc. 21th Ann. lnt'l AGM SlGlR Conf. Research and Development in lnformation Retrieval (SIGIR '98), AGM Press, New York,

D.D. Lewis and M. Ringuette, "Comparison of Two Learning Algorithms for Text Categorization,'' Proc. Third Ann. Symp. Document Analysis and lnformation Retrieval, Information Sciences Research Inst., Las Vegas, 1994, pp. 81-93.

0.0. Lewis and W.A. Gale, "A Sequential Algorithm for Training Text Classifiers," Proc. Seventh Ann. lnt'l AGM-SIGIR Conf. Research and Development in Information Retrieval, AGM Press, New York, 1994.

D.D. Lewis et al., "Training Algorithms for Linear Text Classifiers," Proc. 19th Ann. Int'l AGM SIGIR Gonf. Research and Development in Information Retrieval (SIGIR '96), AGM Press, New York, 1996, pp. 298-306.

R. Liere and P. Tadepalli, "Active Learning with Committees: Preliminary Results in Comparing Winnow and Perceptron in Text Categorization," Working Notes of Learning from Text and the Web, Gonf. Automated Learning and Discovery (GONALD-98), Carnegie Mellon Univ., Pittsburgh, 1998; http://www.cs.cmu. edu/-conald/conald.shtml.

W. Lam, K.F. Low, and C.Y. Ho, "Using Bayesian Network Induction Approach

W. Lam and C.Y. Ho, "Using A Generalized Instance Set for Automatic Text

1998, pp. 81-89.

~ ~ ~~~~~~~~ ~ ~ _ _ ~ ~ ~ _ _ ~ ~~ _ _ ~

Bag of words (frq) Stop list t Decision rules frequency weight

Bag of words (frq) Stemming Boosted decision t min.frq trees

Bag of words lnformativity TFIDF, Winnow, WordStat

Bag of words (frq) Stop-listtstemming TFlDF

Baa of words Ifra) LSI (latent semantic - . ., indexing using SVD)

Bag of words (frq) LSI

Bag of words -

Bag of words t word position Ordered word list -

Minimum frq

Bag of words t Minimum WordNet connectivity

Bag of words (frq) Minimum frq t informativity

Minimum frq Bag of words (frq)

Bag of words (frq) Mutual info

Bag of words (frq) Stop list

Bag of words Stop list t informativity

Bag of words Log likelihood ratio

Bag of words (frq) -

Bag of words -

TFlDF

Memory-based reasoning Decision rules,

ILP Decision rules, sleeping expert

Semantic relationship graph

TFIDF, PrTFIDF, naiire Bayes

Support Vector Machines

Bayesian network

Generalized instance, set, k-nearest neighbor

Naive Bayes. decision trees

Logistic regression with naive Bayes Widrow-Hoff, EG

Winnow (in query by committee)

50 ~

IEEE INTELLIGENT SYSTEMS ~~ ~~


http://www.cs.cmu.edu/-conald/conald.shtml

http://www.cs.cmu.edu/-conaidlconaid.shtm1

http://www.cs.cmu

DOCUMENT Aurnons REPRESENTATION FEITURE SELfCTlON CLASSIFICATION

P. Maes, "Agents that Reduce Work and Information Overload," Comm. ACM,

D. Mladenic, Personal Webwatcher: lmplementation and Design, Tech. Report

Bag of words t header information Bag of words (frq)

Vol. 37, NO. 7, July 1994, pp. 30-40.

IJS-OP-7472, Carnegie Mellon Univ., Pittsburgh, 1996; http://www.cs.cmu.edu/ -TextLearning/pww

Hierarchy, Working Notes of Learning from Jextand the Web, Conf. Automated Learning and (CONALD-98j, Carnegie Mellon Univ., Pittsburgh, 1998.

D. Mladenic and M. Grobelnik, "Word Sequences as Features in Text-Learning," Proc. Seventh Electrotechnical and Computer Security Conf. (ERK '94 , IEEE Region 8, Slovenia Section IEEE, Ljubljana, Slovenia., 1998, pp. 145-148.

I. Moulinier and J.-G. Ganascia, "Applying an Existing Machine Learning Algorithm to Text Categorization," Connectionist, Statistical, and Symbolic Approaches to Learning for Natural Language Processing, S. Wermter, E. Riloff, and G. Scheler, eds., Springer-Verlag, Berlin, 1996, pp. 343-354.

K. Nigam and A. McCallum, "Pool-Based Active Learning for Text Classification," Workino Notes of Learnino from Text and the Web. Conf. Automated Leamina and

D. Mladenic and M. Grobelnik, "Feature Selection for Classification Based on Text Bag of words using n-grams (frq)

Bag of words

Bag of words

Discov&y (COMALD-98), Earnegie Mellon Univ., insburgh, 1998; http://www. cs.cmu.edu/-conald/conald.shtml.

M. Pauani, J. Muramatsu, and D. Billsus, "Syskill & Webert: Identifying Interesting Web Sites,'' Proc. 13th Nat'l Conf. Artificial lntelligence AAA/ 96, AAA1 Press, Menlo Park, Calif., 1996, pp. 54-61

M. Pazzani and D. Billsus, "Learning and Revising User Profiles: The Identification of Interesting Web Sites," Machine Learning 27, Kluwer Academic Publishers, Dordrecht, The Netherlands, 1997, pp. 313-331.

J. Shavlik and T. Eliassi-Rad, "Building Intelligent Agents for Web-Based Tasks: A Theory-Refinement Approach," Working Notes of Learning from Text and the Web, Conf. Automated Learning and Discovery (CONALD-98), Carnegie Mellon Univ., Pittsburgh, 1998; http://www.cs.cmu.edu/-conaid/conald.shtmI.

S. Slattery and M. Craven, "Learning to Exploit Document Relationships and Structure: The Case for Relational Learning on the Web," Working Notes of Learning from Text and the Web, Conf. Automated Learning and Discovery (CONALD-98), Carnegie Mellon Univ., Pittsburgh, 1998; http://www.cs.cmu. edui-conald/conald.shtml.

M. Mc Elligott and H. Sorensen, "An Emergent Approach to Information Filtering," Abakus. 0C.C. ComputerScience J., Vol. 1, No. 4, Dec. 1993, pp. 1-19.

H. Sorensen and M. McElligott, "PSUN: A Profiling System far Usenet News." CiKM95 lntelligent lnformation Agents Workshop, 1995.

E. Wiener, J.O. Pedersen, and AS. Weigend, "A Neural Network Approach to Topic Spotting," Proc. Fourth Ann. Symp. Document Analysis and lnformation Retrieval (SDAIR '95), Information Science Research Inst., Las Vegas, 1995; http://www.stern.nyu.edu/-aweigend/Research/Papers~extCategorization.

Y. Yang. "Expert Network: Effective and Efficient Learning form Human Decisions in Text Categorization and Retrieval," Proc. Seventh Ann. lnt'l AGM-SlGlR Conf. Research and Development in lnformation Retrieval, ACM Press, New York, 1994, pp.. 13-22.

Information RetrievalJ., May 1999. Y. Yang, "An Evaluation of Statistical Approaches to Text Categorization,"

Bag of words

Localized bag of words

Bag of words t hypertexvgraph

n-gram graph (only bigrams)

Bag of words

Bag of words

Selecting Memory-based keywords reasoning lnformativity Naive Bayes,

Stop list t Naive Bayes minimum frq t

nearest neighbor

odds ratio

lnformativity

Minimum frq

Stop list t informativity

Stop-list t stemming

lnformativity

Decision rules

EM with QBC

TFIDF, naive Bayes, nearest neighbor, neural networks, decision trees Theory refinement on neural networks

Naive Bayes, ILP

Weighting graph Connectionist edges combined with

genetic algorithms

Stop-list t Neural networks, minimum frq t logistic regression stemming t relevancy or LSI Informativity, k-nearest neighbor $-stat. LLSF

As Table A shows, any current systems tlial lcarii Crom lex1 use the hag-of-words representation with Boolean features, which indicatcs il a specific word occurred in a document or the frequency of a word in a given document. Some work uses additional ioformation such as word position or word tuplcs callcd n-grams (for example, "machine learning" is a 2 gram and "World Widc Wcb" is a 3 gram). Somc rccent work indicatcs that the use of hypertext structurc and graph orgmization of Wcb pages improves classilicatioii rcsults. Thcrc is curreotly no sludy Ihal

cnmpares different document reprcsenkitions ovcr sevcral domains to show clcar advantages or some representatioo.

Number of features. One of the frequently used approaches to reduce the number of different words is to rcmovc words that occur in the "stop list" containing common English words like "a," "thc," or "with:' or pruning lhe inl'requent words (word frequency < min.Crcqucncy)-sgain,

(continued)


http://www.cs.cmu.edu

http://www

http://www.cs.cmu.edu/-conaid/conald.shtmI

http://www.cs.cmu

i iavigatc tlic Wch. ‘ l kx l learning cui he ;ipplictl on c(illcctetl i ~ i I ~ ~ r i i i ; i t i o n to help users hrowsing tlic Web. Our work (in Per- sotiill WchW;itciicr3 is ma in ly inspired h y

WchWalcher, a Lcarning i\plJIClitiCC lor the World Wide Wch,”.” iuitl other work rclatcd to learning q)prci i t icc and learning f rom ~ c x t . “ ‘ ~ ~ ~ A learning aplxctit ice lcts LIS iiiito-

see Tahle A lvr exiiinplcs ol work being done in the viirious areiis 1 describe. Connected to the particular language i s also word stemniing, which reduces tlie miiiiber ol different words using ii Iangu:igc-spccific stemming iilgoritliiii-for example, “work” rcpl ing,” aiid “worked“).

Many approaches nse a langoagc-indcpen(lcnt q~proach and intro- duce some soit of wvrd scoriiig 10 select only the best words or rcducc tlic dimetisionality usiiig latent semantic indexing (ISI) with singular value decomposition.

Experinients with different numbers of sclected leiilures used in text classification indicate that the hest result^ come from either using oiily a small percentage uf carefully sclected len~urees (up to 10% olal l leatures) or, in some CBSCS, using all llie features (see the papers by Lewis and c n - Ic;igucs. Yang and Pedersen. aiid Weiner ;md colleagues in Table A). A coiiiparison of dilferent worcl-scoring inCasurcs i i s a l in lcitture-suhset selcctioii for text tlatil shows that the ii i i ist promising measures tiikc into account llic i ia~ure of h e problem ilomain and thc classiticatioll algurithin char;stcristics oscd? Surprisingly good results are ohtilined using a siiii- plc liquency niciisure in ii cvmbiiixtioii with a stop list.

Algorithms. One well-cslablishcd tcchiiique lor text classiticiition in infomiation rctricviil i s to rc ircscnt cadi documcnl with ii bag olwonls as a TP/DP‘vectnr in the of words that appear in training docw mcnts, sum a11 interesting docurncni vectors, and use the resulting vector as a modcl lbr classilication (hiiscd u t i tlic rclcviiiice feedback mcflioil-

sec papers hy Saltmi and Buckley mil by Rocchio i n ‘I:ihlc A). Each component of a ducumcnt vcctvr di’ = 7’1.’(w,, d) / / )V( iv , ) i s c;ilculatcd a s the ~iriiduct iif TF (rrmijw‘/inPn‘~~-nitiiiber oltinies word IV, occurred in ii ilocument) and 1/11: = Iug[IV/)L~wij I (imursr ~ l ~ x u r n ~ r r f ~ r ~ ~ ~ ~ i i ~ ~ ~ i r y j , whcrc I) i s llic number of diicumerits and document frequency /)I~’(wJ i s tlie iiumher ot‘dncumcnts wlicrc words w, occiirrcd et least once. The cxiict Somrulas used in ililferent apprnzches might slightly vaiy (sonic lietors are added and nrirmiiliz;ilViii i s pcrioiincd, hut the idea remains the same). The ;ippro;ich llien represents a new documcnt i is ii vector i n tlie siinie vector space as the gciicraled model slid ~neasures the dislnnce hetween thcni (osu;tlly using tlic cosine similarity measure) to chssify the document. This technique i s commonly used ti1 gct hsclinc results when testing ;I machinc-lc;iming ;ilgot-ithm on text data (see Mitclicll in Table A). TFIIIF cliissilication has nlready heen used in m;icliine-le;irning experiments on Web dalaand, in most cases, i t proved iiiferior t~ tested machine-learning methods.

An extension of TFlDF called prvbxhilislic TFlDF lakes iiilo aceoutit document representation and outperforms TPIDF, while proving compa- rablc to the iliiivc Rayesiim classifier (Joachims in Table A). The Naive Bayesian classifier aiid the k-ncarcst neighbor are two classifiers coni- monly osed in text le;miing and rcpoltcd to be among the best perfmining classifiers for text data. In addition to using tlic naive Baycsino cliissilicr and nearcsl neighbor, several experimenters performed cxperiiiients 011 text data with symbolic learning using decision trees, aod one gniiip experimented using dccisiiin rules. Another grnup ciiinpiircd tlic ~~er l i i r - iiiance nf liiicar least-square t i t (LLSF) iiiid a variant olt-oc;irest ncigli- hor, reporting that both classifiers iicliicved similiir results.

As Table A shows, Creecy et al. aiid Maes used meniory-based reasoil- ing. Aptc and colleagucs used clccision rules and later boosted decision trees. Cohcii used decision rules, the sleeping experts algorithm, and two inductive logic programming (ILP) algorithms, FOIL and Flipper. Slalteiy

I i iat ical ly custoniizc ki indivit lu;i l tiscrs. iisiiil: each user iiitct.iictioii :is ii l ra in i i i j i ex;iiiiplc. I’ei-sonal WcbWalchcr cui be i l l - s1;illcd I(icelly b y the user ancl connected to

md Craven used thc Naive Bayesian classilicr ;incl twii ILP algorithms FOIL and FOIL-PILW (FOIL with Predicate Invention for Large I’eature Spaces). Lewis et al. used a coinhination oftlie Niiive Bayesian classilier and logistic regression, the Widrow-Hoffalgoritlini, and exponential gra- dient (EGj. Using neural networks, Wiener et al. whicvcd slightly hcttcr results than logistic regression. McElligut and Sorensco used a coiincc- tionist iipproach combined with genetic iilgoritliiiis, and Lain et al. used Bayesian network induction.

Gelfanod used seiii;iiitic relationship graphs (SKG) to represent docii- nients based on the WordNet lexical daliihase. This approach pcrtiirins classilication similar to TFIDK delining each class by n grotip of training d~icumeiits and representing i t a s ii union of their SKG rcprcsentn- tion. Armstrong et 211. used TFIDF, the Winnow algiirithni. and ii slatis- tical npprmch called WurdStal that assunies iiiutiial independetice of words. Shavlik ;ind Eliassi used tlieory retinemcut on neural networks, whcrc the uscr provides an initial ndvicc that i s cornpilcd into neural networks niid relined during the inteniction with the user hascd on the uscr’s page ratings and ialditionally pnividcd advisuries. Licre iiiid Tadcp;illi used aclivc learning iavnlving a commitlee nl Winnow learii- crs. Nig;im and McCalluni also used active leariling iii a coiiihinatioii ot q w r y hy roininilree and tlic cxpcctatioo-miiximieitias ;ilgorilhiii.

There is cuircntly no strong evidence pointing to fhe supcriority of ;my of these text-learning algorithms over different domains. Most experiments show the superiority of tlic tested algorithm over the

TFIIIF classilicalion. 111 comparing learning algorithnis, Pamini and Dillsus iiidicatc that a dticumcnt rcprcscntatiuii including lcature selcclion i s ii more promising approach to elassilication-accoracy improve- ment than finding ii hetter learning algorithin. Is their expcrimcnla, the naive Bayesian cltissifier, iiearest neighhor, arid neural networks pcr- liirincd best 011 ~cstcd data. Yang fcn~nd similar good perliirin;ince lor k- liearest iieighhor, neural networks, and linear least-square lit. while also showing pvor perlorniance lor the naive Bayesian classifier. How- ever, Yang and Joachiins rcported that nii their domains faitore, suhscl selcclion was not crucial iur tlic classilier pcrformancc. Joachiins re- ported that Support Vector Machines outpcrform tlie naivc Bayesian classifier, while Apte, Damer;iu, and Wciss obtained eveti hetter results using hoosted decision trees. Lam and Ho observed that the generalize& instance-set algorithm achieved hetter results tliaii either k-oe;ircst neighhor or linciir classilicrs (Rocchio. Wi&-nw-tIoll).

I, D. Mladenic and M. Grohelnik, “Fciilure Sclecliori for Cl;issifica- lion Based on Text Hierarchy,” Wurkirig Nures r , l l~ ,nrn ing j r~ im ?ixl imd the Web, C O ~ I ~ Aulorrrot~vl I*,(iriring uild 11i.rcovrry (CONALD-‘IKJ, Carnegie Mellon Univ., Pittsburgh, 1998; htlp:// wwwcs.cmu.edd-TextLe~iminE/pww.

2. D. Mladciiic and M. Cirobclnik, “Word Scqucnccs :IS Features i l k

’lcxt-Learning:’ .”roc. S’wrmtli E l ~ , ~ t r , ~ a ~ ~ h i r i c r r l rind Coiiiput(,r Sci- wire Cunf.’ (ERK’CJX), lEEE Region 8. Slovenia Section IHEE, Ljubljana, Slovenia, 1998, pp. 145-148.

3, D. Mladcnic, Mudtine Learrlinfi on ~ ~ J f l / l U ~ l l ~ J g P i ~ t ~ ~ J l ~ , ~ , ~ is l r ib i r r i~d 7iw Durn, PhD thesis, Faculty lor Computer aiid Infiirmiition Sci- ence, Univ. of L,jubl,j;uia, Slovcnia. 1998; litlp://www-ai.i.js.si/ Dun,jaMliideoic~hD.litinl.

52 IEEE INfltlGENr SYSTEMS


tlic Wch I~rowscr as a proxy server. A pr(ilw lype version 0 1 Ihc systcni i s available itnd

etlit/-lbxtLcarning/pww). I'crsonal WehWatclicr i s ii content-hasctl

pei-soii~iI agent that helps I ISC~S hrowsc the Wch. Imaginc ii user using a Wch browscr h r rcqucsting dociiiiicnts, most oltcn hy cliCking the hypcrlinks on alrciitly rcqocstetl tlocii- ments that are presented to thal user, I'rcdict- ing iiiid highlighling the clicked hypcrlinks i s one wily to Iiclp users n w i g i t c the Web. I n the inacliii ic-Icarning setting we tised, we coii- sidcrcd al l the hypcrlinks shown to tlic iiscriis lr i i i i i ing cxit i i i~i lcs with ii B~iolean class value (clicked or unclicketl). We build descriptions olhyperliitks and lieat tlics iiiiicnts lor niachiiic-learn plcs. Oiir ;ipproacli coIIcc~s raining cxainplcs oi i l inc Trom ciicli intcr;ictioii with the user. Data collected li-oni n single user rcprcscnls a dniiiain 1 0 he handlcd by tiiacliiiie-learning tcchniqucs. 'l'hc system's l irsl vcrsioii nscs ii hag-of-word document rc1)rcscitlation. lcaturc selection using in1orin;itivity. iiiid ii naive Ikiycsiaii icr. :'

Persun:il WebWatchcr's stroctiire. The idea i s to hclp iiscrs hrowsc the Wcb wilhout putting any ;idditional workload on tlicni. The only work involvcd i s to instiill the sysleiri ancl sin- ply cniincct it t o the Wch browser as ii l x~ ixy scrvcr. Each rcqiicsl goes Irom the uset tn O L I ~

systeiii, which retrieves the rcqucstctl Wch piigc's original docuniciit Ii'oni the Wch. l'hc origind is stored on disk for lcarning end prci- ccssctl lor atltling tlic suggestions. Pcrsonal WchW;itclier sends the modil icd Wcb pigc to

tlic iiscr iiistead 01' to the oi-igiiial one. l'hc modilication we ititroditcc does not rcinovc any informiition froin 11ic Web page- i t adds small icons highligliting potenlially intcrcst- ing hypcrlinks on tlic 1pagc.

As I;ignrc 4 shows, Pcrs~in;il WchWatclicr cni is ists 01' two nii i in part that iiitcriicts with the iiscr tliriiiigh the Web browser and a /miwvvtl iat provides the itscr

mntlcl to the server. The c ~ ~ m i r i i i i i i ~ i i t i o ~ i hetwccti them is through ii disk; tlic priixy saves ntltli-csscs 0 1 visited tlocuments (URLs), iiiid thc 1c;irncr iiscs them to rctricvc

interests. Proxy waits i n an inl ini tc loop for a Web

page rcqiicst from the Ihr~w~er. 011 reqiicst, it fetches the reqiicstctl document and, i l it i s an HMI'L d o c t ~ ~ i i ~ n t , adds atlvicc atid iorwirds thc doculllent io the user. 'To ;idd suggestions,

uscd 111. reScIlrch I'llrI"1Ses (httl'://cs.cmu.

docuIIIcIIIs uscd to gcncr;itc il lllollcl oluscr

JUlY/nuGUSI 1999

proxy IbIw:irtls tlic documcnt to the adviser, which extracts tlcscriptions ot'hypcrlinks f'rniii

the docunicnt and calls clnssilication that tises

tlic gcncr;iled [Iser modcl. Each hyperlink i s

~ ~ - ~~ ~~ ~~ I

Figure 4. Structure of Personol WebWotcher, on assistant for Web browsing.

clnssilietl-assigne(1 ii degrcc o1 relevance h;ised on its siinilnrity lo thc user tiiotlcl-and the most relevant hypcrlittks arc highlighted.

l'igure 5 gives an cx:iiiiplc page prescnlcd

Figure 5. Exomple of HTML-poge presented to the user by Personol WebWotcher. Notice thot three hyperlinks ore highlighted os interesting ("Mochine Learning Information Services" ond two projed members: Doyne Freitog, Thorsten Joochims).


via the Netscape browser by Pcrsonal Wcb- Watcher. The docurncut is actually the Web- Watcher project page. Oncc run, thc system processcs the reqncsted pages hy adding a banner to the top of thc page showing that Pcr- sonal Webwatcher is watching over the user’s shouldcr and highlighting interesting hypcrlinks.Alimited nnmberof hyperlinks that arc scored above a given threshold arc recommended to thc user, indicating their scores using graphical symbols placed around highly scored hypcrlinks. For example, in Figurc 5 , three hyperlinks are suggcsted by Pcrsonal WcbWatchcr-“Machine Learning Informa- tion Serviccs” and two [mjcc t niembcrs (Daync Freitag, Thorsten Joachims)-based on the model of interests built from about 500 documents I visited in 1996.

IJR PERSONAL WEDWATCHER prqject is involved in ongoing research on differcnt aspects of using tcxt learning [or betterweb browsing, including richer document re:lircscntations using dynamically con- structed background knowledgc and making word-occurrence prediction based on vcry short d o c u ~ n e n t s . ~ ~ An important direction of text-learning is iising infixmation extraction spccialized lor different domains, such a s linding publication citatioiis in lhc rcscarch papers available on the Web (CiteSeer’”) or building a knowledge databasc from thc Wch (WchKB’”). ‘lhc currcnt trend of using inachinc 1e;irning for intelligcnt agents includes a combination ol tcxt processing with speech recognition and content processing of image or video?’

On a broader scalc, iiiterestiiig rescxch clucstions for fiitiirc work includc the study of scalability to domains having vcry short or very long documents. My colleagucs and I have described expcrimerits on vcry short documents (containing hypcrlink dcscrip- lions).18 Other intcresting rescarch questions includc thc influence 0 1 sparse-word skitis- tics, thc incorporation of temporal information in Wch-browsing heliavior, a comhina- tion of natural langnage and statistical mcthods, laarning from ~tructure i n hypcr- text, thc dcvelopmcnt of statistical models that represent hypertext structure, and corn- hining evidcnce from multiple sources.** 0

Acknowledgments This work was Iln;inci;illy supportcd hy the

Slovcnian Ministry for Science and Tcchiiology. I’art of I l l i s wnrk was perlorincd during my shy tit Carncgic Mclloii University i n Tnm Mitchell’s grnup. Many t1i;inks 11) the ;inonyinous rcvicwcrs lor valuable cotnniciils and suggcslions. I am gratciiil lo Nada Lavrac and the tnagazinc’s cdi- tors tor their valuahlc coiiiincnts and suggestions 011 the latest version nf this article.

References 1. T.M. Mitchell, Muchine LLmming, McGI-aw

Ilill, NewYork, 1997.

2. U. Fayyad c t al., Arlvurrces irr Kimwledgr 1)i.wivery arid /)uta Minifix, AAA1 Press1 MI’I’ Prcss, Cnmbridge, Mass., 1996.

3. D. Mladenic, I’erBonul WeliWatcher: irnple- mriifutioii und Uesigrr, Tech. Rcporl IJS-IW 7472, Carncgie Mcllon Univ., Pittshurgh, 1996; littp:llcs.cinu.cdulxtLearningl~iww.

4. M. Balabanovic mdY. Sholiam, “Rib (!on- tent-Based, Collahor;itivc I<econimcnda- lion,’’ Coinrrr. ACM, Vol. 40, No. 3, Mar. 1997, lip. 66-70.

5. I? Macs, “Agents That Reduce Work and In furination O\,crload,” Comm. ACM, Vol. 37, No. 7, July 1994, pp. 3 U O .

6. S. Hedbcrg, ”Agents f ix Sale: I’irst Wave of lnlclligent Agciilh (io Comincrcial, Ex/i?rt, W,I. I I , No. 6, Dec. 1996, 1)

7. K.C. Iloltc iind C. Drutnmnnd, “A Learning Apprentice for Ilrnwsing,” AAA1 Sp-irix Syrrip. ,S(IfJWurr~A~erils, AAA1 I’rcss, Mcnlo h r k , Calif., 1994.

8. C. Druminnnd, D. lonescu, and R . Hnltc,A

of ,Soflwri,u Libraries, Tech. llcpnrl TR-95- 12, Coinputer Science Lkpt., Univ. of Ott;iw;i, Otlawa, Canada, 1995.

9. 0. Etiriorii and D. Weld, “A SoSlhot-ll;iscd Iiitcrliice to the lntcriiet,” CUIJ~III. ACM, Vol. 37, No. 7, July 1994, pp. 72- 79.

IO . M. Gams and M. Grohchiik, “lntclligcnt Agents in Inforiniition Society,”

N.J., 1997, pp. 125-128.

I I . F. Mcncw, “Araclinid: Adaptive Retrieval Agcnts Choosing Heuristic Ncighhorliood for Inliiriiiation IXscovcry,” l’ruc. 14th Inl ’ l On!/ Mudiirie Lrrrrning, 1997, pp. 227-235.

12. K. Atinstrong et id., “Wc1iW;itclicl.: A LC,irii- ing Ap1irctiticc lor tlic Wiirld Wide Web,’’ AAA1 199.7 Siirifig Syriip. InJiirniutiori Curlier- ingfiom ~I~,t~~nigent.oii.s, l)i,strihic/d Ennviwm rn(’rits, AAA1 Press, Mcnlo Park, Calif, 1995.

I D . T. Jmichinis et al., “Machine I.e;o-ning ;id Hypertext,” I. ircl igrulJpenlr~~~eri Maschi- nellicv I*.rrien, Dorttnund, Aug. 1995.

14. T. Joachims, “A Pmb;ibilisticAnnlysis of the Riicchio Algorithm wilh TFIDF for Text C;itcg~ni;.alion,” /’roc. 14fli hu’l Coif; Macliirre 1.eatiiiug (ICMI. 97), 1997, pp. 143-151.

IS . H.C.M. de Krooii,T. Mitchell, andE.J.H. Kerckhoffs, “Improving Learning Accuracy in Information Filtering,” ICML-96 Work- shop: Miiclibie Learrrirq Meets Iiurmirr- Cornputt.,- Irrteractiorr, 1996.

16. K. I m g , “News Wccdcl.: Learning to Ililter Nclncws, I’rnc. 12th Int’l Cor!/: Muchine k,urrring (ICML 95J, Mnrgan K;iulinann, San Prancisco, 1995, pp. 3 13-339.

17. T. Mitchcll, “Experience with a Learning Personal Assislant;’ (,‘ijinm. ACM, Vol. 37, No. 7, July 1904, pp. 81-91

18. D. Mladcnic, Muchine Leurriing 011 Nan- homogr.neolr~s, I)istrihuled E,xt 1)utu. PhD thesis, Faculty for Coinpuler ;ind lnlbrma- lion Science, IJiiiv. ofL.jubljana, Slovcnia, I99R; litlp://cs.cinu.cdiil-TcxlLc~iriiiiigl pww/Plil).htinl.

19. K. Bollackcr, S . Lawrence, and L. Gilcs, “CitcSecr: An Autonomons System Cor Pro- cessing and Organizing Scientific Literature on the Web,” Working Notes uJlxarriirig Jinm 72xt ( i t i d thc tVe/i, Coif: A u J u ~ I u ~ ( : ~ Ixar i i i i ig und Ui.rrovo-y (CONALU-VX), Carnegic Mcllon Univ., Pittsburgh, 1998; litlp:Ncs.cmu.crlul-cunaldlcun;lld.hht~nl.

20. T. Mitchell cl al., “‘l‘hc World Wide Knowl- edge Dt~sc Project;’ 1998: hiip://cs.cnru. cdul-WebKIi.

21. “Working Nutcs of Workshop on Mixcd

Carnegic Mcllon Ilniv., Pittshurgh, 1998; litl~~://cs.cmi~.cdd-conald/conald.slit~iil.

22. J. Carhoncll cl al., Reportori the CONAi.1) Wurkstiup or1 L*~iirnhrg Jiorn 72x1 and the Web, Carncgic Mellon Univ., I’ittsburgli, 1998.

Ilunja Mlddellic is ii rcsca~her in c :nee at the J. Stefan lnstitutc and a tc ant t i l Ljubljana University. Most of wnrk is connected with the study and dcvclopnient .)f ii~icliine-lcarniii~ techniques and their ;ipplica- inns on real-world problems from difkrcnt areas such as medicinc, ph;irmncoIiigy. niaoul;icturing, ind economics. She is currently working 011 using ,nachinc Icaming in data analysis, with particular iitcrcst i n Ictirning Srom tcxt and the Web. She .cceived hcr PhU (http:llwww-;ii.i.js.silDuiija~ \illailcoiclPIiI~.litinl) lrom the Faculty for Computcr i i id Inforinalii~n Scicnce, University [if I ,juhlj;in;i. :onhet her at thc Dept. OS Intclligenl Systems. I.


personal web watcher, › ~nreed › ics606 › papers › mladenic99text...10 personal webwatcher...

Documents