1 open source text mining text mining 2003 @ sdm03 cathedral hill hotel, san francisco hinrich...

45
1 Open Source Text Mining Text Mining 2003 @ SDM03 Cathedral Hill Hotel, San Francisco Hinrich Schütze, Enkata May 3, 2003

Upload: aubree-fella

Post on 01-Apr-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Open Source Text Mining Text Mining 2003 @ SDM03 Cathedral Hill Hotel, San Francisco Hinrich Schütze, Enkata May 3, 2003

1

Open Source Text Mining

Text Mining 2003 @ SDM03Cathedral Hill Hotel, San Francisco

Hinrich Schütze, Enkata

May 3, 2003

Page 2: 1 Open Source Text Mining Text Mining 2003 @ SDM03 Cathedral Hill Hotel, San Francisco Hinrich Schütze, Enkata May 3, 2003

2

Motivation

Open source used to be a crackpot idea. Bill Gates on linux (1999.03.24): “I really don't think in the

commercial market, we'll see it in any significant way.” MS 10-Q quarterly filing (2003.01.31): “The popularization

of the open source movement continues to pose a significant challenge to the company's business model.”

Open source is an enabler for radical new things Google Ultra-cheap web servers

Free news Free email Free …

Class projects Walmart pc for $200

Page 3: 1 Open Source Text Mining Text Mining 2003 @ SDM03 Cathedral Hill Hotel, San Francisco Hinrich Schütze, Enkata May 3, 2003

3

GNU-Linux

Page 4: 1 Open Source Text Mining Text Mining 2003 @ SDM03 Cathedral Hill Hotel, San Francisco Hinrich Schütze, Enkata May 3, 2003

4

Web Servers: Open Source Dominates

Source: Netcraft

Page 5: 1 Open Source Text Mining Text Mining 2003 @ SDM03 Cathedral Hill Hotel, San Francisco Hinrich Schütze, Enkata May 3, 2003

5

Motivation (cont.)

Text mining has not had much impact. Many small companies & small projects No large-scale adoption Exception: text-mining-enhanced search

Text mining could transform the world. Unstructured → structured Information explosion

Amount of information has exploded Amount of accessible information has not

Can open source text mining make this happen?

Page 6: 1 Open Source Text Mining Text Mining 2003 @ SDM03 Cathedral Hill Hotel, San Francisco Hinrich Schütze, Enkata May 3, 2003

6

Unstructured vs Structured Data

0

10

20

30

40

50

60

70

80

90

100

Data volume Market Cap

UnstructuredStructured

Prabhakar Raghavan, Verity

Page 7: 1 Open Source Text Mining Text Mining 2003 @ SDM03 Cathedral Hill Hotel, San Francisco Hinrich Schütze, Enkata May 3, 2003

7

Business Motivation

High cost of deploying text mining solutions

How can we lower this cost? 100% proprietary solutions

Require re-invention of core infrastructure Leave fewer resources for high-value

applications built on top of core infrastructure

Page 8: 1 Open Source Text Mining Text Mining 2003 @ SDM03 Cathedral Hill Hotel, San Francisco Hinrich Schütze, Enkata May 3, 2003

8

Definitions

Open source Public domain, bsd, gpl (gnu public license)

Text mining Like data mining but for text NLP (Natural Language Processing)

subdiscipline Has interesting applications now More than just information retrieval /

keyword search Usually: some statistical, probabilistic or

frequentistic component

Page 9: 1 Open Source Text Mining Text Mining 2003 @ SDM03 Cathedral Hill Hotel, San Francisco Hinrich Schütze, Enkata May 3, 2003

9

Text Mining vs. NLP (Natural Language Processing)

What is not text mining: speech, language models, parsing, machine translation

Typical text mining: clustering, information extraction, question answering

Statistical and high volume

Page 10: 1 Open Source Text Mining Text Mining 2003 @ SDM03 Cathedral Hill Hotel, San Francisco Hinrich Schütze, Enkata May 3, 2003

10

Text Mining: History

80s: Electronic text gives birth to Statistical Natural Language Processing (StatNLP).

90s: DARPA sponsors Message Understanding Conferences (MUC) and Information Extraction (IE) community.

Mid-90s: Data Mining becomes a discipline and usurps much of IE and StatNLP as “text mining”.

Page 11: 1 Open Source Text Mining Text Mining 2003 @ SDM03 Cathedral Hill Hotel, San Francisco Hinrich Schütze, Enkata May 3, 2003

11

Text Mining: Hearst’s Definition

Finding nuggets Information extraction Question answering

Finding patterns Clustering Knowledge discovery

Text visualization

Page 12: 1 Open Source Text Mining Text Mining 2003 @ SDM03 Cathedral Hill Hotel, San Francisco Hinrich Schütze, Enkata May 3, 2003

12

foodscience.com-Job2

JobTitle: Ice Cream Guru

Employer: foodscience.com

JobCategory: Travel/Hospitality

JobFunction: Food Services

JobLocation: Upper Midwest

Contact Phone: 800-488-2611

DateExtracted: January 8, 2001

Source: www.foodscience.com/jobs_midwest.html

OtherCompanyJobs: foodscience.com-Job1

Information Extraction

Page 13: 1 Open Source Text Mining Text Mining 2003 @ SDM03 Cathedral Hill Hotel, San Francisco Hinrich Schütze, Enkata May 3, 2003

13

Knowledge Discovery: Arrowsmith

Goal: Connect two disconnected subfields of medicine.

Technique Start with 1st subfield Identify key concepts Search for 2nd subfield with same concepts

Implemented in Arrowsmith system Discovery: magnesium is potential

treatment for migraine

Page 14: 1 Open Source Text Mining Text Mining 2003 @ SDM03 Cathedral Hill Hotel, San Francisco Hinrich Schütze, Enkata May 3, 2003

14

Knowledge Discovery: Arrowsmith

Page 15: 1 Open Source Text Mining Text Mining 2003 @ SDM03 Cathedral Hill Hotel, San Francisco Hinrich Schütze, Enkata May 3, 2003

15

When is Open Source Successful?

“Important” problem Many users (operating system) Fun to work on (games) Public funding available (OpenBSD, security) Open source author gains

fame/satisfaction/immortality/community Adaptation

A little adaptation is easy Most users do not need any adaptation (out of the box use)

Incremental releases are useful Cost sharing without administrative/legal overhead

Dozens of companies with significant interest in linux (ibm …) Many of these companies contribute to open source This is in effect an informal consortium A formal effort probably would have killed linux. Same applies to text mining?

Also: bugs, security, high-availability, ideal for consulting & hardware companies like IBM

Page 16: 1 Open Source Text Mining Text Mining 2003 @ SDM03 Cathedral Hill Hotel, San Francisco Hinrich Schütze, Enkata May 3, 2003

16

When is Open Source Not Successful?

Boring & rare problem Print driver for 10 year old printer

Complex integrated solutions QuarkXPress ERP systems

Good UI experience for non-geeks Apple Microsoft Windows (at least for now)

Page 17: 1 Open Source Text Mining Text Mining 2003 @ SDM03 Cathedral Hill Hotel, San Francisco Hinrich Schütze, Enkata May 3, 2003

17

Text Mining and Open Source

Pro Important problem: fame, satisfaction,

immortality, community can be gained Pooling of resources / critical mass

Con Non-incremental? Most text mining requires significant

adaptation. Most text mining requires data resources as

well as source code. The need for data resources does not fit well

into the open source paradigm.

Page 18: 1 Open Source Text Mining Text Mining 2003 @ SDM03 Cathedral Hill Hotel, San Francisco Hinrich Schütze, Enkata May 3, 2003

18

Text Mining Open Source Today

Lucene Excellent for information retrieval, but not

much text mining. Rain/bow, Weka, GTP, TDMAPI

Text mining algorithms / infrastructure, no data resources

NLTK NLP toolkit, some data resources

WordNet, DMOZ Excellent data resources, but not enough

breadth/depth.

Page 19: 1 Open Source Text Mining Text Mining 2003 @ SDM03 Cathedral Hill Hotel, San Francisco Hinrich Schütze, Enkata May 3, 2003

19

Open Source with Open Data

Spell checkers (e.g., emacs) Antispam software (e.g., spamassassin) Named entity recognition (Gate/Annie)

Free version less powerful than in-house

Page 20: 1 Open Source Text Mining Text Mining 2003 @ SDM03 Cathedral Hill Hotel, San Francisco Hinrich Schütze, Enkata May 3, 2003

20

SpamAssassin: Code + Data

Page 21: 1 Open Source Text Mining Text Mining 2003 @ SDM03 Cathedral Hill Hotel, San Francisco Hinrich Schütze, Enkata May 3, 2003

21

Open Data Resources: Examples

SpamAssassin Classification model for spam

Named entity recognition Word lists, dictionaries

Information extraction Domain model, taxonomies, regular

expressions Shallow parsing

Grammars

Page 22: 1 Open Source Text Mining Text Mining 2003 @ SDM03 Cathedral Hill Hotel, San Francisco Hinrich Schütze, Enkata May 3, 2003

22Code

Da

ta

?

Proprietary Open Source

No Resources

Needed

Significant Resources

Needed

Code vs Data

Text ClassificationN. Entity Recognition

Information Extraction

Complex&Integrated SWGood UI Design

LinuxWeb Servers

Spam FilteringSpell Checkers

Page 23: 1 Open Source Text Mining Text Mining 2003 @ SDM03 Cathedral Hill Hotel, San Francisco Hinrich Schütze, Enkata May 3, 2003

23

Open Source with Data: Key Issues

Can data resources be recycled? Problems have to be similar. More difficult than one would expect: my first

attempt failed (medline/reuters). Next: case study

Assume there is a large library of data resources available.

How do we identify the data resources that can be recycled?

How do we adapt them? How do we get from here to there?

Need incremental approach that is sustained by successes along the way.

Page 24: 1 Open Source Text Mining Text Mining 2003 @ SDM03 Cathedral Hill Hotel, San Francisco Hinrich Schütze, Enkata May 3, 2003

24

Text Mining without Data Resources

Premise: “Knowledge-poor” text mining taps small part of potential of text mining.

Knowledge-poor text mining examples Clustering Phrase extraction First story detection

Many success stories

Page 25: 1 Open Source Text Mining Text Mining 2003 @ SDM03 Cathedral Hill Hotel, San Francisco Hinrich Schütze, Enkata May 3, 2003

25

Case Study: ODP -> ReutersCase Study:Train on ODP

Apply to Reuters

Page 26: 1 Open Source Text Mining Text Mining 2003 @ SDM03 Cathedral Hill Hotel, San Francisco Hinrich Schütze, Enkata May 3, 2003

26

Case Study: Text Classification

Key Issues for text classification Show that text classifiers can be recycled How can we select reusable classifiers for a

particular task? How do we adapt them?

Case Study Train classifiers on open directory (ODP)

165,000 docs (nodes), crawled in 2000, 505 classes

Apply classifiers to Reuters RCV1 780,000 docs, >1000 classes

Hypothesis: A library of classifiers based on ODP can be recycled for RCV1.

Page 27: 1 Open Source Text Mining Text Mining 2003 @ SDM03 Cathedral Hill Hotel, San Francisco Hinrich Schütze, Enkata May 3, 2003

27

Experimental Setup

Train 505 classifiers on ODP Apply them to Reuters Compute chi2 for all ODP x Reuters pairs Evaluate n pairs with the best chi2 Evaluation Measures

Area under ROC curve Plot false positive rate vs true positive rate Compute area under the curve

Average precision Rank documents, compute precision for each rank Average for all positive documents

Estimated based on 25% sample

Page 28: 1 Open Source Text Mining Text Mining 2003 @ SDM03 Cathedral Hill Hotel, San Francisco Hinrich Schütze, Enkata May 3, 2003

28

Japan: ODP -> ReutersROC Curve

Japan Classifier Trained on ODP Applied to Reuters

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90

False Positive Rate

Tru

e P

osi

tive

Rat

e

Page 29: 1 Open Source Text Mining Text Mining 2003 @ SDM03 Cathedral Hill Hotel, San Francisco Hinrich Schütze, Enkata May 3, 2003

29

Some Results

Page 30: 1 Open Source Text Mining Text Mining 2003 @ SDM03 Cathedral Hill Hotel, San Francisco Hinrich Schütze, Enkata May 3, 2003

30

BusIndTraMar0 / I76300: Ports

Page 31: 1 Open Source Text Mining Text Mining 2003 @ SDM03 Cathedral Hill Hotel, San Francisco Hinrich Schütze, Enkata May 3, 2003

31

Discussion

Promising results These are results without any

adaptation. Performance expected to be much better

after adaptation.

Page 32: 1 Open Source Text Mining Text Mining 2003 @ SDM03 Cathedral Hill Hotel, San Francisco Hinrich Schütze, Enkata May 3, 2003

32

Discussion (cont)

Class relationships are m:n, not 1:1 Reuters: GSPO

SpoBasCol0 SpoBasMinLea0 SpoBasReg0 SpoHocIceLeaNatPla0 SpoHocIceLeaPro0

ODP: RegEurUniBusInd0 (UK industries) I13000 (petroleum & natural gas) I17000 (water supply) I32000 (mechanical engineering) I66100 (restaurants, cafes, fast food) I79020 (telecommunications) I9741105 (radio broadcasting)

Page 33: 1 Open Source Text Mining Text Mining 2003 @ SDM03 Cathedral Hill Hotel, San Francisco Hinrich Schütze, Enkata May 3, 2003

33

Why Recycling Classifiers is Difficult

Autonomous vs relative decisions ODP Japan classifier w/o modifications has

high precision, but only 1% recall on RCV1! Most classifiers are tuned for optimal

performance in embedded system. Tuning decreases robustness in recycling. Tokenization, document length, numbers Numbers throw off medline vs. non-medline

categorizer (financial classified as medical) Length-sensitive multinomial Naïve Bayes:

nonsensical results

Page 34: 1 Open Source Text Mining Text Mining 2003 @ SDM03 Cathedral Hill Hotel, San Francisco Hinrich Schütze, Enkata May 3, 2003

34

Specifics

What would an open source text classification package look like?

Code Text mining algorithms Customization component

To adapt recycled data resources Creation component

To create new data resources Data

Recycled data resources Newly created data resources

Pick a good area Bioinformatics: genes / proteins Product catalogs

Page 35: 1 Open Source Text Mining Text Mining 2003 @ SDM03 Cathedral Hill Hotel, San Francisco Hinrich Schütze, Enkata May 3, 2003

35

Other Text Mining Areas

Named entity recognition Information extraction Shallow parsing

Page 36: 1 Open Source Text Mining Text Mining 2003 @ SDM03 Cathedral Hill Hotel, San Francisco Hinrich Schütze, Enkata May 3, 2003

36

Data vs Code

What about just sharing training sets? Often proprietary

What about just sharing models? Small preprocessing changes can throw you

off completely Share (simple?) classifier cum

preprocessor and models Still proprietary issues

Page 37: 1 Open Source Text Mining Text Mining 2003 @ SDM03 Cathedral Hill Hotel, San Francisco Hinrich Schütze, Enkata May 3, 2003

37

Open Source & Data

Sanitized&Enhanced

Code+Data

EnhancedCode+Data

adapt

Public Proprietary

Code+DataV1.0

Code+DataV1.1

publish

san

itiz

e

new

rele

ase

Page 38: 1 Open Source Text Mining Text Mining 2003 @ SDM03 Cathedral Hill Hotel, San Francisco Hinrich Schütze, Enkata May 3, 2003

38

Free Riders?

Open source is successful because it makes free riding hard. Viral nature of GPL.

Harder to achieve for some data resources Download models Apply to your data Retrain You own 100% of the result

Less of a problem for dictionaries and grammars

Page 39: 1 Open Source Text Mining Text Mining 2003 @ SDM03 Cathedral Hill Hotel, San Francisco Hinrich Schütze, Enkata May 3, 2003

39

Data Licenses

Open Directory License http://rdf.dmoz.org/license.html Bsd flavor

Wordnet http://www.cogsci.princeton.edu/~wn/

license.shtml Copyright

No license to sell derivative works? Some criteria for derivative works

Substantially similar (seinfeld trivia) Potential damage to future marketing of derivative

works

Page 40: 1 Open Source Text Mining Text Mining 2003 @ SDM03 Cathedral Hill Hotel, San Francisco Hinrich Schütze, Enkata May 3, 2003

40

Code vs Data Licenses

Some similarity If I open-source my code, then I will benefit

from bug fixes & enhancements written by others.

If I open-source my data resource, then my classification model may become more robust due to improvements made by others.

Some dissimilarity Code is very abstract: few issues with

proprietary information creeping in. Text mining resources are not very abstract:

there is a potential of sensitive information leaking out.

Page 41: 1 Open Source Text Mining Text Mining 2003 @ SDM03 Cathedral Hill Hotel, San Francisco Hinrich Schütze, Enkata May 3, 2003

41

Areas in Need of Research

How to identify reusable text mining components ODP/Reuters case study does not address this. Need (small) labeled sample to be able to do this?

How to adapt reusable text mining components Active learning Interactive parameter tweaking? Combination of recycled classifier and new training

information Estimate performance

Most estimation techniques require large labeled samples.

The point is to avoid construction of a large labeled sample.

Create viral license for data resources.

Page 42: 1 Open Source Text Mining Text Mining 2003 @ SDM03 Cathedral Hill Hotel, San Francisco Hinrich Schütze, Enkata May 3, 2003

42

Summary

Many interesting research issues Need institution/individual to take the lead Need motivated network of contributors

data resource contributors source code contributors

Start with small & simple project that proves idea

If it works … text mining could become an enabler on a par with linux.

Page 43: 1 Open Source Text Mining Text Mining 2003 @ SDM03 Cathedral Hill Hotel, San Francisco Hinrich Schütze, Enkata May 3, 2003

43

More Slides

Page 44: 1 Open Source Text Mining Text Mining 2003 @ SDM03 Cathedral Hill Hotel, San Francisco Hinrich Schütze, Enkata May 3, 2003

44

RegAsiJap0 JAP 0.86 0.62

RegAsiPhi0 PHLNS 0.91 0.56

RegAsiIndSta0 INDIA 0.85 0.53

SpoSocPla0 CCAT 0.60 0.53

RegEurRus0 CCAT 0.58 0.51

RegEurRus0 RUSS 0.85 0.51

SpoSocPla0 GSPO 0.78 0.42

SpoBasReg0 GSPO 0.75 0.33

RegAsiIndSta0 MCAT 0.56 0.32

SpoBasPla1 GSPO 0.80 0.31

SpoBasCol0 GSPO 0.78 0.31

SpoBasCol1 GSPO 0.74 0.26

RegEurSlo0 SLVAK 0.86 0.25

SpoBasPla0 GSPO 0.77 0.24

RegEurRus0 MCAT 0.49 0.23

BusIndTraMar0 I76300 0.81 0.23

SpoHocIceLeaPro0 GSPO 0.71 0.20

SpoBasMinLea0 GSPO 0.71 0.20

RegMidLeb0 LEBAN 0.83 0.19

RecAvi0 I36400 0.74 0.18

RegSou0 BRAZ 0.84 0.18

RegAsiHonBus0 HKONG 0.66 0.18

SpoMotAut0 GSPO 0.67 0.18

SpoHocIceLeaNatPla0 GSPO 0.72 0.17

SocPol0 EEC 0.85 0.17

RegAsiIndSta0 M14 0.59 0.17

RegAsiChiPro0 CHINA 0.67 0.17

RecAvi0 I3640010 0.77 0.17

SpoFooAmeColNca1 GSPO 0.72 0.17

SocPol0 G15 0.86 0.16

RegEurBul0 BUL 0.72 0.15

RegAsiIndPro0 INDON 0.72 0.13

SpoSocPla0 UK 0.49 0.12

RegEurUkr0 UKRN 0.73 0.11

RegEurRus0 GPOL 0.48 0.11

RegEurPolVoi0 POL 0.67 0.11

RegAsiIndSta0 M141 0.61 0.10

SpoFooAmeNflPla0 GSPO 0.65 0.09

RegEurGerSta0 GFR 0.56 0.09

RegEurFra0 FRA 0.54 0.09

RegCar0 CUBA 0.76 0.09

RegEurUniBusInd0 C18 0.59 0.08

RegEurUniEngEss0 I66200 0.72 0.08

RegSou0 PERU 0.88 0.08

ComHar0 C22 0.61 0.08

RegMidTur0 TURK 0.69 0.08

RegAsiIndSta0 M13 0.56 0.08

RegEurUniBusInd0 C181 0.59 0.07

RegNorUniCalLocPxx0 LATV 0.64 0.07

RegEurRus0 GVIO 0.52 0.07

SpoSocPla0 ITALY 0.58 0.07

RegEurUniSco0 GSPO 0.54 0.07

RegEurNet0 NETH 0.65 0.07

RegEurRus0 GDIP 0.46 0.07

ArtMusStyCouBan0 GENT 0.52 0.07

RegEurRus0 BYELRS 0.92 0.06

BusIndTraMar0 C24 0.54 0.06

BusIndTraMar0 I74000 0.72 0.06

RegNorMexSta0 I76300 0.58 0.06

SpoHocIceLeaNatPla0 CANA 0.54 0.06

RegSou0 MRCSL 1.00 0.06

SocRelBud0 GREL 0.57 0.05

RegEurBel0 FRA 0.49 0.05

SpoSocPla0 FRA 0.50 0.05

RegEurUniBusInd0 I6540005 0.69 0.05

RegNorCanQueLoc0 FRA 0.46 0.05

RegEurGerSta0 GSPO 0.45 0.05

RegAsiIndSta0 M131 0.61 0.05

RegAsiPak0 SHAJH 0.76 0.05

SpoSocPla0 GFR 0.48 0.05

RegSou0 PARA 0.90 0.04

RegEurUniBusInd0 I9741109 0.59 0.04

RegSou0 BOL 0.90 0.04

RegEurRus0 UKRN 0.83 0.04

SpoSocPla0 SPAIN 0.61 0.04

NewOnlCnn0 BAH 0.56 0.04

ArtAniVoi0 I97100 0.70 0.03

RegEurRus0 NATO 0.75 0.03

RegEurRus0 GDEF 0.55 0.03

SpoSocPla0 MONAC 0.87 0.03

SciEarPal0 GSCI 0.42 0.03

RegEurRom0 ROM 0.57 0.03

RegAsiPhi0 I85000 0.66 0.03

SpoBasReg0 SPAIN 0.59 0.03

BusIndTraMar0 USSR 0.47 0.03

SpoSocPla0 NETH 0.54 0.03

SpoFooAmeNflPla0 CANA 0.48 0.03

RegEurRus0 AZERB 0.94 0.03

SciBioTaxTaxPlaMagMag0 ECU 0.54 0.03

RegNorUniCalLocPxx0 I41500 0.65 0.02

RegEurRus0 TADZK 0.95 0.02

RegEurUniBusInd0 I8150206 0.71 0.02

RegEurUniBusInd0 I81502 0.58 0.02

RegSou0 URU 0.88 0.02

RegEurUniBusInd0 I50300 0.74 0.02

RegEurUniBusInd0 I37100 0.79 0.02

RefFlaReg0 GUREP 0.69 0.02

SciBioTaxTaxPlaMagMag0 I0100144 0.58 0.02

NewOnlCnn0 GWEA 0.66 0.02

RegEurUniBusInd0 I85000 0.57 0.02

ArtCelMxx0 I97100 0.66 0.02

SpoMotAut0 SMARNO 0.88 0.02

RegEurUniBusInd0 I5020022 0.79 0.02

NewOnlCnn0 DOMR 0.55 0.02

ArtMusStyCouBan0 GPRO 0.45 0.02

RegEurUniEngEss0 I83954 0.66 0.02

SpoBasReg0 GREECE 0.51 0.02

RegEurRus0 GRGIA 0.84 0.02

RegEurRus0 KAZK 0.82 0.02

RegEurNet0 M142 0.45 0.02

RegEurUniBusInd0 I83200 0.67 0.01

NewOnlCnn0 BELZ 0.50 0.01

RegEurUniBusInd0 C34 0.49 0.01

RegEurUniEngEss0 I82002 0.56 0.01

SpoBasReg0 ISRAEL 0.38 0.01

RegEurUniBusInd0 I83400 0.73 0.01

RegEurUniBusInd0 I83954 0.67 0.01

RegEurPolVoi0 FIN 0.58 0.01

RegEurRus0 USSR 0.82 0.01

RegEurUniBusInd0 I9741105 0.58 0.01

RegEurUniBusInd0 I32852 0.80 0.01

RegEurUniBusInd0 I83940 0.63 0.01

BusIndTraMar0 BUL 0.37 0.01

RegEurUniBusInd0 I61000 0.68 0.01

BusIndTraMar0 ESTNIA 0.60 0.01

NewOnlCnn0 GABON 0.46 0.01

NewOnlCnn0 CVI 0.70 0.01

SciBioTaxTaxAniChoAve0 GENV 0.45 0.01

SpoMotAut0 MONAC 0.71 0.01

ArtCelBxx0 I97100 0.64 0.01

SpoBasReg0 TURK 0.46 0.01

BusIndTraMar0 PORL 0.57 0.01

SpoBasReg0 CRTIA 0.48 0.01

RegEurUniBusInd0 I95100 0.65 0.01

BusIndTraMar0 CRTIA 0.41 0.01

BusIndTraMar0 UKRN 0.43 0.01

ArtCelLxx0 I97100 0.60 0.01

RegEurRus0 MOLDV 0.78 0.01

RegSou0 SURM 0.80 0.01

BusIndTraMar0 LATV 0.60 0.01

BusIndTraMar0 ALB 0.24 0.01

BusIndTraMar0 LITH 0.58 0.01

ArtCelSxx0 I97100 0.63 0.01

RegEurUniBusInd0 I16000 0.59 0.01

SpoBasCol0 E71 0.42 0.01

SciBioTaxTaxPlaMagMag0 BELZ 0.53 0.01

ArtMusStyCouBan0 GOBIT 0.53 0.01

BusFinBanBanReg0 C173 0.68 0.01

RegEurRus0 ARMEN 0.85 0.01

RegEurRus0 I22471 0.66 0.01

RegEurRus0 TURKM 0.86 0.01

BusIndTraMar0 ROM 0.40 0.01

BusIndTraMar0 TUNIS 0.67 0.00

RegAsiChiPro0 I5020006 0.76 0.00

ArtTelNet0 I9741105 0.67 0.00

BusIndTraMar0 YEMAR 0.49 0.00

BusIndTraMar0 CYPR 0.40 0.00

RefFlaReg0 SLVNIA 0.57 0.00

RegEurUniEngEss0 I9741105 0.57 0.00

RegEurRus0 KIRGH 0.83 0.00

RegCar0 GTOUR 0.55 0.00

BusIndTraMar0 UAE 0.48 0.00

NewOnlCnn0 BERM 0.52 0.00

BusIndTraMar0 NAMIB 0.48 0.00

BusIndTraMar0 JORDAN 0.36 0.00

RecAvi0 C313 0.42 0.00

BusIndTraMar0 MOZAM 0.51 0.00

RegEurUniBusInd0 I66200 0.66 0.00

BusIndTraMar0 SILEN 0.34 0.00

RegMidLeb0 I9741105 0.54 0.00

RegAsiHonBus0 I81400 0.61 0.00

RefFlaReg0 WORLD 0.43 0.00

RegNorUniCalLocVxx0 C313 0.39 0.00

RegAsiHonBus0 I64700 0.72 0.00

RefFlaReg0 UPVOLA 0.58 0.00

SciBioTaxTaxPlaMagMag0 I0100216 0.66 0.00

RegAsiHonBus0 I3640048 0.70 0.00

SciBioTaxTaxAniChoAve0 AARCT 0.53 0.00

RegSou0 I5020051 0.84 0.00

NewOnlCnn0 TCAI 0.00 0.00

Page 45: 1 Open Source Text Mining Text Mining 2003 @ SDM03 Cathedral Hill Hotel, San Francisco Hinrich Schütze, Enkata May 3, 2003

45

Resources http://www-csli.stanford.edu/~schuetze (this talk, some

additional material) Source of Gates quote:

http://www.techweb.com/wire/story/TWB19990324S0014 Kurt D. Bollacker and Joydeep Ghosh. A scalable method for

classifier knowledge reuse. In Proceedings of the 1997 International Conference on Neural Networks, pages 1474-79, June 1997. (proposes measure for selecting classifiers for reuse)

W.Cohen, D.Kudenko: Transferring and Retraining Learned Information Filters, Proceedings of the Fourteenth National Conference on Artificial Intelligence, AAAI 97. (transfer within the same dataset)

Kurt D. Bollacker and Joydeep Ghosh. A supra-classifier architecture for scalable knowledge reuse. In The 1998 International Conference on Machine Learning, pp. 64-72, July 1998. (transfer within the same dataset)

Motivation of open source contributors: http://newsforge.com/newsforge/03/04/19/2128256.shtml?tid=11, http://cybernaut.com/modules.php?op=modload&name=News&file=article&sid=8&mode=thread&order=0&thold=0