1 open source text mining text mining 2003 @ sdm03 cathedral hill hotel, san francisco hinrich...

1

Open Source Text Mining

Text Mining 2003 @ SDM03Cathedral Hill Hotel, San Francisco

Hinrich Schütze, Enkata

May 3, 2003

2

Motivation

Open source used to be a crackpot idea. Bill Gates on linux (1999.03.24): “I really don't think in the

commercial market, we'll see it in any significant way.” MS 10-Q quarterly filing (2003.01.31): “The popularization

of the open source movement continues to pose a significant challenge to the company's business model.”

Open source is an enabler for radical new things Google Ultra-cheap web servers

Free news Free email Free …

Class projects Walmart pc for $200

3

GNU-Linux

4

Web Servers: Open Source Dominates

Source: Netcraft

5

Motivation (cont.)

Text mining has not had much impact. Many small companies & small projects No large-scale adoption Exception: text-mining-enhanced search

Text mining could transform the world. Unstructured → structured Information explosion

Amount of information has exploded Amount of accessible information has not

Can open source text mining make this happen?

6

Unstructured vs Structured Data

0

10

20

30

40

50

60

70

80

90

100

Data volume Market Cap

UnstructuredStructured

Prabhakar Raghavan, Verity

7

Business Motivation

High cost of deploying text mining solutions

How can we lower this cost? 100% proprietary solutions

Require re-invention of core infrastructure Leave fewer resources for high-value

applications built on top of core infrastructure

8

Definitions

Open source Public domain, bsd, gpl (gnu public license)

Text mining Like data mining but for text NLP (Natural Language Processing)

subdiscipline Has interesting applications now More than just information retrieval /

keyword search Usually: some statistical, probabilistic or

frequentistic component

9

Text Mining vs. NLP (Natural Language Processing)

What is not text mining: speech, language models, parsing, machine translation

Typical text mining: clustering, information extraction, question answering

Statistical and high volume

10

Text Mining: History

80s: Electronic text gives birth to Statistical Natural Language Processing (StatNLP).

90s: DARPA sponsors Message Understanding Conferences (MUC) and Information Extraction (IE) community.

Mid-90s: Data Mining becomes a discipline and usurps much of IE and StatNLP as “text mining”.

11

Text Mining: Hearst’s Definition

Finding nuggets Information extraction Question answering

Finding patterns Clustering Knowledge discovery

Text visualization

12

foodscience.com-Job2

JobTitle: Ice Cream Guru

Employer: foodscience.com

JobCategory: Travel/Hospitality

JobFunction: Food Services

JobLocation: Upper Midwest

Contact Phone: 800-488-2611

DateExtracted: January 8, 2001

Source: www.foodscience.com/jobs_midwest.html

OtherCompanyJobs: foodscience.com-Job1

Information Extraction

13

Knowledge Discovery: Arrowsmith

Goal: Connect two disconnected subfields of medicine.

Technique Start with 1st subfield Identify key concepts Search for 2nd subfield with same concepts

Implemented in Arrowsmith system Discovery: magnesium is potential

treatment for migraine

14

Knowledge Discovery: Arrowsmith

15

When is Open Source Successful?

“Important” problem Many users (operating system) Fun to work on (games) Public funding available (OpenBSD, security) Open source author gains

fame/satisfaction/immortality/community Adaptation

A little adaptation is easy Most users do not need any adaptation (out of the box use)

Incremental releases are useful Cost sharing without administrative/legal overhead

Dozens of companies with significant interest in linux (ibm …) Many of these companies contribute to open source This is in effect an informal consortium A formal effort probably would have killed linux. Same applies to text mining?

Also: bugs, security, high-availability, ideal for consulting & hardware companies like IBM

16

When is Open Source Not Successful?

Boring & rare problem Print driver for 10 year old printer

Complex integrated solutions QuarkXPress ERP systems

Good UI experience for non-geeks Apple Microsoft Windows (at least for now)

17

Text Mining and Open Source

Pro Important problem: fame, satisfaction,

immortality, community can be gained Pooling of resources / critical mass

Con Non-incremental? Most text mining requires significant

adaptation. Most text mining requires data resources as

well as source code. The need for data resources does not fit well

into the open source paradigm.

18

Text Mining Open Source Today

Lucene Excellent for information retrieval, but not

much text mining. Rain/bow, Weka, GTP, TDMAPI

Text mining algorithms / infrastructure, no data resources

NLTK NLP toolkit, some data resources

WordNet, DMOZ Excellent data resources, but not enough

breadth/depth.

19

Open Source with Open Data

Spell checkers (e.g., emacs) Antispam software (e.g., spamassassin) Named entity recognition (Gate/Annie)

Free version less powerful than in-house

20

SpamAssassin: Code + Data

21

Open Data Resources: Examples

SpamAssassin Classification model for spam

Named entity recognition Word lists, dictionaries

Information extraction Domain model, taxonomies, regular

expressions Shallow parsing

Grammars

22Code

Da

ta

?

Proprietary Open Source

No Resources

Needed

Significant Resources

Needed

Code vs Data

Text ClassificationN. Entity Recognition

Information Extraction

Complex&Integrated SWGood UI Design

LinuxWeb Servers

Spam FilteringSpell Checkers

23

Open Source with Data: Key Issues

Can data resources be recycled? Problems have to be similar. More difficult than one would expect: my first

attempt failed (medline/reuters). Next: case study

Assume there is a large library of data resources available.

How do we identify the data resources that can be recycled?

How do we adapt them? How do we get from here to there?

Need incremental approach that is sustained by successes along the way.

24

Text Mining without Data Resources

Premise: “Knowledge-poor” text mining taps small part of potential of text mining.

Knowledge-poor text mining examples Clustering Phrase extraction First story detection

Many success stories

25

Case Study: ODP -> ReutersCase Study:Train on ODP

Apply to Reuters

26

Case Study: Text Classification

Key Issues for text classification Show that text classifiers can be recycled How can we select reusable classifiers for a

particular task? How do we adapt them?

Case Study Train classifiers on open directory (ODP)

165,000 docs (nodes), crawled in 2000, 505 classes

Apply classifiers to Reuters RCV1 780,000 docs, >1000 classes

Hypothesis: A library of classifiers based on ODP can be recycled for RCV1.

27

Experimental Setup

Train 505 classifiers on ODP Apply them to Reuters Compute chi2 for all ODP x Reuters pairs Evaluate n pairs with the best chi2 Evaluation Measures

Area under ROC curve Plot false positive rate vs true positive rate Compute area under the curve

Average precision Rank documents, compute precision for each rank Average for all positive documents

Estimated based on 25% sample

28

Japan: ODP -> ReutersROC Curve

Japan Classifier Trained on ODP Applied to Reuters

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90

False Positive Rate

Tru

e P

osi

tive

Rat

e

29

Some Results

30

BusIndTraMar0 / I76300: Ports

31

Discussion

Promising results These are results without any

adaptation. Performance expected to be much better

after adaptation.

32

Discussion (cont)

Class relationships are m:n, not 1:1 Reuters: GSPO

SpoBasCol0 SpoBasMinLea0 SpoBasReg0 SpoHocIceLeaNatPla0 SpoHocIceLeaPro0

ODP: RegEurUniBusInd0 (UK industries) I13000 (petroleum & natural gas) I17000 (water supply) I32000 (mechanical engineering) I66100 (restaurants, cafes, fast food) I79020 (telecommunications) I9741105 (radio broadcasting)

33

Why Recycling Classifiers is Difficult

Autonomous vs relative decisions ODP Japan classifier w/o modifications has

high precision, but only 1% recall on RCV1! Most classifiers are tuned for optimal

performance in embedded system. Tuning decreases robustness in recycling. Tokenization, document length, numbers Numbers throw off medline vs. non-medline

categorizer (financial classified as medical) Length-sensitive multinomial Naïve Bayes:

nonsensical results

34

Specifics

What would an open source text classification package look like?

Code Text mining algorithms Customization component

To adapt recycled data resources Creation component

To create new data resources Data

Recycled data resources Newly created data resources

Pick a good area Bioinformatics: genes / proteins Product catalogs

35

Other Text Mining Areas

Named entity recognition Information extraction Shallow parsing

36

Data vs Code

What about just sharing training sets? Often proprietary

What about just sharing models? Small preprocessing changes can throw you

off completely Share (simple?) classifier cum

preprocessor and models Still proprietary issues

37

Open Source & Data

Sanitized&Enhanced

Code+Data

EnhancedCode+Data

adapt

Public Proprietary

Code+DataV1.0

Code+DataV1.1

publish

san

itiz

e

new

rele

ase

38

Free Riders?

Open source is successful because it makes free riding hard. Viral nature of GPL.

Harder to achieve for some data resources Download models Apply to your data Retrain You own 100% of the result

Less of a problem for dictionaries and grammars

39

Data Licenses

Open Directory License http://rdf.dmoz.org/license.html Bsd flavor

Wordnet http://www.cogsci.princeton.edu/~wn/

license.shtml Copyright

No license to sell derivative works? Some criteria for derivative works

Substantially similar (seinfeld trivia) Potential damage to future marketing of derivative

works

http://rdf.dmoz.org/license.html

40

Code vs Data Licenses

Some similarity If I open-source my code, then I will benefit

from bug fixes & enhancements written by others.

If I open-source my data resource, then my classification model may become more robust due to improvements made by others.

Some dissimilarity Code is very abstract: few issues with

proprietary information creeping in. Text mining resources are not very abstract:

there is a potential of sensitive information leaking out.

41

Areas in Need of Research

How to identify reusable text mining components ODP/Reuters case study does not address this. Need (small) labeled sample to be able to do this?

How to adapt reusable text mining components Active learning Interactive parameter tweaking? Combination of recycled classifier and new training

information Estimate performance

Most estimation techniques require large labeled samples.

The point is to avoid construction of a large labeled sample.

Create viral license for data resources.

42

Summary

Many interesting research issues Need institution/individual to take the lead Need motivated network of contributors

data resource contributors source code contributors

Start with small & simple project that proves idea

If it works … text mining could become an enabler on a par with linux.

43

More Slides

44

RegAsiJap0 JAP 0.86 0.62

RegAsiPhi0 PHLNS 0.91 0.56

RegAsiIndSta0 INDIA 0.85 0.53

SpoSocPla0 CCAT 0.60 0.53

RegEurRus0 CCAT 0.58 0.51

RegEurRus0 RUSS 0.85 0.51

SpoSocPla0 GSPO 0.78 0.42

SpoBasReg0 GSPO 0.75 0.33

RegAsiIndSta0 MCAT 0.56 0.32

SpoBasPla1 GSPO 0.80 0.31

SpoBasCol0 GSPO 0.78 0.31

SpoBasCol1 GSPO 0.74 0.26

RegEurSlo0 SLVAK 0.86 0.25

SpoBasPla0 GSPO 0.77 0.24

RegEurRus0 MCAT 0.49 0.23

BusIndTraMar0 I76300 0.81 0.23

SpoHocIceLeaPro0 GSPO 0.71 0.20

SpoBasMinLea0 GSPO 0.71 0.20

RegMidLeb0 LEBAN 0.83 0.19

RecAvi0 I36400 0.74 0.18

RegSou0 BRAZ 0.84 0.18

RegAsiHonBus0 HKONG 0.66 0.18

SpoMotAut0 GSPO 0.67 0.18

SpoHocIceLeaNatPla0 GSPO 0.72 0.17

SocPol0 EEC 0.85 0.17

RegAsiIndSta0 M14 0.59 0.17

RegAsiChiPro0 CHINA 0.67 0.17

RecAvi0 I3640010 0.77 0.17

SpoFooAmeColNca1 GSPO 0.72 0.17

SocPol0 G15 0.86 0.16

RegEurBul0 BUL 0.72 0.15

RegAsiIndPro0 INDON 0.72 0.13

SpoSocPla0 UK 0.49 0.12

RegEurUkr0 UKRN 0.73 0.11

RegEurRus0 GPOL 0.48 0.11

RegEurPolVoi0 POL 0.67 0.11


SpoFooAmeNflPla0 GSPO 0.65 0.09

RegEurGerSta0 GFR 0.56 0.09

RegEurFra0 FRA 0.54 0.09

RegCar0 CUBA 0.76 0.09

RegEurUniBusInd0 C18 0.59 0.08

RegEurUniEngEss0 I66200 0.72 0.08

RegSou0 PERU 0.88 0.08

ComHar0 C22 0.61 0.08

RegMidTur0 TURK 0.69 0.08



RegNorUniCalLocPxx0 LATV 0.64 0.07

RegEurRus0 GVIO 0.52 0.07

SpoSocPla0 ITALY 0.58 0.07

RegEurUniSco0 GSPO 0.54 0.07

RegEurNet0 NETH 0.65 0.07

RegEurRus0 GDIP 0.46 0.07

ArtMusStyCouBan0 GENT 0.52 0.07

RegEurRus0 BYELRS 0.92 0.06

BusIndTraMar0 C24 0.54 0.06

BusIndTraMar0 I74000 0.72 0.06

RegNorMexSta0 I76300 0.58 0.06

SpoHocIceLeaNatPla0 CANA 0.54 0.06

RegSou0 MRCSL 1.00 0.06

SocRelBud0 GREL 0.57 0.05

RegEurBel0 FRA 0.49 0.05

SpoSocPla0 FRA 0.50 0.05

RegEurUniBusInd0 I6540005 0.69 0.05

RegNorCanQueLoc0 FRA 0.46 0.05

RegEurGerSta0 GSPO 0.45 0.05


RegAsiPak0 SHAJH 0.76 0.05

SpoSocPla0 GFR 0.48 0.05

RegSou0 PARA 0.90 0.04


RegSou0 BOL 0.90 0.04

RegEurRus0 UKRN 0.83 0.04

SpoSocPla0 SPAIN 0.61 0.04

NewOnlCnn0 BAH 0.56 0.04

ArtAniVoi0 I97100 0.70 0.03

RegEurRus0 NATO 0.75 0.03

RegEurRus0 GDEF 0.55 0.03

SpoSocPla0 MONAC 0.87 0.03

SciEarPal0 GSCI 0.42 0.03

RegEurRom0 ROM 0.57 0.03

RegAsiPhi0 I85000 0.66 0.03

SpoBasReg0 SPAIN 0.59 0.03

BusIndTraMar0 USSR 0.47 0.03

SpoSocPla0 NETH 0.54 0.03

SpoFooAmeNflPla0 CANA 0.48 0.03

RegEurRus0 AZERB 0.94 0.03

SciBioTaxTaxPlaMagMag0 ECU 0.54 0.03

RegNorUniCalLocPxx0 I41500 0.65 0.02

RegEurRus0 TADZK 0.95 0.02



RegSou0 URU 0.88 0.02



RefFlaReg0 GUREP 0.69 0.02

SciBioTaxTaxPlaMagMag0 I0100144 0.58 0.02

NewOnlCnn0 GWEA 0.66 0.02


ArtCelMxx0 I97100 0.66 0.02

SpoMotAut0 SMARNO 0.88 0.02


NewOnlCnn0 DOMR 0.55 0.02

ArtMusStyCouBan0 GPRO 0.45 0.02


SpoBasReg0 GREECE 0.51 0.02

RegEurRus0 GRGIA 0.84 0.02

RegEurRus0 KAZK 0.82 0.02

RegEurNet0 M142 0.45 0.02


NewOnlCnn0 BELZ 0.50 0.01



SpoBasReg0 ISRAEL 0.38 0.01



RegEurPolVoi0 FIN 0.58 0.01

RegEurRus0 USSR 0.82 0.01




BusIndTraMar0 BUL 0.37 0.01


BusIndTraMar0 ESTNIA 0.60 0.01

NewOnlCnn0 GABON 0.46 0.01

NewOnlCnn0 CVI 0.70 0.01

SciBioTaxTaxAniChoAve0 GENV 0.45 0.01

SpoMotAut0 MONAC 0.71 0.01

ArtCelBxx0 I97100 0.64 0.01

SpoBasReg0 TURK 0.46 0.01

BusIndTraMar0 PORL 0.57 0.01

SpoBasReg0 CRTIA 0.48 0.01


BusIndTraMar0 CRTIA 0.41 0.01

BusIndTraMar0 UKRN 0.43 0.01

ArtCelLxx0 I97100 0.60 0.01

RegEurRus0 MOLDV 0.78 0.01

RegSou0 SURM 0.80 0.01

BusIndTraMar0 LATV 0.60 0.01

BusIndTraMar0 ALB 0.24 0.01

BusIndTraMar0 LITH 0.58 0.01

ArtCelSxx0 I97100 0.63 0.01


SpoBasCol0 E71 0.42 0.01

SciBioTaxTaxPlaMagMag0 BELZ 0.53 0.01

ArtMusStyCouBan0 GOBIT 0.53 0.01

BusFinBanBanReg0 C173 0.68 0.01

RegEurRus0 ARMEN 0.85 0.01

RegEurRus0 I22471 0.66 0.01

RegEurRus0 TURKM 0.86 0.01

BusIndTraMar0 ROM 0.40 0.01

BusIndTraMar0 TUNIS 0.67 0.00

RegAsiChiPro0 I5020006 0.76 0.00

ArtTelNet0 I9741105 0.67 0.00

BusIndTraMar0 YEMAR 0.49 0.00

BusIndTraMar0 CYPR 0.40 0.00

RefFlaReg0 SLVNIA 0.57 0.00


RegEurRus0 KIRGH 0.83 0.00

RegCar0 GTOUR 0.55 0.00

BusIndTraMar0 UAE 0.48 0.00

NewOnlCnn0 BERM 0.52 0.00

BusIndTraMar0 NAMIB 0.48 0.00

BusIndTraMar0 JORDAN 0.36 0.00

RecAvi0 C313 0.42 0.00

BusIndTraMar0 MOZAM 0.51 0.00


BusIndTraMar0 SILEN 0.34 0.00

RegMidLeb0 I9741105 0.54 0.00

RegAsiHonBus0 I81400 0.61 0.00

RefFlaReg0 WORLD 0.43 0.00

RegNorUniCalLocVxx0 C313 0.39 0.00


RefFlaReg0 UPVOLA 0.58 0.00

SciBioTaxTaxPlaMagMag0 I0100216 0.66 0.00


SciBioTaxTaxAniChoAve0 AARCT 0.53 0.00

RegSou0 I5020051 0.84 0.00

NewOnlCnn0 TCAI 0.00 0.00

45

Resources http://www-csli.stanford.edu/~schuetze (this talk, some

additional material) Source of Gates quote:

http://www.techweb.com/wire/story/TWB19990324S0014 Kurt D. Bollacker and Joydeep Ghosh. A scalable method for

classifier knowledge reuse. In Proceedings of the 1997 International Conference on Neural Networks, pages 1474-79, June 1997. (proposes measure for selecting classifiers for reuse)

W.Cohen, D.Kudenko: Transferring and Retraining Learned Information Filters, Proceedings of the Fourteenth National Conference on Artificial Intelligence, AAAI 97. (transfer within the same dataset)

Kurt D. Bollacker and Joydeep Ghosh. A supra-classifier architecture for scalable knowledge reuse. In The 1998 International Conference on Machine Learning, pp. 64-72, July 1998. (transfer within the same dataset)

Motivation of open source contributors: http://newsforge.com/newsforge/03/04/19/2128256.shtml?tid=11, http://cybernaut.com/modules.php?op=modload&name=News&file=article&sid=8&mode=thread&order=0&thold=0

1 open source text mining text mining 2003 @ sdm03 cathedral hill hotel, san francisco hinrich...

Documents