text classification with limited labeled data

56
Text Classification with Limited Labeled Data Andrew McCallum [email protected] Just Research (formerly JPRC) Center for Automated Learning and Discovery, Carnegie Mellon University Joint work with Kamal Nigam, Tom Mitchell, Sebastian Thrun, Roni Rosenfeld, Andrew Ng, Larry Wasserman, Kristie Seymore, and Jason Rennie

Upload: hannah-molina

Post on 30-Dec-2015

45 views

Category:

Documents


2 download

DESCRIPTION

Text Classification with Limited Labeled Data. Andrew McCallum [email protected] Just Research (formerly JPRC) Center for Automated Learning and Discovery, Carnegie Mellon University. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Text Classification with Limited Labeled Data

Text Classificationwith Limited Labeled Data

Andrew [email protected]

Just Research (formerly JPRC)

Center for Automated Learning and Discovery, Carnegie Mellon University

Joint work with Kamal Nigam, Tom Mitchell, Sebastian Thrun, Roni Rosenfeld, Andrew Ng, Larry Wasserman, Kristie Seymore, and Jason Rennie

Page 2: Text Classification with Limited Labeled Data

2

Page 3: Text Classification with Limited Labeled Data

3

Page 4: Text Classification with Limited Labeled Data

4

Page 5: Text Classification with Limited Labeled Data

5

The Task: Document Classification(also “Document Categorization”, “Routing” or “Tagging”)

Automatically placing documents in their correct categories.

Magnetism RelativityEvolutionBotanyIrrigation Crops

cornwheatsilofarmgrow...

corntulipssplicinggrow...

watergratingditchfarmtractor...

selectionmutationDarwinGalapagosDNA...

... ...

“grow corn tractor…”

TrainingData:

TestingData:

Categories:

(Crops)

Page 6: Text Classification with Limited Labeled Data

6

A Probabilistic Approach to Document Classification

)|Pr(maxarg dcc jc j

Pick the most probable class, given the evidence:

jc

d- a class (like “Crops”)

- a document (like “grow corn tractor...”)

)Pr(

)|Pr()Pr()|Pr(

d

cdcdc jj

j

Bayes Rule:

k

d

i i

d

i i

ckdk

jdj

cwc

cwc||

1

||

1

)|Pr()Pr(

)|Pr()Pr(

(1) One mixture-component per class(2) Independence assumption

“Naïve Bayes”:

idw - the i th word in d (like “corn”)

Page 7: Text Classification with Limited Labeled Data

7

Parameter Estimation in Naïve Bayes

||

1

),(||

),(1

)|r(P̂V

t cdkt

cdki

ji

jk

jk

dwNV

dwN

cw

||

1

)|Pr()Pr(maxargd

ijdjj cwcc

i

Maximum a posteriori estimate of Pr(w|c),with a Dirichlet prior,(AKA “Laplace smoothing”)

Naïve Bayes

Two ways to improve this method:(A) Make less restrictive assumptions about the model(B) Get better estimates of the model parameters, i.e. Pr(w|c)

where N(w,d) isnumber of times word w occursin document d.

Page 8: Text Classification with Limited Labeled Data

8

The Rest of the Talk

(1) Borrow data from related classes in a hierarchy

(2) Use unlabeled data.

Two Methods for Improving Parameter Estimation when Labeled Data is Sparse

Page 9: Text Classification with Limited Labeled Data

Improving Document Classification by Shrinkage in a Hierarchy

Andrew McCallum

Roni Rosenfeld

Tom Mitchell

Andrew Ng (Berkeley)

Larry Wasserman (CMU Statistics)

Page 10: Text Classification with Limited Labeled Data

10

The Idea: “Shrinkage” / “Deleted Interpolation”

We can improve the parameter estimates in a leaf by averaging them with the estimates in its ancestors. This represents a tradeoff between reliability and specificity.

Magnetism Relativity

Physics

EvolutionBotanyIrrigation Crops

BiologyAgriculture

Science

cornwheatsilofarmgrow...

corntulipssplicinggrow...

watergratingditchfarmtractor...

“corn grow tractor…”

selectionmutationDarwinGalapagosDNA...

... ...

TestingData:

TrainingData:

Categories:

(Crops)

Page 11: Text Classification with Limited Labeled Data

11

“Shrinkage” / “Deleted Interpolation”

Crops of

ancestors#

0ancestor

ancestorancestorSHRINKAGE )Crops|tractor""r(P̂)Crops|tractor""(Pr j

[James and Stein, 1961] / [Jelinek and Mercer, 1980]

)Crops|tractor"r("P̂

||

1)tractor"("PrUNIFORM V

(Uniform)

Magnetism Relativity

Physics

EvolutionBotanyIrrigation Crops

BiologyAgriculture

Science

)eAgricultur|tractor"r("P̂

)Science|tractor"r("P̂

Page 12: Text Classification with Limited Labeled Data

12

Learning Mixture Weights

Crops

Agriculture

Science

Learn the ’s via EM, performing the E-step with leave-one-out cross-validation.

parent

Crops

child

Crops

tgrandparen

Crops

Uniform uniform

Crops

corn wheatsilo farmgrow...

Use the current ’s to estimate the degreeto which each node was likely to have generated the words in held out documents.

E-step

M-stepUse the estimates to recalculate new

values for the ’s.

Page 13: Text Classification with Limited Labeled Data

14

Learning Mixture Weights

Hw jtj

jtjj

tcw

cw

m

mm

aaa

)|r(P̂

)|r(P̂

m

m

aa

j

jj

E-step

M-step

Page 14: Text Classification with Limited Labeled Data

15

Newsgroups Data Set

macibm

graphicswindows

X guns

mideastautomotorcycle

atheism

christian

misc baseballhockey

misc

computers religion sport politics motor

15 classes, 15k documents,1.7 million words, 52k vocabulary

(Subset of Ken Lang’s 20 Newsgroups set)

Page 15: Text Classification with Limited Labeled Data

16

Newsgroups HierarchyMixture Weights

Mixture Weights# trainingdocuments Class child parent g’parent uniform

/politics/talk.politics.guns 0.368 0.092 0.017 0.522/politics/talk.politics.mideast 0.256 0.132 0.001 0.611235/politics/talk.politics.misc 0.197 0.213 0.026 0.564/politics/talk.politics.guns 0.801 0.089 0.048 0.061/politics/talk.politics.mideast 0.859 0.061 0.010 0.0717497/politics/talk.politics.misc 0.762 0.126 0.043 0.068

Page 16: Text Classification with Limited Labeled Data

18

Industry Sector Data Set

waterair

railroadtrucking

misc coal

oil&gas

filmcommunication

electric

water

gas appliancefurniture

integrated

transportation utilities consumer energy services

71 classes, 6.5k documents,1.2 million words, 30k vocabulary

... ... ...

… (11)

www.marketguide.com

Page 17: Text Classification with Limited Labeled Data

19

Industry Sector Classification Accuracy

Title:

Creator:gnuplotPreview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.

Page 18: Text Classification with Limited Labeled Data

20

Newsgroups Classification Accuracy

Title:

Creator:gnuplotPreview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.

Page 19: Text Classification with Limited Labeled Data

21

Yahoo Science Data Set

dairycrops

agronomyforestry

AI

HCIcraft

missions

botany

evolution

cellmagnetism

relativity

courses

agriculture biology physics CS space

264 classes, 14k documents,3 million words, 76k vocabulary

... ... ...

… (30)

www.yahoo.com/Science

... ...

Page 20: Text Classification with Limited Labeled Data

22

Yahoo Science Classification Accuracy

Title:

Creator:gnuplotPreview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.

Page 21: Text Classification with Limited Labeled Data

23

Related Work• Shrinkage in Statistics:

– [Stein 1955], [James & Stein 1961]

• Deleted Interpolation in Language Modeling:– [Jelinek & Mercer 1980], [Seymore & Rosenfeld 1997]

• Bayesian Hierarchical Modeling for n-grams– [MacKay & Peto 1994]

• Class hierarchies for text classification– [Koller & Sahami 1997]

• Using EM to set mixture weights in a hierarchical clustering model for unsupervised learning– [Hofmann & Puzicha 1998]

Page 22: Text Classification with Limited Labeled Data

24

Future Work

• Learning hierarchies that aid classification.

• Using more complex generative models.– Capturing word dependancies– Clustering words in each ancestor

Page 23: Text Classification with Limited Labeled Data

25

Shrinkage Conclusions

• Shrinkage in a hierarchy of classes can dramatically improve classification accuracy.

• Shrinkage helps especially when training data is sparse. In models more complex than naïve Bayes, it should be even more helpful.

• [The hierarchy can be pruned for exponential reduction in computation necessary for classification; only minimal loss in accuracy.]

Page 24: Text Classification with Limited Labeled Data

26

The Rest of the Talk

(1) Borrow data from related classes in a hierarchy.

(2) Use unlabeled data.

Two Methods for Improving Parameter Estimation when Labeled Data is Sparse

Page 25: Text Classification with Limited Labeled Data

Text Classification with Labeled and Unlabeled Documents

Kamal Nigam

Andrew McCallum

Sebastian Thrun

Tom Mitchell

Page 26: Text Classification with Limited Labeled Data

28

The Scenario

Training datawith class labels

Data available at trainingtime, but without class labels

Web pagesuser says areinteresting

Web pagesuser says areuninteresting

Web pages userhasn’t seen or saidanything about

Can we use the unlabeled documents to increase accuracy?

Page 27: Text Classification with Limited Labeled Data

29

Using the Unlabeled DataBuild a classificationmodel using limitedlabeled data

Use model to estimate thelabels of the unlabeleddocuments

Use all documents to build a new classification model, which is often more accurate because it is trained using more data.

Page 28: Text Classification with Limited Labeled Data

30

An Example

Baseball Ice Skating

Labeled Data

Fell on the ice...The new hitter struck out...

Pete Rose is not as good an athlete as Tara Lipinski...

Struck out in last inning...

Homerun in the first inning...

Perfect triple jump...

Katarina Witt’s gold medal performance...

New ice skates...

Practice at the ice rink every day...

Unlabeled Data

Tara Lipinski’s substitute ice skates didn’t hurt her performance. She graced the ice with a series of perfect jumps and won the gold medal.

Tara Lipinski bought a new house for her parents.

Pr ( Lipinski ) = 0.01 Pr ( Lipinski ) = 0.001

Pr ( Lipinski | Ice Skating ) = 0.02

Pr ( Lipinski | Baseball ) = 0.003

After EM:

Before EM:

Page 29: Text Classification with Limited Labeled Data

31

Filling in Missing Labels with EM

• E-step: Use current estimates of model parameters to “guess” value of missing labels.

• M-step: Use current “guesses” for missing labels to calculate new estimates of model parameters.

• Repeat E- and M-steps until convergence.

Expectation Maximization is a class of iterative algorithms for maximum likelihood estimation with incomplete data.

[Dempster et al ‘77], [Ghahramani & Jordan ‘95], [McLachlan & Krishnan ‘97]

Finds the model parameters that locally maximize the probability of both the labeled and the unlabeled data.

Page 30: Text Classification with Limited Labeled Data

32

EM for Text Classification

Expectation-step (estimate the class labels)

||

1

)|Pr()Pr()|Pr(d

ijdjj cwcdc

i

Maximization-step (new parameters using the estimates)

k

k

d

V

tkjkt

dkjki

ji

dcdwNV

dcdwN

cw||

1

)|Pr(),(||

)|Pr(),(1

)|r(P̂

Page 31: Text Classification with Limited Labeled Data

33

WebKB Data Set

student faculty course project

4 classes, 4199 documents

from CS academic departments

Page 32: Text Classification with Limited Labeled Data

34

Word Vector Evolution with EM

Iteration 0intelligence

DDartificial

understandingDDwdist

identicalrus

arrangegames

dartmouthnatural

cognitivelogic

provingprolog

Iteration 1DDD

lectureccD*

DD:DDhandout

dueproblem

settay

DDamyurtas

homeworkkfoury

sec

Iteration 2D

DDlecture

ccDD:DD

dueD*

homeworkassignment

handoutsethw

examproblemDDam

postscript

(D is a digit)

Page 33: Text Classification with Limited Labeled Data

35

EM as Clustering

X

X

X

= unlabeled

Page 34: Text Classification with Limited Labeled Data

36

EM as Clustering, Gone Wrong

X

X

X

Page 35: Text Classification with Limited Labeled Data

37

20 Newsgroups Data Set

20 class labels, 20,000 documents62k unique words

comp.sys.m

ac.hardware

comp.sys.ibm

.pc.hardware

comp.os.m

s-window

s.misc

alt.atheismcom

p.graphics

comp.w

indows.x

rec.sport.baseballrec.sport.hockey

talk.politics.mideast

talk.politics.guns

talk.politics.misc

talk.religion.misc

sci.crypt

sci.electronics

sci.med

sci.space

Page 36: Text Classification with Limited Labeled Data

38

Newsgroups Classification Accuracyvarying # labeled documents

Title:

Creator:gnuplotPreview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.

Page 37: Text Classification with Limited Labeled Data

39

Newsgroups Classification Accuracyvarying # unlabeled documentsTitle:

Creator:gnuplotPreview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.

Page 38: Text Classification with Limited Labeled Data

40

WebKB Classification Accuracyvarying # labeled documents

Title:

Creator:gnuplotPreview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.

Page 39: Text Classification with Limited Labeled Data

41

WebKB Classification Accuracyvarying weight of unlabeled data

Title:

Creator:gnuplotPreview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.

Page 40: Text Classification with Limited Labeled Data

42

WebKB Classification Accuracyvarying # labeled documents

and selecting unlabeled weight by CVTitle:

Creator:gnuplotPreview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.

Page 41: Text Classification with Limited Labeled Data

43

Reuters 21578 Data Set

acq earn interest ship

135 class labels, 12902 documents

crude graincorn wheat …

Page 42: Text Classification with Limited Labeled Data

44

EM as Clustering, Salvageable

XX

Page 43: Text Classification with Limited Labeled Data

45

Reuters 21578 Precision-Recall Breakeven

Category NB 1 EM 1 EM 20 EM 40 Diff .acq 75.9 39.5 88.4 88.9 +13.0corn 40.5 21.1 39.8 39.1 -0.7crude 60.7 27.8 63.9 66.6 +5.9earn 92.6 90.2 95.3 95.2 +2.7grain 51.7 21.0 54.6 55.8 +4.1interest 52.0 25.9 48.6 50.3 -1.7money-fx 57.7 28.8 54.7 59.7 +2.0ship 58.1 9.3 46.5 55.0 -3.1trade 56.8 34.7 54.3 57.0 +0.2wheat 48.9 13.0 42.1 44.2 -4.7

# mixture componentsfor negative class

Page 44: Text Classification with Limited Labeled Data

46

Related Work• Using EM to reduce the need for training examples:

– [Miller & Uyar 1997], [Shahshahani & Landgrebe 1994]

• Using EM to fill in missing values– [Ghahramani & Jordan 1995]

• AutoClass - unsupervised EM with Naïve Bayes:– [Cheeseman et al. 1988]

• Co-Training– [Blum & Mitchell COLT’98]

• Relevance Feedback for Information Retrieval– [Salton & Buckley 1990]

Page 45: Text Classification with Limited Labeled Data

47

Unlabeled Data Conclusions & Future Work

• Combining labeled and unlabeled data with EM can greatly reduce the need for labeled training data.

• Exercise caution: EM can sometimes hurt.– Weight the unlabeled data.– Choose parametric model carefully.

• Vary EM likelihood surface for different tasks.• Use similar techniques for other text tasks: e.g.

Information Extraction.

Page 46: Text Classification with Limited Labeled Data

48

Cora Demo

Page 47: Text Classification with Limited Labeled Data

49

Page 48: Text Classification with Limited Labeled Data

50

Page 49: Text Classification with Limited Labeled Data

51

Page 50: Text Classification with Limited Labeled Data

52

Populating a hierarchy

• Naïve Bayes+ Simple, robust document classification.+ Many principled enhancements (e.g. shrinkage).– Requires a lot of labeled training data.

• Keyword matching+ Requires no labeled training data.– Human effort to select keywords (acc/cov)– Brittle, breaks easily

Page 51: Text Classification with Limited Labeled Data

53

Combine Naïve Bayes and Keywords for Best of Both

• Classify unlabeled documents with keyword matching.

• Pretend these category labels are correct, and use this data to train naïve Bayes.

• Naïve Bayes acts to temper and “round out” the keyword class definitions.

• Brings in new probabilistically-weighted keywords that are correlated with the few original keywords.

Page 52: Text Classification with Limited Labeled Data

54

Cora Topic HierarchyClassification Accuracy

0

10

20

30

40

50

60

70

Algorithm

Keyword Matching

Naïve Bayes

Naïve Bayes withShrinkage

Page 53: Text Classification with Limited Labeled Data

55

Top words found by naïve Bayes and Shrinkage

ROOTcomputer, university, science, system, paper

HCIcomputersystem

multimedia university

paper

IRinformation

textdocuments

classificationretrieval

Hardwarecircuitsdesigns

computeruniversity

performance

AIlearning

universitycomputer

basedintelligence

Programmingprogramming

languagelogic

universityprograms

GUIinterfacedesignuser

sketchinterfaces

Cooperativecollaborative

CSCWwork

providegroup

Multimediamultimedia

realtimedata

media

Planningplanningtemporalreasoning

planproblems

MachineLearninglearning

algorithmuniversitynetworks

NLPlanguagenatural

processinginformation

text

Semanticssemantics

denotationallanguage

constructiontypes

GarbageCollection

garbagecollectionmemory

optimizationregion

Page 54: Text Classification with Limited Labeled Data

56

Less Labeled Data, but with Unlabeled Data

0

10

20

30

40

50

60

70

Algorithm

Naïve Bayes

Naïve Bayes withShrinkage

Naïve Bayes withShrinkage andUnlabeled Data

Page 55: Text Classification with Limited Labeled Data

57

Next Cora Projects

• Improving existing components with further Machine Learning research.

• Building a topic hierarchy automatically by clustering.

• Reference matching by machine learning.• Active Learning for improving performance

interactively.• Seminal-paper detection (ala Kleinberg).• TDT (“What’s new in research this month?”)

Page 56: Text Classification with Limited Labeled Data

58

BibliographyFor more details see

http://www.cs.cmu.edu/~mccallum

Improving Text Classification by Shrinkage in a Hierarchy of Classes

McCallum, Rosenfeld, Mitchell, Ng

ICML-98

Learning to ClassifyText from Labeled and Unlabeled Documents

Nigam, McCallum, Thrun, Mitchell

AAAI-98

Building Domain-Specific Search Engines with Machine LearningMcCallum, Nigam, Rennie, SeymoreAAAI Spring Symposium 1999 (submitted)