improving text classification by shrinkage in a hierarchy of classes andrew mccallum just research...
TRANSCRIPT
Improving Text Classification by Shrinkage in a
Hierarchy of Classes
Andrew McCallum
Just Research & CMU
Tom Mitchell
CMU
Roni Rosenfeld
CMU
Andrew Y. Ng
MIT AI Lab
2
The Task: Document Classification(also “Document Categorization”, “Routing” or “Tagging”)
Automatically placing documents in their correct categories.
Magnetism RelativityEvolutionBotanyIrrigation Crops
cornwheatsilofarmgrow...
corntulipssplicinggrow...
watergratingditchfarmtractor...
selectionmutationDarwinGalapagosDNA...
... ...
“grow corn tractor…”
TrainingData:
TestingData:
Categories:
(Crops)
3
The Idea: “Shrinkage” / “Deleted Interpolation”
We can improve the parameter estimates in a leaf by averaging them with the estimates in its ancestors.
Magnetism Relativity
Physics
EvolutionBotanyIrrigation Crops
BiologyAgriculture
Science
cornwheatsilofarmgrow...
corntulipssplicinggrow...
watergratingditchfarmtractor...
“corn grow tractor…”
selectionmutationDarwinGalapagosDNA...
... ...
TestingData:
TrainingData:
Categories:
(Crops)
4
A Probabilistic Approach toDocument Classification
||
1
),(||
),(1
)|r(P̂V
t cdkt
cdki
ji
jk
jk
dwNV
dwN
cw
||
1
)|Pr()Pr(maxargd
ijdjj cwcc
i
Maximum a posteriori estimate of Pr(w|c),with a Dirichlet prior, =1(AKA Laplace smoothing)
Naïve Bayes
where N(w,d) isnumber of times word w occursin document d.
where cj is a class, d is a document, wdi is the i th word of document d
5
“Shrinkage” / “Deleted Interpolation”
Crops of
ancestors#
0ancestor
ancestorancestorSHRINKAGE )Crops|tractor""r(P̂)Crops|tractor""(Pr j
[James and Stein, 1961] / [Jelinek and Mercer, 1980]
)Crops|tractor"r("P̂
||
1)tractor"("PrUNIFORM V
(Uniform)
Magnetism Relativity
Physics
EvolutionBotanyIrrigation Crops
BiologyAgriculture
Science
)eAgricultur|tractor"r("P̂
)Science|tractor"r("P̂
6
Learning Mixture Weights
Crops
Agriculture
Science
Learn the ’s via EM, performing the E-step with leave-one-out cross-validation.
parent
Crops
child
Crops
tgrandparen
Crops
Uniform uniform
Crops
corn wheatsilo farmgrow...
Use the current ’s to estimate the degreeto which each node was likely to have generated the words in held out documents.
E-step
M-stepUse the estimates to recalculate new
values for the ’s.
7
Learning Mixture Weights
Hw jtj
jtjj
tcw
cw
m
mm
aaa
)|r(P̂
)|r(P̂
m
m
aa
j
jj
E-step
M-step
8
Newsgroups Data Set
macibm
graphicswindows
X guns
mideastautomotorcycle
atheism
christian
misc baseballhockey
misc
computers religion sport politics motor
15 classes, 15k documents,1.7 million words, 52k vocabulary
(Subset of Ken Lang’s 20 Newsgroups set)
9
Newsgroups HierarchyMixture Weights
Mixture Weights# trainingdocuments Class child parent g’parent uniform
/politics/talk.politics.guns 0.368 0.092 0.017 0.522/politics/talk.politics.mideast 0.256 0.132 0.001 0.611235/politics/talk.politics.misc 0.197 0.213 0.026 0.564/politics/talk.politics.guns 0.801 0.089 0.048 0.061/politics/talk.politics.mideast 0.859 0.061 0.010 0.0717497/politics/talk.politics.misc 0.762 0.126 0.043 0.068
10
Newsgroups HierarchyMixture Weights
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
guns mideast misc guns mideast misc
leaf
parent
root
uniform
235 training documents(15/class)
7497 training documents(~500/class)
11
Industry Sector Data Set
waterair
railroadtrucking
misc coal
oil&gas
filmcommunication
electric
water
gas appliancefurniture
integrated
transportation utilities consumer energy services
71 classes, 6.5k documents,1.2 million words, 30k vocabulary
... ... ...
… (11)
www.marketguide.com
12
Industry Sector Classification Accuracy
Title:
Creator:gnuplotPreview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.
13
Newsgroups Classification Accuracy
Title:
Creator:gnuplotPreview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.
14
Yahoo Science Data Set
dairycrops
agronomyforestry
AI
HCIcraft
missions
botany
evolution
cellmagnetism
relativity
courses
agriculture biology physics CS space
264 classes, 14k documents,3 million words, 76k vocabulary
... ... ...
… (30)
www.yahoo.com/Science
... ...
15
Yahoo Science Classification Accuracy
Title:
Creator:gnuplotPreview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.
17
Related Work• Shrinkage in Statistics:
– [Stein 1955], [James & Stein 1961]
• Deleted Interpolation in Language Modeling:– [Jelinek & Mercer 1980], [Seymore & Rosenfeld 1997]
• Bayesian Hierarchical Modeling for n-grams– [MacKay & Peto 1994]
• Class hierarchies for text classification– [Koller & Sahami 1997]
• Using EM to set mixture weights in a hierarchical clustering model for unsupervised learning– [Hofmann & Puzicha 1998]
18
Conclusions
• Shrinkage in a hierarchy of classes can dramatically improve classification accuracy (29%)
• Shrinkage helps especially when training data is sparse. In models more complex than naïve Bayes, it should be even more helpful.
• [The hierarchy can be pruned for exponential reduction in computation necessary for classification; only minimal loss of accuracy.]
19
Future Work
• Learning hierarchies that aid classification.
• Using more complex generative models.– Capturing word dependancies– Clustering words in each ancestor