concept and theme discovery through probabilistic models and clustering qiaozhu mei oct. 12, 2005
TRANSCRIPT
![Page 1: Concept and Theme Discovery through Probabilistic Models and Clustering Qiaozhu Mei Oct. 12, 2005](https://reader035.vdocument.in/reader035/viewer/2022070307/551a5253550346cb358b5c79/html5/thumbnails/1.jpg)
Concept and Theme Discovery through Probabilistic Models and ClusteringQiaozhu Mei
Oct. 12, 2005
![Page 2: Concept and Theme Discovery through Probabilistic Models and Clustering Qiaozhu Mei Oct. 12, 2005](https://reader035.vdocument.in/reader035/viewer/2022070307/551a5253550346cb358b5c79/html5/thumbnails/2.jpg)
Concepts and Themes
Language units in biology literature mining: Terms Phrases Entities Concepts (tight groups of terms/entities
representing semantics: e.g. Gene Synonyms) Themes (loose groups of terms representing
topic/subtopics)
![Page 3: Concept and Theme Discovery through Probabilistic Models and Clustering Qiaozhu Mei Oct. 12, 2005](https://reader035.vdocument.in/reader035/viewer/2022070307/551a5253550346cb358b5c79/html5/thumbnails/3.jpg)
Theme Discovery
What we’ve got now: A Generative Model to extract k themes from a collection
Each theme as a language model, represented by top probability words in a theme language model
KL Divergence to model the distance/similarity between themes;
retrieve most similar themes to a term group
k
iiid wPbBwPbdwP
1, )|(*)1()|(*):(
![Page 4: Concept and Theme Discovery through Probabilistic Models and Clustering Qiaozhu Mei Oct. 12, 2005](https://reader035.vdocument.in/reader035/viewer/2022070307/551a5253550346cb358b5c79/html5/thumbnails/4.jpg)
Theme Discovery (cont.)
What we’ve got now (cont.): Use HMM to segment the whole collection with the theme
extracted Use MMR to find most representative and least redundant
phrases to represent a theme (currently using n-gram prob. as and edit distance as similarity, performance to be tuned..)
Results: http://ucair.cs.uiuc.edu/qmei2/ThemeNavigation.html
![Page 5: Concept and Theme Discovery through Probabilistic Models and Clustering Qiaozhu Mei Oct. 12, 2005](https://reader035.vdocument.in/reader035/viewer/2022070307/551a5253550346cb358b5c79/html5/thumbnails/5.jpg)
Some justifications Fly collection:
Cluster 0: circadian Cluster 1: adh, evolution Cluster 2: a mixture of two topics, apoptosis and promoters Cluster 6: brain development Cluster 8: cell division Cluster 12: drosophila immunity Cluster 13: nervous systems Cluster 14: hedgehog segment Polarity gene Cluster 16: Histone, Polycomb Cluster 17: visual system
![Page 6: Concept and Theme Discovery through Probabilistic Models and Clustering Qiaozhu Mei Oct. 12, 2005](https://reader035.vdocument.in/reader035/viewer/2022070307/551a5253550346cb358b5c79/html5/thumbnails/6.jpg)
Theme Discovery (cont.)
Problems: How to select k? (how many themes do we believe are
there in the collection: bee collection should have smaller k than fly collection)
Can we find themes in a hierarchical manner? This can solve the former problem…however, when to cutoff?
How to represent a theme? Top words sometimes difficult to tell the semantics Phrases? Sentences?
Other possible approaches to extract theme? (LDAs, Clustering methods)
![Page 7: Concept and Theme Discovery through Probabilistic Models and Clustering Qiaozhu Mei Oct. 12, 2005](https://reader035.vdocument.in/reader035/viewer/2022070307/551a5253550346cb358b5c79/html5/thumbnails/7.jpg)
Hierarchical Theme Discovery A straightforward approach (top
down splitting): Discover k themes from the initial
collection Segment the collection by the k
themes For each theme, build a sub-
collection with the segments in previous step
For each sub-collection, extract k’ themes
Do these processes iteratively Problem: When to stop splitting
iteration?
Theme1Theme2
Theme3
Collection
Theme2.1Theme2.2 Theme2.3
……
![Page 8: Concept and Theme Discovery through Probabilistic Models and Clustering Qiaozhu Mei Oct. 12, 2005](https://reader035.vdocument.in/reader035/viewer/2022070307/551a5253550346cb358b5c79/html5/thumbnails/8.jpg)
Hierarchical Theme Discovery (results)
A bee collection with 929 documents
Level1: 5 themes
Level2: 3 sub-themes for each higher level theme
… … …
![Page 9: Concept and Theme Discovery through Probabilistic Models and Clustering Qiaozhu Mei Oct. 12, 2005](https://reader035.vdocument.in/reader035/viewer/2022070307/551a5253550346cb358b5c79/html5/thumbnails/9.jpg)
Hierarchical Theme Discovery (results)african jelly royal european venom population africanized sting kda feral m reward subspecies proteins patients discrimination naja cue characters areas
queenworkersworkersignaljhvibrationpheromoneglandeggssignalshormonejuvenileanarchisticqueenseggiridaceaepolicingixiabehavioralage
pollinatorplantspollinationflowersplantaespermatophytaangiospermaedicotyledonespollenseedfruitangiospermsspermatophytesvasculardicotscropplantflowerpollinatorsspecies
learningbrainconditioningolfactoryneuralneuronsmushroommemorysucrosenervouscoordinationdopamineextensionantennalodorsystemproboscisbodieslobekenyon
varroamitemitesjacobsoniacarinabroodparasitecolonieshostcontrolcheliceratachelicerateshygienicvirusesinfestationdestructorpestinfestedparasitologymortality
![Page 10: Concept and Theme Discovery through Probabilistic Models and Clustering Qiaozhu Mei Oct. 12, 2005](https://reader035.vdocument.in/reader035/viewer/2022070307/551a5253550346cb358b5c79/html5/thumbnails/10.jpg)
Hierarchical Theme Discovery (results)african jelly royal european venom population africanized sting kda feral m reward subspecies proteins patients discrimination naja cue characters areas
queenworkersworkersignaljhvibrationpheromoneglandeggssignalshormonejuvenileanarchisticqueenseggiridaceaepolicingixiabehavioralage
pollinatorplantspollinationflowersplantaespermatophytaangiospermaedicotyledonespollenseedfruitangiospermsspermatophytesvasculardicotscropplantflowerpollinatorsspecies
learningbrainconditioningolfactoryneuralneuronsmushroommemorysucrosenervouscoordinationdopamineextensionantennalodorsystemproboscisbodieslobekenyon
varroamitemitesjacobsoniacarinabroodparasitecolonieshostcontrolcheliceratachelicerateshygienicvirusesinfestationdestructorpestinfestedparasitologymortality
africaneuropeanpopulationpopulationspatternspatterngeneticdiscriminationmitochondrialstudiesinformationarecontrastgreentwobeeshavederivedafricasubspecies
larvaemicroorganismsgrambacteria0coloniesroyalqueenjellyeubacterianonworkersqueensproduction2nestitalian5fractionnestmates
venomrewardpatientsnajakdaproteinswaspproteindipterapla2vespulaprimateshominidaechordatavertebratamugstingspermdosequality
![Page 11: Concept and Theme Discovery through Probabilistic Models and Clustering Qiaozhu Mei Oct. 12, 2005](https://reader035.vdocument.in/reader035/viewer/2022070307/551a5253550346cb358b5c79/html5/thumbnails/11.jpg)
Hierarchical Theme Discovery (results)african jelly royal european venom population africanized sting kda feral m reward subspecies proteins patients discrimination naja cue characters areas
pollinatorplantspollinationflowersplantaespermatophytaangiospermaedicotyledonespollenseedfruitangiospermsspermatophytesvasculardicotscropplantflowerpollinatorsspecies
learningbrainconditioningolfactoryneuralneuronsmushroommemorysucrosenervouscoordinationdopamineextensionantennalodorsystemproboscisbodieslobekenyon
varroamitemitesjacobsoniacarinabroodparasitecolonieshostcontrolcheliceratachelicerateshygienicvirusesinfestationdestructorpestinfestedparasitologymortality
queenworkerworkerscoloniespollenvibrationeggsforagingdevelopmentbroodsignalqueensbeesanarchisticbehavioraliridaceaelarvaeeggpheromonemay
foodforagersdancetransferenzymebiosynthesisreceiverscontrastnectarflightsourceflowwaterinformationratesddtrjcaucasianvisualgreen
mammalsvertebratesvenomnonhumanlmlmodelsmodelchordatesbeeswaxmugomegaembryomammaliavertebratahaschordatanursecolouredvg
queenworkersworkersignaljhvibrationpheromoneglandeggssignalshormonejuvenileanarchisticqueenseggiridaceaepolicingixiabehavioralage
![Page 12: Concept and Theme Discovery through Probabilistic Models and Clustering Qiaozhu Mei Oct. 12, 2005](https://reader035.vdocument.in/reader035/viewer/2022070307/551a5253550346cb358b5c79/html5/thumbnails/12.jpg)
Hierarchical Theme Discovery (results)african jelly royal european venom population africanized sting kda feral m reward subspecies proteins patients discrimination naja cue characters areas
queenworkersworkersignaljhvibrationpheromoneglandeggssignalshormonejuvenileanarchisticqueenseggiridaceaepolicingixiabehavioralage
pollinatorplantspollinationflowersplantaespermatophytaangiospermaedicotyledonespollenseedfruitangiospermsspermatophytesvasculardicotscropplantflowerpollinatorsspecies
learningbrainconditioningolfactoryneuralneuronsmushroommemorysucrosenervouscoordinationdopamineextensionantennalodorsystemproboscisbodieslobekenyon
varroamitemitesjacobsoniacarinabroodparasitecolonieshostcontrolcheliceratachelicerateshygienicvirusesinfestationdestructorpestinfestedparasitologymortality
seedpercropsunflowernumbercruciferaefruithybridagricultureseedsqualitycultivarweighthelianthusoilseedcompositaeannuusyieldpollinationset
ecologyisspeciesenvironmentalsciencesfloweringfloralterrestrialpollinatorvisitingreproductionplantsccashewselfanimaliafoodinsectsfabasize
polleneephoneybeesmatingbumblebeessphivebacteriascentmimosabrazilundertakerschromatographymarksrecentlygrameubacteriacarawaymicroorganismspropolis
![Page 13: Concept and Theme Discovery through Probabilistic Models and Clustering Qiaozhu Mei Oct. 12, 2005](https://reader035.vdocument.in/reader035/viewer/2022070307/551a5253550346cb358b5c79/html5/thumbnails/13.jpg)
Hierarchical Theme Discovery (results)african jelly royal european venom population africanized sting kda feral m reward subspecies proteins patients discrimination naja cue characters areas
queenworkersworkersignaljhvibrationpheromoneglandeggssignalshormonejuvenileanarchisticqueenseggiridaceaepolicingixiabehavioralage
pollinatorplantspollinationflowersplantaespermatophytaangiospermaedicotyledonespollenseedfruitangiospermsspermatophytesvasculardicotscropplantflowerpollinatorsspecies
learningbrainconditioningolfactoryneuralneuronsmushroommemorysucrosenervouscoordinationdopamineextensionantennalodorsystemproboscisbodieslobekenyon
varroamitemitesjacobsoniacarinabroodparasitecolonieshostcontrolcheliceratachelicerateshygienicvirusesinfestationdestructorpestinfestedparasitologymortality
dopaminelevelsdevelopmentagebindingpupalbrainoctopaminedivisionadultcolonieslaborglasstreatedcolonyryrpigmentationchromosomesaroliumda
beessucroseconditioningresponselearningextensionproboscispollenforagersperformancebetweenthresholdshoneybeessolutiondiscriminationstrainrateforagingconcentrationlow
imidaclopridcurrentmemorymushroomneurons1expressed4cellsantennalmbbodiescurrentsnervousbrainmvkinasereceptorstermprotein
![Page 14: Concept and Theme Discovery through Probabilistic Models and Clustering Qiaozhu Mei Oct. 12, 2005](https://reader035.vdocument.in/reader035/viewer/2022070307/551a5253550346cb358b5c79/html5/thumbnails/14.jpg)
Hierarchical Theme Discovery (results)african jelly royal european venom population africanized sting kda feral m reward subspecies proteins patients discrimination naja cue characters areas
queenworkersworkersignaljhvibrationpheromoneglandeggssignalshormonejuvenileanarchisticqueenseggiridaceaepolicingixiabehavioralage
pollinatorplantspollinationflowersplantaespermatophytaangiospermaedicotyledonespollenseedfruitangiospermsspermatophytesvasculardicotscropplantflowerpollinatorsspecies
learningbrainconditioningolfactoryneuralneuronsmushroommemorysucrosenervouscoordinationdopamineextensionantennalodorsystemproboscisbodieslobekenyon
varroamitemitesjacobsoniacarinabroodparasitecolonieshostcontrolcheliceratachelicerateshygienicvirusesinfestationdestructorpestinfestedparasitologymortality
mitevarroamitesbroodjacobsoniacarinacoloniesparasiteforworkercontroladroneformicpopulationacidhost0cellstreatment
pollenbeesforagerstheirortaheatathygienicforagingproteinactivitybehaviourincreasedresponsebloodflightstripsmetabolicremoval
viruseslarvaemicroorganismsvirusbacteriaanimalpaenibacillusinfectionmolecularpathogeneubacteriagramformingendosporepositivespapventomopathogen
![Page 15: Concept and Theme Discovery through Probabilistic Models and Clustering Qiaozhu Mei Oct. 12, 2005](https://reader035.vdocument.in/reader035/viewer/2022070307/551a5253550346cb358b5c79/html5/thumbnails/15.jpg)
Phrase Representations: african jelly royal european venom population africanized sting kda feral m reward subspecies proteins patients discrimination naja cue characters areas
queenworkersworkersignaljhvibrationpheromoneglandeggssignalshormonejuvenileanarchisticqueenseggiridaceaepolicingixiabehavioralage
pollinatorplantspollinationflowersplantaespermatophytaangiospermaedicotyledonespollenseedfruitangiospermsspermatophytesvasculardicotscropplantflowerpollinatorsspecies
learningbrainconditioningolfactoryneuralneuronsmushroommemorysucrosenervouscoordinationdopamineextensionantennalodorsystemproboscisbodieslobekenyon
varroamitemitesjacobsoniacarinabroodparasitecolonieshostcontrolcheliceratachelicerateshygienicvirusesinfestationdestructorpestinfestedparasitologymortality
biochemistry and molecular biophysics endocrine system chemical coordination and homeostasis molecular genetics biochemistry and molecular biophysics sense organs sensory reception animals arthropods chordates insects invertebrates mammals system chemical coordination and homeostasis vertebrata chordata animalia honey bee behavior terrestrial ecologymammalia vertebrata chordata animalia juvenile hormone queen rodentia mammalia vertebrata chordata animalia worker laid eggs vibration signal genetics biochemistry and molecular biophysics dufour s gland mammals nonhuman mammals workers egg laying queen mandibular gland pheromone nonhuman vertebrates iridaceae ixia arthropoda invertebrata animalia muridae aves vertebrata chordata animalia mug ml
![Page 16: Concept and Theme Discovery through Probabilistic Models and Clustering Qiaozhu Mei Oct. 12, 2005](https://reader035.vdocument.in/reader035/viewer/2022070307/551a5253550346cb358b5c79/html5/thumbnails/16.jpg)
Hierarchical Theme Discovery (cont.) A bottom up agglomerative approach:
Find many micro-themes Group similar micro-themes into larger ones Borrow strategy from data mining:
BIRCH: incrementally form many micro-clusters, organized in a tree structure
Macro-clustering based on micro-clusters. Problem: Again, when to stop?
![Page 17: Concept and Theme Discovery through Probabilistic Models and Clustering Qiaozhu Mei Oct. 12, 2005](https://reader035.vdocument.in/reader035/viewer/2022070307/551a5253550346cb358b5c79/html5/thumbnails/17.jpg)
Hierarchical Theme Discovery (cont.) Model-based approach:
Hofmann, IJCAI 99. Assume we know the collection is generated from
a hierarchical structure, use a generative model to learn the themes. (e.g. make use of GO hierarchies)
Problem: in most cases we don’t know the hierarchies.
![Page 18: Concept and Theme Discovery through Probabilistic Models and Clustering Qiaozhu Mei Oct. 12, 2005](https://reader035.vdocument.in/reader035/viewer/2022070307/551a5253550346cb358b5c79/html5/thumbnails/18.jpg)
Other Research Problems
Represent a theme: Using top words: where to cut Using phrases: have to tune the MMR (many
possible strategies and parameter tuning) Using sentence? Like summarization
Themes are interesting… but how to make use of the themes?
How to evaluate themes??
![Page 19: Concept and Theme Discovery through Probabilistic Models and Clustering Qiaozhu Mei Oct. 12, 2005](https://reader035.vdocument.in/reader035/viewer/2022070307/551a5253550346cb358b5c79/html5/thumbnails/19.jpg)
Concept Extraction
What we have now: N-gram algorithm (actually 2-gram): iteratively group a pair
of terms which are most likely to be replaceable considering the context of one term before/after it.
Time Complexity: O(N3), Space Complexity: now O(N2). Beespace server can deal with <= 9000 terms now (2.4g memory). (performance not evaluated due to the small data size acceptable).
Problem: based on Mutual Information, preferring 2-grams with low frequency. Doesn’t make use of farther context.
Will removing stop words help or turn down the performance?
![Page 20: Concept and Theme Discovery through Probabilistic Models and Clustering Qiaozhu Mei Oct. 12, 2005](https://reader035.vdocument.in/reader035/viewer/2022070307/551a5253550346cb358b5c79/html5/thumbnails/20.jpg)
Some finding:
A small dataset: (200+ abstracts containing gene synonyms)
Only 600 iterations (merge 600 times) Most of them are reasonable, but not really useful E.g. head-to-head tail-to-tail E.g. within-locus between-locus
FBgn0000017: Dsrc Dabl FBgn0000078: amylase-null AMY-null Problem: doc-set too small, n-gram too sparse to fin
d useful concepts.
![Page 21: Concept and Theme Discovery through Probabilistic Models and Clustering Qiaozhu Mei Oct. 12, 2005](https://reader035.vdocument.in/reader035/viewer/2022070307/551a5253550346cb358b5c79/html5/thumbnails/21.jpg)
Concept Extraction (cont.)
Other Possible strategy: Lin et al, KDD 02: Use feature vector to represent
terms, the weights are the mutual information between term and context feature. Thus more flexible than n-gram. (if only consider 2-gram as context features, this will be similar to what we have)
Use committee to represent a cluster, thus assures the clusters are tight and robust.
Problem: not sure how to select features
![Page 22: Concept and Theme Discovery through Probabilistic Models and Clustering Qiaozhu Mei Oct. 12, 2005](https://reader035.vdocument.in/reader035/viewer/2022070307/551a5253550346cb358b5c79/html5/thumbnails/22.jpg)
Summary
Theme Extraction: Generally performs well, if we can find a good k. Hierarchical Clustering can solve this problem, but still
need to find a reasonable stop criteria. Representation is an interesting problem: MMR phrase
extraction should be further tuned Difficult to evaluate other than expert justification
Concept extraction: N-gram has space constraints: haven’t really tested the
performance… Generally, the performance should be better on large data sets
Other clustering algorithms can be explored.