author-topic models for large text corpora padhraic smyth department of computer science university...
Post on 19-Dec-2015
214 views
TRANSCRIPT
Author-Topic Models for Large Text Corpora
Padhraic SmythPadhraic SmythDepartment of Computer ScienceDepartment of Computer Science
University of California, Irvine University of California, Irvine
In collaboration with: In collaboration with: Mark Steyvers (UCI)Mark Steyvers (UCI)
Michal Rosen-Zvi (UCI)Michal Rosen-Zvi (UCI)
Tom GriffithsTom Griffiths (Stanford) (Stanford)
Outline
• Problem motivation:Problem motivation:• Modeling large sets of documentsModeling large sets of documents
• Probabilistic approachesProbabilistic approaches• topic models -> author-topic modelstopic models -> author-topic models
• ResultsResults• Author-topic results from CiteSeer, NIPS, Enron dataAuthor-topic results from CiteSeer, NIPS, Enron data• Applications of the modelApplications of the model• (Demo of author-topic query tool)(Demo of author-topic query tool)
• Future directionsFuture directions
Data Sets of Interest
• Data = set of documentsData = set of documents• Large collection of documents: 10k, 100k, etcLarge collection of documents: 10k, 100k, etc• Know authors of the documentsKnow authors of the documents• Know years/dates of the documentsKnow years/dates of the documents• …………
• (will typically assume bag of words representation)(will typically assume bag of words representation)
Examples of Data Sets
• CiteSeer:CiteSeer:• 160k abstracts, 80k authors, 1986-2002160k abstracts, 80k authors, 1986-2002
• NIPS papersNIPS papers• 2k papers, 1k authors, 1987-19992k papers, 1k authors, 1987-1999
• ReutersReuters• 20k newspaper articles, 114 authors20k newspaper articles, 114 authors
Pennsylvania Gazette
1728-18001728-1800
80,000 articles80,000 articles
25 million words25 million words
www.accessible.comwww.accessible.com
Enron email data
500,000 emails500,000 emails
5000 authors5000 authors
1999-20021999-2002
Problems of Interest
• What topics do these documents “span”?What topics do these documents “span”?
• Which documents are about a particular topic?Which documents are about a particular topic?
• How have topics changed over time?How have topics changed over time?
• What does author X write about?What does author X write about?
• Who is likely to write about topic Y?Who is likely to write about topic Y?
• Who wrote this specific document?Who wrote this specific document?
• and so on…..and so on…..
A topic is represented as a (multinomial) distribution over words
)|( zwP
WORD PROB.
PROBABILISTIC 0.0778
BAYESIAN 0.0671
PROBABILITY 0.0532
CARLO 0.0309
MONTE 0.0308
DISTRIBUTION 0.0257
INFERENCE 0.0253
PROBABILITIES 0.0253
CONDITIONAL 0.0229
PRIOR 0.0219
.... ...
TOPIC 209
WORD PROB.
RETRIEVAL 0.1179
TEXT 0.0853
DOCUMENTS 0.0527
INFORMATION 0.0504
DOCUMENT 0.0441
CONTENT 0.0242
INDEXING 0.0205
RELEVANCE 0.0159
COLLECTION 0.0146
RELEVANT 0.0136
... ...
TOPIC 289
Cluster Models
DOCUMENT 1DOCUMENT 1
LearningLearning
LearningLearning
BayesianBayesian
ProbabilistProbabilisticic
DOCUMENT 2DOCUMENT 2
RetrievalRetrieval
InformatioInformationn
RetrievalRetrieval
InformatioInformationn
Cluster Models
DOCUMENT 1DOCUMENT 1
LearningLearning
LearningLearning
BayesianBayesian
ProbabilistProbabilisticic
DOCUMENT 2DOCUMENT 2
RetrievalRetrieval
InformatioInformationn
RetrievalRetrieval
InformatioInformationn
P(probabilistic | P(probabilistic | topictopic) = 0.25) = 0.25
P(learning | P(learning | topictopic) = 0.50) = 0.50
P(Bayesian | P(Bayesian | topictopic) = 0.25) = 0.25
P(other words | P(other words | topictopic) = 0.00) = 0.00
P(information | P(information | topictopic) = 0.5) = 0.5
P(retrieval | P(retrieval | topictopic) = 0.5) = 0.5
P(other words | P(other words | topictopic) = 0.0) = 0.0
Graphical Model
zz
ww
Cluster VariableCluster Variable
WordWord
n wordsn words
Graphical Model
zz
ww
Cluster VariableCluster Variable
WordWord
D documentsD documents
n wordsn words
Graphical Model
zz
ww
Cluster VariableCluster Variable
WordWord
Cluster-WordCluster-Word
distributionsdistributions
D documentsD documents
n wordsn words
Cluster Cluster
WeightsWeights
Cluster Models
DOCUMENT 1DOCUMENT 1
LearningLearning
LearningLearning
BayesianBayesian
ProbabilistProbabilisticic
DOCUMENT 2DOCUMENT 2
RetrievalRetrieval
InformatioInformationn
RetrievalRetrieval
InformatioInformationn
DOCUMENT 3DOCUMENT 3
LearningLearning
InformatioInformationn
RetrievalRetrieval
ProbabilistProbabilisticic
Topic Models
DOCUMENT 1DOCUMENT 1
LearningLearning
LearningLearning
BayesianBayesian
ProbabilistProbabilisticic
DOCUMENT 2DOCUMENT 2
RetrievalRetrieval
InformatioInformationn
RetrievalRetrieval
InformatioInformationn
Topic Models
DOCUMENT 1DOCUMENT 1
LearningLearning
LearningLearning
BayesianBayesian
ProbabilistProbabilisticic
DOCUMENT 2DOCUMENT 2
RetrievalRetrieval
InformatioInformationn
RetrievalRetrieval
InformatioInformationn
DOCUMENT 3DOCUMENT 3
LearningLearning
InformatioInformationn
RetrievalRetrieval
ProbabilistProbabilisticic
History of topic models
• Latent class models in statistics (late 60’s)Latent class models in statistics (late 60’s)
• Hoffman (1999)Hoffman (1999)• Original application to documentsOriginal application to documents
• Blei, Ng, and Jordan (2001, 2003)Blei, Ng, and Jordan (2001, 2003)• Variational methodsVariational methods
• Griffiths and Steyvers (2003, 2004)Griffiths and Steyvers (2003, 2004)• Gibbs sampling approach (very efficient)Gibbs sampling approach (very efficient)
Word/Document countsfor 16 Artificial Documents
River Stream Bank Money Loan123456789
10111213141516
Can we recover the original topics and topic mixtures from this data?
docu
men
ts
Example of Gibbs Sampling
River Stream Bank Money Loan123456789
10111213141516
• Assign word tokens randomly to topics:Assign word tokens randomly to topics:(●=topic 1; (●=topic 1; ●●=topic 2 )=topic 2 )
River Stream Bank Money Loan123456789
10111213141516
After 1 iteration
• Apply sampling equation to each word tokenApply sampling equation to each word token
River Stream Bank Money Loan123456789
10111213141516
After 4 iterations
River Stream Bank Money Loan123456789
10111213141516
After 32 iterations
stream .40 bank .39bank .35 money .32river .25 loan .29
topic 1 topic 2●● ●●
Topic Models
DOCUMENT 1DOCUMENT 1
LearningLearning
LearningLearning
BayesianBayesian
ProbabilistProbabilisticic
DOCUMENT 2DOCUMENT 2
RetrievalRetrieval
InformatioInformationn
RetrievalRetrieval
InformatioInformationn
DOCUMENT 3DOCUMENT 3
LearningLearning
InformatioInformationn
RetrievalRetrieval
ProbabilistProbabilisticic
Author-Topic Models
DOCUMENT 1DOCUMENT 1
LearningLearning
LearningLearning
BayesianBayesian
ProbabilistProbabilisticic
DOCUMENT 2DOCUMENT 2
RetrievalRetrieval
InformatioInformationn
RetrievalRetrieval
InformatioInformationn
Author-Topic Models
DOCUMENT 1DOCUMENT 1
LearningLearning
LearningLearning
BayesianBayesian
ProbabilistProbabilisticic
DOCUMENT 2DOCUMENT 2
RetrievalRetrieval
InformatioInformationn
RetrievalRetrieval
InformatioInformationn
DOCUMENT 3DOCUMENT 3
LearningLearning
InformatioInformationn
RetrievalRetrieval
ProbabilistProbabilisticic
Approach
• The author-topic modelThe author-topic model• a probabilistic model linking authors and topicsa probabilistic model linking authors and topics
• authors -> topics -> wordsauthors -> topics -> words
• learned from data learned from data • completely unsupervised, no labelscompletely unsupervised, no labels
• generative modelgenerative model• Different questions or queries can be answered by Different questions or queries can be answered by
appropriate probability calculusappropriate probability calculus• E.g., p(author | words in document)E.g., p(author | words in document)• E.g., p(topic | author)E.g., p(topic | author)
Graphical Model
xx
zz
AuthorAuthor
TopicTopic
Graphical Model
xx
zz
ww
AuthorAuthor
TopicTopic
WordWord
Graphical Model
xx
zz
ww
AuthorAuthor
TopicTopic
WordWord
nn
Graphical Model
xx
zz
ww
aa
AuthorAuthor
TopicTopic
WordWord
DD
nn
Graphical Model
xx
zz
ww
aa
AuthorAuthor
TopicTopic
WordWord
Author-TopicAuthor-Topic
distributionsdistributions
Topic-WordTopic-Word
distributionsdistributions
DD
nn
Generative Process
• Let’s assume authors Let’s assume authors AA11 and and AA22 collaborate and produce a collaborate and produce a paperpaper• AA11 has multinomial topic distribution has multinomial topic distribution
• AA22 has multinomial topic distribution has multinomial topic distribution
• For each word in the paper:For each word in the paper:
1.1. Sample an author Sample an author xx (uniformly) from (uniformly) from AA11,, AA22
2.2. Sample a topic Sample a topic z z from from XX
3.3. Sample a word Sample a word ww from a multinomial topic distribution from a multinomial topic distribution zz
Graphical Model
xx
zz
ww
aa
AuthorAuthor
TopicTopic
WordWord
Author-TopicAuthor-Topic
distributionsdistributions
Topic-WordTopic-Word
distributionsdistributions
DD
nn
Learning
• ObservedObserved• WW = observed words = observed words, A = , A = sets of known authorssets of known authors
• UnknownUnknown• x, zx, z : hidden variables : hidden variables• ΘΘ, , : unknown parameters: unknown parameters
• Interested in:Interested in:• p( p( x, zx, z | | W, AW, A) ) • p( p( θθ , , | W, A) | W, A)
• But exact inference is not tractableBut exact inference is not tractable
Step 1: Gibbs sampling of x and z
xx
zz
ww
aa
AuthorAuthor
TopicTopic
WordWord
DD
nn
MarginalizeMarginalize
over unknownover unknown
parametersparameters
Step 2: MAP estimates of θ and
xx
zz
ww
aa
AuthorAuthor
TopicTopic
WordWord
DD
nn
Condition onCondition on
particular particular
samples ofsamples of
x and zx and z
Step 2: MAP estimates of θ and
xx
zz
ww
aa
AuthorAuthor
TopicTopic
WordWord
DD
nn
Point estimatesPoint estimates
of unknownof unknown
parametersparameters
More Details on Learning
• Gibbs sampling for x and zGibbs sampling for x and z• Typically run 2000 Gibbs iterationsTypically run 2000 Gibbs iterations• 1 iteration = full pass through all documents1 iteration = full pass through all documents
• Estimating Estimating θθ and and • x and z sample -> point estimatesx and z sample -> point estimates• non-informative Dirichlet priors fornon-informative Dirichlet priors for θθ and and
• Computational EfficiencyComputational Efficiency• Learning is linear in the number of word tokens Learning is linear in the number of word tokens
• Predictions on new documentsPredictions on new documents• can average over can average over θθ and and (from different samples, (from different samples,
different runs)different runs)
Gibbs Sampling
• Need full conditional distributions for variablesNeed full conditional distributions for variables
• The probability of assigning the current word The probability of assigning the current word ii to topic to topic jj and and author author kk given everything else: given everything else:
number of times word w assigned to topic j
number of times topic j assigned to author k
' '' '
),,,,|,(j
ATkj
ATmj
m
WTjm
WTmj
diiiiii TC
C
VC
CmwkxjzP
awxz
Experiments on Real Data
• CorporaCorpora• CiteSeer:CiteSeer: 160K abstracts, 160K abstracts, 85K authors85K authors• NIPS:NIPS: 1.7K papers, 1.7K papers, 2K authors2K authors• Enron:Enron: 115K emails, 115K emails, 5K authors (sender)5K authors (sender)• Pubmed:Pubmed: 27K abstracts,27K abstracts, 50K authors50K authors
• Removed stop words; no stemmingRemoved stop words; no stemming
• Ignore word order, just use word countsIgnore word order, just use word counts
• Processing time:Processing time:Nips: 2000 Gibbs iterations Nips: 2000 Gibbs iterations 8 hours 8 hours
CiteSeer: 2000 Gibbs iterations CiteSeer: 2000 Gibbs iterations 4 days 4 days
Four example topics from CiteSeer (T=300)
WORD PROB. WORD PROB. WORD PROB. WORD PROB.
DATA 0.1563 PROBABILISTIC 0.0778 RETRIEVAL 0.1179 QUERY 0.1848
MINING 0.0674 BAYESIAN 0.0671 TEXT 0.0853 QUERIES 0.1367
ATTRIBUTES 0.0462 PROBABILITY 0.0532 DOCUMENTS 0.0527 INDEX 0.0488
DISCOVERY 0.0401 CARLO 0.0309 INFORMATION 0.0504 DATA 0.0368
ASSOCIATION 0.0335 MONTE 0.0308 DOCUMENT 0.0441 JOIN 0.0260
LARGE 0.0280 DISTRIBUTION 0.0257 CONTENT 0.0242 INDEXING 0.0180
KNOWLEDGE 0.0260 INFERENCE 0.0253 INDEXING 0.0205 PROCESSING 0.0113
DATABASES 0.0210 PROBABILITIES 0.0253 RELEVANCE 0.0159 AGGREGATE 0.0110
ATTRIBUTE 0.0188 CONDITIONAL 0.0229 COLLECTION 0.0146 ACCESS 0.0102
DATASETS 0.0165 PRIOR 0.0219 RELEVANT 0.0136 PRESENT 0.0095
AUTHOR PROB. AUTHOR PROB. AUTHOR PROB. AUTHOR PROB.
Han_J 0.0196 Friedman_N 0.0094 Oard_D 0.0110 Suciu_D 0.0102
Rastogi_R 0.0094 Heckerman_D 0.0067 Croft_W 0.0056 Naughton_J 0.0095
Zaki_M 0.0084 Ghahramani_Z 0.0062 Jones_K 0.0053 Levy_A 0.0071
Shim_K 0.0077 Koller_D 0.0062 Schauble_P 0.0051 DeWitt_D 0.0068
Ng_R 0.0060 Jordan_M 0.0059 Voorhees_E 0.0050 Wong_L 0.0067
Liu_B 0.0058 Neal_R 0.0055 Singhal_A 0.0048 Chakrabarti_K 0.0064
Mannila_H 0.0056 Raftery_A 0.0054 Hawking_D 0.0048 Ross_K 0.0061
Brin_S 0.0054 Lukasiewicz_T 0.0053 Merkl_D 0.0042 Hellerstein_J 0.0059
Liu_H 0.0047 Halpern_J 0.0052 Allan_J 0.0040 Lenzerini_M 0.0054
Holder_L 0.0044 Muller_P 0.0048 Doermann_D 0.0039 Moerkotte_G 0.0053
TOPIC 205 TOPIC 209 TOPIC 289 TOPIC 10
More CiteSeer Topics
WORD PROB. WORD PROB. WORD PROB. WORD PROB.
SPEECH 0.1134 PROBABILISTIC 0.0778 USER 0.2541 STARS 0.0164
RECOGNITION 0.0349 BAYESIAN 0.0671 INTERFACE 0.1080 OBSERVATIONS 0.0150
WORD 0.0295 PROBABILITY 0.0532 USERS 0.0788 SOLAR 0.0150
SPEAKER 0.0227 CARLO 0.0309 INTERFACES 0.0433 MAGNETIC 0.0145
ACOUSTIC 0.0205 MONTE 0.0308 GRAPHICAL 0.0392 RAY 0.0144
RATE 0.0134 DISTRIBUTION 0.0257 INTERACTIVE 0.0354 EMISSION 0.0134
SPOKEN 0.0132 INFERENCE 0.0253 INTERACTION 0.0261 GALAXIES 0.0124
SOUND 0.0127 PROBABILITIES 0.0253 VISUAL 0.0203 OBSERVED 0.0108
TRAINING 0.0104 CONDITIONAL 0.0229 DISPLAY 0.0128 SUBJECT 0.0101
MUSIC 0.0102 PRIOR 0.0219 MANIPULATION 0.0099 STAR 0.0087
AUTHOR PROB. AUTHOR PROB. AUTHOR PROB. AUTHOR PROB.
Waibel_A 0.0156 Friedman_N 0.0094 Shneiderman_B 0.0060 Linsky_J 0.0143
Gauvain_J 0.0133 Heckerman_D 0.0067 Rauterberg_M 0.0031 Falcke_H 0.0131
Lamel_L 0.0128 Ghahramani_Z 0.0062 Lavana_H 0.0024 Mursula_K 0.0089
Woodland_P 0.0124 Koller_D 0.0062 Pentland_A 0.0021 Butler_R 0.0083
Ney_H 0.0080 Jordan_M 0.0059 Myers_B 0.0021 Bjorkman_K 0.0078
Hansen_J 0.0078 Neal_R 0.0055 Minas_M 0.0021 Knapp_G 0.0067
Renals_S 0.0072 Raftery_A 0.0054 Burnett_M 0.0021 Kundu_M 0.0063
Noth_E 0.0071 Lukasiewicz_T 0.0053 Winiwarter_W 0.0020 Christensen-J 0.0059
Boves_L 0.0070 Halpern_J 0.0052 Chang_S 0.0019 Cranmer_S 0.0055
Young_S 0.0069 Muller_P 0.0048 Korvemaker_B 0.0019 Nagar_N 0.0050
TOPIC 10 TOPIC 209 TOPIC 87 TOPIC 20
Some topics relate to generic word usage
WORD PROB.
METHOD 0.5851
METHODS 0.3321
APPLIED 0.0268
APPLYING 0.0056
ORIGINAL 0.0054
DEVELOPED 0.0051
PROPOSE 0.0046
COMBINES 0.0034
PRACTICAL 0.0031
APPLY 0.0029
AUTHOR PROB.
Yang_T 0.0014
Zhang_J 0.0014
Loncaric_S 0.0014
Liu_Y 0.0013
Benner_P 0.0013
Faloutsos_C 0.0013
Cortadella_J 0.0012
Paige_R 0.0011
Tai_X 0.0011
Lee_J 0.0011
TOPIC 273
What can the Model be used for?
• We can analyze our document set through the We can analyze our document set through the “topic lens” “topic lens”
• ApplicationsApplications• QueriesQueries
• Who writes on this topic?Who writes on this topic? e.g., finding experts or reviewers in a particular areae.g., finding experts or reviewers in a particular area
• What topics does this person do research on?What topics does this person do research on?• Discovering trends over timeDiscovering trends over time• Detecting unusual papers and authorsDetecting unusual papers and authors• Interactive browsing of a digital library via topicsInteractive browsing of a digital library via topics• Parsing documents (and parts of documents) by topicParsing documents (and parts of documents) by topic• and more…..and more…..
Some likely topics per author (CiteSeer)
• Author = Andrew McCallum, U Mass:Author = Andrew McCallum, U Mass:• Topic 1: classification, training, generalization, decision, data,…Topic 1: classification, training, generalization, decision, data,…• Topic 2: learning, machine, examples, reinforcement, inductive,…..Topic 2: learning, machine, examples, reinforcement, inductive,…..• Topic 3: retrieval, text, document, information, content,…Topic 3: retrieval, text, document, information, content,…
• Author = Hector Garcia-Molina, Stanford:Author = Hector Garcia-Molina, Stanford:- - Topic 1: query, index, data, join, processing, aggregate….Topic 1: query, index, data, join, processing, aggregate….- Topic 2: transaction, concurrency, copy, permission, distributed….- Topic 2: transaction, concurrency, copy, permission, distributed….- Topic 3: source, separation, paper, heterogeneous, merging…..- Topic 3: source, separation, paper, heterogeneous, merging…..
• Author = Paul Cohen, USC/ISI:Author = Paul Cohen, USC/ISI:- Topic 1: agent, multi, coordination, autonomous, intelligent….- Topic 1: agent, multi, coordination, autonomous, intelligent….- Topic 2: planning, action, goal, world, execution, situation…- Topic 2: planning, action, goal, world, execution, situation…- Topic 3: human, interaction, people, cognitive, social, natural….- Topic 3: human, interaction, people, cognitive, social, natural….
Temporal patterns in topics: hot and cold topics
• We have CiteSeer papers from 1986-2002We have CiteSeer papers from 1986-2002
• For each year, calculate the fraction of words For each year, calculate the fraction of words assigned to each topicassigned to each topic
• -> a time-series for topics-> a time-series for topics• Hot topics become more prevalentHot topics become more prevalent• Cold topics become less prevalentCold topics become less prevalent
1986 1988 1990 1992 1994 1996 1998 2000 20020
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2x 10
4
Year
Nu
mb
er o
f Doc
umen
tsDocument and Word Distribution by Year in the UCI CiteSeer Data
Nu
mb
er o
f Wo
rds
0
2
4
6
8
10
12
14x 10
5
1990 1992 1994 1996 1998 2000 20020
0.002
0.004
0.006
0.008
0.01
0.012
Year
To
pic
Pro
ba
bili
tyCHANGING TRENDS IN COMPUTER SCIENCE
INFORMATIONRETRIEVAL
WWW
1990 1992 1994 1996 1998 2000 20020
0.002
0.004
0.006
0.008
0.01
0.012
Year
To
pic
Pro
ba
bili
tyCHANGING TRENDS IN COMPUTER SCIENCE
OPERATINGSYSTEMS
INFORMATIONRETRIEVAL
WWW
PROGRAMMINGLANGUAGES
1990 1992 1994 1996 1998 2000 20021
2
3
4
5
6
7
8x 10
-3
HOT TOPICS: MACHINE LEARNING/DATA MINING
Year
Top
ic P
roba
bilit
y
REGRESSION
DATA MINING
CLASSIFICATION
1990 1992 1994 1996 1998 2000 20021.5
2
2.5
3
3.5
4
4.5
5
5.5x 10
-3
BAYES MARCHES ON
Year
Top
ic P
roba
bilit
y
BAYESIAN
PROBABILITY
STATISTICALPREDICTION
1990 1992 1994 1996 1998 2000 20020
0.002
0.004
0.006
0.008
0.01
0.012
INTERESTING "TOPICS"
Year
Top
ic P
roba
bilit
y
FRENCH WORDS:LA, LES, UNE, NOUS, EST
MATH SYMBOLS:GAMMA, DELTA, OMEGA
DARPA
Four example topics from NIPS (T=100)
WORD PROB. WORD PROB. WORD PROB. WORD PROB.
LIKELIHOOD 0.0539 RECOGNITION 0.0400 REINFORCEMENT 0.0411 KERNEL 0.0683
MIXTURE 0.0509 CHARACTER 0.0336 POLICY 0.0371 SUPPORT 0.0377
EM 0.0470 CHARACTERS 0.0250 ACTION 0.0332 VECTOR 0.0257
DENSITY 0.0398 TANGENT 0.0241 OPTIMAL 0.0208 KERNELS 0.0217
GAUSSIAN 0.0349 HANDWRITTEN 0.0169 ACTIONS 0.0208 SET 0.0205
ESTIMATION 0.0314 DIGITS 0.0159 FUNCTION 0.0178 SVM 0.0204
LOG 0.0263 IMAGE 0.0157 REWARD 0.0165 SPACE 0.0188
MAXIMUM 0.0254 DISTANCE 0.0153 SUTTON 0.0164 MACHINES 0.0168
PARAMETERS 0.0209 DIGIT 0.0149 AGENT 0.0136 REGRESSION 0.0155
ESTIMATE 0.0204 HAND 0.0126 DECISION 0.0118 MARGIN 0.0151
AUTHOR PROB. AUTHOR PROB. AUTHOR PROB. AUTHOR PROB.
Tresp_V 0.0333 Simard_P 0.0694 Singh_S 0.1412 Smola_A 0.1033
Singer_Y 0.0281 Martin_G 0.0394 Barto_A 0.0471 Scholkopf_B 0.0730
Jebara_T 0.0207 LeCun_Y 0.0359 Sutton_R 0.0430 Burges_C 0.0489
Ghahramani_Z 0.0196 Denker_J 0.0278 Dayan_P 0.0324 Vapnik_V 0.0431
Ueda_N 0.0170 Henderson_D 0.0256 Parr_R 0.0314 Chapelle_O 0.0210
Jordan_M 0.0150 Revow_M 0.0229 Dietterich_T 0.0231 Cristianini_N 0.0185
Roweis_S 0.0123 Platt_J 0.0226 Tsitsiklis_J 0.0194 Ratsch_G 0.0172
Schuster_M 0.0104 Keeler_J 0.0192 Randlov_J 0.0167 Laskov_P 0.0169
Xu_L 0.0098 Rashid_M 0.0182 Bradtke_S 0.0161 Tipping_M 0.0153
Saul_L 0.0094 Sackinger_E 0.0132 Schwartz_A 0.0142 Sollich_P 0.0141
TOPIC 19 TOPIC 24 TOPIC 29 TOPIC 87
NIPS: support vector topic
NIPS: neural network topic
Pennsylvania Gazette Data(courtesy of David Newman, UC Irvine)(courtesy of David Newman, UC Irvine)
Enron email data
500,000 emails500,000 emails
5000 authors5000 authors
1999-20021999-2002
Enron email topics
WORD PROB. WORD PROB. WORD PROB. WORD PROB.
FEEDBACK 0.0781 PROJECT 0.0514 FERC 0.0554 ENVIRONMENTAL 0.0291
PERFORMANCE 0.0462 PLANT 0.028 MARKET 0.0328 AIR 0.0232
PROCESS 0.0455 COST 0.0182 ISO 0.0226 MTBE 0.019
PEP 0.0446 CONSTRUCTION 0.0169 COMMISSION 0.0215 EMISSIONS 0.017
MANAGEMENT 0.03 UNIT 0.0166 ORDER 0.0212 CLEAN 0.0143
COMPLETE 0.0205 FACILITY 0.0165 FILING 0.0149 EPA 0.0133
QUESTIONS 0.0203 SITE 0.0136 COMMENTS 0.0116 PENDING 0.0129
SELECTED 0.0187 PROJECTS 0.0117 PRICE 0.0116 SAFETY 0.0104
COMPLETED 0.0146 CONTRACT 0.011 CALIFORNIA 0.0110 WATER 0.0092
SYSTEM 0.0146 UNITS 0.0106 FILED 0.0110 GASOLINE 0.0086
SENDER PROB. SENDER PROB. SENDER PROB. SENDER PROB.
perfmgmt 0.2195 *** 0.0288 *** 0.0532 *** 0.1339
perf eval process 0.0784 *** 0.022 *** 0.0454 *** 0.0275
enron announcements 0.0489 *** 0.0123 *** 0.0384 *** 0.0205
*** 0.0089 *** 0.0111 *** 0.0334 *** 0.0166
*** 0.0048 *** 0.0108 *** 0.0317 *** 0.0129
TOPIC 23TOPIC 36 TOPIC 72 TOPIC 54
Non-work Topics…
WORD PROB. WORD PROB. WORD PROB. WORD PROB.
HOLIDAY 0.0857 TEXANS 0.0145 GOD 0.0357 AMAZON 0.0312
PARTY 0.0368 WIN 0.0143 LIFE 0.0272 GIFT 0.0226
YEAR 0.0316 FOOTBALL 0.0137 MAN 0.0116 CLICK 0.0193
SEASON 0.0305 FANTASY 0.0129 PEOPLE 0.0103 SAVE 0.0147
COMPANY 0.0255 SPORTSLINE 0.0129 CHRIST 0.0092 SHOPPING 0.0140
CELEBRATION 0.0199 PLAY 0.0123 FAITH 0.0083 OFFER 0.0124
ENRON 0.0198 TEAM 0.0114 LORD 0.0079 HOLIDAY 0.0122
TIME 0.0194 GAME 0.0112 JESUS 0.0075 RECEIVE 0.0102
RECOGNIZE 0.019 SPORTS 0.011 SPIRITUAL 0.0066 SHIPPING 0.0100
MONTH 0.018 GAMES 0.0109 VISIT 0.0065 FLOWERS 0.0099
SENDER PROB. SENDER PROB. SENDER PROB. SENDER PROB.
chairman & ceo 0.131 cbs sportsline com 0.0866 crosswalk com 0.2358 amazon com 0.1344
*** 0.0102 houston texans 0.0267 wordsmith 0.0208 jos a bank 0.0266
*** 0.0046 houstontexans 0.0203 *** 0.0107 sharperimageoffers 0.0136
*** 0.0022 sportsline rewards 0.0175 doctor dictionary 0.0101 travelocity com 0.0094
general announcement 0.0017 pro football 0.0136 *** 0.0061 barnes & noble com 0.0089
TOPIC 109TOPIC 66 TOPIC 182 TOPIC 113
Topical Topics
WORD PROB. WORD PROB. WORD PROB. WORD PROB.
POWER 0.0915 STATE 0.0253 COMMITTEE 0.0197 LAW 0.0380
CALIFORNIA 0.0756 PLAN 0.0245 BILL 0.0189 TESTIMONY 0.0201
ELECTRICITY 0.0331 CALIFORNIA 0.0137 HOUSE 0.0169 ATTORNEY 0.0164
UTILITIES 0.0253 POLITICIAN Y 0.0137 WASHINGTON 0.0140 SETTLEMENT 0.0131
PRICES 0.0249 RATE 0.0131 SENATE 0.0135 LEGAL 0.0100
MARKET 0.0244 BANKRUPTCY 0.0126 POLITICIAN X 0.0114 EXHIBIT 0.0098
PRICE 0.0207 SOCAL 0.0119 CONGRESS 0.0112 CLE 0.0093
UTILITY 0.0140 POWER 0.0114 PRESIDENT 0.0105 SOCALGAS 0.0093
CUSTOMERS 0.0134 BONDS 0.0109 LEGISLATION 0.0099 METALS 0.0091
ELECTRIC 0.0120 MOU 0.0107 DC 0.0093 PERSON Z 0.0083
SENDER PROB. SENDER PROB. SENDER PROB. SENDER PROB.
*** 0.1160 *** 0.0395 *** 0.0696 *** 0.0696
*** 0.0518 *** 0.0337 *** 0.0453 *** 0.0453
*** 0.0284 *** 0.0295 *** 0.0255 *** 0.0255
*** 0.0272 *** 0.0251 *** 0.0173 *** 0.0173
*** 0.0266 *** 0.0202 *** 0.0317 *** 0.0317
TOPIC 194TOPIC 18 TOPIC 22 TOPIC 114
Enron email: California Energy Crisis
Message-ID: <21993848.1075843452041.JavaMail.evans@thyme>Message-ID: <21993848.1075843452041.JavaMail.evans@thyme>
Date: Fri, 27 Apr 2001 09:25:00 -0700 (PDT)Date: Fri, 27 Apr 2001 09:25:00 -0700 (PDT)
Subject: California Update 4/27/01Subject: California Update 4/27/01
………… …………..
FERC price cap decision reflects Bush political and economic objectives. FERC price cap decision reflects Bush political and economic objectives. Politically, Bush is determined to let the crisis blame fall on DavisPolitically, Bush is determined to let the crisis blame fall on Davis; from ; from an economic perspective, he is unwilling to create disincentives for an economic perspective, he is unwilling to create disincentives for new power generationnew power generation
The FERC decision is a holding move by the Bush administration that looks The FERC decision is a holding move by the Bush administration that looks like action, but is not. Rather, it allows the situation in California to like action, but is not. Rather, it allows the situation in California to continue to develop virtually unabated. continue to develop virtually unabated. The political strategy appears The political strategy appears to allow the situation to deteriorate to the point where Davis cannot to allow the situation to deteriorate to the point where Davis cannot escape shouldering the blameescape shouldering the blame. Once they are politically inoculated, . Once they are politically inoculated, the Administration can begin to look at regional solutions. Moreover, the Administration can begin to look at regional solutions. Moreover, the Administration has already made explicit (and will certainly restate the Administration has already made explicit (and will certainly restate in the forthcoming Cheney commission report) its opposition to in the forthcoming Cheney commission report) its opposition to stronger price caps …..stronger price caps …..
Enron email: US Senate BillMessage-ID: <23926374.1075846156491.JavaMail.evans@thyme>Message-ID: <23926374.1075846156491.JavaMail.evans@thyme>Date: Thu, 15 Jun 2000 08:59:00 -0700 (PDT)Date: Thu, 15 Jun 2000 08:59:00 -0700 (PDT)From: ***************From: ***************To: ***************To: ***************Subject: Senate Commerce Committee Pipeline Safety MarkupSubject: Senate Commerce Committee Pipeline Safety Markup The Senate Commerce Committee held a markup today where Senator John McCain's The Senate Commerce Committee held a markup today where Senator John McCain's (R-AZ) pipeline safety legislation, S. 2438, was approved. The overall (R-AZ) pipeline safety legislation, S. 2438, was approved. The overall outcome was not unexpected -- outcome was not unexpected -- the final legislation contained several the final legislation contained several provisions that went a little bit further than Enron and INGAA would have provisions that went a little bit further than Enron and INGAA would have likedliked, ……………, ……………
2) McCain amendment to Section 13 (b) (on operator assistance investigations) 2) McCain amendment to Section 13 (b) (on operator assistance investigations) -- Approved by voice vote. …….-- Approved by voice vote. …….
3) Sen. John Kerry (D-MA) Amendment on Enforcement -- Approved by voice 3) Sen. John Kerry (D-MA) Amendment on Enforcement -- Approved by voice vote. Another confusing vote, in which many members did not understand the vote. Another confusing vote, in which many members did not understand the changes being made, but agreed to it on the condition that clarifications be changes being made, but agreed to it on the condition that clarifications be made before Senate floor action. made before Senate floor action. Late last night, Enron led a group Late last night, Enron led a group including companies from INGAA and AGA in providing comments to Senator Kerry including companies from INGAA and AGA in providing comments to Senator Kerry which caused him to make substantial changes to his amendment before it was which caused him to make substantial changes to his amendment before it was voted on at markup, including dropping provisions allowing citizen suits and voted on at markup, including dropping provisions allowing citizen suits and other troubling issues. In the end, the amendment that passed was other troubling issues. In the end, the amendment that passed was acceptable to industryacceptable to industry..
Enron email: political donations
10/16/2000 04:41 PM10/16/2000 04:41 PM Subject: Ashcroft Senate Campaign RequestSubject: Ashcroft Senate Campaign Request
We have received a We have received a request from the Ashcroft Senate campaign for $10,000 in request from the Ashcroft Senate campaign for $10,000 in soft moneysoft money. This is the race where Governor Carnahan is the challenger. Enron . This is the race where Governor Carnahan is the challenger. Enron PAC has contributed $10,000 and Enron has also contributed $15,000 soft money PAC has contributed $10,000 and Enron has also contributed $15,000 soft money in this campaign to Senator Ashcroft. Ken Lay has been personally interested in this campaign to Senator Ashcroft. Ken Lay has been personally interested in the Ashcroft campaign. Our polling information is that Ashcroft is in the Ashcroft campaign. Our polling information is that Ashcroft is currently leading 43 to 38 with an undecided of 19 percent. currently leading 43 to 38 with an undecided of 19 percent.
……………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………
Message-ID: <2546687.1075846182883.JavaMail.evans@thyme>Message-ID: <2546687.1075846182883.JavaMail.evans@thyme>Date: Mon, 16 Oct 2000 14:13:00 -0700 (PDT)Date: Mon, 16 Oct 2000 14:13:00 -0700 (PDT)From: *****From: *****To: *****To: *****Subject: Re: Ashcroft Senate Campaign RequestSubject: Re: Ashcroft Senate Campaign Request
If you can cover it I would say yes. It's a key race and If you can cover it I would say yes. It's a key race and we have been close we have been close to Ashcroft for years. Let's make sure he knows we gave it.... we need to to Ashcroft for years. Let's make sure he knows we gave it.... we need to follow up with him. Last time I talked to him he basically recited the follow up with him. Last time I talked to him he basically recited the utilities' position on electric restructuring. Let's make it clear that we utilities' position on electric restructuring. Let's make it clear that we want to talk right after the electionwant to talk right after the election..
PubMed-Query Topics
WORD PROB. WORD PROB. WORD PROB. WORD PROB.
BIOLOGICAL 0.1002 PLAGUE 0.0296 BOTULISM 0.1014 HIV 0.0916
AGENTS 0.0889 MEDICAL 0.0287 BOTULINUM 0.0888 PROTEASE 0.0563
THREAT 0.0396 CENTURY 0.0280 TOXIN 0.0877 AMPRENAVIR 0.0527
BIOTERRORISM 0.0348 MEDICINE 0.0266 TYPE 0.0669 INHIBITORS 0.0366
WEAPONS 0.0328 HISTORY 0.0203 CLOSTRIDIUM 0.0340 INHIBITOR 0.0220
POTENTIAL 0.0305 EPIDEMIC 0.0106 INFANT 0.0245 PLASMA 0.0204
ATTACK 0.0290 GREAT 0.0091 NEUROTOXIN 0.0184 APV 0.0169
CHEMICAL 0.0288 EPIDEMICS 0.0090 BONT 0.0167 DRUG 0.0169
WARFARE 0.0219 CHINESE 0.0083 FOOD 0.0134 RITONAVIR 0.0164
ANTHRAX 0.0146 FRENCH 0.0082 PARALYSIS 0.0124 IMMUNODEFICIENCY0.0150
AUTHOR PROB. AUTHOR PROB. AUTHOR PROB. AUTHOR PROB.
Atlas_RM 0.0044 Károly_L 0.0089 Hatheway_CL 0.0254 Sadler_BM 0.0129
Tegnell_A 0.0036 Jian-ping_Z 0.0085 Schiavo_G 0.0141 Tisdale_M 0.0118
Aas_P 0.0036 Sabbatani_S 0.0080 Sugiyama_H 0.0111 Lou_Y 0.0069
Greenfield_RA 0.0032 Theodorides_J 0.0045 Arnon_SS 0.0108 Stein_DS 0.0069
Bricaire_F 0.0032 Bowers_JZ 0.0045 Simpson_LL 0.0093 Haubrich_R 0.0061
TOPIC 32TOPIC 188 TOPIC 63 TOPIC 85
PubMed-Query Topics
WORD PROB. WORD PROB. WORD PROB. WORD PROB.
ANTHRACIS 0.1627 CHEMICAL 0.0578 HD 0.0657 ENZYME 0.0938
ANTHRAX 0.1402 SARIN 0.0454 MUSTARD 0.0639 ACTIVE 0.0429
BACILLUS 0.1219 AGENT 0.0332 EXPOSURE 0.0444 SUBSTRATE 0.0399
SPORES 0.0614 GAS 0.0312 SM 0.0353 SITE 0.0361
CEREUS 0.0382 AGENTS 0.0268 SULFUR 0.0343 ENZYMES 0.0308
SPORE 0.0274 VX 0.0264 SKIN 0.0208 REACTION 0.0225
THURINGIENSIS 0.0177 NERVE 0.0232 EXPOSED 0.0185 SUBSTRATES 0.0201
SUBTILIS 0.0152 ACID 0.0220 AGENT 0.0140 FOLD 0.0176
STERNE 0.0124 TOXIC 0.0197 EPIDERMAL 0.0129 CATALYTIC 0.0154
INHALATIONAL 0.0104 PRODUCTS 0.0170 DAMAGE 0.0116 RATE 0.0148
AUTHOR PROB. AUTHOR PROB. AUTHOR PROB. AUTHOR PROB.
Mock_M 0.0203 Minami_M 0.0093 Monteiro-Riviere_NA 0.0284 Masson_P 0.0166
Phillips_AP 0.0125 Hoskin_FC 0.0092 Smith_WJ 0.0219 Kovach_IM 0.0137
Welkos_SL 0.0083 Benschop_HP 0.0090 Lindsay_CD 0.0214 Schramm_VL 0.0094
Turnbull_PC 0.0071 Raushel_FM 0.0084 Sawyer_TW 0.0146 Barak_D 0.0076
Fouet_A 0.0067 Wild_JR 0.0075 Meier_HL 0.0139 Broomfield_CA 0.0072
TOPIC 178TOPIC 40 TOPIC 89 TOPIC 104
PubMed-Query Author Model
• P. M. Lindeque, South AfricaP. M. Lindeque, South Africa
TOPICSTOPICS• Topic 1: water, natural, foci, environmental, sourceTopic 1: water, natural, foci, environmental, source prob=0.33prob=0.33• Topic 2: anthracis, anthrax, bacillus, spores, cereusTopic 2: anthracis, anthrax, bacillus, spores, cereus prob=0.13prob=0.13• Topic 3: species, sp, isolated, populations, testedTopic 3: species, sp, isolated, populations, tested prob=0.06prob=0.06• Topic 4: epidemic, occurred, outbreak, personsTopic 4: epidemic, occurred, outbreak, persons prob=0.06prob=0.06• Topic 5: positive, samples, negative, testedTopic 5: positive, samples, negative, tested prob=0.05prob=0.05
PAPERSPAPERS• Vaccine-induced protections against anthrax in cheetahVaccine-induced protections against anthrax in cheetah• Airborne movement of anthrax spores from carcass sites in the Etosha Airborne movement of anthrax spores from carcass sites in the Etosha
National ParkNational Park• Ecology and epidemiology of anthrax in the Etosha National ParkEcology and epidemiology of anthrax in the Etosha National Park• Serology and anthrax in humans, livestock, and wildlifeSerology and anthrax in humans, livestock, and wildlife
PubMed-Query: Topics by Country
ISRAEL, n=196 authors TOPIC 188 TOPIC 6 TOPIC 133 TOPIC 104 TOPIC 159
p=0.049 p=0.045 p=0.043 p=0.027 p=0.025 BIOLOGICAL INJURY HEALTH HD EMERGENCY
AGENTS INJURIES PUBLIC MUSTARD RESPONSE THREAT WAR CARE EXPOSURE MEDICAL
BIOTERRORISM TERRORIST SERVICES SM PREPAREDNESS
WEAPONS MILITARY EDUCATION SULFUR DISASTER POTENTIAL MEDICAL NATIONAL SKIN MANAGEMENT
ATTACK VICTIMS COMMUNITY EXPOSED TRAINING CHEMICAL TRAUMA INFORMATION AGENT EVENTS
WARFARE BLAST PREVENTION EPIDERMAL BIOTERRORISM ANTHRAX VETERANS LOCAL DAMAGE LOCAL
CHINA, n=1775 authors
TOPIC 177 TOPIC 7 TOPIC 79 TOPIC 49 TOPIC 197 p=0.045 p=0.026 p=0.024 p=0.024 p=0.023 SARS RENAL FINDINGS METHODS PATIENTS
RESPIRATORY HFRS CHEST RESULTS HOSPITAL SEVERE VIRUS CT CONCLUSION PATIENT
COV SYNDROME LUNG OBJECTIVE ADMITTED SYNDROME FEVER CLINICAL CONCLUSIONS TWENTY
ACUTE HEMORRHAGIC PULMONARY BACKGROUND HOSPITALIZED CORONAVIRUS HANTAVIRUS ABNORMAL STUDY CONSECUTIVE
CHINA HANTAAN INVOLVEMENT OBJECTIVES PROSPECTIVELY
KONG PUUMALA COMMON INVESTIGATE DIAGNOSED PROBABLE HANTAVIRUSES RADIOGRAPHIC DESIGN PROGNOSIS
ISRAEL, n=196 authors TOPIC 188 TOPIC 6 TOPIC 133 TOPIC 104 TOPIC 159
p=0.049 p=0.045 p=0.043 p=0.027 p=0.025 BIOLOGICAL INJURY HEALTH HD EMERGENCY
AGENTS INJURIES PUBLIC MUSTARD RESPONSE THREAT WAR CARE EXPOSURE MEDICAL
BIOTERRORISM TERRORIST SERVICES SM PREPAREDNESS
WEAPONS MILITARY EDUCATION SULFUR DISASTER POTENTIAL MEDICAL NATIONAL SKIN MANAGEMENT
ATTACK VICTIMS COMMUNITY EXPOSED TRAINING CHEMICAL TRAUMA INFORMATION AGENT EVENTS
WARFARE BLAST PREVENTION EPIDERMAL BIOTERRORISM ANTHRAX VETERANS LOCAL DAMAGE LOCAL
CHINA, n=1775 authors
TOPIC 177 TOPIC 7 TOPIC 79 TOPIC 49 TOPIC 197 p=0.045 p=0.026 p=0.024 p=0.024 p=0.023 SARS RENAL FINDINGS METHODS PATIENTS
RESPIRATORY HFRS CHEST RESULTS HOSPITAL SEVERE VIRUS CT CONCLUSION PATIENT
COV SYNDROME LUNG OBJECTIVE ADMITTED SYNDROME FEVER CLINICAL CONCLUSIONS TWENTY
ACUTE HEMORRHAGIC PULMONARY BACKGROUND HOSPITALIZED CORONAVIRUS HANTAVIRUS ABNORMAL STUDY CONSECUTIVE
CHINA HANTAAN INVOLVEMENT OBJECTIVES PROSPECTIVELY
KONG PUUMALA COMMON INVESTIGATE DIAGNOSED PROBABLE HANTAVIRUSES RADIOGRAPHIC DESIGN PROGNOSIS
PubMed-Query: Topics by Country
3 of 300 example topics (TASA)
WORD PROB. WORD PROB. WORD PROB.
PLAY 0.0601 MUSIC 0.0903 PLAY 0.1358
PLAYS 0.0362 DANCE 0.0345 BALL 0.1288
STAGE 0.0305 SONG 0.0329 GAME 0.0654
MOVIE 0.0288 PLAY 0.0301 PLAYING 0.0418
SCENE 0.0253 SING 0.0265 HIT 0.0324
ROLE 0.0245 SINGING 0.0264 PLAYED 0.0312
AUDIENCE 0.0197 BAND 0.0260 BASEBALL 0.0274
THEATER 0.0186 PLAYED 0.0229 GAMES 0.0250
PART 0.0178 SANG 0.0224 BAT 0.0193
FILM 0.0148 SONGS 0.0208 RUN 0.0186
ACTORS 0.0145 DANCING 0.0198 THROW 0.0158
DRAMA 0.0136 PIANO 0.0169 BALLS 0.0154
REAL 0.0128 PLAYING 0.0159 TENNIS 0.0107
CHARACTER 0.0122 RHYTHM 0.0145 HOME 0.0099
ACTOR 0.0116 ALBERT 0.0134 CATCH 0.0098
ACT 0.0114 MUSICAL 0.0134 FIELD 0.0097
MOVIES 0.0114 DRUM 0.0129 PLAYER 0.0096
ACTION 0.0101 GUITAR 0.0098 FUN 0.0092
SET 0.0097 BEAT 0.0097 THROWING 0.0083
SCENES 0.0094 BALLET 0.0096 PITCHER 0.0080
TOPIC 82 TOPIC 166TOPIC 77
Word sense disambiguation(numbers & colors topic assignments)
A Play082 is written082 to be performed082 on a stage082 before a live093 audience082 or before motion270 picture004 or television004 cameras004 ( for later054 viewing004 by large202 audiences082). A Play082 is written082 because playwrights082 have something ... He was listening077 to music077 coming009 from a passing043 riverboat. The music077 had already captured006 his heart157 as well as his ear119. It was jazz077. Bix beiderbecke had already had music077 lessons077. He wanted268 to play077 the cornet. And he wanted268 to play077 jazz077... J im296 plays166 the game166. J im296 likes081 the game166 for one. The game166 book254 helps081 jim296. Don180 comes040 into the house038. Don180 and jim296 read254 the game166 book254. The boys020 see a game166 for two. The two boys020 play166 the game166....
Finding unusual papers for an author
Perplexity = exp [entropy (words | model) ] Perplexity = exp [entropy (words | model) ]
= measure of surprise for model on data= measure of surprise for model on data
Can calculate perplexity of unseen documents, Can calculate perplexity of unseen documents,
conditioned on the model for a particular authorconditioned on the model for a particular author
Papers and Perplexities: M_Jordan
Factorial Hidden Markov ModelsFactorial Hidden Markov Models 687687
Learning from Incomplete DataLearning from Incomplete Data 702702
Papers and Perplexities: M_Jordan
Factorial Hidden Markov ModelsFactorial Hidden Markov Models 687687
Learning from Incomplete DataLearning from Incomplete Data 702702
MEDIAN PERPLEXITYMEDIAN PERPLEXITY 25672567
Papers and Perplexities: M_Jordan
Factorial Hidden Markov ModelsFactorial Hidden Markov Models 687687
Learning from Incomplete DataLearning from Incomplete Data 702702
MEDIAN PERPLEXITYMEDIAN PERPLEXITY 25672567
Defining and Handling Transient Defining and Handling Transient Fields in PjamaFields in Pjama
1455514555
An Orthogonally Persistent JAVAAn Orthogonally Persistent JAVA 1602116021
Papers and Perplexities: T_Mitchell
Explanation-based Learning for Explanation-based Learning for Mobile Robot PerceptionMobile Robot Perception
10931093
Learning to Extract Symbolic Learning to Extract Symbolic Knowledge from the WebKnowledge from the Web
11961196
Papers and Perplexities: T_Mitchell
Explanation-based Learning for Explanation-based Learning for Mobile Robot PerceptionMobile Robot Perception
10931093
Learning to Extract Symbolic Learning to Extract Symbolic Knowledge from the WebKnowledge from the Web
11961196
MEDIAN PERPLEXITYMEDIAN PERPLEXITY 28372837
Papers and Perplexities: T_Mitchell
Explanation-based Learning for Explanation-based Learning for Mobile Robot PerceptionMobile Robot Perception
10931093
Learning to Extract Symbolic Learning to Extract Symbolic Knowledge from the WebKnowledge from the Web
11961196
MEDIAN PERPLEXITYMEDIAN PERPLEXITY 28372837
Text Classification from Labeled Text Classification from Labeled and Unlabeled Documents using EMand Unlabeled Documents using EM
38023802
A Method for Estimating A Method for Estimating Occupational Radiation Dose…Occupational Radiation Dose…
88148814
Author prediction with CiteSeer
• Task: predict (single) author of new CiteSeer Task: predict (single) author of new CiteSeer abstractsabstracts
• Results:Results:• For 33% of documents, author guessed correctlyFor 33% of documents, author guessed correctly• Median rank of true author = 26 (out of 85,000) Median rank of true author = 26 (out of 85,000)
Who wrote what?
A method1 is described which like the kernel1 trick1 in support1 vector1 machines1 SVMs1 lets us generalize distance1 based2 algorithms to operate in feature1 spaces usually nonlinearly related to the input1 space This is done by identifying a class of kernels1 which can be represented as norm1 based2 distances1 in Hilbert spaces It turns1 out that common kernel1 algorithms such as SVMs1 and kernel1 PCA1 are actually really distance1 based2 algorithms and can be run2 with that class of kernels1 too As well as providing1 a useful new insight1 into how these algorithms work the present2 work can form the basis1 for conceiving new algorithms
This paper presents2 a comprehensive approach for model2 based2 diagnosis2 which includes proposals for characterizing and computing2 preferred2 diagnoses2 assuming that the system2 description2 is augmented with a system2 structure2 a directed2 graph2 explicating the interconnections between system2 components2 Specifically we first introduce the notion of a consequence2 which is a syntactically2 unconstrained propositional2 sentence2 that characterizes all consistency2 based2 diagnoses2 and show2 that standard2 characterizations of diagnoses2 such as minimal conflicts1 correspond to syntactic2 variations1 on a consequence2 Second we propose a new syntactic2 variation on the consequence2 known as negation2 normal form NNF and discuss its merits compared to standard variations Third we introduce a basic algorithm2 for computing consequences in NNF given a structured system2 description We show that if the system2 structure2 does not contain cycles2 then there is always a linear size2 consequence2 in NNF which can be computed in linear time2 For arbitrary1 system2 structures2 we show a precise connection between the complexity2 of computing2 consequences and the topology of the underlying system2 structure2 Finally we present2 an algorithm2 that enumerates2 the preferred2 diagnoses2 characterized by a consequence2 The algorithm2 is shown1 to take linear time2 in the size2 of the consequence2 if the preference criterion1 satisfies some general conditions
Written by(1) Scholkopf_B
Written by(2) Darwiche_A
Test of model: 1) artificially combine abstracts from different authors 2) check whether assignment is to correct original author
The Author-Topic Browser Querying on
author Pazzani_M
Querying on topic relevant to author
Querying on document written
by author
Stability of Topics
• Content of topics is arbitrary across runs of modelContent of topics is arbitrary across runs of model(e.g., topic #1 is not the same across runs) (e.g., topic #1 is not the same across runs)
• However, However, • Majority of topics are stable over processing timeMajority of topics are stable over processing time• Majority of topics can be aligned across runs Majority of topics can be aligned across runs
• Topics appear to represent genuine structure in Topics appear to represent genuine structure in data data
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100
2
4
6
8
10
12
14
16
Comparing NIPS topics from the same Markov chain
KL
dist
ance
topics at t1=1000
Re-o
rdere
d t
op
ics
at
t 2=
2000
BEST KL = 0.54
WORST KL = 4.78
ANALOG .043 ANALOG .044CIRCUIT .040 CIRCUIT .040
CHIP .034 CHIP .037CURRENT .025 VOLTAGE .024VOLTAGE .023 CURRENT .023
VLSI .022 VLSI .023INPUT .018 OUTPUT .022
OUTPUT .018 INPUT .019CIRCUITS .015 CIRCUITS .015
FIGURE .014 PULSE .012PULSE .012 SYNAPSE .012
SYNAPSE .011 SILICON .011SILICON .011 FIGURE .010
CMOS .009 CMOS .009MEAD .008 GATE .009
t1 t2
FEEDBACK .040 ADAPTATION .051ADAPTATION .034 FIGURE .033
CORTEX .025 SIMULATION .026REGION .016 GAIN .025FIGURE .015 EFFECTS .016
FUNCTION .014 FIBERS .014BRAIN .013 COMPUTATIONAL .014
COMPUTATIONAL .013 EXPERIMENT .014FIBER .012 FIBER .013
FIBERS .011 SITES .012ELECTRIC .011 RESULTS .012
BOWER .010 EXPERIMENTS .012FISH .010 ELECTRIC .011
SIMULATIONS .009 SITE .009CEREBELLAR .009 NEURO .009
t1 t2
Gibbs Sampler Stability (NIPS data)
New Applications/ Future Work
• Reviewer RecommendationReviewer Recommendation• ““Find reviewers for this set of grant proposals who are active in relevant Find reviewers for this set of grant proposals who are active in relevant
topics and have no conflicts of interest”topics and have no conflicts of interest”
• Change Detection/MonitoringChange Detection/Monitoring• Which authors are on the leading edge of new topics?Which authors are on the leading edge of new topics?• Characterize the “topic trajectory” of this author over time Characterize the “topic trajectory” of this author over time
• Author IdentificationAuthor Identification• Who wrote this document? Incorporation of stylistic information (stylometry)Who wrote this document? Incorporation of stylistic information (stylometry)
• Additions to the modelAdditions to the model• Modeling citationsModeling citations• Modeling topic persistence in a documentModeling topic persistence in a document• ……....
Summary
• Topic models are a versatile probabilistic model for text dataTopic models are a versatile probabilistic model for text data
• Author-topic models are a very useful generalizationAuthor-topic models are a very useful generalization• Equivalent to topics model with 1 different author per documentEquivalent to topics model with 1 different author per document• Learning has linear time complexityLearning has linear time complexity
• Gibbs sampling is practical on very large data setsGibbs sampling is practical on very large data sets
• Experimental resultsExperimental results• On multiple large complex data sets, the resulting topic-word and On multiple large complex data sets, the resulting topic-word and
author-topic models are quite interpretableauthor-topic models are quite interpretable• Results appear stable relative to samplingResults appear stable relative to sampling
• Numerous possible applications…. Numerous possible applications….
• Current model is quite simple….many extensions possibleCurrent model is quite simple….many extensions possible
Further Information
• www.datalab.uci.eduwww.datalab.uci.edu• Steyvers et al, ACM SIGKDD 2004Steyvers et al, ACM SIGKDD 2004• Rosen-Zvi et al, UAI 2004Rosen-Zvi et al, UAI 2004
• www.datalab.uci.edu/author-topicwww.datalab.uci.edu/author-topic• JAVA demo of online browserJAVA demo of online browser• additional tables and resultsadditional tables and results
BACKUP SLIDES
Author-Topics Model
xx
zz
ww
aa
AuthorAuthor
TopicTopic
WordWord
Author-TopicAuthor-Topic
distributionsdistributions
Topic-WordTopic-Word
distributionsdistributions
DD
nn
Topics Model: Topics, no Authors
xx
zz
ww
AuthorAuthor
TopicTopic
WordWord
Document-TopicDocument-Topic
Distributions Distributions
Topic-WordTopic-Word
distributionsdistributions
DD
nn
Author Model: Authors, no Topics
aa
ww
aa
AuthorAuthor
WordWord
DD
nn
Author-WordAuthor-Word
Distributions Distributions
Comparison Results • Train models on part Train models on part
of a new document of a new document and predict remaining and predict remaining wordswords
• Without having seen Without having seen anyany words from new words from new document, author-document, author-topic information topic information helps in predicting helps in predicting words from that words from that documentdocument
• Topics model is more Topics model is more flexible in adapting to flexible in adapting to new document after new document after observing a number of observing a number of wordswords
Per
plex
ity
(new
wor
ds)
2000
4000
6000
8000
10000
12000
14000
# Observed words in document
Author model
Topics model
Author-Topics
Latent Semantic Analysis(Landauer & Dumais, 1997)
Words with similar co-occurence patterns across documentsend up with similar vector representations
word/document counts high dimensional space
SVD RIVERSTREAM
MONEY
BANK
11
……1616
……00
……
MONEYMONEY
……
66191955BANKBANK
00001212STREAMSTREAM
00003434RIVERRIVER
Doc3 … Doc3 … Doc2Doc2Doc1Doc1
LSALSA
• GeometricGeometric
• Partially generativePartially generative
• Dimensions are Dimensions are not interpretablenot interpretable
• Little flexibility to expand Little flexibility to expand model (e.g., syntax)model (e.g., syntax)
TopicsTopics
• ProbabilisticProbabilistic
• Fully generative Fully generative
• Topic dimensions are Topic dimensions are often interpretableoften interpretable
• Modular language of Modular language of bayes nets/ graphical bayes nets/ graphical models models
Modeling syntax and semantics(Steyvers, Griffiths, Blei, and Tenenbaum)
z
w
zz
w w
xxx
semantics: probabilistic topics
syntax: 3rd order HMM
long-range, document specific,dependencies
short-range dependencies constantacross all documents