advisor: hsin-hsi chen reporter: y.h chang 2008-03-21

Efficient Topic-based Un supervised Name Disambig uation Yang Song, Jian Huang, Isaac G. Councill, Jia Li, C. Lee Giles JCDL2007 Advisor: Hsin-Hsi Chen Reporter: Y.H Chang 2008-03-21

Upload: zaina

Post on 13-Jan-2016




0 download


Efficient Topic-based Unsupervised Name Disambiguation Yang Song, Jian Huang, Isaac G. Councill, Jia Li, C. Lee Giles JCDL2007. Advisor: Hsin-Hsi Chen Reporter: Y.H Chang 2008-03-21. Outline. Introductoin Related Work Method Topic-based PLSA (Probabilistic Latent Semantic Analysis) - PowerPoint PPT Presentation


Page 1: Advisor: Hsin-Hsi Chen Reporter: Y.H Chang 2008-03-21

Efficient Topic-based Unsupervised Name Disambiguation

Yang Song, Jian Huang, Isaac G. Councill,Jia Li, C. Lee Giles


Advisor: Hsin-Hsi ChenReporter: Y.H Chang


Page 2: Advisor: Hsin-Hsi Chen Reporter: Y.H Chang 2008-03-21

2008/03/21 Y.H Chang 2


Introductoin Related Work Method

Topic-based PLSA (Probabilistic Latent Semantic Analysis) Topic-based LDA (Latent Dirichlet Allocation) Clustering

Experiment Conclusion

Page 3: Advisor: Hsin-Hsi Chen Reporter: Y.H Chang 2008-03-21

2008/03/21 Y.H Chang 3


Name ambiguity Sharing same name, misspelling, name abbreviations

Searching Google for “Yang Song”: 1st page shows five different people’s home pages

In this paper, we focus on the problem of disambiguating person names within web pages and scientific documents.

Page 4: Advisor: Hsin-Hsi Chen Reporter: Y.H Chang 2008-03-21

2008/03/21 Y.H Chang 4


MethodLearning topic-name matrix

by PLSA and LDA(feature set)

Topic disambiguate with agglomerative clustering method

In similar topic:generate name-name matrix

People disambiguate with another agglomerative clustering method

Page 5: Advisor: Hsin-Hsi Chen Reporter: Y.H Chang 2008-03-21

2008/03/21 Y.H Chang 5


Introductoin Related Work Method

Topic-based PLSA Topic-based LDA Clustering

Experiment Conclusion

Page 6: Advisor: Hsin-Hsi Chen Reporter: Y.H Chang 2008-03-21

2008/03/21 Y.H Chang 6

Related Work [19]G. S. Mann and D. Yarowsky. Unsupervised personal name disambiguati

on. 2003 (transitivity problem) [9]H. Han, H. Zha, and C. L. Giles. Name disambiguation in author citations

using a k-way spectral clustering method. 2005 (complexity O(N2)) [12]J. Huang, S. Ertekin, and C. L. Giles. Efficient name disambiguation for l

arge-scale databases. 2006 [2]I. Bhattacharya and L. Getoor. A latent dirichlet model forunsupervised ent

ity resolution. 2006 The aforementioned work mainly tackled the name disambiguati

on problem using the metadata records of the authors. This paper solves the name disambiguation problem in a novel way, by accounting for the topic distribution of the authors and adopting unsupervised methods.

Page 7: Advisor: Hsin-Hsi Chen Reporter: Y.H Chang 2008-03-21

2008/03/21 Y.H Chang 7


Introductoin Related Work Method

Topic-based PLSA Topic-based LDA Clustering

Experiment Conclusion


Learning topic-name matrixby PLSA and LDA

(feature set)

Topic disambiguate with agglomerative clustering method

… …          … …

Page 8: Advisor: Hsin-Hsi Chen Reporter: Y.H Chang 2008-03-21

2008/03/21 Y.H Chang 8


From a statistical point of view, (1999) Hofmann presented an alternative to LSA, or Probabilistic Latent Semantic Analysis/Indexing (PLSA/PLSI) , which discovers sets of latent variables.

The model is described as an aspect model, assuming the existence of hidden factors underlying the co-occurrences among two sets of objects.

Page 9: Advisor: Hsin-Hsi Chen Reporter: Y.H Chang 2008-03-21

2008/03/21 Y.H Chang 9


The goal of model fitting for PLSA is to estimate the parameters P(z),P(a|z), P(z|d),P(w|z), given a set of observations (d, a,w). The standard way to estimate the probability values is the Expectation-Maximization (EM) algorithm

z: topic of document


People’s name



Page 10: Advisor: Hsin-Hsi Chen Reporter: Y.H Chang 2008-03-21

2008/03/21 Y.H Chang 10


Page 11: Advisor: Hsin-Hsi Chen Reporter: Y.H Chang 2008-03-21

2008/03/21 Y.H Chang 11

PLSA-Predicting New Name Appearances

Additionally, there is no natural way to assign probability to new documents.

Therefore, to predict the topics of new documents (with potentially new names) after training, the estimated P(w|z) parameters are used to estimate P(a|z) for new names a in test document dnew through a “folding-in” process.

Specifically, the E-step is the same as equation (4); however, the M-step maintains the original P(w|z) and only updates P(a|z) as well as P(z|d).

Page 12: Advisor: Hsin-Hsi Chen Reporter: Y.H Chang 2008-03-21

2008/03/21 Y.H Chang 12


(2003) Blei et al. introduced a Bayesian hierarchical model, Latent Dirichlet Allocation (LDA) , in which each document has its own topic distribution, drawn from a conjugate Dirichlet prior that remains the same for all documents in a collection.

Page 13: Advisor: Hsin-Hsi Chen Reporter: Y.H Chang 2008-03-21

2008/03/21 Y.H Chang 13


In our model, names (authors) and words are not directly related, i.e., each topic can generate a set of names and a set of words simultaneously with different probabilities, allowing more freedom to the model in parameter estimation.

a multinomial distribution φz for each topic z

a multinomial Distribution θd

a topic zdi from the multinomial distribution θd a name adi from the

multinomial distribution λzdi

a word wdi from the multinomial distribution φzdi

Page 14: Advisor: Hsin-Hsi Chen Reporter: Y.H Chang 2008-03-21

2008/03/21 Y.H Chang 14


In the following section, we apply the Gibbs sampling framework to get around the intractability problem of parameter estimation.

Page 15: Advisor: Hsin-Hsi Chen Reporter: Y.H Chang 2008-03-21

2008/03/21 Y.H Chang 15

Gibbs sampling for the LDA model

Note that in our case, we do not estimate the parameters α, β and λ. For simplicity and performance, they are fixed at 50/K, 0.01 and 0.1 respectively.

Page 16: Advisor: Hsin-Hsi Chen Reporter: Y.H Chang 2008-03-21

2008/03/21 Y.H Chang 16

ClusteringLearning topic-name matrix

by PLSA and LDA(feature set)

Topic disambiguate with agglomerative clustering method

In similar topic:generate name-name matrix

People disambiguate with another

agglomerative clustering method

Levenshtein distance (defined as Le(x, y)) is used as the measurement and as a result the similarity between two names x and y

Page 17: Advisor: Hsin-Hsi Chen Reporter: Y.H Chang 2008-03-21

2008/03/21 Y.H Chang 17


Introductoin Related Work Method

Topic-based PLSA Topic-based LDA Clustering

Experiment Conclusion

Page 18: Advisor: Hsin-Hsi Chen Reporter: Y.H Chang 2008-03-21

2008/03/21 Y.H Chang 18


Web Appearances of Person Names 12 person names => 187 different people including SRI employees and professors are submitted as queries to the Google se

arch engine, the first 100 pages are then retrieved for each query. Furthermore, to eliminate the bias towards longer documents, only the first 200 words are used in each example.

Author Appearances in Scientific Docs We obtained the 9 most ambiguous author names from the entire data set , each of

which has at least 20 name variations. In the worst case (C. Chen), 103 authors share the same name.

Page 19: Advisor: Hsin-Hsi Chen Reporter: Y.H Chang 2008-03-21

2008/03/21 Y.H Chang 19


Evaluation : pair-level pairwise F1 score F1P and clusterlevel

pairwise F1 score F1C. F1P is defined as the pairwise precision pp and

pairwise recall pr Likewise, F1C is the harmonic mean of cluster p

recision cp and cluster recall cr

Page 20: Advisor: Hsin-Hsi Chen Reporter: Y.H Chang 2008-03-21

2008/03/21 Y.H Chang 20

author-topic relationships in the CiteSeer data set extracted by the topic-based PLSA model.

Page 21: Advisor: Hsin-Hsi Chen Reporter: Y.H Chang 2008-03-21

2008/03/21 Y.H Chang 21


Page 22: Advisor: Hsin-Hsi Chen Reporter: Y.H Chang 2008-03-21

2008/03/21 Y.H Chang 22


Page 23: Advisor: Hsin-Hsi Chen Reporter: Y.H Chang 2008-03-21

2008/03/21 Y.H Chang 23


As a result, we empirically tested our models for the entire CiteSeer data set with more than 750,000 documents.

PLSA yields 418,500 unique authors in 2,570 minutes, while LDA finishes in 4,390 minutes with 418,775 authors.(1~3 days)

Page 24: Advisor: Hsin-Hsi Chen Reporter: Y.H Chang 2008-03-21

2008/03/21 Y.H Chang 24


Introductoin Related Work Method

Topic-based PLSA Topic-based LDA Clustering

Experiment Conclusion

Page 25: Advisor: Hsin-Hsi Chen Reporter: Y.H Chang 2008-03-21

2008/03/21 Y.H Chang 25

Conclusion We have proposed a novel framework for unsupervised name

disambiguation by leveraging graphical Bayesian models and a hierarchical clustering method.

Although our primary focus in this paper is on person name disambiguation, our general approach should be equally applicable to other entity disambiguation domains.

Potential applications include noun phrases disambiguation,e.g., “tiger” as an animal, “tiger” as a golf player, “tiger” the baseball team, “tiger” the operating system or “tiger” for the new Java version.