text-based measures of document diversity date : 2014/02/12 source : kdd’13 authors : kevin...
TRANSCRIPT
![Page 1: Text-Based Measures of Document Diversity Date : 2014/02/12 Source : KDD’13 Authors : Kevin Bache, David Newman, and Padhraic Smyth Advisor : Dr. Jia-Ling,](https://reader030.vdocument.in/reader030/viewer/2022032702/56649cc45503460f9498da6f/html5/thumbnails/1.jpg)
Text-Based Measures of Document DiversityDate: 2014/02/12
Source:KDD’13
Authors:Kevin Bache, David Newman, and Padhraic
Smyth
Advisor:Dr. Jia-Ling, Koh
Speaker: Shun-Chen, Cheng
![Page 2: Text-Based Measures of Document Diversity Date : 2014/02/12 Source : KDD’13 Authors : Kevin Bache, David Newman, and Padhraic Smyth Advisor : Dr. Jia-Ling,](https://reader030.vdocument.in/reader030/viewer/2022032702/56649cc45503460f9498da6f/html5/thumbnails/2.jpg)
2
Outline
IntroductionMethodExperimentConclusions
![Page 3: Text-Based Measures of Document Diversity Date : 2014/02/12 Source : KDD’13 Authors : Kevin Bache, David Newman, and Padhraic Smyth Advisor : Dr. Jia-Ling,](https://reader030.vdocument.in/reader030/viewer/2022032702/56649cc45503460f9498da6f/html5/thumbnails/3.jpg)
3
Introduction
(Interdisciplinary)
the hypothesis:
interdisciplinary research can lead to new discoveries at a rate faster than that of traditional research projects conducted within single disciplines
(single disciplines)
![Page 4: Text-Based Measures of Document Diversity Date : 2014/02/12 Source : KDD’13 Authors : Kevin Bache, David Newman, and Padhraic Smyth Advisor : Dr. Jia-Ling,](https://reader030.vdocument.in/reader030/viewer/2022032702/56649cc45503460f9498da6f/html5/thumbnails/4.jpg)
4
Introduction
Task:
Diversity score
assign
quantifying how diverse a document is in terms of its content
Goal
![Page 5: Text-Based Measures of Document Diversity Date : 2014/02/12 Source : KDD’13 Authors : Kevin Bache, David Newman, and Padhraic Smyth Advisor : Dr. Jia-Ling,](https://reader030.vdocument.in/reader030/viewer/2022032702/56649cc45503460f9498da6f/html5/thumbnails/5.jpg)
5
Framework
Diversity score of each document
corpus
LDALearn
T for D D x T
matrix
Rao’s Diversity measure
Topic co-occurrence similarity measures
T : topicD :document
![Page 6: Text-Based Measures of Document Diversity Date : 2014/02/12 Source : KDD’13 Authors : Kevin Bache, David Newman, and Padhraic Smyth Advisor : Dr. Jia-Ling,](https://reader030.vdocument.in/reader030/viewer/2022032702/56649cc45503460f9498da6f/html5/thumbnails/6.jpg)
6
Outline
IntroductionMethodExperimentConclusions
![Page 7: Text-Based Measures of Document Diversity Date : 2014/02/12 Source : KDD’13 Authors : Kevin Bache, David Newman, and Padhraic Smyth Advisor : Dr. Jia-Ling,](https://reader030.vdocument.in/reader030/viewer/2022032702/56649cc45503460f9498da6f/html5/thumbnails/7.jpg)
7
Topic-based Diversity(1)
LDA : collapsed Gibbs samplerUsing the topic-word assignments from the
final iteration of the Gibbs samplerndj corresponding to the number of word
tokens in document d that are assigned to topic j.
Example of create D x T matrix :
9 0 10 10 62 15 81 2 16
t1 t2 t3
d1
d2
d3
d4
n13
![Page 8: Text-Based Measures of Document Diversity Date : 2014/02/12 Source : KDD’13 Authors : Kevin Bache, David Newman, and Padhraic Smyth Advisor : Dr. Jia-Ling,](https://reader030.vdocument.in/reader030/viewer/2022032702/56649cc45503460f9498da6f/html5/thumbnails/8.jpg)
8
Topic-based Diversity(2)
Rao’s Diversity for a document d :
ndj : the value of entry (d,j) in DxT matrix nd : the number of word tokens in d
d
dj
n
ndjP )|(
),( ji measure of the distance between topic i and topic j
![Page 9: Text-Based Measures of Document Diversity Date : 2014/02/12 Source : KDD’13 Authors : Kevin Bache, David Newman, and Padhraic Smyth Advisor : Dr. Jia-Ling,](https://reader030.vdocument.in/reader030/viewer/2022032702/56649cc45503460f9498da6f/html5/thumbnails/9.jpg)
9
Topic-based Diversity(3)
Example of Rao’s diversity :
9 0 10 10 62 15 81 2 16
t1 t2 t3
d1
d2
d3
d4
div(1) = 1.26div(2) = 0.04688div(3) = 0.09344div(4) = 1.557895
i)(j, j)(i,
similarity cosineon based assume
1.0)3,2(
7.0)3,1(
2.0)2,1(
![Page 10: Text-Based Measures of Document Diversity Date : 2014/02/12 Source : KDD’13 Authors : Kevin Bache, David Newman, and Padhraic Smyth Advisor : Dr. Jia-Ling,](https://reader030.vdocument.in/reader030/viewer/2022032702/56649cc45503460f9498da6f/html5/thumbnails/10.jpg)
10
Topicco-occurranceSimilarity
Cosine similarity :
Probabilistic-based :
N
ndP d)( N : number of word tokens in
the corpus.
ndj : the value of entry (d,j) in DxT matrix
),( jis
![Page 11: Text-Based Measures of Document Diversity Date : 2014/02/12 Source : KDD’13 Authors : Kevin Bache, David Newman, and Padhraic Smyth Advisor : Dr. Jia-Ling,](https://reader030.vdocument.in/reader030/viewer/2022032702/56649cc45503460f9498da6f/html5/thumbnails/11.jpg)
11
Similarity toDistance
Similarity measures
Cosine similarity
Probability based
),( ji
distance means ),( ji
Similarity to Distance
![Page 12: Text-Based Measures of Document Diversity Date : 2014/02/12 Source : KDD’13 Authors : Kevin Bache, David Newman, and Padhraic Smyth Advisor : Dr. Jia-Ling,](https://reader030.vdocument.in/reader030/viewer/2022032702/56649cc45503460f9498da6f/html5/thumbnails/12.jpg)
12
Outline
IntroductionMethodExperimentConclusions
![Page 13: Text-Based Measures of Document Diversity Date : 2014/02/12 Source : KDD’13 Authors : Kevin Bache, David Newman, and Padhraic Smyth Advisor : Dr. Jia-Ling,](https://reader030.vdocument.in/reader030/viewer/2022032702/56649cc45503460f9498da6f/html5/thumbnails/13.jpg)
13
Experiment
Dataset
PubMed Central Open Access dataset (PubMed )
NSF Awards from 2007 to 2012 (NSF)
Association of Computational Linguistics Anthology
Network (ACL)
Topic Modeling (LDA)
MALLET
α : 0.05*(N/D*T) , β : 0.01
5,000 iterations. Keep only the final sample in the
chain.
T = 10, 30, 100 and 300 topics.
![Page 14: Text-Based Measures of Document Diversity Date : 2014/02/12 Source : KDD’13 Authors : Kevin Bache, David Newman, and Padhraic Smyth Advisor : Dr. Jia-Ling,](https://reader030.vdocument.in/reader030/viewer/2022032702/56649cc45503460f9498da6f/html5/thumbnails/14.jpg)
14
Pseudo-Documents
Reason : no ground-truth measure for a document's diversity.
Half of which were designed to have high diversity and half of which were designed to have low diversity.
High diversity pseudo-document :manually selecting
Randomly select an article from A and one from B.
Relatively unrelated
Journal A
Journal B
Pseudo-document
Randomly select
![Page 15: Text-Based Measures of Document Diversity Date : 2014/02/12 Source : KDD’13 Authors : Kevin Bache, David Newman, and Padhraic Smyth Advisor : Dr. Jia-Ling,](https://reader030.vdocument.in/reader030/viewer/2022032702/56649cc45503460f9498da6f/html5/thumbnails/15.jpg)
15
ExperimentROC Curve
AUC: Area under the ROC curve
![Page 16: Text-Based Measures of Document Diversity Date : 2014/02/12 Source : KDD’13 Authors : Kevin Bache, David Newman, and Padhraic Smyth Advisor : Dr. Jia-Ling,](https://reader030.vdocument.in/reader030/viewer/2022032702/56649cc45503460f9498da6f/html5/thumbnails/16.jpg)
16
ExperimentAUC scores for different diversity measures based on 1000 pseudo-documents from PubMed
![Page 17: Text-Based Measures of Document Diversity Date : 2014/02/12 Source : KDD’13 Authors : Kevin Bache, David Newman, and Padhraic Smyth Advisor : Dr. Jia-Ling,](https://reader030.vdocument.in/reader030/viewer/2022032702/56649cc45503460f9498da6f/html5/thumbnails/17.jpg)
17
ExperimentEvaluating transformations
![Page 18: Text-Based Measures of Document Diversity Date : 2014/02/12 Source : KDD’13 Authors : Kevin Bache, David Newman, and Padhraic Smyth Advisor : Dr. Jia-Ling,](https://reader030.vdocument.in/reader030/viewer/2022032702/56649cc45503460f9498da6f/html5/thumbnails/18.jpg)
18
Experimentmost diverse NSF grant proposals
![Page 19: Text-Based Measures of Document Diversity Date : 2014/02/12 Source : KDD’13 Authors : Kevin Bache, David Newman, and Padhraic Smyth Advisor : Dr. Jia-Ling,](https://reader030.vdocument.in/reader030/viewer/2022032702/56649cc45503460f9498da6f/html5/thumbnails/19.jpg)
19
Outline
IntroductionMethodExperimentConclusions
![Page 20: Text-Based Measures of Document Diversity Date : 2014/02/12 Source : KDD’13 Authors : Kevin Bache, David Newman, and Padhraic Smyth Advisor : Dr. Jia-Ling,](https://reader030.vdocument.in/reader030/viewer/2022032702/56649cc45503460f9498da6f/html5/thumbnails/20.jpg)
20
Conclusions
Presented an approach for quantifying the
diversity of individual documents in a corpus
based on their text content.
More data-driven, performing the equivalent of
learning journal categories by learning topics
from text.
Can be run on any collection of text documents,
even without a prior categorization scheme.
A possible direction for future work is that of
temporal document diversity.