Download - Cluster stability
Cluster stability
Nees Jan van Eck and Ludo Waltman
Centre for Science and Technology Studies (CWTS), Leiden University
Workshop “Comparison of Algorithms”, Amsterdam
April 20, 2015
Problem statement
• A clustering technique can be used to obtain highly
detailed clustering results (i.e., a large number of
clusters)
• A clustering technique can be used to force each
publication to be assigned to a cluster
• However, in a highly detailed clustering, is the
assignment of publications to clusters still
meaningful?
• The assignment of a publication to a cluster may be
based on very little information (e.g., a single
citation relation)
1
Example: Waltman and Van Eck (2012)
2
Cluster stability
• To ensure that publications are assigned to clusters
in a meaningful way, we introduce the notion of
stable clusters
• Essentially, a cluster is stable if it is insensitive to
small changes in the underlying data
• Bootstrapping is used to make small changes in the
data
• There is no formal statistical framework
• To some extent, this resembles the stability
intervals in the CWTS Leiden Ranking
3
Identification of stable clusters:
Step 1
• Collect the citation network of publications
• Create a large number (e.g., 100) of bootstrap
citation networks
• In each bootstrap citation network, perform
clustering:
– Clustering technique of Waltman and Van Eck (2012)
– User-defined resolution parameter
– Smart local moving algorithm of Waltman and Van Eck (2013)
• For each pair of publications, calculate the
proportion of the bootstrap clustering results in
which the publications are in the same cluster
4
Identification of stable clusters:
Step 2
• Create a network of publications with an edge
between two publications if the publications are in
the same cluster in at least a certain proportion
(e.g., 0.9) of the bootstrap clustering results
• Identify connected components in the newly
created network
• Each connected component represents a stable
cluster
5
Non-parametric bootstrapping
• Sample with replacement from the set of all citation
relations between publications
• Make sure to obtain a sample that is of the same
size as the original set of citation relations
• Some citation relations will occur multiple times in
the sample, others won’t occur in it at all
• Based on the sampled citation relations, create a
bootstrap citation network
• Edges have integer weights in this network
6
Parametric bootstrapping
• A bootstrap citation network is a weighted variant
of the original citation network, with each edge
having an integer weight drawn from a Poisson
distribution with mean 1 (cf. Rosvall & Bergstrom,
2009)
• Total edge weight in the bootstrap citation network
will be approximately equal to the number of edges
in the original network
• For large networks, parametric and non-parametric
bootstrapping coincide
• We use parametric bootstrapping
7
Data
• Library & Information Sciences (LIS):
– Time period: 1996-2013
– Publications: 31,534
– Citation links: 131,266
• Astrophysics (Berlin dataset):
– Time period: 2003-2010
– Publications: 101,828
– Citation links: 924,171
8
Cluster stability LIS
9
Stable clusters LIS (resolution 2)
10
Stable clusters LIS (resolution 2)
11
Cluster stability Berlin
12
Cluster stability
13
LIS Berlin
Conclusions
• What is a good clustering of publication?
– High accuracy: Publications in the same cluster are topically
related
– High level of detail: It is possible to have a large number of
clusters
– Comprehensiveness: The clustering includes all publications
– Uniformity in cluster size: Clusters are of roughly the same size
• It seems impossible to obtain a clustering that has
all properties listed above
• At least one property needs to be given up
14
Conclusions
• Why cannot we have an accurate and detailed
clustering that includes all publications?
– Consider the field of scientometrics
– We would expect an accurate and detailed clustering to have
clusters dealing with topics such as indicators, science mapping,
collaboration, patents, etc.
– However, many publications in scientometrics (e.g., case studies)
do not neatly belong to one of these topics and therefore cannot
be accurately assigned to a cluster
• If we want to have an accurate and detailed
clustering, we need to be satisfied with a clustering
that doesn’t comprehensively cover all publications
• The clustering covers only publications related to
the main topics in the fields15
Conclusions
• Analysis of cluster stability offers an approach to
distinguish between meaningful and non-
meaningful assignments of publications to clusters
• Clustering based on direct citations is
computationally attractive but ignores relevant
information (e.g., bibliographic coupling)
• A post processing procedure can be developed to
try to assign ‘isolated publications’ to stable
clusters based on additional information
• Cluster stability is a general idea that can be
applied also to other clustering approaches
16
References
Rosvall, M., & Bergstrom, C.T. (2009). Mapping change
in large networks. PLoS ONE, 5(1), e8694.
http://dx.doi.org/10.1371/journal.pone.0008694
Waltman, L., & Van Eck, N.J. (2012). A new methodology
for constructing a publication-level classification
system of science. JASIST, 63(12), 2378-2392.
http://dx.doi.org/10.1002/asi.22748
Waltman, L., & Van Eck, N.J. (2013). A smart local moving
algorithm for large-scale modularity-based community
detection. European Physical Journal B, 86(11), 471.
http://dx.doi.org/10.1140/epjb/e2013-40829-0
17