on cluster stability
TRANSCRIPT
On cluster stability
Nees Jan van Eck
Centre for Science and Technology Studies (CWTS), Leiden University
15th International Conference on Scientometrics & Informetrics
Istanbul, Turkey, June 30, 2015
2
Introduction
• A clustering technique can be used to obtain highly detailed clustering results (i.e., a large number of clusters)
• A clustering technique can be used to force each publication to be assigned to a cluster
• However, in a highly detailed clustering, is the assignment of publications to clusters still meaningful?
3
Example: Waltman and Van Eck (2012)
4
Cluster stability
• To ensure that publications are assigned to clusters in a meaningful way, we introduce the notion of stable clusters
• Essentially, a cluster is stable if it is insensitive to small changes in the underlying data
• Bootstrapping is used to make small changes in the data
5
Identification of stable clusters:Step 1• Collect the citation network of publications
• Create a large number (e.g., 100) of bootstrap citation networks:– A bootstrap citation network is a weighted variant of the
original citation network in which each edge has an integer weight drawn from a Poisson distribution with mean 1 (cf. Rosvall & Bergstrom, 2009)
• In each bootstrap citation network, perform clustering
• For each pair of publications, calculate the proportion of the bootstrap clustering results in which the publications are in the same cluster
6
Original network Bootstrap networks
11
10
11
21
1
0
13
11
1 2
11
110
1
13
1
0
41
11
2 2
11
110
2
10
0
1
31
10
1 1
Clustering
11
10
11
21
1
0
13
10
1 2
11
001
1
13
0
0
41
11
0 2
11
110
1
10
0
1
31
20
1 1
1.00.9
0.90.40.6
0.9
0.90.9
0.1
0.1
0.91.0
0.9
0.5
0.9 1.0
Weighted network Clustered bootstrap networks
7
Identification of stable clusters:Step 2• Create a network of publications with an edge
between two publications if the publications are in the same cluster in at least a certain proportion (e.g., 0.9) of the bootstrap clustering results
• Identify connected components in the newly created network
• Each connected component represents a stable cluster
8
1.00.9
0.90.40.6
0.9
0.90.9
0.1
0.1
0.91.0
0.9
0.5
0.9 1.0
Weighted network Binary network
Connected components
Stable clusters
9
Data
• Library & Information Sciences (LIS):– Time period: 1996-2013
– Publications: 31,534
– Citation links: 131,266
• Astrophysics (Berlin dataset):– Time period: 2003-2010
– Publications: 101,828
– Citation links: 924,171
10
Cluster stability LIS
11
Stable clusters LIS (resolution 2)
12
Stable clusters LIS (resolution 2)
13
Cluster stability Berlin
14
Cluster stability
LIS Berlin
15
Conclusions
• If we want to have an accurate and detailed clustering, we need to be satisfied with a clustering that doesn’t comprehensively cover all publications
• Publications that do not clearly belong to one of the main topics in a field cannot be assigned to a cluster
• Cluster stability analysis can be used to distinguish between meaningful and non-meaningful assignments of publications to clusters
16
Thank you for your attention!
17
References
Rosvall, M., & Bergstrom, C.T. (2009). Mapping change in large networks. PLoS ONE, 5(1), e8694.http://dx.doi.org/10.1371/journal.pone.0008694
Waltman, L., & Van Eck, N.J. (2012). A new methodology for constructing a publication-level classification system of science. JASIST, 63(12), 2378-2392.http://dx.doi.org/10.1002/asi.22748
Waltman, L., & Van Eck, N.J. (2013). A smart local moving algorithm for large-scale modularity-based
community detection. European Physical Journal B, 86(11), 471.http://dx.doi.org/10.1140/epjb/e2013-40829-0