on cluster stability

17
On cluster stability Nees Jan van Eck Centre for Science and Technology Studies (CWTS), Leiden University 15th International Conference on Scientometrics & Informetrics Istanbul, Turkey, June 30, 2015

Upload: nees-jan-van-eck

Post on 07-Aug-2015

364 views

Category:

Science


3 download

TRANSCRIPT

Page 1: On cluster stability

On cluster stability

Nees Jan van Eck

Centre for Science and Technology Studies (CWTS), Leiden University

15th International Conference on Scientometrics & Informetrics

Istanbul, Turkey, June 30, 2015

Page 2: On cluster stability

2

Introduction

• A clustering technique can be used to obtain highly detailed clustering results (i.e., a large number of clusters)

• A clustering technique can be used to force each publication to be assigned to a cluster

• However, in a highly detailed clustering, is the assignment of publications to clusters still meaningful?

Page 3: On cluster stability

3

Example: Waltman and Van Eck (2012)

Page 4: On cluster stability

4

Cluster stability

• To ensure that publications are assigned to clusters in a meaningful way, we introduce the notion of stable clusters

• Essentially, a cluster is stable if it is insensitive to small changes in the underlying data

• Bootstrapping is used to make small changes in the data

Page 5: On cluster stability

5

Identification of stable clusters:Step 1• Collect the citation network of publications

• Create a large number (e.g., 100) of bootstrap citation networks:– A bootstrap citation network is a weighted variant of the

original citation network in which each edge has an integer weight drawn from a Poisson distribution with mean 1 (cf. Rosvall & Bergstrom, 2009)

• In each bootstrap citation network, perform clustering

• For each pair of publications, calculate the proportion of the bootstrap clustering results in which the publications are in the same cluster

Page 6: On cluster stability

6

Original network Bootstrap networks

11

10

11

21

1

0

13

11

1 2

11

110

1

13

1

0

41

11

2 2

11

110

2

10

0

1

31

10

1 1

Clustering

11

10

11

21

1

0

13

10

1 2

11

001

1

13

0

0

41

11

0 2

11

110

1

10

0

1

31

20

1 1

1.00.9

0.90.40.6

0.9

0.90.9

0.1

0.1

0.91.0

0.9

0.5

0.9 1.0

Weighted network Clustered bootstrap networks

Page 7: On cluster stability

7

Identification of stable clusters:Step 2• Create a network of publications with an edge

between two publications if the publications are in the same cluster in at least a certain proportion (e.g., 0.9) of the bootstrap clustering results

• Identify connected components in the newly created network

• Each connected component represents a stable cluster

Page 8: On cluster stability

8

1.00.9

0.90.40.6

0.9

0.90.9

0.1

0.1

0.91.0

0.9

0.5

0.9 1.0

Weighted network Binary network

Connected components

Stable clusters

Page 9: On cluster stability

9

Data

• Library & Information Sciences (LIS):– Time period: 1996-2013

– Publications: 31,534

– Citation links: 131,266

• Astrophysics (Berlin dataset):– Time period: 2003-2010

– Publications: 101,828

– Citation links: 924,171

Page 10: On cluster stability

10

Cluster stability LIS

Page 11: On cluster stability

11

Stable clusters LIS (resolution 2)

Page 12: On cluster stability

12

Stable clusters LIS (resolution 2)

Page 13: On cluster stability

13

Cluster stability Berlin

Page 14: On cluster stability

14

Cluster stability

LIS Berlin

Page 15: On cluster stability

15

Conclusions

• If we want to have an accurate and detailed clustering, we need to be satisfied with a clustering that doesn’t comprehensively cover all publications

• Publications that do not clearly belong to one of the main topics in a field cannot be assigned to a cluster

• Cluster stability analysis can be used to distinguish between meaningful and non-meaningful assignments of publications to clusters

Page 16: On cluster stability

16

Thank you for your attention!

Page 17: On cluster stability

17

References

Rosvall, M., & Bergstrom, C.T. (2009). Mapping change in large networks. PLoS ONE, 5(1), e8694.http://dx.doi.org/10.1371/journal.pone.0008694

Waltman, L., & Van Eck, N.J. (2012). A new methodology for constructing a publication-level classification system of science. JASIST, 63(12), 2378-2392.http://dx.doi.org/10.1002/asi.22748

Waltman, L., & Van Eck, N.J. (2013). A smart local moving algorithm for large-scale modularity-based

community detection. European Physical Journal B, 86(11), 471.http://dx.doi.org/10.1140/epjb/e2013-40829-0