link prediction in co-authorship network le nhat minh ( a0074403n) supervisor: dongyuan lu 1

37
LINK PREDICTION IN CO-AUTHORSHIP NETWORK Le Nhat Minh ( A0074403N) Supervisor: Dongyuan Lu 1

Upload: godfrey-harvey

Post on 14-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: LINK PREDICTION IN CO-AUTHORSHIP NETWORK Le Nhat Minh ( A0074403N) Supervisor: Dongyuan Lu 1

1

LINK PREDICTION IN CO-AUTHORSHIP NETWORKLe Nhat Minh ( A0074403N)

Supervisor: Dongyuan Lu

Page 2: LINK PREDICTION IN CO-AUTHORSHIP NETWORK Le Nhat Minh ( A0074403N) Supervisor: Dongyuan Lu 1

2

Introduction• Link prediction

• Introduce future connections within the network scope

• Co-authorship network• A network of collaborations among researchers, scientists,

academic writers

Page 3: LINK PREDICTION IN CO-AUTHORSHIP NETWORK Le Nhat Minh ( A0074403N) Supervisor: Dongyuan Lu 1

3

Introduction• Potential applications

• Recommend experts or group of researchers for individual

researcher.

Page 4: LINK PREDICTION IN CO-AUTHORSHIP NETWORK Le Nhat Minh ( A0074403N) Supervisor: Dongyuan Lu 1

4

Outline

• Problem Background

• Related Work

• Workflow

• Conclusion

• Result Analysis

• Research plan

Page 5: LINK PREDICTION IN CO-AUTHORSHIP NETWORK Le Nhat Minh ( A0074403N) Supervisor: Dongyuan Lu 1

5

Problem Background

• What connect researchers together ?

• Given an instance of co-authorship network:

• A researcher connect to another if they collaborated on at least one

paper.

Problem

Background

Related

Work

Workflow

Conclusion

X2001

Y2004

X X

XY

Page 6: LINK PREDICTION IN CO-AUTHORSHIP NETWORK Le Nhat Minh ( A0074403N) Supervisor: Dongyuan Lu 1

6

Problem Background

• How to predict the link?

• Based on criteria:

• Co-authorship network topology

• Researcher’s personal information

• Researcher’s papers

• Boost up link predictions performance

• Recommend link should be really relevant to the interest of the

authors or at least possible for researcher to collaborate.

Problem

Background

Related

Work

Workflow

Conclusion

Page 7: LINK PREDICTION IN CO-AUTHORSHIP NETWORK Le Nhat Minh ( A0074403N) Supervisor: Dongyuan Lu 1

7

Related Work

• Link prediction problems in Social network

• Liben‐Nowell, D., & Kleinberg, J., 2007

• Bliss, C. A., Frank, M. R., Danforth, C. M., & Dodds, P. S., 2013

• In social network, interactions among users are very

dynamic with:

• Creation of new link within a few days

• Deletion or replacement of the existent links

• Different features present by the two networks

• Characteristics of individual researcher : citations, affiliations , institutions, ...

• Characteristics of person : marriage status, ages, working places, …

Problem

Background

Related

Work

Workflow

Conclusion

Page 8: LINK PREDICTION IN CO-AUTHORSHIP NETWORK Le Nhat Minh ( A0074403N) Supervisor: Dongyuan Lu 1

8

• Three mainstream approaches for link prediction:

• Similarity based estimation

• Liben‐Nowell, D., & Kleinberg, J., 2007

• Maximum likelihood estimation

• Murata, T., & Moriyasu, S., 2008

• Guimerà, R., & Sales-Pardo, M., 2009

• Supervised Learning model

• Pavlov, M., & Ichise, R., 2007

• Al Hasan, M., Chaoji, V., Salem, S., & Zaki, M., 2006

Problem

Background

Related

Work

Workflow

Conclusion

Page 9: LINK PREDICTION IN CO-AUTHORSHIP NETWORK Le Nhat Minh ( A0074403N) Supervisor: Dongyuan Lu 1

9

Similarity Based Estimation• Use metrics to estimate proximities of pairs of researchers

• Based on those proximities to rank pairs of researchers

• The top pairs of researchers will likely to be the recommendations.

Problem

Background

Related

Work

Workflow

Conclusion

Page 10: LINK PREDICTION IN CO-AUTHORSHIP NETWORK Le Nhat Minh ( A0074403N) Supervisor: Dongyuan Lu 1

10

Similarity Based Estimation• Network structure based measurement

Some conventions:

Yand X node between Similarity :XYS

X of neighbours ofSet :Γ(X)

Yof neighbours ofSet :Γ(Y)

Ynode of Degree|:Γ(Y)|k(Y)

X node of Degree:|Γ(X)|k(X)

Problem

Background

Related

Work

Workflow

Conclusion

Page 11: LINK PREDICTION IN CO-AUTHORSHIP NETWORK Le Nhat Minh ( A0074403N) Supervisor: Dongyuan Lu 1

11

Similarity Based Estimation• Common Neighbor:

|(Y) (X)| SXY

XY

Problem

Background

Related

Work

Workflow

Conclusion

Page 12: LINK PREDICTION IN CO-AUTHORSHIP NETWORK Le Nhat Minh ( A0074403N) Supervisor: Dongyuan Lu 1

12

Similarity Based Estimation• Jaccard’s coefficient:

|)()(|

|)()(|

YX

YXSXY

XY

Problem

Background

Related

Work

Workflow

Conclusion

Page 13: LINK PREDICTION IN CO-AUTHORSHIP NETWORK Le Nhat Minh ( A0074403N) Supervisor: Dongyuan Lu 1

13

Similarity Based Estimation• Preferential Attachment:

)()( YkXkSXY

XY

Problem

Background

Related

Work

Workflow

Conclusion

Page 14: LINK PREDICTION IN CO-AUTHORSHIP NETWORK Le Nhat Minh ( A0074403N) Supervisor: Dongyuan Lu 1

14

Similarity Based Estimation

• Adamic/Adar:

)()( )(log

1

YXZXY ZkS

XY

Z

Problem

Background

Related

Work

Workflow

Conclusion

Page 15: LINK PREDICTION IN CO-AUTHORSHIP NETWORK Le Nhat Minh ( A0074403N) Supervisor: Dongyuan Lu 1

15

Similarity Based Estimation• Shortest Path:

• Defines the minimum number of edges connecting two nodes.

• PageRank:• A random walk on the graph assigning the probability that a node

could be reach. The proximity between a pair of node can be determined by the sum of the node PageRank.

Problem

Background

Related

Work

Workflow

Conclusion

Page 16: LINK PREDICTION IN CO-AUTHORSHIP NETWORK Le Nhat Minh ( A0074403N) Supervisor: Dongyuan Lu 1

16

Maximum Likelihood Estimation• Predefine specific rules of a network

• Required a prior knowledge of the network

• The likelihood of any non-connected link is calculated according to those rules.

Problem

Background

Related

Work

Workflow

Conclusion

Page 17: LINK PREDICTION IN CO-AUTHORSHIP NETWORK Le Nhat Minh ( A0074403N) Supervisor: Dongyuan Lu 1

17

Supervised Learning Model• Construct dimensional feature vectors

• Fetch these vectors to classifiers to optimize a target function (training model)

• Link prediction becomes a binary classification

Problem

Background

Related

Work

Workflow

Conclusion

Page 18: LINK PREDICTION IN CO-AUTHORSHIP NETWORK Le Nhat Minh ( A0074403N) Supervisor: Dongyuan Lu 1

18

Supervised Learning Model

• Related work (Al Hasan, M., Chaoji, V., Salem, S., & Zaki,

M., 2006) using:• Decision Tree• SVM (Linear Kernel)• K nearest neighbor• Multilayer Perceptron• Naives Bayes• Bagging

• Combine many classifiers (Pavlov, M., & Ichise, R., 2007)• Decision stump + AdaBoost• Decision Tree + AdaBoost• SMO + AdaBoost

Problem

Background

Related

Work

Workflow

Conclusion

Page 19: LINK PREDICTION IN CO-AUTHORSHIP NETWORK Le Nhat Minh ( A0074403N) Supervisor: Dongyuan Lu 1

19

Summary• Similarity based estimation

• Not quite well-perform

• Maximum likelihood• Depend on the network

• Supervised learning model• Perform better than similarity based estimation

Problem

Background

Related

Work

Workflow

Conclusion

Page 20: LINK PREDICTION IN CO-AUTHORSHIP NETWORK Le Nhat Minh ( A0074403N) Supervisor: Dongyuan Lu 1

20

Workflow

Problem

Background

Related

Work

Workflow

Conclusion

Classifier Model Features

Page 21: LINK PREDICTION IN CO-AUTHORSHIP NETWORK Le Nhat Minh ( A0074403N) Supervisor: Dongyuan Lu 1

21

Graph Description

• Co-authorship graph:

• Undirected graph G (V , E)

• Node or Vertex ( Author )

• Author ID

• Author Name

• Link or Edge (Co-authorship)

• Pair of author ID

• List of publication year followed by paper title

(Ex: 2004 :”Introduction to …” )

Problem

Background

Related

Work

Workflow

Conclusion

Page 22: LINK PREDICTION IN CO-AUTHORSHIP NETWORK Le Nhat Minh ( A0074403N) Supervisor: Dongyuan Lu 1

22

Setting up data• Dataset is separated into 2 timing spans: 2000 – 2010

and 2010 – 2013• The first is for training, the latter is for testing.• Currently, there are 134,307 researchers in the network

2000 – 2013.• Crop out authors who are not available in testing period,

remaining 104,265 researchers

Problem

Background

Related

Work

Workflow

Conclusion

Page 23: LINK PREDICTION IN CO-AUTHORSHIP NETWORK Le Nhat Minh ( A0074403N) Supervisor: Dongyuan Lu 1

23

Setting up data• Choose a subset from 104,265 researchers• Experiment on 937 researchers

2000-2010 2010-2013

Real Network

No of node 104,265 104,265

No of link 413,691 35,558

Experiment Network

No. of node 937 937

No. of link 3093 57

Problem

Background

Related

Work

Workflow

Conclusion

Page 24: LINK PREDICTION IN CO-AUTHORSHIP NETWORK Le Nhat Minh ( A0074403N) Supervisor: Dongyuan Lu 1

24

Baseline Features

• Extract features from the network structure:

• Local similarity

• Common Neighbor

• Adamic / Adar

• Preferential Attachment

• Jaccard’s coefficient

• Global similarity

• Shortest Path

• PageRank

Problem

Background

Related

Work

Workflow

Conclusion

Page 25: LINK PREDICTION IN CO-AUTHORSHIP NETWORK Le Nhat Minh ( A0074403N) Supervisor: Dongyuan Lu 1

25

Baseline Features

• Feature for co-authorship network

• Keyword matching (Cohen, S., & Ebel, L., 2013 )

A suggested metric to measure the textual relavancy uses a TF-

IDF based function to determine.

Problem

Background

Related

Work

Workflow

Conclusion

Page 26: LINK PREDICTION IN CO-AUTHORSHIP NETWORK Le Nhat Minh ( A0074403N) Supervisor: Dongyuan Lu 1

26

Proposed FeaturesProductivity of the authors

Observe the “history” of an authorFor example, at a particular node A:

Problem

Background

Related

Work

Workflow

Conclusion

T2 = 2005T0 = 2000 T1 = 2004 T3= 2006

i=0 i=1 i=2 i=3

n=3m=1

n=4m=2

n=6m=2

n=7m=3

n : No. of shared paperm: No. of collaborators

1m

1n

0m

2n

1m

1n

Page 27: LINK PREDICTION IN CO-AUTHORSHIP NETWORK Le Nhat Minh ( A0074403N) Supervisor: Dongyuan Lu 1

27

Proposed Features

α : a constant to assign the weight of each time period

0 1

1

1)(

)(i ii

mm

TTi

TT

nnAP

iTiT

ii

Problem

Background

Related

Work

Workflow

Conclusion

Productivity of the authors

Observe the “history” of an author

The “productivity” of node A:

Page 28: LINK PREDICTION IN CO-AUTHORSHIP NETWORK Le Nhat Minh ( A0074403N) Supervisor: Dongyuan Lu 1

28

Training set

• Set up training data

• With n nodes, there is possible links.

• Among those, separate two links

• Positive link: links appear in training years.

• Negative link: the remaining non-existent link in training years.

Note: Avoid bias training by balancing the number of instances between true

and false label.

• Classify all the non-existent links

• Compare with the testing data

2

)1( nn

Problem

Background

Related

Work

Workflow

Conclusion

Page 29: LINK PREDICTION IN CO-AUTHORSHIP NETWORK Le Nhat Minh ( A0074403N) Supervisor: Dongyuan Lu 1

29

Experimental Results

• Measurement of performance

• Precision:

• Recall:

• Harmonic mean:

• New links to predict: 57 links

005.0558826

26

P

45.03126

26

R

009.031558826*2

2621

F

Problem

Background

Related

Work

Workflow

Conclusion

Prediction

True Link False Link

True Link 26 31

False Link 5,588 429,778

Page 30: LINK PREDICTION IN CO-AUTHORSHIP NETWORK Le Nhat Minh ( A0074403N) Supervisor: Dongyuan Lu 1

30

Result Analysis

• Possible reasons

• Features

• Small set of data – sampling problem

• Instances of the negative links used for training

Problem

Background

Related

Work

Workflow

Conclusion

Page 31: LINK PREDICTION IN CO-AUTHORSHIP NETWORK Le Nhat Minh ( A0074403N) Supervisor: Dongyuan Lu 1

31

Research Plan

• Use weighted graph with parameters:

• No. of papers

• No. of neighbor

• No. of citations

• Focus on features that specifically target the co-authorship network:

• Citations

• Institutions

• Enlarge the experiment dataset size

Thank you

Problem

Background

Related

Work

Workflow

Conclusion

Page 32: LINK PREDICTION IN CO-AUTHORSHIP NETWORK Le Nhat Minh ( A0074403N) Supervisor: Dongyuan Lu 1

32

References• Adamic, L. A., & Adar, E. (2003). Friends and neighbors on the web. Social networks,

25(3), 211-230.• Al Hasan, M., Chaoji, V., Salem, S., & Zaki, M. (2006). Link prediction using supervised

learning. In SDM’06: Workshop on Link Analysis, Counter-terrorism and Security.• Liben‐Nowell, D., & Kleinberg, J. (2007). The link‐prediction problem for social

networks. Journal of the American society for information science and technology, 58(7), 1019-1031.

• Pavlov, M., & Ichise, R. (2007). Finding Experts by Link Prediction in Co-authorship Networks. FEWS, 290, 42-55.

• Murata, T., & Moriyasu, S. (2008). Link prediction based on structural properties of online social networks. New Generation Computing, 26(3), 245-257.

• Guimerà, R., & Sales-Pardo, M. (2009). Missing and spurious interactions and the reconstruction of complex networks. Proceedings of the National Academy of Sciences, 106(52), 22073-22078.

• Bliss, C. A., Frank, M. R., Danforth, C. M., & Dodds, P. S. (2013). An Evolutionary Algorithm Approach to Link Prediction in Dynamic Social Networks. arXiv preprint arXiv:1304.6257.

• Cohen, S., & Ebel, L. (2013). Recommending collaborators using keywords. In Proceedings of the 22nd international conference on World Wide Web companion 959-962.

Page 33: LINK PREDICTION IN CO-AUTHORSHIP NETWORK Le Nhat Minh ( A0074403N) Supervisor: Dongyuan Lu 1

33

• Link per year of training set is greater than link per year of testing set:• In testing period, only consider “new” collaborations. • Any collaborations between researchers that already has a link will

be disregarded.

2000-2010 2010-2013No of node 937 937No of link 3093 57

Page 34: LINK PREDICTION IN CO-AUTHORSHIP NETWORK Le Nhat Minh ( A0074403N) Supervisor: Dongyuan Lu 1

34

Results with different classifiers

Classifier Precision(Positive Predictive Value)

(%)

Recall(Hit rate)

(%)

F1(Harmonic mean)

(%)

Decision Tree 0.3 24.6 0.5

SMO 0.5 45.6 0.9

Bagging 0.4 28.1 0.7

Naive Bayes 0.2 77.2 0.3

Multilayer Perceptron

0.4 47.3 0.8

Page 35: LINK PREDICTION IN CO-AUTHORSHIP NETWORK Le Nhat Minh ( A0074403N) Supervisor: Dongyuan Lu 1

35

Proposed Feature• The reason for proposing this feature:

• Keep track of the researcher tendency• Give “bonus” to researcher who tend to collaborate with “new”

colleagues rather than “old” ones• Also give high score for prolific researchers (based on number of

published paper)

Page 36: LINK PREDICTION IN CO-AUTHORSHIP NETWORK Le Nhat Minh ( A0074403N) Supervisor: Dongyuan Lu 1

36

Stochastic Block Model• Guimerà, R., & Sales-Pardo, M., 2009

Problem

Background

Related

Work

Workflow

Conclusion

lrll QQMA )1()|L(

in isother theand in is node one that such nodes of pairs of No. :

, group between edges of No. :

connected are , group in nodes y that twoprobabilit :

r

l

Q

Page 37: LINK PREDICTION IN CO-AUTHORSHIP NETWORK Le Nhat Minh ( A0074403N) Supervisor: Dongyuan Lu 1

37

Stochastic Block Model

1

2

3

4

5

6

7

X Y

Problem

Background

Related

Work

Workflow

Conclusion

}}7,6,5,4{

},3,2,1{{M

6

1

6

5

6

5

6

11L

5102

The reliability of an individual link is:

')'()'()'|(

)()|()|1()|1(

dMMpMLMAL

dMMpMALMALAALR

xy

xyxy