link prediction with the linkpred tool
TRANSCRIPT
Measuring scholarly impact: Methods and practice
Link prediction with the linkpred tool
Raf GunsUniversity of [email protected]
If you want to follow along…
Download and install Anaconda Python from http://continuum.io/downloads
Download the example data from http://bit.ly/1HpZvIa
“A pair of scientists who have five mutual previous collaborators, for instance, are about twice as likely to collaborate as a pair with only two, and about 200 times as likely as a pair with none.” (Newman, 2001; emphasis mine)
Agenda
What is link prediction? (and why?)
Example data
The linkpred tool
Link prediction in practice
Conclusion
What is link prediction?
Networks
Networks in informetrics
Citation Papers Journals Authors Patents …
Collaboration Authors Institutions Countries …
Co-citation Bibliographic coupling Web links And so on
Definitions
A network G = (V, E) consists of: A set of nodes or vertices V A set of links or edges E
Each link connects two nodes from V
Neighbourhood N(v) of node v: all nodes connected to v
Node degree |N(v)| of v: number of connected nodes = number of items in set N(v)
Change in networks
Most networks are not static, e.g. in collaboration network: New authors appear Old authors disappear New collaborations are initiated Previous collaborators stop collaborating
Change in networks
Some changes are more plausible than others
Change in networks
Different mechanisms have been identified
Assortativity: similar nodes are more likely to connect
Preferential attachment: well-connected nodes attract more new connections
Cf. cumulative advantage, Matthew effect
The link prediction question
Liben-Nowell and Kleinberg (2003, 2007):
“Given a snapshot of a social network, can we infer whichnew interactions among its members are likely to occurin the near future?”
Link prediction steps
1. Data gathering
2. Preprocessing
3. Prediction
4. Evaluation
Steps
Why link prediction?
You want to know which links will appear in the future
Recommendation
Finding missing links
Finding ‘anomalous’ links (correct or incorrect)
Evaluating network formation and evolution models
Our example data
Data
Guns and Rousseau (2013) Collaboration between
cities in Africa and South-Asia
Topic: malaria In three consecutive
time periods
Available as three Pajek network files: http://bit.ly/1HpZvIa
1997-2001
2002-2006
2007-2011
The linkpred tool
About
https://github.com/rafguns/linkpred
Cross-platform (written in Python)
Open source: BSD license
Command-line tool!
Alternative: LPmade (https://github.com/rlichtenwalter/LPmade)
How and where to get linkpred
1. Install Anaconda Python: http://continuum.io/downloads
2. Open command-line window3. Run command:
> pip install https://github.com/rafguns/linkpred/archive/stable.zip
4. Wait until installation is finished
Basic usage
> linkpredShould display brief usage instructions
> linkpred --helpDisplays more complete help output
Basic usage
> linkpred training-network-file --predictors predictor --output output-type
Read the network in training-network-file, predict using predictor and give output of output-type
> linkpred training-network-file test-network-file --predictors predictor --output output-type
Read the network in training-network-file, compare with test-network-file, predict using predictor and give output of output-type
Link prediction in practice
Preprocessing
Nodes may also appear and disappear Restrict to intersection of node sets of training and test
network Only where test network is available
Restrict by degree (default: only discard isolate nodes)
Directed networks: not supported Convert to undirected first
Prediction: choosing predictors
Local AdamicAdar AssociationStrength CommonNeighbours Cosine DegreeProduct Jaccard MaxOverlap MinOverlap NMeasure Pearson ResourceAllocation
Global GraphDistance Katz RootedPageRank SimRank
Other Community Copy Random
Local predictors
Tendency towards triadic closure
Number of common neighbours is a simple but powerful predictor.
Local predictors
Common neighbours
Normalizations of common neighbours Jaccard coefficient, cosine measure…
Adamic/Adar (Adamic & Adar, 2003)
Weighted networks
In weighted networks, links have weights (e.g. number of joint papers, number of citations…)
Link weights : often ignored!!
Most predictors in linkpred can use link weights General idea: higher link weight (e.g., more common
papers), stronger connection
Global predictors
Graph distance: lowest number of links needed to travel from a to b problem: small world
phenomenon
Global predictors
Katz (1953):
: 1 if i and j are linked, 0 otherwise : number of walks with length k from i to j : parameter, “probability of effectiveness of a single link”
Longer walks: lower effectiveness
Global predictors
Rooted PageRank
Global predictors
Rooted PageRank
Global predictors
SimRank (Jeh & Widom, 2002)
“Objects that link to similar objects are similar themselves.”
Starting point: a node is maximally similar to itself:W(v, v) = 1
Demo
Predict
Save predictions to file import in e.g. Excel
Evaluation
Step 4: ‘How well does it work?’
How? compare to ‘known good’ test network
Four groups:
Link Non-link
Predicted True positive False positive
Not predicted False negative True negative
Evaluation
Simply save results to text file:--output cache-evaluations
Create chart: Recall-precision ROC
Evaluation: recall-precision
Precision: fraction of correct predictions
Recall: fraction of correctly predicted links
Evaluation: ROC
False positive rate:Fraction of incorrectly
predicted links
True positive rate: fraction of correctly
predicted links(= recall)
Profiles
A simple way to save and reuse the configuration of a complex prediction run (options, predictors, parameters…)
Usage example:> linkpred network-file --profile profile.yml
Format: YAML, see https://en.wikipedia.org/wiki/YAML
Example profile
predictors: - name: AdamicAdar displayname: Adamic/Adar - name: GraphDistance displayname: Graph distance parameters: weight: weight - name: SimRank displayname: SimRank (c=0.4) parameters: c: 0.4
- name: SimRank displayname: SimRank (c=0.8) parameters: c: 0.8output: - cache-predictions - recall-precision
Conclusion
About link prediction
Link prediction is possible because link formation is not a purely random process
Limitations: Unaware of social and other circumstantial factors Which predictor is ‘best’ for a concrete situation? Trade-off between prediction accuracy and non-triviality
About linkpred
Relatively simple but powerful
Limitations: Not suitable for very large and/or dense networks Does not incorporate more complex setups like predictor
combinations, machine learning etc.
All results can be exported for analysis in other software (cache-*)
Open source: contributions welcome!