pagerank extensions and a new way to measure scientific impacts
TRANSCRIPT
OUTLINE
I. Context
II. What is PageRank?
III. PageRank extensions
IV. Pira Algorithm
V. Scientific impact measures
VI. Database
VII. Future works
2
1. SNOW
Create a pertinent ranking method for a dynamic context (forum, Twitter, Facebook, …)
Example: forum, YouTube
4
2. PIRA (PUBLICATION INDUCED RESEARCH ANALYSES)
Validation project for Snow
Create new PageRank-like method to rank authors and scientific publications
Author paper graph
Author PaperAuthorship
Citation
5
1. HOW DOES GOOGLE WORK?
Crawling DB
Calcu
late web
site’ score
DB
PageRank
Website PageRank-score
Google 1000000
Yahoo 900000
…
Le Monde 1000
…
7
1. HOW DOES GOOGLE WORK?
David Beckham DBDavid + beckham
Websites relative
to keywords
Website Request-score
davidbeckham.com 1000
lequipe.fr 999
…
Ex: Request-score = number of keyword appearance * PR-score8
2. DEFINITION OF PAGERANK
Random surfer model
Score of a website (vertex) is the probability it is visited in an infinite journey
9
3. DAMPING FACTOR
Resemblance to Markov chain:
State space = set of vertices
State transition = edge
Irreducibility => Stationary distribution
Irreducibility = connectedness
Graph is strongly connected => Score determined and unique
And if the graph is not strongly connected?
10
A
C
B
D
A
3. DAMPING FACTOR
The probability a web surfer gets bored and decides to jump to a random website
The state transition is non-zero between all pair of vertices
11
A
C
B
D
A
C
B
D
df / 4
1-df
3. DAMPING FACTOR EFFECT
Change score order of two vertices !
Score(V3) > Score(V5)
Score(V3) < Score(V5)
12
1. EXTENDED GRAPH
Different vertices’ types, different edges’ types
More general: vertices and edges can carry properties
Type is short-cut for property « type »
14
Company website----------------------------
AdresseProfit
.....
Personal blog----------------------------
CreatorTheme
.....
Commercial link
Partenariat link
Friend link--------------yearplace
Publicity link-------------------Price
2. P-WEIGHT
Assign different probability weight
(p-weight) to different edges
This value determines the
appreciation of an object towards
another
Ex: in Facebook, “comment” edge
vs. “like” edge
This value is relative
A
B
C
P-weight = 1
P-weight = 2
15
3. C-WEIGHT
V1 V2
e• Increment differently the counter on the journeycounter(V2) = counter(V2) + c-weight(e)
• Represents the appreciation a vertex gives to another
• Example: author
paper
Wrote: c-weight = 0
16
C-WEIGHT VS P-WEIGHT
Score(V3) > Score(V5)
Score(V3) = Score(V5)
• Two ways to favor a vertex• Are they equivalent ?
17
4. PATH DIVERSITY
Remember the journey
Add a additional checking process before incrementing counter(w): compare w to previous vertices
Ex: if w appears already in this list, it is probable that the visitor has fallen in a cycle (ex: w --> w1 --> w2 --> w) and so w should not receive any credit
w w1
w218
1. AUTHOR PAPER GRAPH
21
Paper set Author set
cite
wrote
IsWrittenBy
Score of an author/ a paper is the probability it is visited in an infinite journey on the author paper graph
2. JOURNEY EXTENSION
A. CHOOSING TYPE BEFORE CHOOSING NODE
Score(A2) = score(A1) !
24
If « wrote » edge and « cite » edge are treated equally
With the extension « choosing type firstly »
Score(A2) = score(A1) / n
2B. JOURNEY EXTENSION:
C-WEIGHT FOR “WROTE” EDGE
C-weight(wrote) = 0
25
An author does not give directly appreciationtowards his papers:
A P
wrote
2B. JOURNEY EXTENSION:
C-WEIGHT FOR “WROTE” EDGE
If no c-weight(wrote)
26
Score(P2) > Score(P1)
If c-weight(wrote) = 0 Score(P2) = Score(P1)
2C. JOURNEY EXTENSION
P-WEIGHT FOR “WROTE” EDGE
If no p-weight, then Score(P3) < Score(P4) !
27
C-weight(wrote) = 0 => Score(P1) = Score(P2)
Score(P3) = Score(P4)
2C. JOURNEY EXTENSION:
P-WEIGHT FOR “WROTE” EDGE
pe
a
Solution:P-weight(e) = 1 / nbAuthor(p)
P-weight(w1) = P-weight(w2) + P-weight(w3) + P-weight(w4)
Score(P1) = Score(P2)
Score(P3) = Score(P4)
28
V. SCIENTIFIC IMPACT MEASURES
1. Author measures
a) Classic measures
b) Scenarios
c) Summary
2. Paper measures
a. Classic measures
b. Summary
29
1. AUTHOR MEASURES
A. CLASSIC MEASURES
Publication
Citation
H-Index, G-Index
PR-A: PageRank on Author graph
30
1A. CLASSIC MEASURES
PR-A: PAGERANK ON AUTHOR GRAPH
34
• Author graph
• P-weight
p qAB
1
P-weight = 1/6A’
A’’ B’
cite
wrote
1. AUTHOR MEASURES
B. SCENARIOS
Quality of paper
Number of co-authors
Quality of citing papers
Self citation
Score range
35
1B. SCENARIO: QUALITY OF PAPER
Quality of paper
score(A1) > score(A2)
Publication No
Citation Yes
H-Index Yes
PR-A Yes
Pira Yes
36Score(A1) > Score(A2)
1C. SCENARIO: NUMBER OF CO-AUTHORS
score(A1) < score(A4)
Publication No
Citation No
H-Index No
PR-A Yes
Pira Yes
Number of co-authors
37Score(A1) < Score(A2)
1D. SCENARIO: QUALITY OF CITING PAPER
score(A1) > score(A2)
Publication No
Citation No
H-Index No
PR-A No …
Pira Yes
Quality of citing papers
38
Score(A1) > Score(A2)
1E. SCENARIO: SELF-CITATION
score(A2) > score(A1)
Publication No
Citation No
H-Index No
PR-A Yes
Pira Yes
Self-citation
39Score(A1) < Score(A2)
1G. SCENARIO: SCORE RANGE
Millions of authors
Score range (in general)
Publication: < 1000
Citation < 20000
H-Index < 100, G-Index < 200
PR-A, Pira: infinity
Sufficient score range
Publication No
Citation No
H-Index No
PR-A Yes
Pira Yes40
1. AUTHOR MEASURES
C. SUMMARY
Criteria/Measures Publication Citation H-Index PR-A Pira
Paper quality No Yes Yes Yes Yes
Number of co-authors No No No Yes Yes
Citing papers' quality No No No No Yes
Self-citation's effect No No No Yes Yes
Domain specific No No No Yes Yes
Score range No No No Yes Yes
41
2B. SUMMARY FOR PAPER MEASURES
Criteria/Measures Citation PR-P Pira
Citing papers' quality No Yes Yes
Self-citation's effect No Yes Yes
Domain specific No Yes Yes
Score range No Yes Yes
Citing authors' quality No No Yes
43
VI. DATABASE
Aggregate DBLP and CiteSeerX
246039 authors (73241 in DBLP) and 281207 papers(67772 in DBLP)
The theoretic scenarios have been found
44
VII. FUTURE WORKS
Pira
Optimization : >10 times faster than matrix-multiplication algorithm
Apply path diversity
Take into account the content
Create a complete search engine for scientific world
Snow
A lot of work to do …
45
1D. SCENARIO: QUALITY OF CITING PAPER
score(A1) > score(A2)
Publication No
Citation No
H-Index No
PR-A No …
Pira Yes
Quality of citing papers
PR-A vs Pira47
2. JOURNEY EXTENSION
A. CHOOSING TYPE BEFORE CHOOSING NODE
Score(A2) = score(A1) !Score(P2) = Score(A1) + score(A2)
Score (A2) = Score(P2) / 2
48
2. JOURNEY EXTENSION
A. CHOOSING TYPE BEFORE CHOOSING NODE
Choose type first !
Score(P2) = Score (A2) + Score(A1) / n
Score(A2) = Score(P2) / 2Score(A2) = Score(A1) / n
49
1F. SCENARIO: DOMAIN SPECIFIC
Domain specific: the average number of citations varies from domain to domain.
An average paper is cited about 6 times in life sciences and < 1 times in mathematics
Domain specific
Publication No
Citation No
H-Index No
PR-A Yes
Pira Yes
50
2. ALGORITHM
Post User
wasWrittenBy, p-weight = 5
Wrote, c-weight = 0
Positive, c-weight = 2
Negative, c-weight = -2
• Path diversity = 6
• C-weight & P-weight
54
3. DAMPING FACTOR
The probability df a web surfer gets bored and decides to jump to a random website
In practice, Google set df to 0.15
The score of a website is the probability it is visited in a infinite journey on the web-graph by following 2 rules:
When a visitor arrives at a website A that has at least one outgoing links (i.e. having links to at least one website)
with probability df, the visitor picks a random website (in the set of all websites) to jump to and restart the journey
with probability 1 - df the visitor picks randomly an outgoing link of A to follow
If this website does not have any outgoing links, then the visitor picks a random website to restart the journey from.
Adaption to Markov chain
56