an efficient ranking algorithm for scientific research...
TRANSCRIPT
An Efficient Ranking Algorithm for Scientific Research Papers
By
Fathi Mahmoud Fathi Al-Hattab
Supervisor:
Dr. Mohammad Hassan
Co-supervisor:
Dr.Yaser Al-laham
This Thesis is submitted in Partial Fulfillment of the Requirements for
Master’s Degree in Computer Science
Faculty of Graduate Studies
Zarqa University-Jordan
March, 2016
ii
iii
ACKNOWLEDGEMENTS
Prior to acknowledgments, I must glorify Allah the Almighty for giving me courage and patience
to carry out this work successfully.
I would like to express my deepest gratitude to my supervisors, Dr. Mohammad Hassan and
Dr.Yaser Al-laham. I gratefully acknowledge Zarqa University for the scholarship that I have
received to pursue my master's Degree.
And finally, my dear father! How can I ever thank you? Your endless love and support fill me
with life. You are my source of inspiration and I am forever indebted to you. Loads of love and
thanks to you!
iv
TABLE OF CONTENTS
LIST OF TABLES ....................................................................................................................... vi
LIST OF FIGURES .................................................................................................................... vii
LIST OF ACRONYMS ............................................................................................................. viii
Chapter 1: Introduction ............................................................................................................... 1
1.1 Overview ............................................................................................................................... 1
1.2 Problem Definition ................................................................................................................ 2
1.3 Thesis Motivation ................................................................................................................. 2
1.4 Thesis Objectives .................................................................................................................. 3
1.5 Thesis Outline ...................................................................................................................... 3
Chapter 2: Background and Literature Survey ......................................................................... 4
2.1 Outline ................................................................................................................................... 4
2.2 Information Retrieval ............................................................................................................ 4
2.2.1 Ranking Process ............................................................................................................. 5
2.3 Ranking Models .................................................................................................................... 5
2.3.1 Relevance Ranking Algorithms ..................................................................................... 6
2.3.2 Important Ranking Models ............................................................................................ 9
2.4 PageRank Algorithm ............................................................................................................. 9
2.5 SRPs Ranking Algorithms and Related Work .................................................................... 13
2.5.1 PageRank with N-Star Ranking Model ........................................................................ 13
2.5.2 PageRank with HITS Model ........................................................................................ 14
2.5.3 PageRank with Closed Frequent Keyword-set ............................................................ 14
2.5.4 PageRank with Jaccard Index ...................................................................................... 14
2.5.5 PageRank with Focused Surfer Model ........................................................................ 15
2.5.6 Modified PageRank ..................................................................................................... 15
2.5.7 PageRank with CiteRank ............................................................................................. 17
v
2.5.8 Random walk with Restart ........................................................................................... 17
2.6 Chapter Summary ............................................................................................................... 20
Chapter 3: Experiment and Evaluation .................................................................................... 22
3.1 Introduction ......................................................................................................................... 22
3.2 Dataset ................................................................................................................................. 22
3.3 Implementation ................................................................................................................... 23
3.3.1 Data Preparation and Extracting .................................................................................. 24
3.3.2 Extracting Citation Networks ...................................................................................... 25
3.3.3 Calculation of Scientific Research Ranking SRR Score.............................................. 27
Procedure: Scientific Research Ranking R ........................................................................... 29
3.4 Results ................................................................................................................................. 30
3.5 Evaluation ........................................................................................................................... 30
3.5.1 Distribution of Ranked SRPs among the Age of the Paper ......................................... 30
3.5.2 Distribution of Ranked SRPs among Citation ............................................................. 31
3.5.3 Recall and Precision ..................................................................................................... 33
Chapter 4: Conclusion and Future Work ................................................................................. 35
4.1 Results and Conclusion ....................................................................................................... 35
4.2 Future work ......................................................................................................................... 36
REFERENCES ............................................................................................................................ 37
Appendices………………………………………………………………………………… …44
vi
LIST OF TABLES
Page Table Caption Number
12 Limitation of Ranking Models 2.1
17 Summary of SRPs Ranking Algorithms 2.2
23 Distribution of Paper Publication Dates 3.1
34
Results of Precision and Recall for both PR and the proposed
SRP Rank algorithms.
3.2
vii
LIST OF FIGURES
PAGE Figure Caption Number
4 The Basic Process in IR System 2.1
6 Ranking Models 2.2
11 Example of PageRank 2.3
25 Sample of Original Dataset 3.1
24 Dataset Information Arranged into Database 3.2
24 Dataset After Preprocessing 3.3
25 Sample of the Paper Citation Network 3.4
26
Sample of The information extracted from author paper
graph 3.5
26
Sample of the information used in ranking score
calculation 3.6
31
Distribution of the proposed rank method among the age
of the paper 3.7
31 Distribution of PR among the age of the paper 3.8
32
Distribution of the proposed rank method among the
citation count. 3.9
32 Distribution of PR among the citation count. 3.10
viii
LIST OF ACRONYMS
IR Information Retrieval
SRP Scientific Research Papers
VSM Vector Space Model
TF Term Frequency
IDF Inverse Document Frequency
LSI Latent Semantic Indexing
SVD Singular Value Decomposition
LMIR Language Model for IR
HITS Hyperlink-Induced Topic Search
PR PageRank
CC Citation Count
IR Information Retrieval
RWR Random Walk with Restart
ix
Abstract
Due to the enormous evolution of the World Wide Web and providing the online access to the
digital contents of universities and public libraries, the WWW has become the most popular
resource for data and information. This massive content of data makes it generally impossible for
common users to locate their desired information by using the traditional search engines that
only based on relevancy and the number of query occurrences, as the number of returned results
can be tremendous. This made it necessary to develop an efficient and effective ranking
algorithm to solve this problem. One of the most famous ranking algorithms is the PageRank,
which is adopted by Google search engine, but because the nature of scientific research papers
differs from the webpage, it may not be suitable to be used alone to rank scientific research
papers as it favors the old papers and bias rather than the new ones that also relies heavily on
citation counts which is not the only important factor to reflect the paper's importance. This
thesis proposes an efficient ranking method that is suitable to rank scientific research papers and
able to improve PageRank score making it less biased in using the date of publication and
author's score in addition to PageRank. The results showed that the proposed method succeeded
in making the ranking results more neutral with both old and new research papers.
x
اشص
االعاق اىرة سز ترف١غ جاعا اىرثاخ اجاعاخ اىث١غ ل١ا جاعا اعاخ جشثه ف اائ اسرغ ارطع اص
.اثادث١ طالب اعاخ صاصع ا جعا إ االرغد ع جاع
اطتح اث١ااخ ع اثذث ع١ ٠صعة ا االرغد شثىح سرشض اا عائماا ٠شى اصثخ اث١ااخ اائ اى ظا إ
اث١ااخ ع اثذث جح ف ارطاتمح اىاخ عضص سالي ارطاتك ع فمظ ذعرض ار ارم١ض٠ح، اثذث اخنذغ تاسرشضا
رائج ذغذ١ة ع لاصعج فعاح ساعػ١اخ ٠غفذ ا ذاجا الرضد ظا ؛ ائح ذى لض اثذث ئجاد ا وا ، اشثىح ع اشؼح
اثذث ذغن لث جاسرشض (PageRank) جساعػ اشاػ١اخ ظ اشغ .سرشض جتاسة ا١را دسة اثذث
تاسرشضا اع اثذث اعاق ذغذ١ة إف ا٠ة، صفذاخ طث١ع ع ذشرف جاع اثذث عاقأ جطث١ع أ إ نغاا جج عاش
اىث١غ اعراصا ا٠ضا اجض٠ضج ع جامض٠ االعاق ذفض ذذ١ؼا اوثغ الا دضا جاسة ذى ال لض (PageRank) ساعػ١ح
.اثذث ا١ح ض رذض٠ض ا عاال دضا ذعض ال لض ار االلرثاساخ عضص ع
٠ثتخ PageRank) ) ساعػ١ح ذط٠غ ع لاصعج اع اثذث اعاق رغذ١ة اسثح جفعاي ساعػ١ح االطغدح ظ ذمرغح
.PageRank)) ل١حي اضافح اؤف ذم١١ اشغ رذاع تاسرشضا طه ذذ١ؼاا ال ذصثخ
.اجض٠ضج امض٠ح ١حاع االعاق ع ذاػاا اوثغ ارغذ١ة رائج ذجع أ اسرطاعد امرغد اشاعػ١ح ارائج اظغخ
1
Chapter 1: Introduction
1.1 Overview
With the enormous evolution of the World Wide Web (WWW), it has become the most
popular information resource for text, media and metadata. The number of indexing web
keeps growing at the rate of 222, 21 million web pages a day [1] and in 2015; the Indexed
Web reached more than 4.73 billion pages according to ILK Research Group*. This huge
content of data makes it generally impossible for common users to locate their desired
information by using traditional search that is based only on relevancy and the number of
query occurrences in a document. The number of returned results can be tremendous and
the user would spend much time to find the desired information from long list. This
situation raised the demand to develop efficient and effective information retrieval systems
and ranking algorithm to solve this problem.
One of the reliable resources for information on the web is the digital academic libraries
that considered being one of the most important sources that provide Scientific Research
Papers (SRP) for student and researchers. Considerable numbers of universities and public
libraries have provided access to books, journals and other documents. This collection
helps academic researchers to acquaint with new journal articles and conferences
proceedings that relate to their areas of researches.
* ILK Research Group. Tilburg University - Netherlands.
http://www.worldwidewebsize.com/
2
1.2 Problem Definition
The nature of SRPs differs from the usual web search; ranking papers that only depend on
citation analysis may not always reflect their quality. It has been surrounded by a number
of different viewpoints and opinions on how well citations can measure SRPs quality.
Some researchers measure SPR quality based on how often it has been cited on other SRPs
and the date of publication. Others argue that citation counts are an indicator that best
assesses a publication‟s impact rather than its quality or importance, but that citation counts
are only partial indicators of impact and that other factors such as communication practices
and author visibility have to be assumed significantly. One of the most famous ranking
algorithms is PageRank algorithm which is adopted by Google search engine, and based on
using citation counts as the highest weighed factor in Google scholar engine.
It is not always the best choice to rank scientific research papers (SRPs) based on citation
count as the main factor; because the nature of SRPs is different from the nature of web
pages. There are other factors that determine the importance of SRP rather than citation
count to get better ranking results.
1.3 Thesis Motivation
This thesis was motive by these problems:
1. SRPs citation count does not necessarily reflect its quality or importance to research.
2. Using author's score as a factor in the ranking process could improve the ranking results.
3. New papers didn‟t had been publish for long time as old papers to be studied and tested
as old papers.
3
1.4 Thesis Objectives
The main objective of this thesis is to propose an efficient ranking method suitable to rank
SRPs and able to improve PageRank by including author's score and make it less biased
against new papers.
1.5 Thesis Outline
The remainder of this thesis is organized as follows:
• Chapter 2 : will presents the basic concepts and process in information retrieval (IR)
system, foundations concepts used in this thesis and gives a brief review for current and
popular ranking algorithms then presents literature review for the related works and
recent researches conducted to develop scientific research ranking methods.
• Chapter 3: will presents and explains the used dataset and the proposed ranking
method. And evaluates the proposed algorithms performance.
• Chapter 4 : concludes the thesis work and describes the results and the recombination
for the future work.
4
Chapter 2: Background and Literature Review
2.1 Outline
This chapter presents the basic concepts and process of information retrieval system, the
current ranking models and the foundations concepts used in this thesis. It also reviews the
recent algorithms that have been used to rank scientific and academic research papers.
2.2 Information Retrieval
Information Retrieval (IR) is a process of locating and obtaining information needed by a
user from a collection of available information resources [2]. There are 3 vital components
in web search engine known as Crawler (called also robot or spider), Indexer and Ranking
mechanism; the basic process in IR is shown in figure (2.1).
Figure 2.1: The basic process in IR system
A general web search engine can be summarized as follows:
1. Crawling: A crawler browses the web graph following hyperlinks and fetches links and
stores the extracted URL‟s into a local repository [3].
5
2. Indexing: The search engine indexes the pages collected by the crawler, extracts
keywords from each page and records the URL where each word has occurred [3].
3. User submits a query [3].
4. The query transfers in terms of keywords on the interface of a search engine, and is
examined with the index [3].
5. Ranker retrieves documents after consulting the index to get the most relevant
documents to the query. The relevant documents are then sorted according to their degree
of relevance, importance, etc. and then presented to the user [3].
Several sub-processes are also performed on documents and text before indexing, such as
parsing, lexical analysis, phrase detection and stemming [4].
2.2.1 Ranking Process
Ranking algorithms are at the core of information retrieval (IR) systems. It plays a very
important role by identifying the most relevant pages that are most likely be able to satisfy
the user's needs according to some criterion [5].
2.3 Ranking Models
In general, ranking algorithms can be either query-dependent that ranks list of documents
according to relevance between these documents and the query, e.g. Boolean Model. Or
query-independent and rank list of documents based on their own importance, e.g.
PageRank [6]. The basic current ranking models are shown in figure 2.2.
6
Figure 2.2: Ranking Models
2.3.1 Relevance Ranking Algorithms
The relevance ranking algorithms are also known as content-based rankers which work on
the basis of number of matched terms, frequency of terms and location of terms. It usually
takes each individual document as an input, and computes a score measuring the matching
between the document and the query. Then all the documents are sorted in descending
order of their scores [7].
2.3.1.1 Boolean Ranking Model
Boolean Model is an old and simple ranking model based on Boolean algebra and set
theory. It treats documents as bag of index terms that are words or phrases and uses
Boolean algebra expression as query, the terms are connected with logical operators (And,
Or and Not) [8].
7
2.3.1.2 Vector Space Model
The Vector Space Model (VSM) proposed by Salton, G. in 1975 represents the documents
and the queries as vectors in a Euclidean space, and the similarities can be measured using
the inner product of two vectors. Term Frequency- Inverse Document Frequency (TF-IDF)
weighting has been used to get more effective vector representation of the query and the
documents. Thus, the document length often divides the term frequency [9], and it is
defined as the following:
𝑇𝐹 𝑡 =𝑇
𝐹 (2.1) [9]
where T is the number of times term t appears in a document, and F is the total number of
terms in the document (or document length).
IDF score down the frequent terms that may appear a lot of times and scale up the rare
ones, it is defined as the following:
𝐼𝐷𝐹 𝑡 = 𝑙𝑜𝑔𝑁
𝑛 (𝑡) (2.2) [9]
Where N is the total number of documents in the collection, and n(t) is the number of
documents containing term t.
2.3.1.3 Latent Semantic Indexing
Latent Semantic Indexing (LSI) by Deerwester, S. in 1988, uses a mathematical technique
called Singular Value Decomposition (SVD) to identify pattern in the relationships
between the terms and concepts. LSI is based on the principle that words that are used in
the same contexts tend to have similar meanings [10].
8
2.3.1.4 BM25 Model
BM25 Model, also referred to as “Okapi BM25,” by Robertson S.E, 1994 is based on
probabilistic ranking principle. It ranks a set of documents by the log-odds of their
relevance regardless of the inter-relationship between the query terms within a document
(e.g., their relative proximity). It is not a single function, but actually a whole family of
scoring functions, with slightly different components and parameter [11]. Given a query q,
containing terms 𝑡1, … , 𝑡𝑚 , the BM25 score of a document d is computed as
𝐵𝑀25 𝑑, 𝑞 = 𝐼𝐷𝐹 𝑡𝑖 .𝑇𝐹 𝑡𝑖 ,𝑑 .(𝑘1+1)
𝑇𝐹 𝑡𝑖 ,𝑑 +𝑘1 .(1−𝑏+𝑏 .𝐿𝐸𝑁 (𝑑)
𝑎𝑣𝑑𝑙 )′
𝑀𝑖=1 (2.3) [11]
where TF (t,d) is the term frequency of t in document d, LEN (d) is the length (number of
words) of documents d, and (𝑎𝑣𝑑𝑙) is the average document length in the text collection
from which documents are drawn. 𝑘1 and b are free parameters, IDF(t) is the inverse
document frequency (IDF) weight of the term t.
2.3.1.5 Language Model for IR
Language Model for IR (LMIR) presented by Ponte, J. T. in 1998 is an application of the
statistical language model on information retrieval. It assigns a probability to a sequence of
terms. With query q as input, documents are ranked based on the query likelihood, or the
probability that the document‟s language model will generate the terms in the query (i.e.,
P(q|d) ). By further assuming the independence between terms, one has 𝑃 𝑞 𝑑 =
𝑃 𝑡𝑖 𝑑 ,𝑀𝑖 if query q contains terms 𝑡1, …𝑡𝑀 . [12]
To learn the document‟s language model, a maximum likelihood method is used. Usually a
background language model estimated using the entire collection is used for this purpose
[7]. The document‟s language model can be constructed as follows:
9
𝑃 𝑡𝑖 𝑑 = 1 − 𝜆 𝑇𝐹(𝑡𝑖 ,𝑑)
𝐿𝐸𝑁(𝑑)+ 𝜆𝑝(𝑡𝑖|𝐶), (2.4) [12]
where 𝑝(𝑡𝑖|𝐶) is the background language model for term ti , and λ ∈ [ 0, 1] is a
smoothing factor [12].
2.3.2 Important Ranking Models
Important Ranking algorithms (query-independent) also called Connectivity-based Page
Ranking (Link based) rank list of documents according to their own importance based on
link analysis technique [7, 13]. They view the web as a graph where the web pages form
the nodes and the hyperlinks between the web pages form the edges between these nodes
[13].
2.3.2.1 HITS (Hubs and Authorities)
HITS is a link analysis algorithm, the basic idea behind it is that a web page serves two
purposes: to provide information on a topic, and to provide links to other pages giving
information on a topic [14]. Thus, web page is considered to be authority on a subject if it
provides good information about the subject, and considered to be a hub if it provides links
to good authorities on the subject. The scheme therefore assigns two scores for each page,
authority value, which estimates the value of the content of the page, and hub value, which
estimates the value of its links to other pages [13].
2.4 PageRank Algorithm
PageRank (PR) model is very useful for ranking web pages and measuring their
importance. It was first introduced by [15] as a possible model of user surfing behavior. It
10
results from a mathematical algorithm based on the graph, the web graph, created by all World
Wide Web pages as nodes and hyperlinks as edges. The rank value indicates an importance of a
particular page. A hyperlink to a page counts as a vote of support. The PageRank of a page is
defined recursively and depends on the number and PageRank metric of all pages that link to it
("incoming links"). A page that is linked to by many pages with high PageRank receives a high
rank itself. If there are no links to a web page there is no support for that page. The simplest
formula to calculate PR is:
𝑃𝑅 𝐴 =1−𝑑
𝑁+ 𝑑 (
𝑃𝑅(𝐵)
𝐿(𝐵)+
𝑃𝑅(𝐶)
𝐿(𝐶)+
𝑃𝑅(𝐷)
𝐿(𝐷)+ ⋯ ) (2.5) [15]
where L is number of outbound links, d the damping factor, N is the total number of pages
on the web. It can be also expressed as the following:
𝑃𝑅𝜄+1 𝑝𝑖 =(1−𝛼)
𝑛+ 𝛼.
𝑃𝑅(𝑝𝑗 )
𝑜𝑑 (𝑝𝑗 )𝑝𝑗∈𝑁−(𝑝𝑖) (2.6) [15]
where N is the total number of pages on the web, 𝛼 damping factor.
Pseudo code
1. Enter the graph with the links
2. Set all PR to 1
3. Count the outbounds L(i) for each page i from 1 to N
4. Calculate 𝑃𝑅(𝑣)
𝐿(𝑣)𝑣𝜖𝐵𝑢 for each page i from 1 to N where Bu includes those pages that
a. Have a Link to Page i
b. Are different from Page i
5. Update all PR(i) for each page i from 1 to N
6. Repeat step 3 till changes to PR are insignificant
11
The PageRank theory holds that an imaginary surfer who is randomly clicking on links will
eventually stop clicking. The probability, at any step, that the user will continue is a
damping factor d. various studies have tested different damping factors, but it is generally
assumed that the damping factor will be set around 0.85 [15].
The time complexity to compute one iteration of PageRank, where a PageRank value for
each web page is computed, is O(n). Since the result vector of the PageRank computation
contains a single value for each web page, the space requirement for the PageRank
algorithm is also O(n), if the space required to store the web graph is ignored.
Figure 2.3: Example of PR [1].
Table 2.1 shows some limitations of ranking models.
12
Table 2.1: Limitations of Ranking Models
Ranking Model Limitations
Boolean Ranking Model
Can‟t retrieve partial matches.
The retrieved documents are not ranked; it cannot predict the
degree of relevance.
It either retrieves too many documents or very few documents.
It does not use term weights.
It can only predict whether a document is relevant to the query
terms or not.
[7,8]
VSM [9].
Doesn‟t capture the semantics of the query or the document.
Cannot denote the “clear logic view” like Boolean model.
Avoids the assumption that terms are independent from each
other.
LSI [8]. Assumes that terms are independent from each other.
PageRank [15].
It shouldn‟t be used as stand alone matric; it should be used as
parameter only.
Favors older pages, because a new page, even a very good one,
will not have many links unless it is part of an existing site.
HITS [14].
Topic drift and efficiency problems occur.
Non-relevant documents can be retrieved.
13
2.5 SRPs Ranking Algorithms and Related Work
Google Scholar is one of the major academic search engines, but its ranking algorithm for
academic articles is unknown. So, we cannot get more information about how this
algorithm works, Beel J, et al. [16] performed reverse-engineering on Google Scholar‟s
ranking algorithm. The results showed that it relies heavily on citation counts and it's the
highest weighted factor. It also puts high weighting on words in the tile. The author's name
and journal also have impact on the ranking, while the frequency of the search term in
content doesn‟t affect the ranking score. It also seems to weight recent articles stronger
than the older articles. The study found that Google Scholar is more suitable to find
standard articles and less suitable when searching for gems or articles by authors advancing
a new or different view from the mainstream.
2.5.1 PageRank with N-Star Ranking Model
N-linear ranking model is the system of N ranking scores of N classes, the rank scores
depend on others by a linear constraint system. Anh, V.L. et al. [17] proposed ranking
system based on N-linear model with PR to rank SRPs beside authors and, conferences.
The system has two models SD4R, and SD3R to rank datasets without citation information
and SD4R. The models were tested using dataset built form DBLP.
Sohn, B.S. et al. [18] also proposed a generalized network analysis approach to rank SRPs
using N-star model. Based on this model and PageRank algorithm, two different ranking
methods were derived, a query / topic independent rank called Universal-Publication rank
(UP rank), and a query/topic dependent rank called Topic Publication-rank (TP rank). The
model takes into account the mutual relationship between keyword, publication, and
citation.
14
2.5.2 PageRank with HITS Model
Due M, et al. [19] proposed PageRank algorithm, an extension of PageRank and HITS
ranking models. The basic idea is to measure the relativity between papers. It measure the
indirect relationships between papers using relativity measurements instead of the simple
direct citation between papers.
Jiang X, et al. [20] proposed MutualRank a graph-based ranking framework that integrates
mutual reinforcement relationships among networks of papers, researchers and venues to
achieve a more synthetic, accurate and fair ranking result than previous graph-based
methods.
Wang, Y. et al. [21] proposed PageRank HITS framework that exploits different kinds of
information simultaneously and examines their usability in tasks of ranking scientific
articles and capture time information in the evolving network to obtain a better ranking
result.
2.5.3 PageRank with Closed Frequent Keyword-set
Shubhankar, K. et al. [22] introduced novel algorithm that uses closed frequent keyword-
set and modified time independent PageRank algorithm to detect and rank topics in
research papers based on their significance in research community rather than popularly of
the topic, which considers only the frequency of topics. Each topic is assigned to an
authoritative score using the modified PageRank algorithm and all papers sharing a topic
from one natural cluster or more.
2.5.4 PageRank with Jaccard Index
Haddadene, H. et al. [23] proposed model of representation of the scientific production
15
based on the notion of similarity between articles and adaptive PageRank algorithm. The
similarity between two documents of database was calculated using Jaccard index- also
known as the Jaccard similarity coefficient. An adaptation of PageRank algorithm was used
to rank and measure the relative importance of a document.
2.5.5 PageRank with Focused Surfer Model
Krapivin, M. et al. [24] introduced Focused Page Rank (FPR) to rank SRPs based on
traditional citation count, PageRank and Focused Surfer model. This algorithm aimed to
reduce the problem of “effect of outbound links” which means that if paper P is cited many
times by papers with high rank but containing a large quantity of outgoing links, it may
decrease P‟s rank, this causes paper highly cited but poorly ranked by PR. FPR is a tradeoff
between Page Rank and Citation Count, so it may serve as an agreement between the
followers of pure citation count and Page Rank followers. The algorithm was evaluated on
Citeseer autonomous digital library content, the result showed that FRP algorithm suffers
less from the "outbound links" problem compared to basic PageRank algorithm.
2.5.6 Modified PageRank
Sidiropoulos A, et al. [25] proposed SceasRank algorithm a modified PageRank algorithm
with two parameters “a” and “b”. b is called the direct citation enforcement factor and a is a
parameter controlling the speed at which an indirect citation enforcement converges to
zero. Converges are faster than algorithms which are more similar to the PageRank
algorithm, taking the publication dates of papers into consideration.
16
Sun, Y. et al. [26] proposed popularity weighted ranking algorithm for academic digital
libraries based on PageRank and uses the popularity factor of a publication venue
overcoming the limitations of impact factors.
Chen, P. et al. [27] proposed PageRank method to rank SRPs. The results showed that there
are some classical articles in Physics domain have small quantity of citations and very high
PageRank. These articles were named “scientific gems”. Existence of “scientific gems” is
caused by PR model, which captures not only the total citation count, but also the rank of
each of the citing papers
Walker D, et al. [28] introduced CiteRank as an adaptation of the PageRank algorithm. It
considers the publication time of scientific articles and utilizes a random walk model to
predict the number of future citations for each article. The model reduces the bias of time
to some extent because recent articles will be promoted to higher scores.
Sayyadi H, et al. 2009 [29] proposed a method called FutureRank, which estimates the
future PageRank prestige score for each article by using citations network, authors, and
publication date. In the model, the usage of authorship provides additional information to
rank recent publications. If an author is authority (i.e., publishing many prestigious papers
previously) then the new publications of him/her can be expected to have good quality.
Articles transfer their authority score to their authors, and an author collects the authority
score of all of his/her publications.
Another paper by Singh, A. et al. [30] proposed ranking algorithm based on citation
network. The algorithm uses a modified version of PageRank algorithm metric and takes
into account the time factor in ranking the research papers to reduce the bias against the
recent papers. Using the scores of the research papers, other scores are added to
17
conferences and authors to rank them. The algorithms were implanted and tested on DBLP
dataset.
2.5.7 PageRank with CiteRank
Dunaiski D, et al. [31] proposed a novel algorithm called NewRank, a combination of
PageRank and CiteRank algorithms. It focuses on identifying influential papers that were
published recently. The ranking algorithms were evaluated against the list of the most
influential papers compiled by the ICSE selection committee.
2.5.8 Random walk with Restart
Hwang, W.S, et al. [32] proposed a new SRPs algorithm that aims mainly to balance the
impacts of old papers and new papers and to solve the distortion in ranking due to
publication dates, which causes recent papers to have poor rank by crediting the recent
papers. The algorithm used Random Walk with Restart (RWR) on a graph where nodes are
papers and edges are citations among papers, and defined the age damping factor for the
papers, where the age damping factor ρ has a special parameter τ denoting the
characteristic decay time.
A summary of related works is shown in table 2.2
Table 2.2: Summary of SRPs Ranking Algorithms
Algorithm Techniques Features Factors Dataset
Up Rank, TP
Rank [18].
PageRank
N-star
Ranking Model
Considers the query/topic,
and the content.
Citation
Keyword
Title
-
18
Content
SC4R, SC3R [17]
PageRank
N-star
Ranking Model
Evaluates more detail based
on the context of their
relationships.
Citations
Author
Venue
DBLP
Academic
Microsoft
PTRA [33].
Citation Count.
Mathematical
Calculation.
Depends highly on time of
publication to rank the papers.
Gives the paper age a higher
impact.
Citation.
Publication venue.
Date of Publication.
Scholar.
Citeseerx
IEEE
Xplore.
DBLP
PageRank+HITS
[21].
PageRank
HITS
Uses time information.
Ranks SRPs in heterogeneous
network.
Citation
Author
Journal/Conference.
Date of Publication.
Prestige (Article)
arXiv
Cora
MutualRank [20].
PageRank
HITS
Less biased against new
papers.
Returns more relevant highly
ranked papers.
Citation
Title (Authority and
Soundness)
Author
(Importance)
Venue (Prestige)
ANN
(Haddadene H, et al.
2012) [23].
PageRank
Jaccard Index.
Ranks papers based on the
notion of similarity between
articles and adaptive PageRank
algorithm.
Citation
Title
Abstract
-
NewRank [31]. PageRank
CiteRank
Counts more citations from:
newer papers and citations
from popular paper.
Citation
Date of publication.
Date of references.
Microsoft
Academic
Hep-th
TopicRank [22]. PageRank Clusters papers into topics Citation DBLP
19
Closed
frequent
keyword-set
Estimates the article‟s future
prestige.
Title
Date of publication.
Keyword
Topic
CitationRank [30]. PageRank
Time-independent ranking.
Give authors and conferences
ranking scores based on papers
score.
Citation
Date of publication. DBLP
YetRank [32].
Random Walk
with Restart
(RWR)
Gives high rank to papers
credited by other authoritative
papers or published in premier
journals or conferences.
Balances the impacts of old
papers and new papers.
Solves the distortion in
ranking due to publication
dates.
Citation
Date of publication.
Journal/conference.
DBLP
FutureRank [29]. PageRank
Estimates the future
PageRank prestige score for
SRPs.
Citations.
Authors.
Publication date
arXiv
(hep-th)
PaperRank [19]. PageRank
HITS
Can find more authoritative
papers than the traditional
methods.
Citation
Title
Keyword
Text
Scholar
CiteSeer
Focused
PageRank [24].
PageRank
Focused Surfer
model
Tradeoff between PageRank
and Citation Count.
Suffers less from the effect of
outbound links.
Citations
Citation counts Citeseer
Scientific Gems
[27]. PageRank Captures scientific gems. Citations APS
20
Popularity
Weighted Ranking
Algorithm [26].
Page Rank Overcomes the limitations of
impact factors.
Citations
Title
Venue (popularity
factor)
Citeseer
CiteRank [28].
PageRank
Random Walk
Model.
Overcomes the problem of
the aging effect in citation
networks.
Gives higher rank to papers
credited by other authoritative
papers or published in premier
journals or conferences.
Predicts the number of future
citations for each article.
Citation
Title
Date of publication
Hep-th
Physrev
SceasRank [25]. PageRank
Gives higher rank to papers
credited by other important
papers and newer papers.
Citations
Date of publication
Conference
DPLB
SCEAS
System
2.6 Chapter Summary
Over the past decade, many studies were conducted to evaluate the productivity of
scientific articles by developing more efficient algorithms to rank SRPs. These algorithms
guide and help researchers, students and authors in their researches in the best available
way.
All relevant models have some limitations which make this type of ranking models not the
best choice to use in web search engines. Even though important-ranking models are
relatively new, but it improved the search process and changed the way it works.
The conclusions of this chapter include:
PageRank algorithm was the favorite ranking model among researchers, and it was the
most suitable model to rank SRPs.
21
Number of researches accompanied PageRank with other methods to enhance the
ranking results and overcome some PageRank limitations.
Citation count was the most weighted factor in most studies.
Many studies used time as a very important factor to give the newer papers a higher
score, e.g. PTRA algorithm which depends highly on time [33], PageRank+HITS [21],
NewRank [31], CitationRank [30], and YetRank [32].
There was no unified evaluation process, each researcher used different methods, other
researchers didn‟t evaluate their work. The evaluation methods included:
o Collecting a set of recommended papers from websites of graduate-level
computational linguistics courses of 15 top universities used as the evaluation
benchmark [20].
o Evaluating the result against the list of the most influential papers compiled by the
International Conference on Software Engineering ICSE selection committee [31].
o Using diversity or difference between PageRank (PR) and citation count (CC) [24].
o Evaluating the rank results based on the prizes of „VLDB 10 Year Award ‟,
„SIGMOD Test of Time Award‟ and „SIGMOD E.F.Codd Innovations Award [25].
o Comparing the ranking results with references in the corresponding chapters of the
famous datamining book “J. Han and M. Kamber. Data Mining: Concepts and
o Techniques. Morgan Kaufmann, 2nd Edition, 2006” [32].
o Comparing the distribution of database papers according to the age of publication
and citation number with PR algorithm [33].
The proposed algorithm ranks SRPs based on PageRank score, date of publication, and
author score.
22
Chapter 3: Experiment and Evaluation
3.1 Introduction
Ranking is an important stage in any search engine. This chapter explains the steps of the
proposed SRP-Rank algorithm:
- Dataset preparation and processing to extract important information.
- Constructing the citations graphs.
- Presenting the proposed ranking method.
It also aims to evaluate the result of the proposed ranking method. The results were
compared to PR. The criteria used in comparison are distribution of Ranked SRPs among
the age of the paper and among number of citations recall, precision and f- measure.
3.2 Dataset
To evaluate the proposed algorithm, dataset containing scientific research papers metadata
such as title and authors is required. There are several free and paid resources that provide
dataset for researchers and differ in their size and the degree of SRPs processing. While
some of these datasets provide only the full text of publications without any processing,
others provide some processed data.
In this study, we used dataset obtained from Web of Science containing the abstract, basic
metadata and 9583 citations for 1,189 SRPs in history and physiology of Science field and
cover publications from 1956 to 2013. Samples of the dataset are shown in figure 3.1.
23
Figure 3.1: Sample of Original Dataset
The dataset contains additional information such as page number and authors contact
address, the distribution of paper publication dataset is shown in table 3.1.
Table 3.1: Distribution of paper publication dates
Year 2013 2012 2011 2010 2009 2008 2007 2006 2005 2004
# of Papers 19 31 32 34 28 26 28 19 14 12
Year 2003 2002 2001 2000 1999 1998 1997 1996 1995 1994
# of Papers 9 8 8 9 29 15 7 10 14 17
Year 1993 1992 1991 1990 1989 1988 1987 1986 1985 1984
# of Papers 13 13 17 15 13 17 13 20 19 26
Year 1983 1982 1981 1980 1979 1978 1977 1976 1975 1974
# of Papers 17 15 18 16 23 23 28 28 25 19
Year 1973 1972 1971 1970 1969 1968 1967 1966 1965 1964
# of Papers 25 26 27 28 20 21 27 25 30 26
Year 1963 1962 1961 1960 1959 1958 1957 1956
# of Papers 19 33 23 28 25 24 17 21
3.3 Implementation
The proposed method is done using several stages. First, the dataset was prepared and the
required information was extracted and used to build paper citation and author-paper
24
graphs. Finally, the algorithm was implemented to give ranking score for all papers in the
list.
3.3.1 Data Preparation and Extracting
To extract the data that will be used in ranking score calculation, that data need to be
processed. It first was arranged using Sci2 tool, which automatically saves the dataset as
excel sheet as shown in figure 3.2.
Figure 3.2: Dataset Information Arranged into Database.
The unwanted and missing fields were removed, and only the data needed by the algorithm
were kept. The final database contained information about authors, title, year of
publication, citation count and bibliography as in figure 3.3.
25
Figure 3.3: Dataset after processing.
3.3.2 Extracting Citation Networks
To get the remaining data needed by the ranking method, paper citation network and author
paper network were extracted by Sci2 tool.
Paper citation network (graph)
Paper citation network required for PR calculation. Sample of paper citation network is
shown in figure 3.4.
3.4: Sample of the Paper Citation Network
26
Authors Paper Network (graph)
Author citation graph provides information about number of works and citation for each
author. Sample of the information extracted from author paper graph is shown in figure 3.5.
Figure 3.5: Sample of the information extracted from author paper graph.
The collected information that will be used in the proposed ranking algorithm are shown in
figure 3.6
Figure 3.6: Sample of the information used in ranking score calculation
27
3.3.3 Calculation of Scientific Research Ranking SRR Score
Calculating the final ranking score depends on PR (citation graph), date of publication,
author citation, and number of author's works.
PageRank Score
The PR calculation was done using MATLAB based on the citation graph extracted by
Sci2 tool.
Date of Publication
One of the PageRank shortcomings is that it favors old papers to new ones even if it is
good, and that is caused because the longer the paper has been around the more citations it
has. Recent papers have little citations so that they should be given some promotion in the
ranking process [21]. To overcome this drawback and make the algorithm unbiased and
more reliable, “pagerank score per age” was used in the algorithm. Page Rank score per
age is calculated by dividing pr score for SRP by the log of its age, which is the number of
years since it has been published.
𝐴 = 𝑌 − 𝑌𝑖 (3.1)
where Y is the current year, Yi is the year of publication
Author Score
The number of papers published and citation count may reflect the productivity and
popularity of the author. Different matrices are used by research community to measure
author performance such as h-index. In this thesis author score was found using the
following equation:
28
𝐴𝑈 =(𝐴𝑊1+..+𝐴𝑊𝑚 )+(𝐴𝐶1+⋯+𝐴𝐶𝑚 )
𝑁.𝐻 (3.2)
where AW number of papers published by each author, AC is the citation count for each
author, N is the number of all authors for the current paper, H is constant.
The constant H, was used to reduce the impact of the author sore on the final ranking score,
and make this side of the equation more balanced with the other side. H was set to 10 after
testing other values, and it gave the most balanced result compared to other tested values
such as 0.5, 20 and 50.
The final Ranking result is calculated using the equation:
𝑆𝑅𝑃 − 𝑅𝑎𝑛𝑘 =𝑃𝑅 𝑖
(1+log 𝐴 )+ 𝐴𝑈 (3.3)
where A is the Age of the SRP, AU is the author score.
The pagerank assigns scores to each paper based on its citation, to make a balance between
citation and age without neglecting the citation metric and not to gain a higher effect to
citation (pr scores) , the page rank scores is divided by the log of the paper age.
Then the pseudo code of proposed ranking algorithms is the following:
29
Procedure: Scientific Research Ranking R
Required:
Ti= Title.
Ac= Authors Citation Count.
Aw= Authors Number of Works.
N= Number of Authors for each Paper.
Di=Date of Publication.
PR= PageRank Score.
D=Current Year.
1: For each paper in dataset.
2: Initialize SPR, AC, AW, D, PR to 0.0;
3: Get :
N[current paper],
Ac[current paper], Aw[current paper]
Di [current paper], PR[current paper]
4: computer AU= (Ac+Aw)/(N.)
5: compute A=2015-Di
6: compute Scientific Research Ranking R= (PR/(1+log (A))+AU)
end
30
3.4 Results
To use the results later in the evaluation process, a query was used to narrow the list since
it contains articles in different subjects that are not all related to each other. The used query
was “nineteenth”, the retrieved list contained 41 articles.
Both of the proposed ranking algorithm and PR were used to rank the list to evaluate the
results.
3.5 Evaluation
Evaluation of ranking algorithm is a challenging and rough procedure for many reasons:
The lack of a comprehensive evaluation metric that is acknowledged by the
academic community [31].
There is no ground truth of the article‟s real rank [21].
It is subjective which ranking algorithm behavior is better, which means it depends
on each individual. One ranking algorithm results could be satisfying for a user, but
not satisfying for another user [25].
3.5.1 Distribution of Ranked SRPs among the Age of the Paper
To evaluate whether the proposed SRP-Rank method is less biased against new papers and
examine the effect of age on the distribution of the top ranked papers, charts for all the 41
ranked list were used. Figures 3.7 and 3.8 show the distribution of the proposed rank
method and distribution of PR among the age of the paper.
31
Figure 3.7: Distribution of The Proposed Rank method among the Age of the Paper.
Figure 3.8: Distribution of PR among the Age of the Paper.
In figure 3,8 the distribution of ranked SRPs using PR among the age of the paper showed
that the PageRank algorithm is biased against new papers. In figure 3.7, the distribution of
the ranked SRPs using proposed method among the age of the paper showed that the
proposed method is less biased against new papers unlike PR.
3.5.2 Distribution of Ranked SRPs among the Citation
To evaluate the citation effect on the proposed method, charts for ranked list were used.
Figures 3.9 and 3.10 show the distribution of the proposed rank method and distribution of
0
2
4
6
8
10
12
14
16
18
0 10 20 30 40 50
0
2
4
6
8
10
12
14
16
18
0 10 20 30 40 50
Rank
Rank
Age
of
SRP
Age
of
SRP
32
PR among the citation count.
Figure 3.9: Distribution of The Proposed Rank method among the citation count.
Figure 3.10: Distribution of PR among the Citation Count.
In figure3.9, the distribution of Ranked SRPs using PR among the citation count showed
that the PageRank algorithm is based heavily on citation count. In figure 3.10, the
distribution of the ranked SRPs using proposed method among the citation count showed
that the proposed method is less depending on citation count unlike PR. For example,the
propsed SRP ranking method gives higher ranks for papers even they don‟t have many
citation, but PR gives a lower rank for papers with no citation as in figure 3.10.
0
5
10
15
20
25
30
35
40
0 10 20 30 40 50
0
5
10
15
20
25
30
35
40
0 10 20 30 40 50Rank
Rank
Nu
mbe
r of
Citat
ion
Nu
mbe
r of
Citat
ion
33
3.5.3 Recall and Precision
In IR, precision is the ability to provide the relevant set of documents from set of retrieved
documents [34]. It is calculated by dividing the amount of overlapping between retrieved
and the relevant set by the number of documents retrieved [1] as shown in the following
equation:
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑆𝑒𝑡 ∩ 𝑅𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑆𝑒𝑡
𝑅𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑆𝑒𝑡 (3.4) [1]
Recall is the ability to provide maximum number of relevant web services from a set of
relevant web services [34]. It is calculated by dividing the amount of overlapping between
retrieved set of documents and relevant set by the number of relevant sets [1]:
𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑆𝑒𝑡 ∩ 𝑅𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑆𝑒𝑡
𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑆𝑒𝑡 (3.5) [1]
These measures are used in IR with binary classification (e.g. seminal/non-seminal) to
measure the relevance of a set of retrieved items and evaluate the performance of
information retrieval systems. A modified version of these measures was presented by [35]
to evaluate web service ranking methods. According to [35] recall is calculated by dividing
the highest rank score by total rank score of all paper in, as shown in equation (3.6).
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =𝐻𝑖𝑔𝑒𝑠𝑡 𝑟𝑎𝑛𝑘 𝑠𝑐𝑜𝑟𝑒
𝑇𝑜𝑡𝑎𝑙 𝑟𝑎𝑛𝑘 𝑠𝑐𝑜𝑟𝑒 𝑜𝑓 𝑎𝑙𝑙 𝑝𝑎𝑝𝑒𝑟𝑠 (3.6) [35]
Recall is found by dividing the highest rank score by the score of 2nd
highest algorithm.
𝑅𝑒𝑐𝑎𝑙𝑙 =𝐻𝑖𝑔𝑒𝑠𝑡 𝑟𝑎𝑛𝑘 𝑠𝑐𝑜𝑟𝑒
𝑠𝑐𝑜𝑟𝑒 𝑜𝑓 2𝑛𝑑 𝑖𝑔𝑒𝑠𝑡 𝑝𝑎𝑝𝑒𝑟 (3.7) [35]
34
The analysis and comparison of the proposed algorithm and PR based on this evaluation
metric is shown in table 3.2. This metric also proves that the proposed method offers a
better solution to ranking SRPs when compared with the basic PageRank algorithm.
Table 3.2: Results of Precision and Recall in both page rank and proposed SRP rank
algorithms
The recall value of the proposed SRP ranking method proved that this method gave a
higher rank for the top ranked papers. This result argues more evidences that the proposed
method shows a sort of distinctions to the top ranked paper. The precision value achieved
by the proposed method is higher than the PR.The distribution of ranks is biased towards
higher ranked paper. This result makes it possible to distinguish interesting papers in
certain topics to the ranked paper faster than PR.
Precision Recall
PR 0.025813562 1.018441539
SRP-Rank 0.082599556 1.266729551
35
Chapter 4: Conclusion and Future Work
This chapter aims to discuss the result of the proposed ranking algorithm, and the future
work.
4.1 Results and Conclusion
In this thesis, scientific research papers ranking algorithm was proposed to balance the
impacts of the PR score on old papers and new papers. It aims to improve the result of PR
ranking by solving the problem of favoring old papers and making it more suitable to rank
scientific research papers. The results show that the proposed ranking algorithm doesn‟t
rely heavily on citation count as PR. The results were neutral with both old and new papers.
On the other hand there is no specific evaluation method for ranking results that are
identified by the scientific community. The satisfaction about the ranking results varies
from person to person. An evaluation method to find the precision and recall of ranking
result that was proposed by [35], was tested in this thesis. The proposed method got higher
precision and recall than PageRank. The distribution of ranked SRPs among the age of the
paper showed that the proposed SRP-Rank succeeded in giving unbiased scores to old
papers against new papers. However, the PageRank gives a higher score for the old papers
more than the new ones; because the old papers have more opportunities to be cited more
than the new ones.
36
4.2 Future work
In terms of future work, there are several directions that can be explored:
- Testing the proposed method on more dataset with different quires.
- Making survey to evaluate the method results by users.
- Experimenting with the deferent parameters for a better tuning of the rankings such as
place of publication.
- Another interesting direction might be examined to conduct more types of evaluation of
the results including query evaluation.
37
REFERENCES
[1] Levene, M. (2010) An Introduction To Search Engines And Web Navigation, (2nd
ed.) New Jersey: A John Wiley & Sons, Inc.
[2] Maurya, V., Pandey, P. and Maurya L.S. (2013) Effective Information Retrieval
System. International Journal of Emerging Technology and Advanced Engineering,
vol. 3, no. 4, pp. 787-792.
[3] Manning, C. Raghavan, P. Schutze, H. (2008) Introduction to Information
Retrieval. New York: Cambridge University Press.
[4] Ceri S, Bozzon A, Brambilla M, Della Valle E, Fraternali P, Quarteroni S. Web
Information Retrieval. Heidelberg: Springer; 2013.
[5] Baeza-Yates R, and Ribeiro-Neto B, (2011), Modern Information Retrieval: The
Concepts and Technology behind Search. Boston: Addison-Wesley Professional, 2nd
Ed.
[6] Frakes W.B. and Baeza-Yates R. (1992), Information Retrieval: Data Structures
and Algorithms. New Jersey: Prentice Hall, 1st Ed.
[7] Liu T.Y. (2011), Learning to Rank for Information Retrieval. Berlin. Heidelberg:
Springer.
[8] Hiemstra D. (2000,) Using Language Models for Information Retrieval.
Netherlands: Taaluitgeverij Neslia Paniculata.
[9] Salton G, Wong, A and Yang C.S. (1975) A Vector Space Model for Automatic
Indexing. Communications of the ACM, Vol. 18, No. 11, pp 613–620.
[10] Deerwester S, Dumais S, Landauer T, Furnass G, and Beck L, (1988) Improving
Information Retrieval with Latent Semantic Indexing. Proceedings of the 51st
38
Annual Meeting of the American Society for Information Science, Vol. 25, pp. 36-
40.
[11] Robertson S.E, Walker S, Jones S, Hancock-Beaulieu M, and Gatford M. (1994)
Okapi at TREC-3. Proceedings of the Third Text Retrieval Conference,
Gaithersburg, USA.
[12] Ponte J, and Croft W.B. (1998) A Language Modeling Approach to Information
Retrieval. In Proceedings of the 21st International Conference on Research and
Development in Information Retrieval, pp. 275–281.
[13] Devi P, Gupta A, and Dixit A, (2014). Comparative Study of HITS and
PageRank Link Based Ranking Algorithms. International Journal of Advanced
Research in Computer and Communication Engineering, Vol 3, Issue 2, PP. 5749 -
5754.
[14] Kleinberg J, (1999). Authoritative Sources in a Hyperlinked environment.
Journal of the ACM, Vol. 46, No. 5, pp. 604-632.
[15] Brin S, and Page L, (1998) The Anatomy of a Large-Scale Hypertextual Web
Search Engine. Computer Networks, Vol. 30, pp. 107-117.
[16] Beel, J. and Gipp, B. (2009) Google Scholar‘s Ranking Algorithm: An
Introductory Overview, 12th International Conference on Scientometrics and
Informetrics, Vol. 1, PP. 230-241.
[17] Anh, V.L. Hoang, H.V. Trung, H.L. Trung, K.L. and Jung, J.J (2014) Evaluating
Scientific Publications By N-Linear Ranking Model, ANNALES Universitatis
Scientiarum, Sectio Computatorica, Vol. 43, PP. 123-147.
39
[18] Sohn, B.S. and Jung, J. (2015) A Novel Ranking Model for a Large-Scale
Scientific Publication. Mobile Networks and Applications, Vol. 20, Issue 4, PP
508-520.
[19] Due M, Bai F, and Liu Y, (2009). PaperRank: A Ranking Model for Scientific
Publication. IEEE World Congress on Computer Science and Information
Engineering, Vol 4, PP. 277- 281.
[20] Jiang X, Sun X, Zhuge H (2012) Towards an Effective and Unbiased Ranking of
Scientific Literature through Mutual Reinforcements. 21st ACM Conference on
Information and Knowledge Management, Hawaii, USA, pp 714–723.
[21] Wang, Y. Tong, Y. and Zeng, M (2013) Ranking Scientific Articles by Exploiting
Citations, Authors, Journals, and Time Information. Proceedings of the
Twenty-Seventh AAAI Conference on Artificial Intelligence, Bellevue,
Washington, USA, PP. 933 - 939.
[22] Shubhankar, K. Singh, A. and Pude, V. (2011), An Efficient Algorithm for Topic
Ranking and Modeling Topic Evolution. Database and Expert Systems
Applications, Vol 6860, PP 320-330.
[23] Haddadene, H. Harik, H and Salhi, S. (2012) On the Pagerank Algorithm for the
Articles Ranking. Proceedings of the World Congress on Engineering, Vol I,
London, U.K.
[24] Krapivin, M and Marchese, M. (2008), Focused Page Rank in Scientific Papers
Ranking. Digital Libraries: Universal and Ubiquitous Access to Information, Vol
5362, PP 144-153.
40
[25] Sidiropoulos A, and Manolopoulos Y. (2006) Generalized Comparison of Graph-
Based Ranking Algorithms for Publications and Authors. The Journal of
Systems and Software, Vol. 79, PP. 1679 – 1700.
[26] Sun, Y. and Giles C.L. (2007), Popularity Weighted Ranking for Academic
Digital Libraries. 29th European Conference on IR Research, Rome, Italy, PP.
605-612.
[27] Chen, P. Xie, H. Maslov, S. and Redner, S. (2007) Finding Scientific Gems with
Google’s PageRank Algorithm. Elsevier, Journal of Informatics, Vol. 1, PP. 8 -15.
[28] Walker D, Xie H, Yan KK, and Maslov S, (2007) Ranking Scientific Publications
Using a Model of Network Traffic. Journal of Statistical Mechanics: Theory and
Experiment.
[29] Sayyadi, H., and Getoor, L. (2009) Futurerank: Ranking Scientific Articles by
Predicting Their Future PageRank. Ninth SIAM International Conference on
Data Mining, PP. 533–544.
[30] Singh, A. Shubhankar, K. and Pudi, V. (2011) An Efficient Algorithm for
Ranking Research Papers Based on Citation Network. 3rd Conference on Data
Mining and Optimization, Putrajaya, PP. 88 -95.
[31] Dunaiski D, and Visser W, (2012). Comparing Paper Ranking Algorithms.
Proceedings of the South African Institute for Computer Scientists and Information
Technologists Conference, Pretoria, South Africa, PP. 21-30.
[32] Hwang, W.S, Chae, S.M, and Kim, W.K. (2010), Yet Another Paper Ranking
Algorithm Advocating Recent Publications. 19th International Conference on
World Wide Web, Raleigh, North Carolina, USA.
41
[33] Mushtaq, H. (2014) Scientific Research Paper Ranking Algorithm PTRA: A
Tradeoff between Time and Citation Network. Applied Mechanics and
Materials. Vol 551, PP. 603-611.
[34] Bose, A. Nayak, R. and Bruze, P. (2008) Improving Web Service Discovery by
using Semantic Models. 9th International Conference on Web Information
Systems Engineering. Auckland - New Zealand. PP 366 - 380.
[35] Manoharan, R. Archana, A. and Cowlagi, S. N. (2011) Hybrid Web Services
Ranking Algorithm. IJCSI International Journal of Computer Science Issues, Vol.
8, Issue 3, No. 2, PP 452 – 460.
42
Appendices
Sample of Dataset
FN Thomson Reuters Web of Knowledge
VR 1.0
PT J
AU Bertomeu-Sanchez, JR
AF Ramon Bertomeu-Sanchez, Jose
TI Managing Uncertainty in the Academy and the Courtroom Normal Arsenic and Nineteenth-Century
Toxicology
LA English
DT Article
ID KNOWLEDGE; SCIENCE
AB This essay explores how the enhanced sensitivity of chemical tests sometimes produced unforeseen and
puzzling problems in nineteenth-century toxicology. It focuses on the earliest uses of the Marsh test for
arsenic and the controversy surrounding "normal arsenic"-that is, the existence of traces of arsenic in healthy
human bodies. The essay follows the circulation of the Marsh test in French toxicology and its appearance in
the academy, the laboratory, and the courtroom. The new chemical tests could detect very small quantities of
poison, but their high sensitivity also offered new opportunities for imaginative defense attorneys to
undermine the credibility of expert witnesses. In this context, toxicologists had to dispel the uncertainty
associated with the new method and come up with arguments to refute the many possible criticisms of their
findings, among them the appeal to normal arsenic. Meanwhile, new descriptions of animal experiments,
autopsies, and cases of poisoning produced a steady flow of empirical data, sometimes supporting but in
many cases questioning previous conclusions about the reliability of the chemical tests. This challenging
scenario provides many clues about the complex interaction between science and the law in the nineteenth
century, particularly how expert authority, credibility, and trustworthiness were constructed, and frequently
challenged, in the courtroom.
C1 Inst Hist Med & Sci Lopez Pinero, Valencia 46003, Spain.
RP Bertomeu-Sanchez, JR (reprint author), Inst Hist Med & Sci Lopez Pinero, Pl Cisneros 4, Valencia
46003, Spain.
FU Spanish government [HAR2009-12918-C03-03]
FX This essay is part of a larger study on nineteenth-century toxicology supported by the Spanish
government (HAR2009-12918-C03-03). I am very grateful to the staff of the Bibliotheque Interuniversitaire
de Sante,
Paris, who helped me with many relevant sources for this paper, and to the Chemical Heritage Foundation
(CHF), Philadelphia, in whose library this essay took shape thanks to two short-term fellowships (July-
August
2010 and March 2011). Marjorie Gapp and Amanda Antonucci helped me with the impressive art collection
of the CHF. I am also indebted to the organizers of and participants in the meetings in which earlier versions
of the essay were discussed and to Jose Pardo Tomas and Josep Simon Castel for their insightful comments
and suggestions. The late Josep Miquel Vidal, president of the scientific section of the Institut
Menorqui d'Estudis, enthusiastically supported the development of this project. I would like also to
acknowledge the anonymous referees for Empire.
CR ASHMORE M, 1993, SOC STUD SCI, V23, P67, DOI 10.1177/030631293023001003
Barse Jules, 1843, J CHIMIE MED, V9, P571
Barse Jules, 1845, MANUEL COUR ASSISES, P151
Bertomeu Jose R., 2011, M ORFILA AUTOBIOGRAF, P192Bertomeu-Sanchez Jose R., 2006, CHEM MED
CRIME M ORF
Bertomeu-Sanchez Jose Ramon, 2006, CHEM TECHNOLOGY SOC, P300
Bertomeu-Sanchez JR, 2012, ANN SCI, V69, P1, DOI 10.1080/00033790.2011.637471
43
Bertrand Gabriel, 1903, RECHERCHES EXISTENCE
Bloch Magali, 1997, RECHERCHES CONT, V4, p[101, 119]
Blyth Alexander Wynter, 1884, POISONS THEIR EFFECT, P531
Borie Leonard, 1841, CATECHISME TOXICOLOG, P71
Buchner Johannes A., 1839, REP PHARM, V17, P123
Burnett D. Graham, 2007, EMPIRE, V98
Burney I, 2006, POISON DETECTION VIC
Bussy Antoine, 1840, REPONSE ECRITS M RAS
CAMPBELL WA, 1965, CHEM BRIT, V1, P198
Caventou J. B., 1839, B ACAD ROY MED BELG, V4, P275
Chauvaud, EXPERTS EXPERTISE JU, P243
Chauvaud Frederic, 2003, EXPERTS EXPERTISE JU, P192
Chauvaud Frederic, 2000, EXPERTS CRIME MED LE
Christison Robert, 1845, TREATISE POISONS, P289
Coley N. G., 1986, MED HIST, V30, p[173, 181]
Coley Noel G., 1838, J PHARM, V24, P500
Coley Noel G., 1991, MED HIST, V35, p[409, 421]
Coley Noel G., 1837, J PHARM, V23, P553
Coley Noel G., 1837, ANN PHARM CHEM, V23, P217
Collins H, 2007, RETHINKING EXPERTISE
Collins HM, 2010, TACIT EXPLICIT KNOWL
Couerbe, 1840, GAZETTE HOPITAUX, V13, P106
Couerbe Jean-Pierre, 1840, GAZETTE HOPITAUX, V13, P485
Crosland Maurice, 1992, SCI CONTROL FRENCH A
Cullen WR, 2008, IS ARSENIC AN APHRODISIAC?: THE SOCIOCHEMISTRY OF AN ELEMENT, P1,
DOI 10.1039/9781847558602
Danger F. P., 1843, CR HEBD ACAD SCI, V17, P153
Devergie Alphonse, 1845, ANN HYG PUBLIQUE MED, V33, P142
Devergie Alphonse, 1836, TRAITE THEORIQUE PRA, V1, P15
Devergie Alphonse, 1840, ANN HYG PUBL, V24, P136
Devergie Alphonse, 1836, TRAITE THEORIQUE PRA, V1, P17
Donovan James M., 2010, JURIES TRANSFORMATIO, p[5, 37]
Emsley J, 2005, ELEMENTS MURDER HIST
Engelhardt Hugo T., 1987, SCI CONTROVERSIES CA
Essig Mark, 2002, RESEARCH CORNELL U
Flandin Charles, 1846, TRAITE POISONS, V1, P734
Gautier, 1904, CR HEBD ACAD SCI, V139, P101
Gautier, 1899, CR HEBD ACAD SCI, V129, p[929, 935]
Gautier A, 1902, CR HEBD ACAD SCI, V134, P1394
Gautier Armand, 1876, ANN CHIMIE PHYS, V7, P384
Gavroglu K, 2008, HIST SCI, V46, P153
Gerber Samuel M., 1997, MORE CHEM CRIME MARS
Golan Tal, 2004, LAWS MAN LAWS NATURE
Guignard Laurence, 2010, JUGER FOLIE FOLIE CR
Hirsch Adolf G., 1842, ARSENIK, P43
Huber P. W., 1991, GALILEOS REVENGE JUN, P28
Jasanoff S, 1995, SCI BAR LAW SCI TECH
Kaiser D., 2005, DRAWING THEORIES APA
La Berge Ann F., 2004, Perspectives on Science, V12, P424
LANGMUIR I, 1989, PHYS TODAY, V42, P36, DOI 10.1063/1.881205
Latour B., 1987, SCI ACTION FOLLOW SC
Leclerc Olivier, 2005, JUGE EXPERT CONTRIBU
Lefevre Andre, 1913, EXPERTISE DEVANT JUR, P48
Lynch Michael, 2008, TRUTH MACHINE CONTEN
Machamer Peter K., 2000, SCI CONTROVERSIES PH
Marsh J., 1836, EDINBURGH NEW PHILOS, V21, P229
44
Mata Pedro, 1844, VADEMECUM MED CIRUGI, P636
Mercer David, 2002, CAUSATION LAW MED, P83
OLESKO KM, 1993, OSIRIS, V8, P16, DOI 10.1086/368716
Orfila, 1839, B ACAD ROY MED BELG, V4, p[178, 179]
Orfila, 1839, B ACAD ROY MED BELG, V3, p[426, 464]
Orfila, 1838, B ACAD ROY MED BELG, P161
Orfila, 1839, ARCH GEN MED, V4, p[373, 375]
Orfila Mateu, 1852, TRAITE TOXICOLOGIE, V1, P544
Orfila Mateu, 1831, TRAITE EXHUMATIONS J
Orfila Mateu, 1841, RAPPORT MOYENS CONST, P42
Orfila Mateu, 1839, B ACAD ROY MED BELG, V3, p[676, 682]
Orfila Mateu, 1840, B ACAD ROY MED BELG, V5, p[465, 474]
Orfila Mateu, 1838, B ACAD ROY MED BELG, V3, P93
Orfila Mateu, 1844, ANN HYG PUBLIQUE MED, V31, P131
Orfila Mateu, 1839, B ACAD ROYALE MED, V3, P676
Orfila Mateu, 1839, B ACAD ROY MED BELG, V3, P1049
Orfila Mateu, 1818, TRAITE POISONS, V1, P15
Orfila Mateu, 1844, ANN HYG PUBLIQUE MED, V31, p[430, 435]
Orfila Mateu, 1839, EXPERIENCE, V91, P208
Orfila Mateu, 1840, ANN HYG PUBLIQUE MED, V24, p[298, 312]
Orfila MM., 1842, ANN HYGIENE PUBLIQUE, V28, p[148, 152]
Pfaff Christian H., 1841, REP PHARM, V24, P106
Raspail Francois-Vincent, 1840, ACCUSATION EMPOISONN, P24
Raynaud Dominique, 2003, SOCIOLOGIE CONTROVER
Rees George Owen, 1841, GUYS HOSP REP, V6, P163
Reinsch Hugo, 1843, ARSENIK, P43
Rognetta, NOUVELLE METHODE TRA, P20
Schmidtmann Adolf, 1905, HDB GERICHTLICHEN ME, P913
Secord JA, 2004, EMPIRE, V95, P654, DOI 10.1086/430657
Taruffo, 2002, PRUEBA HECHOS
Taruffo Michelle, 2005, B MEX DERECHO COMPAR, V38, P1285
Taylor Alfred S., 1848, POISONS RELATION MED, P350
Topham Jonathan, 2009, POPULARIZING SCI TEC, P1
Usselman MC, 2005, ANN SCI, V62, P1, DOI 10.1080/00033790410001711922
Wagner J. H., 1952, PRO MEDICO, V21, P161
Watson K, 2004, POISONED LIVES ENGLI
Watson KD, 2011, FORENSIC MEDICINE IN WESTERN SOCIETY: A HISTORY, P1
Weisz G., 1995, MED MANDARINS FRENCH
Whorton James C., 2010, ARSENIC CENTURY VICT
World Health Organisation, 1996, TRAC EL HUM NUTR HLT, P217
NR 134
TC 0
Z9 0
PU UNIV CHICAGO PRESS
PI CHICAGO
PA 1427 E 60TH ST, CHICAGO, IL 60637-2954 USA
SN 0021-1753
Screenshot of The Data Extracted From Dataset
45
Sample of The Paper Citation Network *Vertices 52479
1 "Bertomeu Jose R., 2011, M Orfila Autobiograf, P192"
localcitationcount 1
2 "Gautier, 1899, Cr Hebd Acad Sci, V129, P[929, 935]"
localcitationcount 1
3 "Guignard Laurence, 2010, Juger Folie Folie Cr" localcitationcount
1
4 "Langmuir I, 1989, Phys Today, V42, P36, Doi 10.1063/1.881205"
localcitationcount 1
5 "Ashmore M, 1993, Soc Stud Sci, V23, P67, Doi
10.1177/030631293023001003" localcitationcount 1
6 "Orfila, 1839, Arch Gen Med, V4, P[373, 375]" localcitationcount 1
7 "Marsh J., 1836, Edinburgh New Philos, V21, P229"
localcitationcount 1
8 "Couerbe Jean-pierre, 1840, Gazette Hopitaux, V13, P485"
localcitationcount 1
9 "Bertomeu-sanchez Jr, 2012, Ann Sci, V69, P1, Doi
10.1080/00033790.2011.637471" localcitationcount 1
10 "Couerbe, 1840, Gazette Hopitaux, V13, P106" localcitationcount 1
11 "Reinsch Hugo, 1843, Arsenik, P43" localcitationcount 1
12 "Taruffo Michelle, 2005, B Mex Derecho Compar, V38, P1285"
localcitationcount 1
13 "Orfila, 1839, B Acad Roy Med Belg, V3, P[426, 464]"
localcitationcount 1
14 "Olesko Km, 1993, Osiris, V8, P16, Doi 10.1086/368716"
localcitationcount 1
15 "[anonymous], 1839, Gaz Hopitaux, V12, P409" localcitationcount 1
46
16 "Chauvaud Frederic, 2003, Experts Expertise Ju, P192"
localcitationcount 1
17 "[anonymous], 1839, B Acad Rooy Med, V3, P683" localcitationcount
1
18 "[anonymous], 1840, Gaz Tribunaux 0606, V15, P761"
localcitationcount 1
19 "[anonymous], 1841, Rev Sci, V7, P261" localcitationcount 1
20 "Donovan James M., 2010, Juries Transformatio, P[5, 37]"
localcitationcount 1
21 "Lefevre Andre, 1913, Expertise Devant Jur, P48"
localcitationcount 1
22 "Orfila Mateu, 1839, B Acad Roy Med Belg, V3, P[676, 682]"
localcitationcount 1
23 "Rognetta, Nouvelle Methode Tra, P20" localcitationcount 1
24 "Devergie Alphonse, 1845, Ann Hyg Publique Med, V33, P142"
localcitationcount 1
25 "Whorton James C., 2010, Arsenic Century Vict" localcitationcount
1
26 "Collins H, 2007, Rethinking Expertise" localcitationcount 2
27 "Jasanoff S, 1995, Sci Bar Law Sci Tech" localcitationcount 3
28 "Hirsch Adolf G., 1842, Arsenik, P43" localcitationcount 1
29 "Leclerc Olivier, 2005, Juge Expert Contribu" localcitationcount
1
30 "Machamer Peter K., 2000, Sci Controversies Ph"
localcitationcount 1
MATLAB Code To Calculate PageRank From Paper Citation Networks
n =max(max(cit ( : , 1:2)));
C= sparse (cit(:,2), cit(:, 1), cit (:,3),n,n);
function p = calc_PageRank(C, alpha, n_iterations)
m = sum(C, 2);
C(m == 0, :) = 1;
n = length(C);
m = sum(C, 2);
C = spdiags(1 ./ m, 0, n, n) * C;
p = repmat(1 / n, [1 n]);
for i = 1:n_iterations
p = alpha * p * C + (1 - alpha) / n;
end
47
Sample of the proposed method results (Top 15 ranked papers)
Author Year Title Score
Dear, P 2005
What is the history of science the history of? Early modern roots of the ideology of
modern science 22.925
Kevles, DJ 2007
Patents, protections, and privileges - The establishment of intellectual property in
animals and plants 18.09778574
Hankins, TL 2006
A "large and graceful sinuosity" - John Herschel's graphical method
11.83437724
Hankins, TL 1999
Blood, dirt, and nomograms - A particular history of graphs
11.49795406
Worboys, M 2011
Practice and the Science of Medicine in the Nineteenth Century
10.01692389
Morus, IR 2006 Seeing and believing science 8.457623794
Kohlstedt, SG 2005
"Thoughts in things" modernity, history, and north American museums
7.719776119
Dror, OE 1999
The affect of experiment - The turn to emotions in Anglo-American physiology,
1900-1940 6.972314671
Lucier, P 2012
The Origins of Pure and Applied Science in Gilded Age America
6.917488814
Bowler, PJ 2008
What Darwin disturbed - The biology that might have been
6.57518597
Goldstein, D
2008
Outposts of science - The knowledge trade and the expansion of scientific community
in post-Civil War America 6.517109456
Cantor, G 2012
Science, Providence, and Progress at the Great Exhibition
6.417488814
Bertomeu- 2013 Managing Uncertainty in the Academy and
the Courtroom Normal Arsenic and 6.364267563
48
Sanchez, JR Nineteenth-Century Toxicology
Dear, P 2005
What is the history of science the history of? Early modern roots of the ideology of
modern science 22.925
Kevles, DJ 2007
Patents, protections, and privileges - The establishment of intellectual property in
animals and plants 18.09778574
Sample of the ranked results using PR (Top 15)
Author Year Title Score
Kevles, DJ 2007 Patents, protections, and privileges - The establishment of intellectual property in
animals and plants 8.75
Hankins, TL 1999 Blood, dirt, and nomograms - A particular
history of graphs 8.591558442
Morus, IR 2006 Seeing and believing science 8.515853659
Lightman, B
2000 The visual theology of Victorian
popularizers of science - From reverent eye to chemical retina
8.488461538
Dear, P 2005 What is the history of science the history of? Early modern roots of the ideology of
modern science 8.45
Bowler, PJ 2008 What Darwin disturbed - The biology that
might have been 8.441666667
Klein, U 2008 The Laboratory Challenge Some Revisions
of the Standard View of Early Modern Experimentation
8.43125
Alexander, AR
2006 Tragic mathematics - Romantic narratives and the refounding of mathematics in the
early nineteenth century 8.410869565
49
Tucker, J 2006 The historian, the picture, and the archive 8.364285714
Elshakry, M 2010 When Science Became Western
Historiographical Reflections 8.316666667
Portolano, M
2000 John Quincy Adams's rhetorical crusade for
astronomy 8.316666667
Dror, OE 1999 The affect of experiment - The turn to
emotions in Anglo-American physiology, 1900-1940
8.314634146
Canizares-Esguerra, J
2005 Iberian colonial science 8.305555556
Nyhart, LK 1998 Civic and economic zoology in nineteenth-
century Germany - The "living communities" of Karl Mobius
8.286986301
Hankins, TL 2006 A "large and graceful sinuosity" - John
Herschel's graphical method 8.275
Sample of the ranked results using Citation Count (Top 15)
Author Year Title Score
Hankins, TL 1999 Blood, dirt, and nomograms - A particular
history of graphs 34
Dror, OE 1999 The affect of experiment - The turn to
emotions in Anglo-American physiology, 1900-1940
27
Lightman, B 2000 The visual theology of Victorian
popularizers of science - From reverent eye to chemical retina
22
50
Dear, P 2005 What is the history of science the history of? Early modern roots of the ideology of
modern science 18
Morus, IR 2006 Seeing and believing science 15
Kevles, DJ 2007 Patents, protections, and privileges - The establishment of intellectual property in
animals and plants 12
Lucier, P 2009 The Professional and the Scientist in
Nineteenth-Century America 12
Hankins, TL 2006 A "large and graceful sinuosity" - John
Herschel's graphical method 11
Nyhart, LK 1998 Civic and economic zoology in
nineteenth-century Germany - The "living communities" of Karl Mobius
10
Mazzotti, M 1998 The geometers of god - Mathematics and
reaction in the kingdom of Naples 10
Klein, U 2008 The Laboratory Challenge Some Revisions
of the Standard View of Early Modern Experimentation
9
Schloegel, JJ|Schmidgen,
H 2002
General physiology, experimental psychology, and evolutionism -
Unicellular organisms as objects of psychophysiological research, 1877-1918
9
Bowler, PJ 2008 What Darwin disturbed - The biology that
might have been 7
Canizares-Esguerra, J
2005 Iberian colonial science 7
Alexander, AR 2006 Tragic mathematics - Romantic narratives and the refounding of mathematics in the
early nineteenth century 6