an efficient ranking algorithm for scientific research...

An Efficient Ranking Algorithm for Scientific Research Papers

By

Fathi Mahmoud Fathi Al-Hattab

Supervisor:

Dr. Mohammad Hassan

Co-supervisor:

Dr.Yaser Al-laham

This Thesis is submitted in Partial Fulfillment of the Requirements for

Master’s Degree in Computer Science

Faculty of Graduate Studies

Zarqa University-Jordan

March, 2016

iii

ACKNOWLEDGEMENTS

Prior to acknowledgments, I must glorify Allah the Almighty for giving me courage and patience

to carry out this work successfully.

I would like to express my deepest gratitude to my supervisors, Dr. Mohammad Hassan and

Dr.Yaser Al-laham. I gratefully acknowledge Zarqa University for the scholarship that I have

received to pursue my master's Degree.

And finally, my dear father! How can I ever thank you? Your endless love and support fill me

with life. You are my source of inspiration and I am forever indebted to you. Loads of love and

thanks to you!

iv

TABLE OF CONTENTS

LIST OF TABLES ....................................................................................................................... vi

LIST OF FIGURES .................................................................................................................... vii

LIST OF ACRONYMS ............................................................................................................. viii

Chapter 1: Introduction ............................................................................................................... 1

1.1 Overview ............................................................................................................................... 1

1.2 Problem Definition ................................................................................................................ 2

1.3 Thesis Motivation ................................................................................................................. 2

1.4 Thesis Objectives .................................................................................................................. 3

1.5 Thesis Outline ...................................................................................................................... 3

Chapter 2: Background and Literature Survey ......................................................................... 4

2.1 Outline ................................................................................................................................... 4

2.2 Information Retrieval ............................................................................................................ 4

2.2.1 Ranking Process ............................................................................................................. 5

2.3 Ranking Models .................................................................................................................... 5

2.3.1 Relevance Ranking Algorithms ..................................................................................... 6

2.3.2 Important Ranking Models ............................................................................................ 9

2.4 PageRank Algorithm ............................................................................................................. 9

2.5 SRPs Ranking Algorithms and Related Work .................................................................... 13

2.5.1 PageRank with N-Star Ranking Model ........................................................................ 13

2.5.2 PageRank with HITS Model ........................................................................................ 14

2.5.3 PageRank with Closed Frequent Keyword-set ............................................................ 14

2.5.4 PageRank with Jaccard Index ...................................................................................... 14

2.5.5 PageRank with Focused Surfer Model ........................................................................ 15

2.5.6 Modified PageRank ..................................................................................................... 15

2.5.7 PageRank with CiteRank ............................................................................................. 17

v

2.5.8 Random walk with Restart ........................................................................................... 17

2.6 Chapter Summary ............................................................................................................... 20

Chapter 3: Experiment and Evaluation .................................................................................... 22

3.1 Introduction ......................................................................................................................... 22

3.2 Dataset ................................................................................................................................. 22

3.3 Implementation ................................................................................................................... 23

3.3.1 Data Preparation and Extracting .................................................................................. 24

3.3.2 Extracting Citation Networks ...................................................................................... 25

3.3.3 Calculation of Scientific Research Ranking SRR Score.............................................. 27

Procedure: Scientific Research Ranking R ........................................................................... 29

3.4 Results ................................................................................................................................. 30

3.5 Evaluation ........................................................................................................................... 30

3.5.1 Distribution of Ranked SRPs among the Age of the Paper ......................................... 30

3.5.2 Distribution of Ranked SRPs among Citation ............................................................. 31

3.5.3 Recall and Precision ..................................................................................................... 33

Chapter 4: Conclusion and Future Work ................................................................................. 35

4.1 Results and Conclusion ....................................................................................................... 35

4.2 Future work ......................................................................................................................... 36

REFERENCES ............................................................................................................................ 37

Appendices………………………………………………………………………………… …44

vi

LIST OF TABLES

Page Table Caption Number

12 Limitation of Ranking Models 2.1

17 Summary of SRPs Ranking Algorithms 2.2

23 Distribution of Paper Publication Dates 3.1

34

Results of Precision and Recall for both PR and the proposed

SRP Rank algorithms.

3.2

vii

LIST OF FIGURES

PAGE Figure Caption Number

4 The Basic Process in IR System 2.1

6 Ranking Models 2.2

11 Example of PageRank 2.3

25 Sample of Original Dataset 3.1

24 Dataset Information Arranged into Database 3.2

24 Dataset After Preprocessing 3.3

25 Sample of the Paper Citation Network 3.4

26

Sample of The information extracted from author paper

graph 3.5

26

Sample of the information used in ranking score

calculation 3.6

31

Distribution of the proposed rank method among the age

of the paper 3.7

31 Distribution of PR among the age of the paper 3.8

32

Distribution of the proposed rank method among the

citation count. 3.9

32 Distribution of PR among the citation count. 3.10

viii

LIST OF ACRONYMS

IR Information Retrieval

SRP Scientific Research Papers

VSM Vector Space Model

TF Term Frequency

IDF Inverse Document Frequency

LSI Latent Semantic Indexing

SVD Singular Value Decomposition

LMIR Language Model for IR

HITS Hyperlink-Induced Topic Search

PR PageRank

CC Citation Count

IR Information Retrieval

RWR Random Walk with Restart

ix

Abstract

Due to the enormous evolution of the World Wide Web and providing the online access to the

digital contents of universities and public libraries, the WWW has become the most popular

resource for data and information. This massive content of data makes it generally impossible for

common users to locate their desired information by using the traditional search engines that

only based on relevancy and the number of query occurrences, as the number of returned results

can be tremendous. This made it necessary to develop an efficient and effective ranking

algorithm to solve this problem. One of the most famous ranking algorithms is the PageRank,

which is adopted by Google search engine, but because the nature of scientific research papers

differs from the webpage, it may not be suitable to be used alone to rank scientific research

papers as it favors the old papers and bias rather than the new ones that also relies heavily on

citation counts which is not the only important factor to reflect the paper's importance. This

thesis proposes an efficient ranking method that is suitable to rank scientific research papers and

able to improve PageRank score making it less biased in using the date of publication and

author's score in addition to PageRank. The results showed that the proposed method succeeded

in making the ranking results more neutral with both old and new research papers.

x

اشص

االعاق اىرة سز ترف١غ جاعا اىرثاخ اجاعاخ اىث١غ ل١ا جاعا اعاخ جشثه ف اائ اسرغ ارطع اص

.اثادث١ طالب اعاخ صاصع ا جعا إ االرغد ع جاع

اطتح اث١ااخ ع اثذث ع١ ٠صعة ا االرغد شثىح سرشض اا عائماا ٠شى اصثخ اث١ااخ اائ اى ظا إ

اث١ااخ ع اثذث جح ف ارطاتمح اىاخ عضص سالي ارطاتك ع فمظ ذعرض ار ارم١ض٠ح، اثذث اخنذغ تاسرشضا

رائج ذغذ١ة ع لاصعج فعاح ساعػ١اخ ٠غفذ ا ذاجا الرضد ظا ؛ ائح ذى لض اثذث ئجاد ا وا ، اشثىح ع اشؼح

اثذث ذغن لث جاسرشض (PageRank) جساعػ اشاػ١اخ ظ اشغ .سرشض جتاسة ا١را دسة اثذث

تاسرشضا اع اثذث اعاق ذغذ١ة إف ا٠ة، صفذاخ طث١ع ع ذشرف جاع اثذث عاقأ جطث١ع أ إ نغاا جج عاش

اىث١غ اعراصا ا٠ضا اجض٠ضج ع جامض٠ االعاق ذفض ذذ١ؼا اوثغ الا دضا جاسة ذى ال لض (PageRank) ساعػ١ح

.اثذث ا١ح ض رذض٠ض ا عاال دضا ذعض ال لض ار االلرثاساخ عضص ع

٠ثتخ PageRank) ) ساعػ١ح ذط٠غ ع لاصعج اع اثذث اعاق رغذ١ة اسثح جفعاي ساعػ١ح االطغدح ظ ذمرغح

.PageRank)) ل١حي اضافح اؤف ذم١١ اشغ رذاع تاسرشضا طه ذذ١ؼاا ال ذصثخ

.اجض٠ضج امض٠ح ١حاع االعاق ع ذاػاا اوثغ ارغذ١ة رائج ذجع أ اسرطاعد امرغد اشاعػ١ح ارائج اظغخ

1

Chapter 1: Introduction

1.1 Overview

With the enormous evolution of the World Wide Web (WWW), it has become the most

popular information resource for text, media and metadata. The number of indexing web

keeps growing at the rate of 222, 21 million web pages a day [1] and in 2015; the Indexed

Web reached more than 4.73 billion pages according to ILK Research Group*. This huge

content of data makes it generally impossible for common users to locate their desired

information by using traditional search that is based only on relevancy and the number of

query occurrences in a document. The number of returned results can be tremendous and

the user would spend much time to find the desired information from long list. This

situation raised the demand to develop efficient and effective information retrieval systems

and ranking algorithm to solve this problem.

One of the reliable resources for information on the web is the digital academic libraries

that considered being one of the most important sources that provide Scientific Research

Papers (SRP) for student and researchers. Considerable numbers of universities and public

libraries have provided access to books, journals and other documents. This collection

helps academic researchers to acquaint with new journal articles and conferences

proceedings that relate to their areas of researches.

* ILK Research Group. Tilburg University - Netherlands.

http://www.worldwidewebsize.com/

2

1.2 Problem Definition

The nature of SRPs differs from the usual web search; ranking papers that only depend on

citation analysis may not always reflect their quality. It has been surrounded by a number

of different viewpoints and opinions on how well citations can measure SRPs quality.

Some researchers measure SPR quality based on how often it has been cited on other SRPs

and the date of publication. Others argue that citation counts are an indicator that best

assesses a publication‟s impact rather than its quality or importance, but that citation counts

are only partial indicators of impact and that other factors such as communication practices

and author visibility have to be assumed significantly. One of the most famous ranking

algorithms is PageRank algorithm which is adopted by Google search engine, and based on

using citation counts as the highest weighed factor in Google scholar engine.

It is not always the best choice to rank scientific research papers (SRPs) based on citation

count as the main factor; because the nature of SRPs is different from the nature of web

pages. There are other factors that determine the importance of SRP rather than citation

count to get better ranking results.

1.3 Thesis Motivation

This thesis was motive by these problems:

1. SRPs citation count does not necessarily reflect its quality or importance to research.

2. Using author's score as a factor in the ranking process could improve the ranking results.

3. New papers didn‟t had been publish for long time as old papers to be studied and tested

as old papers.

3

1.4 Thesis Objectives

The main objective of this thesis is to propose an efficient ranking method suitable to rank

SRPs and able to improve PageRank by including author's score and make it less biased

against new papers.

1.5 Thesis Outline

The remainder of this thesis is organized as follows:

• Chapter 2 : will presents the basic concepts and process in information retrieval (IR)

system, foundations concepts used in this thesis and gives a brief review for current and

popular ranking algorithms then presents literature review for the related works and

recent researches conducted to develop scientific research ranking methods.

• Chapter 3: will presents and explains the used dataset and the proposed ranking

method. And evaluates the proposed algorithms performance.

• Chapter 4 : concludes the thesis work and describes the results and the recombination

for the future work.

4

Chapter 2: Background and Literature Review

2.1 Outline

This chapter presents the basic concepts and process of information retrieval system, the

current ranking models and the foundations concepts used in this thesis. It also reviews the

recent algorithms that have been used to rank scientific and academic research papers.

2.2 Information Retrieval

Information Retrieval (IR) is a process of locating and obtaining information needed by a

user from a collection of available information resources [2]. There are 3 vital components

in web search engine known as Crawler (called also robot or spider), Indexer and Ranking

mechanism; the basic process in IR is shown in figure (2.1).

Figure 2.1: The basic process in IR system

A general web search engine can be summarized as follows:

1. Crawling: A crawler browses the web graph following hyperlinks and fetches links and

stores the extracted URL‟s into a local repository [3].

5

2. Indexing: The search engine indexes the pages collected by the crawler, extracts

keywords from each page and records the URL where each word has occurred [3].

3. User submits a query [3].

4. The query transfers in terms of keywords on the interface of a search engine, and is

examined with the index [3].

5. Ranker retrieves documents after consulting the index to get the most relevant

documents to the query. The relevant documents are then sorted according to their degree

of relevance, importance, etc. and then presented to the user [3].

Several sub-processes are also performed on documents and text before indexing, such as

parsing, lexical analysis, phrase detection and stemming [4].

2.2.1 Ranking Process

Ranking algorithms are at the core of information retrieval (IR) systems. It plays a very

important role by identifying the most relevant pages that are most likely be able to satisfy

the user's needs according to some criterion [5].

2.3 Ranking Models

In general, ranking algorithms can be either query-dependent that ranks list of documents

according to relevance between these documents and the query, e.g. Boolean Model. Or

query-independent and rank list of documents based on their own importance, e.g.

PageRank [6]. The basic current ranking models are shown in figure 2.2.

6

Figure 2.2: Ranking Models

2.3.1 Relevance Ranking Algorithms

The relevance ranking algorithms are also known as content-based rankers which work on

the basis of number of matched terms, frequency of terms and location of terms. It usually

takes each individual document as an input, and computes a score measuring the matching

between the document and the query. Then all the documents are sorted in descending

order of their scores [7].

2.3.1.1 Boolean Ranking Model

Boolean Model is an old and simple ranking model based on Boolean algebra and set

theory. It treats documents as bag of index terms that are words or phrases and uses

Boolean algebra expression as query, the terms are connected with logical operators (And,

Or and Not) [8].

7

2.3.1.2 Vector Space Model

The Vector Space Model (VSM) proposed by Salton, G. in 1975 represents the documents

and the queries as vectors in a Euclidean space, and the similarities can be measured using

the inner product of two vectors. Term Frequency- Inverse Document Frequency (TF-IDF)

weighting has been used to get more effective vector representation of the query and the

documents. Thus, the document length often divides the term frequency [9], and it is

defined as the following:

𝑇𝐹 𝑡 =𝑇

𝐹 (2.1) [9]

where T is the number of times term t appears in a document, and F is the total number of

terms in the document (or document length).

IDF score down the frequent terms that may appear a lot of times and scale up the rare

ones, it is defined as the following:

𝐼𝐷𝐹 𝑡 = 𝑙𝑜𝑔𝑁

𝑛 (𝑡) (2.2) [9]

Where N is the total number of documents in the collection, and n(t) is the number of

documents containing term t.

2.3.1.3 Latent Semantic Indexing

Latent Semantic Indexing (LSI) by Deerwester, S. in 1988, uses a mathematical technique

called Singular Value Decomposition (SVD) to identify pattern in the relationships

between the terms and concepts. LSI is based on the principle that words that are used in

the same contexts tend to have similar meanings [10].

8

2.3.1.4 BM25 Model

BM25 Model, also referred to as “Okapi BM25,” by Robertson S.E, 1994 is based on

probabilistic ranking principle. It ranks a set of documents by the log-odds of their

relevance regardless of the inter-relationship between the query terms within a document

(e.g., their relative proximity). It is not a single function, but actually a whole family of

scoring functions, with slightly different components and parameter [11]. Given a query q,

containing terms 𝑡1, … , 𝑡𝑚 , the BM25 score of a document d is computed as

𝐵𝑀25 𝑑, 𝑞 = 𝐼𝐷𝐹 𝑡𝑖 .𝑇𝐹 𝑡𝑖 ,𝑑 .(𝑘1+1)

𝑇𝐹 𝑡𝑖 ,𝑑 +𝑘1 .(1−𝑏+𝑏 .𝐿𝐸𝑁 (𝑑)

𝑎𝑣𝑑𝑙 )′

𝑀𝑖=1 (2.3) [11]

where TF (t,d) is the term frequency of t in document d, LEN (d) is the length (number of

words) of documents d, and (𝑎𝑣𝑑𝑙) is the average document length in the text collection

from which documents are drawn. 𝑘1 and b are free parameters, IDF(t) is the inverse

document frequency (IDF) weight of the term t.

2.3.1.5 Language Model for IR

Language Model for IR (LMIR) presented by Ponte, J. T. in 1998 is an application of the

statistical language model on information retrieval. It assigns a probability to a sequence of

terms. With query q as input, documents are ranked based on the query likelihood, or the

probability that the document‟s language model will generate the terms in the query (i.e.,

P(q|d) ). By further assuming the independence between terms, one has 𝑃 𝑞 𝑑 =

𝑃 𝑡𝑖 𝑑 ,𝑀𝑖 if query q contains terms 𝑡1, …𝑡𝑀 . [12]

To learn the document‟s language model, a maximum likelihood method is used. Usually a

background language model estimated using the entire collection is used for this purpose

[7]. The document‟s language model can be constructed as follows:

9

𝑃 𝑡𝑖 𝑑 = 1 − 𝜆 𝑇𝐹(𝑡𝑖 ,𝑑)

𝐿𝐸𝑁(𝑑)+ 𝜆𝑝(𝑡𝑖|𝐶), (2.4) [12]

where 𝑝(𝑡𝑖|𝐶) is the background language model for term ti , and λ ∈ [ 0, 1] is a

smoothing factor [12].

2.3.2 Important Ranking Models

Important Ranking algorithms (query-independent) also called Connectivity-based Page

Ranking (Link based) rank list of documents according to their own importance based on

link analysis technique [7, 13]. They view the web as a graph where the web pages form

the nodes and the hyperlinks between the web pages form the edges between these nodes

[13].

2.3.2.1 HITS (Hubs and Authorities)

HITS is a link analysis algorithm, the basic idea behind it is that a web page serves two

purposes: to provide information on a topic, and to provide links to other pages giving

information on a topic [14]. Thus, web page is considered to be authority on a subject if it

provides good information about the subject, and considered to be a hub if it provides links

to good authorities on the subject. The scheme therefore assigns two scores for each page,

authority value, which estimates the value of the content of the page, and hub value, which

estimates the value of its links to other pages [13].

2.4 PageRank Algorithm

PageRank (PR) model is very useful for ranking web pages and measuring their

importance. It was first introduced by [15] as a possible model of user surfing behavior. It

10

results from a mathematical algorithm based on the graph, the web graph, created by all World

Wide Web pages as nodes and hyperlinks as edges. The rank value indicates an importance of a

particular page. A hyperlink to a page counts as a vote of support. The PageRank of a page is

defined recursively and depends on the number and PageRank metric of all pages that link to it

("incoming links"). A page that is linked to by many pages with high PageRank receives a high

rank itself. If there are no links to a web page there is no support for that page. The simplest

formula to calculate PR is:

𝑃𝑅 𝐴 =1−𝑑

𝑁+ 𝑑 (

𝑃𝑅(𝐵)

𝐿(𝐵)+

𝑃𝑅(𝐶)

𝐿(𝐶)+

𝑃𝑅(𝐷)

𝐿(𝐷)+ ⋯ ) (2.5) [15]

where L is number of outbound links, d the damping factor, N is the total number of pages

on the web. It can be also expressed as the following:

𝑃𝑅𝜄+1 𝑝𝑖 =(1−𝛼)

𝑛+ 𝛼.

𝑃𝑅(𝑝𝑗 )

𝑜𝑑 (𝑝𝑗 )𝑝𝑗∈𝑁−(𝑝𝑖) (2.6) [15]

where N is the total number of pages on the web, 𝛼 damping factor.

Pseudo code

1. Enter the graph with the links

2. Set all PR to 1

3. Count the outbounds L(i) for each page i from 1 to N

4. Calculate 𝑃𝑅(𝑣)

𝐿(𝑣)𝑣𝜖𝐵𝑢 for each page i from 1 to N where Bu includes those pages that

a. Have a Link to Page i

b. Are different from Page i

5. Update all PR(i) for each page i from 1 to N

6. Repeat step 3 till changes to PR are insignificant

11

The PageRank theory holds that an imaginary surfer who is randomly clicking on links will

eventually stop clicking. The probability, at any step, that the user will continue is a

damping factor d. various studies have tested different damping factors, but it is generally

assumed that the damping factor will be set around 0.85 [15].

The time complexity to compute one iteration of PageRank, where a PageRank value for

each web page is computed, is O(n). Since the result vector of the PageRank computation

contains a single value for each web page, the space requirement for the PageRank

algorithm is also O(n), if the space required to store the web graph is ignored.

Figure 2.3: Example of PR [1].

Table 2.1 shows some limitations of ranking models.

12

Table 2.1: Limitations of Ranking Models

Ranking Model Limitations

Boolean Ranking Model

Can‟t retrieve partial matches.

The retrieved documents are not ranked; it cannot predict the

degree of relevance.

It either retrieves too many documents or very few documents.

It does not use term weights.

It can only predict whether a document is relevant to the query

terms or not.

[7,8]

VSM [9].

Doesn‟t capture the semantics of the query or the document.

Cannot denote the “clear logic view” like Boolean model.

Avoids the assumption that terms are independent from each

other.

LSI [8]. Assumes that terms are independent from each other.

PageRank [15].

It shouldn‟t be used as stand alone matric; it should be used as

parameter only.

Favors older pages, because a new page, even a very good one,

will not have many links unless it is part of an existing site.

HITS [14].

Topic drift and efficiency problems occur.

Non-relevant documents can be retrieved.

13

2.5 SRPs Ranking Algorithms and Related Work

Google Scholar is one of the major academic search engines, but its ranking algorithm for

academic articles is unknown. So, we cannot get more information about how this

algorithm works, Beel J, et al. [16] performed reverse-engineering on Google Scholar‟s

ranking algorithm. The results showed that it relies heavily on citation counts and it's the

highest weighted factor. It also puts high weighting on words in the tile. The author's name

and journal also have impact on the ranking, while the frequency of the search term in

content doesn‟t affect the ranking score. It also seems to weight recent articles stronger

than the older articles. The study found that Google Scholar is more suitable to find

standard articles and less suitable when searching for gems or articles by authors advancing

a new or different view from the mainstream.

2.5.1 PageRank with N-Star Ranking Model

N-linear ranking model is the system of N ranking scores of N classes, the rank scores

depend on others by a linear constraint system. Anh, V.L. et al. [17] proposed ranking

system based on N-linear model with PR to rank SRPs beside authors and, conferences.

The system has two models SD4R, and SD3R to rank datasets without citation information

and SD4R. The models were tested using dataset built form DBLP.

Sohn, B.S. et al. [18] also proposed a generalized network analysis approach to rank SRPs

using N-star model. Based on this model and PageRank algorithm, two different ranking

methods were derived, a query / topic independent rank called Universal-Publication rank

(UP rank), and a query/topic dependent rank called Topic Publication-rank (TP rank). The

model takes into account the mutual relationship between keyword, publication, and

citation.

14

2.5.2 PageRank with HITS Model

Due M, et al. [19] proposed PageRank algorithm, an extension of PageRank and HITS

ranking models. The basic idea is to measure the relativity between papers. It measure the

indirect relationships between papers using relativity measurements instead of the simple

direct citation between papers.

Jiang X, et al. [20] proposed MutualRank a graph-based ranking framework that integrates

mutual reinforcement relationships among networks of papers, researchers and venues to

achieve a more synthetic, accurate and fair ranking result than previous graph-based

methods.

Wang, Y. et al. [21] proposed PageRank HITS framework that exploits different kinds of

information simultaneously and examines their usability in tasks of ranking scientific

articles and capture time information in the evolving network to obtain a better ranking

result.

2.5.3 PageRank with Closed Frequent Keyword-set

Shubhankar, K. et al. [22] introduced novel algorithm that uses closed frequent keyword-

set and modified time independent PageRank algorithm to detect and rank topics in

research papers based on their significance in research community rather than popularly of

the topic, which considers only the frequency of topics. Each topic is assigned to an

authoritative score using the modified PageRank algorithm and all papers sharing a topic

from one natural cluster or more.

2.5.4 PageRank with Jaccard Index

Haddadene, H. et al. [23] proposed model of representation of the scientific production

15

based on the notion of similarity between articles and adaptive PageRank algorithm. The

similarity between two documents of database was calculated using Jaccard index- also

known as the Jaccard similarity coefficient. An adaptation of PageRank algorithm was used

to rank and measure the relative importance of a document.

2.5.5 PageRank with Focused Surfer Model

Krapivin, M. et al. [24] introduced Focused Page Rank (FPR) to rank SRPs based on

traditional citation count, PageRank and Focused Surfer model. This algorithm aimed to

reduce the problem of “effect of outbound links” which means that if paper P is cited many

times by papers with high rank but containing a large quantity of outgoing links, it may

decrease P‟s rank, this causes paper highly cited but poorly ranked by PR. FPR is a tradeoff

between Page Rank and Citation Count, so it may serve as an agreement between the

followers of pure citation count and Page Rank followers. The algorithm was evaluated on

Citeseer autonomous digital library content, the result showed that FRP algorithm suffers

less from the "outbound links" problem compared to basic PageRank algorithm.

2.5.6 Modified PageRank

Sidiropoulos A, et al. [25] proposed SceasRank algorithm a modified PageRank algorithm

with two parameters “a” and “b”. b is called the direct citation enforcement factor and a is a

parameter controlling the speed at which an indirect citation enforcement converges to

zero. Converges are faster than algorithms which are more similar to the PageRank

algorithm, taking the publication dates of papers into consideration.

16

Sun, Y. et al. [26] proposed popularity weighted ranking algorithm for academic digital

libraries based on PageRank and uses the popularity factor of a publication venue

overcoming the limitations of impact factors.

Chen, P. et al. [27] proposed PageRank method to rank SRPs. The results showed that there

are some classical articles in Physics domain have small quantity of citations and very high

PageRank. These articles were named “scientific gems”. Existence of “scientific gems” is

caused by PR model, which captures not only the total citation count, but also the rank of

each of the citing papers

Walker D, et al. [28] introduced CiteRank as an adaptation of the PageRank algorithm. It

considers the publication time of scientific articles and utilizes a random walk model to

predict the number of future citations for each article. The model reduces the bias of time

to some extent because recent articles will be promoted to higher scores.

Sayyadi H, et al. 2009 [29] proposed a method called FutureRank, which estimates the

future PageRank prestige score for each article by using citations network, authors, and

publication date. In the model, the usage of authorship provides additional information to

rank recent publications. If an author is authority (i.e., publishing many prestigious papers

previously) then the new publications of him/her can be expected to have good quality.

Articles transfer their authority score to their authors, and an author collects the authority

score of all of his/her publications.

Another paper by Singh, A. et al. [30] proposed ranking algorithm based on citation

network. The algorithm uses a modified version of PageRank algorithm metric and takes

into account the time factor in ranking the research papers to reduce the bias against the

recent papers. Using the scores of the research papers, other scores are added to

17

conferences and authors to rank them. The algorithms were implanted and tested on DBLP

dataset.

2.5.7 PageRank with CiteRank

Dunaiski D, et al. [31] proposed a novel algorithm called NewRank, a combination of

PageRank and CiteRank algorithms. It focuses on identifying influential papers that were

published recently. The ranking algorithms were evaluated against the list of the most

influential papers compiled by the ICSE selection committee.

2.5.8 Random walk with Restart

Hwang, W.S, et al. [32] proposed a new SRPs algorithm that aims mainly to balance the

impacts of old papers and new papers and to solve the distortion in ranking due to

publication dates, which causes recent papers to have poor rank by crediting the recent

papers. The algorithm used Random Walk with Restart (RWR) on a graph where nodes are

papers and edges are citations among papers, and defined the age damping factor for the

papers, where the age damping factor ρ has a special parameter τ denoting the

characteristic decay time.

A summary of related works is shown in table 2.2

Table 2.2: Summary of SRPs Ranking Algorithms

Algorithm Techniques Features Factors Dataset

Up Rank, TP

Rank [18].

PageRank

N-star

Ranking Model

Considers the query/topic,

and the content.

Citation

Keyword

Title

-

18

Content

SC4R, SC3R [17]

PageRank

N-star

Ranking Model

Evaluates more detail based

on the context of their

relationships.

Citations

Author

Venue

DBLP

Academic

Microsoft

PTRA [33].

Citation Count.

Mathematical

Calculation.

Depends highly on time of

publication to rank the papers.

Gives the paper age a higher

impact.

Citation.

Publication venue.

Date of Publication.

Google

Scholar.

Citeseerx

IEEE

Xplore.

DBLP

PageRank+HITS

[21].

PageRank

HITS

Uses time information.

Ranks SRPs in heterogeneous

network.

Citation

Author

Journal/Conference.

Date of Publication.

Prestige (Article)

arXiv

Cora

MutualRank [20].

PageRank

HITS

Less biased against new

papers.

Returns more relevant highly

ranked papers.

Citation

Title (Authority and

Soundness)

Author

(Importance)

Venue (Prestige)

ANN

(Haddadene H, et al.

2012) [23].

PageRank

Jaccard Index.

Ranks papers based on the

notion of similarity between

articles and adaptive PageRank

algorithm.

Citation

Title

Abstract

-

NewRank [31]. PageRank

CiteRank

Counts more citations from:

newer papers and citations

from popular paper.

Citation

Date of publication.

Date of references.

Microsoft

Academic

Hep-th

TopicRank [22]. PageRank Clusters papers into topics Citation DBLP

19

Closed

frequent

keyword-set

Estimates the article‟s future

prestige.

Title


Keyword

Topic

CitationRank [30]. PageRank

Time-independent ranking.

Give authors and conferences

ranking scores based on papers

score.

Citation

Date of publication. DBLP

YetRank [32].

Random Walk

with Restart

(RWR)

Gives high rank to papers

credited by other authoritative

papers or published in premier

journals or conferences.

Balances the impacts of old

papers and new papers.

Solves the distortion in

ranking due to publication

dates.

Citation


Journal/conference.

DBLP

FutureRank [29]. PageRank

Estimates the future

PageRank prestige score for

SRPs.

Citations.

Authors.

Publication date

arXiv

(hep-th)

PaperRank [19]. PageRank

HITS

Can find more authoritative

papers than the traditional

methods.

Citation

Title

Keyword

Text

Google

Scholar

CiteSeer

Focused

PageRank [24].

PageRank

Focused Surfer

model

Tradeoff between PageRank

and Citation Count.

Suffers less from the effect of

outbound links.

Citations

Citation counts Citeseer

Scientific Gems

[27]. PageRank Captures scientific gems. Citations APS

20

Popularity

Weighted Ranking

Algorithm [26].

Page Rank Overcomes the limitations of

impact factors.

Citations

Title

Venue (popularity

factor)

Citeseer

CiteRank [28].

PageRank

Random Walk

Model.

Overcomes the problem of

the aging effect in citation

networks.

Gives higher rank to papers

credited by other authoritative

papers or published in premier

journals or conferences.

Predicts the number of future

citations for each article.

Citation

Title

Date of publication

Hep-th

Physrev

SceasRank [25]. PageRank

Gives higher rank to papers

credited by other important

papers and newer papers.

Citations

Date of publication

Conference

DPLB

SCEAS

System

2.6 Chapter Summary

Over the past decade, many studies were conducted to evaluate the productivity of

scientific articles by developing more efficient algorithms to rank SRPs. These algorithms

guide and help researchers, students and authors in their researches in the best available

way.

All relevant models have some limitations which make this type of ranking models not the

best choice to use in web search engines. Even though important-ranking models are

relatively new, but it improved the search process and changed the way it works.

The conclusions of this chapter include:

PageRank algorithm was the favorite ranking model among researchers, and it was the

most suitable model to rank SRPs.

21

Number of researches accompanied PageRank with other methods to enhance the

ranking results and overcome some PageRank limitations.

Citation count was the most weighted factor in most studies.

Many studies used time as a very important factor to give the newer papers a higher

score, e.g. PTRA algorithm which depends highly on time [33], PageRank+HITS [21],

NewRank [31], CitationRank [30], and YetRank [32].

There was no unified evaluation process, each researcher used different methods, other

researchers didn‟t evaluate their work. The evaluation methods included:

o Collecting a set of recommended papers from websites of graduate-level

computational linguistics courses of 15 top universities used as the evaluation

benchmark [20].

o Evaluating the result against the list of the most influential papers compiled by the

International Conference on Software Engineering ICSE selection committee [31].

o Using diversity or difference between PageRank (PR) and citation count (CC) [24].

o Evaluating the rank results based on the prizes of „VLDB 10 Year Award ‟,

„SIGMOD Test of Time Award‟ and „SIGMOD E.F.Codd Innovations Award [25].

o Comparing the ranking results with references in the corresponding chapters of the

famous datamining book “J. Han and M. Kamber. Data Mining: Concepts and

o Techniques. Morgan Kaufmann, 2nd Edition, 2006” [32].

o Comparing the distribution of database papers according to the age of publication

and citation number with PR algorithm [33].

The proposed algorithm ranks SRPs based on PageRank score, date of publication, and

author score.

22

Chapter 3: Experiment and Evaluation

3.1 Introduction

Ranking is an important stage in any search engine. This chapter explains the steps of the

proposed SRP-Rank algorithm:

- Dataset preparation and processing to extract important information.

- Constructing the citations graphs.

- Presenting the proposed ranking method.

It also aims to evaluate the result of the proposed ranking method. The results were

compared to PR. The criteria used in comparison are distribution of Ranked SRPs among

the age of the paper and among number of citations recall, precision and f- measure.

3.2 Dataset

To evaluate the proposed algorithm, dataset containing scientific research papers metadata

such as title and authors is required. There are several free and paid resources that provide

dataset for researchers and differ in their size and the degree of SRPs processing. While

some of these datasets provide only the full text of publications without any processing,

others provide some processed data.

In this study, we used dataset obtained from Web of Science containing the abstract, basic

metadata and 9583 citations for 1,189 SRPs in history and physiology of Science field and

cover publications from 1956 to 2013. Samples of the dataset are shown in figure 3.1.

23

Figure 3.1: Sample of Original Dataset

The dataset contains additional information such as page number and authors contact

address, the distribution of paper publication dataset is shown in table 3.1.

Table 3.1: Distribution of paper publication dates

Year 2013 2012 2011 2010 2009 2008 2007 2006 2005 2004

# of Papers 19 31 32 34 28 26 28 19 14 12

Year 2003 2002 2001 2000 1999 1998 1997 1996 1995 1994

# of Papers 9 8 8 9 29 15 7 10 14 17

Year 1993 1992 1991 1990 1989 1988 1987 1986 1985 1984

# of Papers 13 13 17 15 13 17 13 20 19 26

Year 1983 1982 1981 1980 1979 1978 1977 1976 1975 1974

# of Papers 17 15 18 16 23 23 28 28 25 19

Year 1973 1972 1971 1970 1969 1968 1967 1966 1965 1964

# of Papers 25 26 27 28 20 21 27 25 30 26

Year 1963 1962 1961 1960 1959 1958 1957 1956

# of Papers 19 33 23 28 25 24 17 21

3.3 Implementation

The proposed method is done using several stages. First, the dataset was prepared and the

required information was extracted and used to build paper citation and author-paper

24

graphs. Finally, the algorithm was implemented to give ranking score for all papers in the

list.

3.3.1 Data Preparation and Extracting

To extract the data that will be used in ranking score calculation, that data need to be

processed. It first was arranged using Sci2 tool, which automatically saves the dataset as

excel sheet as shown in figure 3.2.

Figure 3.2: Dataset Information Arranged into Database.

The unwanted and missing fields were removed, and only the data needed by the algorithm

were kept. The final database contained information about authors, title, year of

publication, citation count and bibliography as in figure 3.3.

25

Figure 3.3: Dataset after processing.

3.3.2 Extracting Citation Networks

To get the remaining data needed by the ranking method, paper citation network and author

paper network were extracted by Sci2 tool.

Paper citation network (graph)

Paper citation network required for PR calculation. Sample of paper citation network is

shown in figure 3.4.

3.4: Sample of the Paper Citation Network

26

Authors Paper Network (graph)

Author citation graph provides information about number of works and citation for each

author. Sample of the information extracted from author paper graph is shown in figure 3.5.

Figure 3.5: Sample of the information extracted from author paper graph.

The collected information that will be used in the proposed ranking algorithm are shown in

figure 3.6

Figure 3.6: Sample of the information used in ranking score calculation

27

3.3.3 Calculation of Scientific Research Ranking SRR Score

Calculating the final ranking score depends on PR (citation graph), date of publication,

author citation, and number of author's works.

PageRank Score

The PR calculation was done using MATLAB based on the citation graph extracted by

Sci2 tool.

Date of Publication

One of the PageRank shortcomings is that it favors old papers to new ones even if it is

good, and that is caused because the longer the paper has been around the more citations it

has. Recent papers have little citations so that they should be given some promotion in the

ranking process [21]. To overcome this drawback and make the algorithm unbiased and

more reliable, “pagerank score per age” was used in the algorithm. Page Rank score per

age is calculated by dividing pr score for SRP by the log of its age, which is the number of

years since it has been published.

𝐴 = 𝑌 − 𝑌𝑖 (3.1)

where Y is the current year, Yi is the year of publication

Author Score

The number of papers published and citation count may reflect the productivity and

popularity of the author. Different matrices are used by research community to measure

author performance such as h-index. In this thesis author score was found using the

following equation:

28

𝐴𝑈 =(𝐴𝑊1+..+𝐴𝑊𝑚 )+(𝐴𝐶1+⋯+𝐴𝐶𝑚 )

𝑁.𝐻 (3.2)

where AW number of papers published by each author, AC is the citation count for each

author, N is the number of all authors for the current paper, H is constant.

The constant H, was used to reduce the impact of the author sore on the final ranking score,

and make this side of the equation more balanced with the other side. H was set to 10 after

testing other values, and it gave the most balanced result compared to other tested values

such as 0.5, 20 and 50.

The final Ranking result is calculated using the equation:

𝑆𝑅𝑃 − 𝑅𝑎𝑛𝑘 =𝑃𝑅 𝑖

(1+log 𝐴 )+ 𝐴𝑈 (3.3)

where A is the Age of the SRP, AU is the author score.

The pagerank assigns scores to each paper based on its citation, to make a balance between

citation and age without neglecting the citation metric and not to gain a higher effect to

citation (pr scores) , the page rank scores is divided by the log of the paper age.

Then the pseudo code of proposed ranking algorithms is the following:

29

Procedure: Scientific Research Ranking R

Required:

Ti= Title.

Ac= Authors Citation Count.

Aw= Authors Number of Works.

N= Number of Authors for each Paper.

Di=Date of Publication.

PR= PageRank Score.

D=Current Year.

1: For each paper in dataset.

2: Initialize SPR, AC, AW, D, PR to 0.0;

3: Get :

N[current paper],

Ac[current paper], Aw[current paper]

Di [current paper], PR[current paper]

4: computer AU= (Ac+Aw)/(N.)

5: compute A=2015-Di

6: compute Scientific Research Ranking R= (PR/(1+log (A))+AU)

end

30

3.4 Results

To use the results later in the evaluation process, a query was used to narrow the list since

it contains articles in different subjects that are not all related to each other. The used query

was “nineteenth”, the retrieved list contained 41 articles.

Both of the proposed ranking algorithm and PR were used to rank the list to evaluate the

results.

3.5 Evaluation

Evaluation of ranking algorithm is a challenging and rough procedure for many reasons:

The lack of a comprehensive evaluation metric that is acknowledged by the

academic community [31].

There is no ground truth of the article‟s real rank [21].

It is subjective which ranking algorithm behavior is better, which means it depends

on each individual. One ranking algorithm results could be satisfying for a user, but

not satisfying for another user [25].

3.5.1 Distribution of Ranked SRPs among the Age of the Paper

To evaluate whether the proposed SRP-Rank method is less biased against new papers and

examine the effect of age on the distribution of the top ranked papers, charts for all the 41

ranked list were used. Figures 3.7 and 3.8 show the distribution of the proposed rank

method and distribution of PR among the age of the paper.

31

Figure 3.7: Distribution of The Proposed Rank method among the Age of the Paper.

Figure 3.8: Distribution of PR among the Age of the Paper.

In figure 3,8 the distribution of ranked SRPs using PR among the age of the paper showed

that the PageRank algorithm is biased against new papers. In figure 3.7, the distribution of

the ranked SRPs using proposed method among the age of the paper showed that the

proposed method is less biased against new papers unlike PR.

3.5.2 Distribution of Ranked SRPs among the Citation

To evaluate the citation effect on the proposed method, charts for ranked list were used.

Figures 3.9 and 3.10 show the distribution of the proposed rank method and distribution of

0

2

4

6

8

10

12

14

16

18

0 10 20 30 40 50

0

2

4

6

8

10

12

14

16

18

0 10 20 30 40 50

Rank

Rank

Age

of

SRP

Age

of

SRP

32

PR among the citation count.

Figure 3.9: Distribution of The Proposed Rank method among the citation count.

Figure 3.10: Distribution of PR among the Citation Count.

In figure3.9, the distribution of Ranked SRPs using PR among the citation count showed

that the PageRank algorithm is based heavily on citation count. In figure 3.10, the

distribution of the ranked SRPs using proposed method among the citation count showed

that the proposed method is less depending on citation count unlike PR. For example,the

propsed SRP ranking method gives higher ranks for papers even they don‟t have many

citation, but PR gives a lower rank for papers with no citation as in figure 3.10.

0

5

10

15

20

25

30

35

40

0 10 20 30 40 50

0

5

10

15

20

25

30

35

40

0 10 20 30 40 50Rank

Rank

Nu

mbe

r of

Citat

ion

Nu

mbe

r of

Citat

ion

33

3.5.3 Recall and Precision

In IR, precision is the ability to provide the relevant set of documents from set of retrieved

documents [34]. It is calculated by dividing the amount of overlapping between retrieved

and the relevant set by the number of documents retrieved [1] as shown in the following

equation:

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑆𝑒𝑡 ∩ 𝑅𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑆𝑒𝑡

𝑅𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑆𝑒𝑡 (3.4) [1]

Recall is the ability to provide maximum number of relevant web services from a set of

relevant web services [34]. It is calculated by dividing the amount of overlapping between

retrieved set of documents and relevant set by the number of relevant sets [1]:

𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑆𝑒𝑡 ∩ 𝑅𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑆𝑒𝑡

𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑆𝑒𝑡 (3.5) [1]

These measures are used in IR with binary classification (e.g. seminal/non-seminal) to

measure the relevance of a set of retrieved items and evaluate the performance of

information retrieval systems. A modified version of these measures was presented by [35]

to evaluate web service ranking methods. According to [35] recall is calculated by dividing

the highest rank score by total rank score of all paper in, as shown in equation (3.6).

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =𝐻𝑖𝑔𝑕𝑒𝑠𝑡 𝑟𝑎𝑛𝑘 𝑠𝑐𝑜𝑟𝑒

𝑇𝑜𝑡𝑎𝑙 𝑟𝑎𝑛𝑘 𝑠𝑐𝑜𝑟𝑒 𝑜𝑓 𝑎𝑙𝑙 𝑝𝑎𝑝𝑒𝑟𝑠 (3.6) [35]

Recall is found by dividing the highest rank score by the score of 2nd

highest algorithm.

𝑅𝑒𝑐𝑎𝑙𝑙 =𝐻𝑖𝑔𝑕𝑒𝑠𝑡 𝑟𝑎𝑛𝑘 𝑠𝑐𝑜𝑟𝑒

𝑠𝑐𝑜𝑟𝑒 𝑜𝑓 2𝑛𝑑 𝑕𝑖𝑔𝑕𝑒𝑠𝑡 𝑝𝑎𝑝𝑒𝑟 (3.7) [35]

34

The analysis and comparison of the proposed algorithm and PR based on this evaluation

metric is shown in table 3.2. This metric also proves that the proposed method offers a

better solution to ranking SRPs when compared with the basic PageRank algorithm.

Table 3.2: Results of Precision and Recall in both page rank and proposed SRP rank

algorithms

The recall value of the proposed SRP ranking method proved that this method gave a

higher rank for the top ranked papers. This result argues more evidences that the proposed

method shows a sort of distinctions to the top ranked paper. The precision value achieved

by the proposed method is higher than the PR.The distribution of ranks is biased towards

higher ranked paper. This result makes it possible to distinguish interesting papers in

certain topics to the ranked paper faster than PR.

Precision Recall

PR 0.025813562 1.018441539

SRP-Rank 0.082599556 1.266729551

35

Chapter 4: Conclusion and Future Work

This chapter aims to discuss the result of the proposed ranking algorithm, and the future

work.

4.1 Results and Conclusion

In this thesis, scientific research papers ranking algorithm was proposed to balance the

impacts of the PR score on old papers and new papers. It aims to improve the result of PR

ranking by solving the problem of favoring old papers and making it more suitable to rank

scientific research papers. The results show that the proposed ranking algorithm doesn‟t

rely heavily on citation count as PR. The results were neutral with both old and new papers.

On the other hand there is no specific evaluation method for ranking results that are

identified by the scientific community. The satisfaction about the ranking results varies

from person to person. An evaluation method to find the precision and recall of ranking

result that was proposed by [35], was tested in this thesis. The proposed method got higher

precision and recall than PageRank. The distribution of ranked SRPs among the age of the

paper showed that the proposed SRP-Rank succeeded in giving unbiased scores to old

papers against new papers. However, the PageRank gives a higher score for the old papers

more than the new ones; because the old papers have more opportunities to be cited more

than the new ones.

36

4.2 Future work

In terms of future work, there are several directions that can be explored:

- Testing the proposed method on more dataset with different quires.

- Making survey to evaluate the method results by users.

- Experimenting with the deferent parameters for a better tuning of the rankings such as

place of publication.

- Another interesting direction might be examined to conduct more types of evaluation of

the results including query evaluation.

37

REFERENCES

[1] Levene, M. (2010) An Introduction To Search Engines And Web Navigation, (2nd

ed.) New Jersey: A John Wiley & Sons, Inc.

[2] Maurya, V., Pandey, P. and Maurya L.S. (2013) Effective Information Retrieval

System. International Journal of Emerging Technology and Advanced Engineering,

vol. 3, no. 4, pp. 787-792.

[3] Manning, C. Raghavan, P. Schutze, H. (2008) Introduction to Information

Retrieval. New York: Cambridge University Press.

[4] Ceri S, Bozzon A, Brambilla M, Della Valle E, Fraternali P, Quarteroni S. Web

Information Retrieval. Heidelberg: Springer; 2013.

[5] Baeza-Yates R, and Ribeiro-Neto B, (2011), Modern Information Retrieval: The

Concepts and Technology behind Search. Boston: Addison-Wesley Professional, 2nd

Ed.

[6] Frakes W.B. and Baeza-Yates R. (1992), Information Retrieval: Data Structures

and Algorithms. New Jersey: Prentice Hall, 1st Ed.

[7] Liu T.Y. (2011), Learning to Rank for Information Retrieval. Berlin. Heidelberg:

Springer.

[8] Hiemstra D. (2000,) Using Language Models for Information Retrieval.

Netherlands: Taaluitgeverij Neslia Paniculata.

[9] Salton G, Wong, A and Yang C.S. (1975) A Vector Space Model for Automatic

Indexing. Communications of the ACM, Vol. 18, No. 11, pp 613–620.

[10] Deerwester S, Dumais S, Landauer T, Furnass G, and Beck L, (1988) Improving

Information Retrieval with Latent Semantic Indexing. Proceedings of the 51st

38

Annual Meeting of the American Society for Information Science, Vol. 25, pp. 36-

40.

[11] Robertson S.E, Walker S, Jones S, Hancock-Beaulieu M, and Gatford M. (1994)

Okapi at TREC-3. Proceedings of the Third Text Retrieval Conference,

Gaithersburg, USA.

[12] Ponte J, and Croft W.B. (1998) A Language Modeling Approach to Information

Retrieval. In Proceedings of the 21st International Conference on Research and

Development in Information Retrieval, pp. 275–281.

[13] Devi P, Gupta A, and Dixit A, (2014). Comparative Study of HITS and

PageRank Link Based Ranking Algorithms. International Journal of Advanced

Research in Computer and Communication Engineering, Vol 3, Issue 2, PP. 5749 -

5754.

[14] Kleinberg J, (1999). Authoritative Sources in a Hyperlinked environment.

Journal of the ACM, Vol. 46, No. 5, pp. 604-632.

[15] Brin S, and Page L, (1998) The Anatomy of a Large-Scale Hypertextual Web

Search Engine. Computer Networks, Vol. 30, pp. 107-117.

[16] Beel, J. and Gipp, B. (2009) Google Scholar‘s Ranking Algorithm: An

Introductory Overview, 12th International Conference on Scientometrics and

Informetrics, Vol. 1, PP. 230-241.

[17] Anh, V.L. Hoang, H.V. Trung, H.L. Trung, K.L. and Jung, J.J (2014) Evaluating

Scientific Publications By N-Linear Ranking Model, ANNALES Universitatis

Scientiarum, Sectio Computatorica, Vol. 43, PP. 123-147.

39

[18] Sohn, B.S. and Jung, J. (2015) A Novel Ranking Model for a Large-Scale

Scientific Publication. Mobile Networks and Applications, Vol. 20, Issue 4, PP

508-520.

[19] Due M, Bai F, and Liu Y, (2009). PaperRank: A Ranking Model for Scientific

Publication. IEEE World Congress on Computer Science and Information

Engineering, Vol 4, PP. 277- 281.

[20] Jiang X, Sun X, Zhuge H (2012) Towards an Effective and Unbiased Ranking of

Scientific Literature through Mutual Reinforcements. 21st ACM Conference on

Information and Knowledge Management, Hawaii, USA, pp 714–723.

[21] Wang, Y. Tong, Y. and Zeng, M (2013) Ranking Scientific Articles by Exploiting

Citations, Authors, Journals, and Time Information. Proceedings of the

Twenty-Seventh AAAI Conference on Artificial Intelligence, Bellevue,

Washington, USA, PP. 933 - 939.

[22] Shubhankar, K. Singh, A. and Pude, V. (2011), An Efficient Algorithm for Topic

Ranking and Modeling Topic Evolution. Database and Expert Systems

Applications, Vol 6860, PP 320-330.

[23] Haddadene, H. Harik, H and Salhi, S. (2012) On the Pagerank Algorithm for the

Articles Ranking. Proceedings of the World Congress on Engineering, Vol I,

London, U.K.

[24] Krapivin, M and Marchese, M. (2008), Focused Page Rank in Scientific Papers

Ranking. Digital Libraries: Universal and Ubiquitous Access to Information, Vol

5362, PP 144-153.

40

[25] Sidiropoulos A, and Manolopoulos Y. (2006) Generalized Comparison of Graph-

Based Ranking Algorithms for Publications and Authors. The Journal of

Systems and Software, Vol. 79, PP. 1679 – 1700.

[26] Sun, Y. and Giles C.L. (2007), Popularity Weighted Ranking for Academic

Digital Libraries. 29th European Conference on IR Research, Rome, Italy, PP.

605-612.

[27] Chen, P. Xie, H. Maslov, S. and Redner, S. (2007) Finding Scientific Gems with

Google’s PageRank Algorithm. Elsevier, Journal of Informatics, Vol. 1, PP. 8 -15.

[28] Walker D, Xie H, Yan KK, and Maslov S, (2007) Ranking Scientific Publications

Using a Model of Network Traffic. Journal of Statistical Mechanics: Theory and

Experiment.

[29] Sayyadi, H., and Getoor, L. (2009) Futurerank: Ranking Scientific Articles by

Predicting Their Future PageRank. Ninth SIAM International Conference on

Data Mining, PP. 533–544.

[30] Singh, A. Shubhankar, K. and Pudi, V. (2011) An Efficient Algorithm for

Ranking Research Papers Based on Citation Network. 3rd Conference on Data

Mining and Optimization, Putrajaya, PP. 88 -95.

[31] Dunaiski D, and Visser W, (2012). Comparing Paper Ranking Algorithms.

Proceedings of the South African Institute for Computer Scientists and Information

Technologists Conference, Pretoria, South Africa, PP. 21-30.

[32] Hwang, W.S, Chae, S.M, and Kim, W.K. (2010), Yet Another Paper Ranking

Algorithm Advocating Recent Publications. 19th International Conference on

World Wide Web, Raleigh, North Carolina, USA.

41

[33] Mushtaq, H. (2014) Scientific Research Paper Ranking Algorithm PTRA: A

Tradeoff between Time and Citation Network. Applied Mechanics and

Materials. Vol 551, PP. 603-611.

[34] Bose, A. Nayak, R. and Bruze, P. (2008) Improving Web Service Discovery by

using Semantic Models. 9th International Conference on Web Information

Systems Engineering. Auckland - New Zealand. PP 366 - 380.

[35] Manoharan, R. Archana, A. and Cowlagi, S. N. (2011) Hybrid Web Services

Ranking Algorithm. IJCSI International Journal of Computer Science Issues, Vol.

8, Issue 3, No. 2, PP 452 – 460.

42

Appendices

Sample of Dataset

FN Thomson Reuters Web of Knowledge

VR 1.0

PT J

AU Bertomeu-Sanchez, JR

AF Ramon Bertomeu-Sanchez, Jose

TI Managing Uncertainty in the Academy and the Courtroom Normal Arsenic and Nineteenth-Century

Toxicology

LA English

DT Article

ID KNOWLEDGE; SCIENCE

AB This essay explores how the enhanced sensitivity of chemical tests sometimes produced unforeseen and

puzzling problems in nineteenth-century toxicology. It focuses on the earliest uses of the Marsh test for

arsenic and the controversy surrounding "normal arsenic"-that is, the existence of traces of arsenic in healthy

human bodies. The essay follows the circulation of the Marsh test in French toxicology and its appearance in

the academy, the laboratory, and the courtroom. The new chemical tests could detect very small quantities of

poison, but their high sensitivity also offered new opportunities for imaginative defense attorneys to

undermine the credibility of expert witnesses. In this context, toxicologists had to dispel the uncertainty

associated with the new method and come up with arguments to refute the many possible criticisms of their

findings, among them the appeal to normal arsenic. Meanwhile, new descriptions of animal experiments,

autopsies, and cases of poisoning produced a steady flow of empirical data, sometimes supporting but in

many cases questioning previous conclusions about the reliability of the chemical tests. This challenging

scenario provides many clues about the complex interaction between science and the law in the nineteenth

century, particularly how expert authority, credibility, and trustworthiness were constructed, and frequently

challenged, in the courtroom.

C1 Inst Hist Med & Sci Lopez Pinero, Valencia 46003, Spain.

RP Bertomeu-Sanchez, JR (reprint author), Inst Hist Med & Sci Lopez Pinero, Pl Cisneros 4, Valencia

46003, Spain.

EM [email protected]

FU Spanish government [HAR2009-12918-C03-03]

FX This essay is part of a larger study on nineteenth-century toxicology supported by the Spanish

government (HAR2009-12918-C03-03). I am very grateful to the staff of the Bibliotheque Interuniversitaire

de Sante,

Paris, who helped me with many relevant sources for this paper, and to the Chemical Heritage Foundation

(CHF), Philadelphia, in whose library this essay took shape thanks to two short-term fellowships (July-

August

2010 and March 2011). Marjorie Gapp and Amanda Antonucci helped me with the impressive art collection

of the CHF. I am also indebted to the organizers of and participants in the meetings in which earlier versions

of the essay were discussed and to Jose Pardo Tomas and Josep Simon Castel for their insightful comments

and suggestions. The late Josep Miquel Vidal, president of the scientific section of the Institut

Menorqui d'Estudis, enthusiastically supported the development of this project. I would like also to

acknowledge the anonymous referees for Empire.

CR ASHMORE M, 1993, SOC STUD SCI, V23, P67, DOI 10.1177/030631293023001003

Barse Jules, 1843, J CHIMIE MED, V9, P571

Barse Jules, 1845, MANUEL COUR ASSISES, P151

Bertomeu Jose R., 2011, M ORFILA AUTOBIOGRAF, P192Bertomeu-Sanchez Jose R., 2006, CHEM MED

CRIME M ORF

Bertomeu-Sanchez Jose Ramon, 2006, CHEM TECHNOLOGY SOC, P300

Bertomeu-Sanchez JR, 2012, ANN SCI, V69, P1, DOI 10.1080/00033790.2011.637471

43

Bertrand Gabriel, 1903, RECHERCHES EXISTENCE

Bloch Magali, 1997, RECHERCHES CONT, V4, p[101, 119]

Blyth Alexander Wynter, 1884, POISONS THEIR EFFECT, P531

Borie Leonard, 1841, CATECHISME TOXICOLOG, P71

Buchner Johannes A., 1839, REP PHARM, V17, P123

Burnett D. Graham, 2007, EMPIRE, V98

Burney I, 2006, POISON DETECTION VIC

Bussy Antoine, 1840, REPONSE ECRITS M RAS

CAMPBELL WA, 1965, CHEM BRIT, V1, P198

Caventou J. B., 1839, B ACAD ROY MED BELG, V4, P275

Chauvaud, EXPERTS EXPERTISE JU, P243

Chauvaud Frederic, 2003, EXPERTS EXPERTISE JU, P192

Chauvaud Frederic, 2000, EXPERTS CRIME MED LE

Christison Robert, 1845, TREATISE POISONS, P289

Coley N. G., 1986, MED HIST, V30, p[173, 181]

Coley Noel G., 1838, J PHARM, V24, P500

Coley Noel G., 1991, MED HIST, V35, p[409, 421]

Coley Noel G., 1837, J PHARM, V23, P553

Coley Noel G., 1837, ANN PHARM CHEM, V23, P217

Collins H, 2007, RETHINKING EXPERTISE

Collins HM, 2010, TACIT EXPLICIT KNOWL

Couerbe, 1840, GAZETTE HOPITAUX, V13, P106

Couerbe Jean-Pierre, 1840, GAZETTE HOPITAUX, V13, P485

Crosland Maurice, 1992, SCI CONTROL FRENCH A

Cullen WR, 2008, IS ARSENIC AN APHRODISIAC?: THE SOCIOCHEMISTRY OF AN ELEMENT, P1,

DOI 10.1039/9781847558602

Danger F. P., 1843, CR HEBD ACAD SCI, V17, P153

Devergie Alphonse, 1845, ANN HYG PUBLIQUE MED, V33, P142

Devergie Alphonse, 1836, TRAITE THEORIQUE PRA, V1, P15

Devergie Alphonse, 1840, ANN HYG PUBL, V24, P136

Devergie Alphonse, 1836, TRAITE THEORIQUE PRA, V1, P17

Donovan James M., 2010, JURIES TRANSFORMATIO, p[5, 37]

Emsley J, 2005, ELEMENTS MURDER HIST

Engelhardt Hugo T., 1987, SCI CONTROVERSIES CA

Essig Mark, 2002, RESEARCH CORNELL U

Flandin Charles, 1846, TRAITE POISONS, V1, P734

Gautier, 1904, CR HEBD ACAD SCI, V139, P101

Gautier, 1899, CR HEBD ACAD SCI, V129, p[929, 935]

Gautier A, 1902, CR HEBD ACAD SCI, V134, P1394

Gautier Armand, 1876, ANN CHIMIE PHYS, V7, P384

Gavroglu K, 2008, HIST SCI, V46, P153

Gerber Samuel M., 1997, MORE CHEM CRIME MARS

Golan Tal, 2004, LAWS MAN LAWS NATURE

Guignard Laurence, 2010, JUGER FOLIE FOLIE CR

Hirsch Adolf G., 1842, ARSENIK, P43

Huber P. W., 1991, GALILEOS REVENGE JUN, P28

Jasanoff S, 1995, SCI BAR LAW SCI TECH

Kaiser D., 2005, DRAWING THEORIES APA

La Berge Ann F., 2004, Perspectives on Science, V12, P424

LANGMUIR I, 1989, PHYS TODAY, V42, P36, DOI 10.1063/1.881205

Latour B., 1987, SCI ACTION FOLLOW SC

Leclerc Olivier, 2005, JUGE EXPERT CONTRIBU

Lefevre Andre, 1913, EXPERTISE DEVANT JUR, P48

Lynch Michael, 2008, TRUTH MACHINE CONTEN

Machamer Peter K., 2000, SCI CONTROVERSIES PH

Marsh J., 1836, EDINBURGH NEW PHILOS, V21, P229

44

Mata Pedro, 1844, VADEMECUM MED CIRUGI, P636

Mercer David, 2002, CAUSATION LAW MED, P83

OLESKO KM, 1993, OSIRIS, V8, P16, DOI 10.1086/368716

Orfila, 1839, B ACAD ROY MED BELG, V4, p[178, 179]

Orfila, 1839, B ACAD ROY MED BELG, V3, p[426, 464]

Orfila, 1838, B ACAD ROY MED BELG, P161

Orfila, 1839, ARCH GEN MED, V4, p[373, 375]

Orfila Mateu, 1852, TRAITE TOXICOLOGIE, V1, P544

Orfila Mateu, 1831, TRAITE EXHUMATIONS J

Orfila Mateu, 1841, RAPPORT MOYENS CONST, P42

Orfila Mateu, 1839, B ACAD ROY MED BELG, V3, p[676, 682]

Orfila Mateu, 1840, B ACAD ROY MED BELG, V5, p[465, 474]

Orfila Mateu, 1838, B ACAD ROY MED BELG, V3, P93

Orfila Mateu, 1844, ANN HYG PUBLIQUE MED, V31, P131

Orfila Mateu, 1839, B ACAD ROYALE MED, V3, P676

Orfila Mateu, 1839, B ACAD ROY MED BELG, V3, P1049

Orfila Mateu, 1818, TRAITE POISONS, V1, P15

Orfila Mateu, 1844, ANN HYG PUBLIQUE MED, V31, p[430, 435]

Orfila Mateu, 1839, EXPERIENCE, V91, P208

Orfila Mateu, 1840, ANN HYG PUBLIQUE MED, V24, p[298, 312]

Orfila MM., 1842, ANN HYGIENE PUBLIQUE, V28, p[148, 152]

Pfaff Christian H., 1841, REP PHARM, V24, P106

Raspail Francois-Vincent, 1840, ACCUSATION EMPOISONN, P24

Raynaud Dominique, 2003, SOCIOLOGIE CONTROVER

Rees George Owen, 1841, GUYS HOSP REP, V6, P163

Reinsch Hugo, 1843, ARSENIK, P43

Rognetta, NOUVELLE METHODE TRA, P20

Schmidtmann Adolf, 1905, HDB GERICHTLICHEN ME, P913

Secord JA, 2004, EMPIRE, V95, P654, DOI 10.1086/430657

Taruffo, 2002, PRUEBA HECHOS

Taruffo Michelle, 2005, B MEX DERECHO COMPAR, V38, P1285

Taylor Alfred S., 1848, POISONS RELATION MED, P350

Topham Jonathan, 2009, POPULARIZING SCI TEC, P1

Usselman MC, 2005, ANN SCI, V62, P1, DOI 10.1080/00033790410001711922

Wagner J. H., 1952, PRO MEDICO, V21, P161

Watson K, 2004, POISONED LIVES ENGLI

Watson KD, 2011, FORENSIC MEDICINE IN WESTERN SOCIETY: A HISTORY, P1

Weisz G., 1995, MED MANDARINS FRENCH

Whorton James C., 2010, ARSENIC CENTURY VICT

World Health Organisation, 1996, TRAC EL HUM NUTR HLT, P217

NR 134

TC 0

Z9 0

PU UNIV CHICAGO PRESS

PI CHICAGO

PA 1427 E 60TH ST, CHICAGO, IL 60637-2954 USA

SN 0021-1753

Screenshot of The Data Extracted From Dataset

45

Sample of The Paper Citation Network *Vertices 52479

1 "Bertomeu Jose R., 2011, M Orfila Autobiograf, P192"

localcitationcount 1

2 "Gautier, 1899, Cr Hebd Acad Sci, V129, P[929, 935]"


3 "Guignard Laurence, 2010, Juger Folie Folie Cr" localcitationcount

1

4 "Langmuir I, 1989, Phys Today, V42, P36, Doi 10.1063/1.881205"


5 "Ashmore M, 1993, Soc Stud Sci, V23, P67, Doi

10.1177/030631293023001003" localcitationcount 1

6 "Orfila, 1839, Arch Gen Med, V4, P[373, 375]" localcitationcount 1

7 "Marsh J., 1836, Edinburgh New Philos, V21, P229"


8 "Couerbe Jean-pierre, 1840, Gazette Hopitaux, V13, P485"


9 "Bertomeu-sanchez Jr, 2012, Ann Sci, V69, P1, Doi

10.1080/00033790.2011.637471" localcitationcount 1

10 "Couerbe, 1840, Gazette Hopitaux, V13, P106" localcitationcount 1

11 "Reinsch Hugo, 1843, Arsenik, P43" localcitationcount 1

12 "Taruffo Michelle, 2005, B Mex Derecho Compar, V38, P1285"


13 "Orfila, 1839, B Acad Roy Med Belg, V3, P[426, 464]"


14 "Olesko Km, 1993, Osiris, V8, P16, Doi 10.1086/368716"


15 "[anonymous], 1839, Gaz Hopitaux, V12, P409" localcitationcount 1

46

16 "Chauvaud Frederic, 2003, Experts Expertise Ju, P192"


17 "[anonymous], 1839, B Acad Rooy Med, V3, P683" localcitationcount

1

18 "[anonymous], 1840, Gaz Tribunaux 0606, V15, P761"


19 "[anonymous], 1841, Rev Sci, V7, P261" localcitationcount 1

20 "Donovan James M., 2010, Juries Transformatio, P[5, 37]"


21 "Lefevre Andre, 1913, Expertise Devant Jur, P48"


22 "Orfila Mateu, 1839, B Acad Roy Med Belg, V3, P[676, 682]"


23 "Rognetta, Nouvelle Methode Tra, P20" localcitationcount 1

24 "Devergie Alphonse, 1845, Ann Hyg Publique Med, V33, P142"


25 "Whorton James C., 2010, Arsenic Century Vict" localcitationcount

1

26 "Collins H, 2007, Rethinking Expertise" localcitationcount 2

27 "Jasanoff S, 1995, Sci Bar Law Sci Tech" localcitationcount 3

28 "Hirsch Adolf G., 1842, Arsenik, P43" localcitationcount 1

29 "Leclerc Olivier, 2005, Juge Expert Contribu" localcitationcount

1

30 "Machamer Peter K., 2000, Sci Controversies Ph"


MATLAB Code To Calculate PageRank From Paper Citation Networks

n =max(max(cit ( : , 1:2)));

C= sparse (cit(:,2), cit(:, 1), cit (:,3),n,n);

function p = calc_PageRank(C, alpha, n_iterations)

m = sum(C, 2);

C(m == 0, :) = 1;

n = length(C);

m = sum(C, 2);

C = spdiags(1 ./ m, 0, n, n) * C;

p = repmat(1 / n, [1 n]);

for i = 1:n_iterations

p = alpha * p * C + (1 - alpha) / n;

end

47

Sample of the proposed method results (Top 15 ranked papers)

Author Year Title Score

Dear, P 2005

What is the history of science the history of? Early modern roots of the ideology of

modern science 22.925

Kevles, DJ 2007

Patents, protections, and privileges - The establishment of intellectual property in

animals and plants 18.09778574

Hankins, TL 2006

A "large and graceful sinuosity" - John Herschel's graphical method

11.83437724

Hankins, TL 1999

Blood, dirt, and nomograms - A particular history of graphs

11.49795406

Worboys, M 2011

Practice and the Science of Medicine in the Nineteenth Century

10.01692389

Morus, IR 2006 Seeing and believing science 8.457623794

Kohlstedt, SG 2005

"Thoughts in things" modernity, history, and north American museums

7.719776119

Dror, OE 1999

The affect of experiment - The turn to emotions in Anglo-American physiology,

1900-1940 6.972314671

Lucier, P 2012

The Origins of Pure and Applied Science in Gilded Age America

6.917488814

Bowler, PJ 2008

What Darwin disturbed - The biology that might have been

6.57518597

Goldstein, D

2008

Outposts of science - The knowledge trade and the expansion of scientific community

in post-Civil War America 6.517109456

Cantor, G 2012

Science, Providence, and Progress at the Great Exhibition

6.417488814

Bertomeu- 2013 Managing Uncertainty in the Academy and

the Courtroom Normal Arsenic and 6.364267563

48

Sanchez, JR Nineteenth-Century Toxicology

Dear, P 2005

What is the history of science the history of? Early modern roots of the ideology of

modern science 22.925

Kevles, DJ 2007

Patents, protections, and privileges - The establishment of intellectual property in


Sample of the ranked results using PR (Top 15)


Kevles, DJ 2007 Patents, protections, and privileges - The establishment of intellectual property in


Hankins, TL 1999 Blood, dirt, and nomograms - A particular

history of graphs 8.591558442

Morus, IR 2006 Seeing and believing science 8.515853659

Lightman, B

2000 The visual theology of Victorian

popularizers of science - From reverent eye to chemical retina

8.488461538

Dear, P 2005 What is the history of science the history of? Early modern roots of the ideology of

modern science 8.45

Bowler, PJ 2008 What Darwin disturbed - The biology that

might have been 8.441666667

Klein, U 2008 The Laboratory Challenge Some Revisions

of the Standard View of Early Modern Experimentation

8.43125

Alexander, AR

2006 Tragic mathematics - Romantic narratives and the refounding of mathematics in the

early nineteenth century 8.410869565

49

Tucker, J 2006 The historian, the picture, and the archive 8.364285714

Elshakry, M 2010 When Science Became Western

Historiographical Reflections 8.316666667

Portolano, M

2000 John Quincy Adams's rhetorical crusade for

astronomy 8.316666667

Dror, OE 1999 The affect of experiment - The turn to

emotions in Anglo-American physiology, 1900-1940

8.314634146

Canizares-Esguerra, J

2005 Iberian colonial science 8.305555556

Nyhart, LK 1998 Civic and economic zoology in nineteenth-

century Germany - The "living communities" of Karl Mobius

8.286986301

Hankins, TL 2006 A "large and graceful sinuosity" - John

Herschel's graphical method 8.275

Sample of the ranked results using Citation Count (Top 15)


Hankins, TL 1999 Blood, dirt, and nomograms - A particular

history of graphs 34

Dror, OE 1999 The affect of experiment - The turn to

emotions in Anglo-American physiology, 1900-1940

27

Lightman, B 2000 The visual theology of Victorian

popularizers of science - From reverent eye to chemical retina

22

50

Dear, P 2005 What is the history of science the history of? Early modern roots of the ideology of

modern science 18

Morus, IR 2006 Seeing and believing science 15

Kevles, DJ 2007 Patents, protections, and privileges - The establishment of intellectual property in

animals and plants 12

Lucier, P 2009 The Professional and the Scientist in

Nineteenth-Century America 12

Hankins, TL 2006 A "large and graceful sinuosity" - John

Herschel's graphical method 11

Nyhart, LK 1998 Civic and economic zoology in

nineteenth-century Germany - The "living communities" of Karl Mobius

10

Mazzotti, M 1998 The geometers of god - Mathematics and

reaction in the kingdom of Naples 10

Klein, U 2008 The Laboratory Challenge Some Revisions

of the Standard View of Early Modern Experimentation

9

Schloegel, JJ|Schmidgen,

H 2002

General physiology, experimental psychology, and evolutionism -

Unicellular organisms as objects of psychophysiological research, 1877-1918

9

Bowler, PJ 2008 What Darwin disturbed - The biology that

might have been 7

Canizares-Esguerra, J

2005 Iberian colonial science 7

Alexander, AR 2006 Tragic mathematics - Romantic narratives and the refounding of mathematics in the

early nineteenth century 6

an efficient ranking algorithm for scientific research...

Documents