bibliometrics and preference modelling

Bibliometrics and preference modelling

Thierry MarchantGhent University

Some academic rankings

Top 5% Authors, as of April 2008

Average Rank Score

Author Nb W Nb C Sc C ANb C ASc C Nb P Sc P ANb P h

1 R. J. Barro 52 1 2 1 1 26 11 15 22 J. E. Stiglitz 5 5 6 4 9 3 2 2 143 A. Shleifer 21 2 1 6 8 10 5 54 14 J. J. Heckman 12 4 4 3 3 4 3 5 35 P. C. B. Phillips 2 30 69 26 59 2 1 3 526 R. E. Lucas Jr. 737 10 8 2 2 156 60 61 337 M. L. Gertler 192 3 3 9 11 208 95 270 118 M. S. Feldstein 19 71 39 40 23 67 21 25 799 E. C. Prescott 109 9 5 11 4 127 82 152 23

10 J. Tirole 14 20 24 21 27 6 4 8 4

Outline

• Why rank ?

• Which attributes?

• Some popular rankings.

• How can we motivate a ranking ?

• The axiomatic approach.

• Comparing peers and apples

Why rank ?

Why rank universities ?• To choose one for studying (bachelor student).• To attract good students (good university).• To obtain subsidies (good university).• To allocate subsidies (government).• To allocate students to various universities in

function of their score at an exam (government).• ...

Why rank departments ?• To choose one for studying (doctoral student).• To attract good students (good department).• To obtain subsidies (good department).• To allocate subsidies (government).• To allocate students to various departments in function

of their score at an exam (government).• ...

Why rank scientists ?• To determine the salary (university).• To award a scientific distinction (scientific society).• To hire a new scientist (university).• To choose a thesis director (student).• To evaluate a department or university (...).• To evaluate a journal (...).• To allocate subsidies (government).• ...

Why rank journals ?• To choose one for publishing (scientist).

• To maximize the dissemination of one’s results.• To maximize one’s value.

• To evaluate a scientist (...).• To evaluate a department (...).• To evaluate a university (...).• To improve one’s image (good publisher).• ...

Why rank articles ?• To select articles (scientist).• To evaluate a scientist (...).• To evaluate a departement (...).• To evaluate a university (...).• To evaluate a journal (...).• ...

Focus in this talk

• Rankings of scientists• Rankings of departments• Rankings of universities• Rankings of journals• Rankings of articles

Which attributes ?

Many relevant attributes

Quality– Evaluation by peers– Quality of the journals– Citations (#, authors, journals, +/-)– Coauthors– Patents– Awards– Budget

Quantity– Number of papers– Number of books– Number of pages– Coauthors (#)– Number of patents– Citations (#)– Awards– Budget– Number of thesis students

Various– Age– Carreer length– Land

– Nationality– Discipline– Century

– University

Bibliometric attributes

Quality– Evaluation by peers– Quality of the journals– Citations (#, authors, journals, +/-)– Coauthors– Patents– Awards– Budget

Quantity– Number of papers– Number of books– Number of pages– Coauthors– Number of patents– Citations (#)– Awards– Budget– Number of thesis students

Various– Age– Carreer length– Land

– Nationality– Discipline– Century

– University

Bibliometric attributes

Why using bibliometric attributes ?• Cheap• Objective ?• Reliable ?

Some popular rankings of scientists

Some popular rankings• Number of publications• Total number of citations• Maximal number of citations• Number of publications with at least a citations.• Average number of citations• The same ones weighted by• Number of authors• Number of pages• Impact factor

• The same ones corrected for age• h-index, g-index, hc-index, hI-index, R-index, A-index, …

The h-index• Published in 2005 by physicist G. Hirsch.• 462 (1267) citations in March 2009 (May 2013).• Adopted by Web of Science (ISI, Thomson).• The h-index is the largest natural number x such that at

least x of his/her papers have at least x citations each.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160

4

8

12

16

Distribution of the number of citations

h-index = 6

How to justify a ranking ?

• THE true and universal ranking does not exist.



Two departments: 50 scientists with 2000 citations

3 scientists with 180 citations



• If one knows the true ranking, one may compute some correlation between the true one and another one.




Assessing the Accuracy of the h- and g-Indexes for Measuring Researchers’ Productivity, Journal of the American society for information science and technology, 64(6):1224–1234, 2013.

“The analysis quantifies the shifts in ranks that occur when researchers’ productivity rankings by simple indicators such as the h- or g-indexes are compared with those by more accurate FSS.”




• Assume a law linking the numbers of papers and citations to the quality of the scientist (unobserved variable) and his age. This law may be probabilistic. Derive then an estimation of the quality of a scientist from his data (papers and citations).




• Assume a law linking the numbers of papers and citations to the quality of the scientist (unobserved variable) and his age. This law may be probabilistic. Derive then an estimation of the quality of a scientists from his data (papers and citations).

• Analyze the mathematical properties of rankings.

Characterization of scoring rules

Definitions

• Set of journals : J = { j, k, l, …}

• Paper: a paper in journal j with x citations and a coauthors is represented by the triplet (j,x,a).

• Scientist: mapping f from J×N×N to N. The number f(j,x,a) represents the number of publications of author f in journal j with x citations and a coauthors.

• Set of scientists: set X of all mappings from J×N×N to N such that Σj∈J Σx∈N Σa∈N f(j,x,a) is finite.

• Bibliometric ranking : weak order ≥ on X (complete and transitive relation).

Scoring rules

• Scoring rule : a bibliometric ranking is a scoring rule if there exists a real-valued mapping u defined on J×N×N such that f ≥ g iff

Σj Σx Σa f(j,x,a) u(j,x,a) ≥ Σj Σx Σa g(j,x,a) u(j,x,a) • Examples :

• u(j,x,a) = 1 # papers

• u(j,x,a) = x # citations

• u(j,x,a) = x/(a+1) # citations weighted by # authors

• u(j,x,a) = IF(j) # papers weighted by impact factor

• …

Axioms

• Independence: for all f, g in X, all j in J, all x, a in N, we have f ≥ g iff f + 1j,x,a ≥ g + 1j,x,a .

Axioms


>+ 1 paper in j, with x citations with a coauthors

+ 1 paper in j, with x citations with a coauthors

>f g

Axioms

• Archimedeanness: for all f, g, h, e in X with f > g, there is a natural n such that e + nf ≥ h + ng .

Axioms

• Archimedeanness: for all f, g, h, e in X with f > g, there is a natural n such that e + nf ≥ h + ng .

<e h

+ f : 10 papers with 20 citations + g : 1 paper with 1 citation + f : 10 papers with 20 citations + g : 1 paper with 1 citation + f : 10 papers with 20 citations + g : 1 paper with 1 citation + f : 10 papers with 20 citations + g : 1 paper with 1 citation

≥

Axioms


Not satisfied by the max # of citations or h-index.

Reversal with the h-index when adding 2 papers.

• Archimedeanness: for all f, g, h, e in X with f > g, there is an integer n such that e + nf ≥ h + ng .

Not satisfied by the max # of citations, h-index, lexicographic ranking.

Result

• Theorem : A bibliometric ranking satisfies Independence and Archimedeanness iff it is a scoring rule. Furthermore u is unique up to a positive affine transformation.

• Proof:

• (X, +, ≥) is an extensive measurement structure as in [Luce, 2000].

• (X, +) is a cancellative (f+g = f+h g=h) monoid. It can be extended to a group (X’, +) by the Grothendieck construction. (X’, +, ≥) is an Abelian and Archimedean linearly ordered group. It is isomorphic to a subgroup of the ordered group of real numbers (Hölder).

Special case: u(j,x,a) = x /(a+1).

• Transfer: for all j in J, all x, y, a in N, we have 1j,x,a + 1j,y+1,a ~ 1j,x+1,a + 1j,y,a (u affine in # citations).

• Condition Zero: for all j in J, all a in N, there is f in X such that f + 1j,0,a ~ f (u linear in # citations).

• Journals Do Not Matter: for all j, j’ in J, all a, x in N, 1j,x,a ~ 1j’,x,a (u independent of journal).

• No Reward for Association: for all j in J, all m, x in N with m >1, 1j,x,0 ~ m 1j,x,m-1 (u inversely proportional to # authors).

Characterization of conjugate scoring rules for scientists and departments

Introduction

• Consider two departments each consisting of two scientists. The scientists in department A both have 4 papers, each one cited 4 times. The scientists in department B both have 3 papers, each one cited 6 times.

• Both scientists in department A have an h-index of 4 and are therefore better than both scientists in department B, with an h-index of 3. Yet, department A has an h-index of 4 and is therefore worse than department B with an h-index of 6. Hence, the “best” department contains the “worst” scientists.

Definitions

• Scientist: mapping f from N to N. The number f(x) represents the number of publications of scientist f in with x citations.

• Set of scientists: set X of all mappings from N to N such that Σx∈N f(x) is finite.

• Ranking of scientists : weak order ≥s on X.

• Department : vector of scientists

• Set of all departments denoted by Y.

• Ranking of departments : weak order ≥d on Y.

Scoring rules

• Scoring rule : a ranking of scientists is a scoring rule if there exists a real-valued mapping u defined on N such that f ≥s g iff

Σx f(x) u(x) ≥ Σx g(x) u(x)

• Scoring rule : a ranking of departments is a scoring rule if there exists a real-valued mapping u defined on N such that (f1, …, fk) ≥d (g1, …, gl) iff

Σi Σx fi(x) v(x) ≥ Σj Σx gj(x) v(x)

• Conjugate scoring rules : ≥s and ≥d are conjugate scoring rules if u = v.

Axioms

• Consistency: if fi ≥s gi, for i = 1, … , k, then (f1, …, fk) ≥d (g1, …, gk) . In addition, if fi >s gi, for some i, then (f1, …, fk) >d (g1, …, gk) .

• Totality: if (f1, …, fk) and (g1, …, gl) are such that Σi fi = Σj gj , then (f1, …, fk) ~d (g1, …, gl) .

• Dummy : (f1, …, fk) ~d (f1, …, fk, 0) .

Result

• Theorem : ≥s and ≥d satisfy Consistency, Totality, Dummy and Archimedeannness of ≥s iff they are conjugate scoring rules. Furthermore u is unique up to a positive affine transformation.

Discussion

Discussion

• Axiomatic analysis of more rankings is needed.

• Axiomatic analysis of indices is different but also relevant.

• Consistency is important (e.g. h-index for scientists and IF for journals).

Literature

•Scientometrics

•Journal of Informetrics

•Journal of the American Society for Information Scienceand Technology

Comparing peers and apples

Comparing scientists of different ages

h-index = a h-index = b

a > b

•Instead of h-index, use an index that is independent of time.

•For instance, the average number of citations per paper, i.e. Σx∈N x f(x)/ Σx∈N f(x)

•Problem: suppose f has one paper with 50 citations and g has 10 papers with 40 citations.

•Divide the h-index by the length of the carreer

•Problem: the h-index is not a linear function of time

Comparing scientists of different ages

Comparing across disciplines• The average number of citations per paper is 80 times

larger in medicine than in mathematics.• Any comparison of scientists across disciplines, using an

index based on citations is therefore flawed.• Field normalization: for a given index, compute the

distribution of the index in each field (medicine, physics, economics, mathematics, literature, …). Define then the normalized index of a scientist as his/her percentile.

• Problem: the definition of a field is arbitrary. The average number of citations per paper is 20 times larger in physics than in mathematics. But only 2-3 times in theoretical physics.

Source field normalization

• Papers in medicine are often cited. This implies that they have long reference lists. Papers in mathematics have short reference lists.

• Instead of defining disciplines or fields, use the length of the reference list to normalize. Thus, divide the number of citations received by a paper by the length of the reference list.

Distributions

Lotka’s law

Proportion of scientists with n papers : F(n) = C/na

with C ≃ 2 and a depending on the field.

Non universal power law

Peterson Pressé and Dill, Proceedings of the National Academy of

Sciences, 107, 2010.

Direct citations : the probability that a new paper will randomly cite paper A is Pdirect = 1/N, with N the total number of published papers.

Indirect citations : the author of the new paper may first find a paper B and learn of paper A via B’s reference list. Pindirect = k/Nn, with k the number of existing citations to A and n the average length of the reference list.

Non universal power law (ctd)

Fraction of the N papers with k citations :

bibliometrics and preference modelling

Documents

subsidies good university

good students good university

rank journals

good students good department

subsidies good department

subsidies government

rank articles

rank universities