“the rise and rise of citation analysis ”
DESCRIPTION
“The Rise and Rise of Citation Analysis ”. Tanmoy Chakraborty CNeRG , IIT- Kgp , India. Twofold Research Interests. Analyzing communities/clusters in complex networks Studying different aspects of citation network . “The Rise and Rise of Citation Analysis ”. In collaboration with - PowerPoint PPT PresentationTRANSCRIPT
“The Rise and Rise of Citation
Analysis”Tanmoy ChakrabortyCNeRG, IIT-Kgp, India
Twofold Research Interests
• Analyzing communities/clusters in complex networks
• Studying different aspects of citation network
“The Rise and Rise of Citation
Analysis”In collaboration with
Suhansanu Kumar, Pawan Gowel, Animesh Mukherjee, Niloy Ganguly
Mixed Sentiment • Sense and non-sense about citation analysis (*6860)
--- T. Opthof, Cardiovascular research, 97
• The rise and rise of citation analysis (*1399) --- Lokman I. Meho, Phy. Res., 07
• Does citation pay? (*887) --- Fowler & Aksnes, Scientometrics, 07
• Think beyond citation analysis (*1009) --- Sarli et al., NIPS, 10
Raw Citations Count To assess
• Quality of a paper• Prominence of a researcher• Success of a collaboration/group• Quality of a conference/journal• Quality of an Institute• Impact of a research area
Only Citation Count
Sooner or later, you will definitely be subjected to such an analysis
Bibliometrics: Raw Citation Count
• Journal Impact factor • Immediacy factor• Eigen factor• Altmetric• 5 years Impact factor
Cita
tion
Common assumption
Publication Universe• Crawled entire Microsoft Academic Search • Papers only in Computer Science domain• Basic preprocessing
Basic Statistics of papers from 1960-2010
Values
Number of valid entries 3,473,171
Number of authors 1,186,412
Number of unique venues 6,143
Avg. number of papers per author 5.18
Avg. number of authors per paper 2.49
Publication UniverseAvailable Metadata for each paper
Title
Unique ID
Named entity disambiguated authors’ name
Year of publication
Named entity disambiguated publication venue
Related research field(s)
References
Keywords
Abstract
Citation context
Citation ProfileAn exhaustive analysis of the citation profiles
• Papers having at least 10 yrs history and consider at most 20yrs history
• Scale the entries of the citation profile between 0-1• Use peak-detection heuristics
» Each peak should be at least 75% of the max peak» Two consecutive peak should be separated at least 3 yrs
Five Universal Citation Profiles
Q1 and Q3 represent the first and third quartiles of the data points respectively.
Avg. behavior
Another category: ‘Oth’ => having less than one citation (on avg) per year
Five Universal Citation ProfilesA deeper look
Immediate Questions• Is the Journal Impact factor (JIF) formula
correct? • JIF at year 2000 : Eugene Garfield (1975)
# of citations received in 2000 by papers published in that journal in 1998 and 1999
divided by
# of papers published in the journal in 1998 and 1999
Immediate Questions
What does JIF really imply?
Importance of the recent papers in current time period?
Relevance of the journal itself in current time period
Why last 2 years?Why not all the citations at current time
Immediate Questions
Over-consider
Under-consider
JIF overlooks the importance of Peak_Late and MonIncrOver-consider
More on the Categories1. Are they biased on the year of publication?
(Aging factor)
Mean year Year deviation
Peak_Int 1994 5.19
Peak_Mul 1992 6.68
Peak_Late 1992 6.54
MonDec 1994 5.43
MonIncr 1993 5.36
Same ageAns: No
More on the Categories2. Are they biased on Journals/conferences?
Peak_Int Peak_Mul Peak_Late MonDec MonIncr% of
conferences paper 65 39.03 39.89 60.73 25.26% of
Journal paper 35 60.97 60.11 39.27 74.74
More on the Categories3. Are they affected by self-citation?
Peak_Int Peak_Mul Peak_Late MonDec MonIncr Oth
Peak_Int 0.72 0.10 0.03 0.01 0 0.15
Peak_Mul 0.02 0.81 0.04 0 0.01 0.11
Peak_Late 0.01 0.06 0.86 0 0.01 0.06
MonDec 0.05 0.14 0 0.41 0 0.35
MonIncr 0 0.02 0.01 0 0.88 0.09
Transition matrix showing the transition of categories after removing self citations
Most affected by self citation
Least affected by self citations
More on the Categories4. What about Peak_Mul?
Peak_Mul Might be Intermediary between Peak_Int and Peak_Late
2.5
5.3
5.1
3.1
4.2
5.3 12.1
10.8
Time
Avg
Pea
k H
eigh
t
Peak_Int
Peak_Mul
Peak_Late
Years after publication
3.1+2.5 = 5.6
(12.1-5.3) = 6.8 ~ (10.8-4.2)
Where does this classification help?
• To improve Bibliometrics in scientific research• Various prediction systems
• Future citation prediction system• Predicting emerging field/topic• Predating future star/seminal papers
• Paper search and Recommender systems
On predicting
Future Citation Count at the Time of Publication
Problem Definition
Traditional FrameworkYan et al., JCDL, 2012
Problems in Traditional Framework• Consider initial few years’ statistics after publication
Proved to be very effective
• Lack of time dimension in prediction
• Suffers a lot from outlier points during regression
Problems in Traditional Framework:how to tackle
• Consider initial few years’ statistics after publication Proved to be very effective
• Lack of time dimension in prediction
• Suffers a lot from outlier points during regression
Try to predict citations as early as possible(may be at the time of publication )
Should consider the time dimension
Reduce outlier points as much as possible
Our Framework: 2-stage Model
FeaturesAuthor-centric
Productivity (Max/Avg)
H-index (Max/Avg)
Versatility (Max/Avg)
Sociality (Max/Avg)
Performance Evaluation(i) Coefficient of determination (R2)
The more, the better
(ii) Mean squared error (θ) The less, the better
(iii) Pearson correlation coefficient (ρ) The more, the better
Performance of SVMConfusion Matrix
Performance of Regression Model
Feature Analysis
• Five universal citation profiles• Different analysis on these categories• Can help to reframe existing bibliometrics• Can be a generic way in machine learning• Can enhance the performance of the existing
systems
Conclusion
Thank You