network analysis in bibliometrics - lovro Šubelj @...
TRANSCRIPT
networkanalysis inbibliometrics
LovroŠubeljUniversityofLjubljana,
FacultyofComputerandInformationScience
CWTS‘17
UniversityofLjubljana
• since1919 271st inCWTSLeidenRanking2017
• 26members 23faculties&3academies
• 40,110students&5,730staffin2016
FacultyofComputerandInformationScience
• since1996cs studysince1973• ≈1,300students&≈180staff• BSc,MSc,PhD cs,prog,math,mm
• research cs,db,is,dm,ml,ai,nets
talkoutline
1. reliabilityofbibliographicdatabasesŠubelj,L.,Fiala,D.,&Bajec,M.(2014).ScientificReports,4,6496.Šubelj,L.,Bajec,M.,Boshkoska, B.M.,etal.(2015).PLoS ONE,10(5),e0127390.
2. modelingpapercitationnetworksŠubelj,L.,&Bajec,M.(2013).InProceedingsoftheLSNA‘13,pp.527–530.Šubelj,L.,Žitnik,S.,&Bajec,M.(2014).InProceedingsoftheNetSci’14,p.1.
3. clusteringpapercitationnetworksŠubelj,L.,VanEck,N.J.,&Waltman,L.(2016).PLoS ONE,11(4),e0154404.
bibliographicdatabasesreliability
• databasesbasisforresearch&evaluation
• databasescandiffersubstantiallydifferentdatabasesoftengivequitedifferentconclusions
• content&structure candiffersubstantiallycoverage,timespan,features,accuracy,acquisitionetc.
• onlyinformalnotionsontheirreliabilityparticularcaseofreliabilityofstructureofcitationnetworks
structureofcitationnetworks
• statisticsofcitationnetworks• mostly consistentwithoutliersoutliersduetodataacquisitioninmostcases
• comparisonoveronestatistic• comparisonovermanystatistics?sameprobleminmachinelearningcommunity
methodology ofdatabasecomparison
• networkstatistics— residuals— databaserank• meanranks ofdatabasesovermanystatistics• residualssince“truedatabase”isnotknowndatabasereliabilityseenasconsistencywithotherdatabases
Studentized statistics residuals x̂ij
Two-tailed Student statistics t-testsH0 : x̂ij = 0 at P -value = 0.1
Student t-distribution with d.f. N � 2
9⇢ij : H1
8x̂ij : H0
Pairwise Spearman correlations ⇢ij
Two-tailed Fisher independence z-testsH0 : ⇢ij = 0 at P -value = 0.01
Standard normal distribution
8⇢ij : H0
9x̂ij : H1
Residuals mean ranks Ri
One-tailed Friedman rank testH0 : Ri = Rj at P -value = 0.1
�2-distribution with d.f. N � 1
H0
H1
Residuals mean ranks Ri
Two-tailed Nemenyi post-hoc testH0 : Ri = Rj at P -value = 0.1
Studentized range with d.f. N25
H0
1
2 3
4
comparisonofcitationnetworks
• comparisonofdifferentcitationnetworksresultsrobusttoselectionofnetworks,statistics,patternsetc.
• comparisonofdifferentinformationnetworks
P -value = 0.1
1 2
3
4
5
6
WoS
Cora
arXiv APS
PubMed
DBLP
P -value = 0.1
1 2
3
4
5
6
Cora
arXiv
WoS PubMed
DBLP
APS
A P!P B A$A
P -value = 0.1
1 2
3
4
5
6
DBLP
WoS
Cora APS
PubMed
arXiv
C A�A
comparisonofbibliographicnetworks
• Apapercitation networks informationnetworks
• Cauthorcollaboration networks socialnetworks• Bauthorcitationnetworkssocial-informationnetworks
P -value = 0.1
1 2 3 4
5 6
WoS
Cora
arXiv APS
PubMed
DBLP
P -value = 0.1
1 2 3 4
5 6
Cora
arXiv
WoS PubMed
DBLP
APS
A P!P B A$A
P -value = 0.1
1 2 3 4
5 6
DBLP
WoS
Cora APS
PubMed
arXiv
C A�A
A B
Cthereisno
“best”database!
talkoutline
1. reliabilityofbibliographicdatabasesŠubelj,L.,Fiala,D.,&Bajec,M.(2014).ScientificReports,4,6496.Šubelj,L.,Bajec,M.,Boshkoska, B.M.,etal.(2015).PLoS ONE,10(5),e0127390.
2. modelingpapercitationnetworksŠubelj,L.,&Bajec,M.(2013).InProceedingsoftheLSNA‘13,pp.527–530.Šubelj,L.,Žitnik,S.,&Bajec,M.(2014).InProceedingsoftheNetSci’14,p.1.
3. clusteringpapercitationnetworksŠubelj,L.,VanEck,N.J.,&Waltman,L.(2016).PLoS ONE,11(4),e0154404.
modelsofcitationnetworks
• generativemodels ofcitationnetworkstoreasonaboutstructure,evolution,dynamics,futureetc.
• manypossibleapplications inbibliometrics
y z
a
x
i
y z
a
x
i
y z
a
x
i
forestfirenetworkmodel
• eachnewnodei formslinksasfollows1. i selectsinitialambassadora andlinkstoa2. i selectsitsneighborsy,z andlinkstoy,z3. y,z aretakenasnewambassadorsofi
y z
a
x
i
wv
y z
a
x
i
wv
forestfirecitationmodel
• eachnewpaperi citesasfollows1. i selectsinitialpapera andcitesa2. i selectsitsreferencesy,z andcitesy,z3. y,z aretakenasnewreadingfori
• thenauthorsreadallcited papers andvice-versa• only≈20%referencesread (Simkin &Roychowdhury,2003)
y z
a
x
i
wv
y z
a
x
i
wv
realisticcitationmodel
• eachnewpaperi citesasfollows1. i selectsinitialpapera andcancitea2. i selectsitsreferencesy,z andcancitey,z3. somereferencesaretakenasnewreadingfori
• read&citedpapersmodeledindependently
y z
a
x
i
wv
y z
a
x
i
wv
directedcitationmodel
• directeddynamicsmuchmorecomplicated• modelreproducesWoS citationnetworks• clearoptima (peak)inmodelparameters
talkoutline
1. reliabilityofbibliographicdatabasesŠubelj,L.,Fiala,D.,&Bajec,M.(2014).ScientificReports,4,6496.Šubelj,L.,Bajec,M.,Boshkoska, B.M.,etal.(2015).PLoS ONE,10(5),e0127390.
2. modelingpapercitationnetworksŠubelj,L.,&Bajec,M.(2013).InProceedingsoftheLSNA‘13,pp.527–530.Šubelj,L.,Žitnik,S.,&Bajec,M.(2014).InProceedingsoftheNetSci’14,p.1.
3. clusteringpapercitationnetworksŠubelj,L.,VanEck,N.J.,&Waltman,L.(2016).PLoS ONE,11(4),e0154404.
clustering citationnetworks
• clusteringpapers basedondirectcitationrelationsresearchareasortopicsofpapers
• systematiccomparison oflargenumberofmethodsnetworkclusteringandpartitioning
thereisno“best”method!