a comparison of on-line computer science citation databases vaclav petricek, ingemar j. cox, hui...
TRANSCRIPT
![Page 1: A Comparison of On-line Computer Science Citation Databases Vaclav Petricek, Ingemar J. Cox, Hui Han, Isaac G. Councill, C. Lee Giles v.petricek@cs.ucl.ac.uk](https://reader035.vdocument.in/reader035/viewer/2022062518/56649e9d5503460f94b9ed0e/html5/thumbnails/1.jpg)
A Comparison of On-line Computer ScienceCitation Databases
Vaclav Petricek, Ingemar J. Cox, Hui Han, Isaac G. Councill, C. Lee Giles
[email protected]://www.cs.ucl.ac.uk/staff/V.Petricek
![Page 2: A Comparison of On-line Computer Science Citation Databases Vaclav Petricek, Ingemar J. Cox, Hui Han, Isaac G. Councill, C. Lee Giles v.petricek@cs.ucl.ac.uk](https://reader035.vdocument.in/reader035/viewer/2022062518/56649e9d5503460f94b9ed0e/html5/thumbnails/2.jpg)
2
Motivation
Autonomous databases have advantages compared to manually constructed
- Easier maintenance- Lower cost
Is it really an equivalent solution that is just cheaper?
Does the automated acquisition introduce any bias?
![Page 3: A Comparison of On-line Computer Science Citation Databases Vaclav Petricek, Ingemar J. Cox, Hui Han, Isaac G. Councill, C. Lee Giles v.petricek@cs.ucl.ac.uk](https://reader035.vdocument.in/reader035/viewer/2022062518/56649e9d5503460f94b9ed0e/html5/thumbnails/3.jpg)
3
Talk Overview
Datasets Acquisition bias and models CS Citation Distribution Conclusions Future Work
![Page 4: A Comparison of On-line Computer Science Citation Databases Vaclav Petricek, Ingemar J. Cox, Hui Han, Isaac G. Councill, C. Lee Giles v.petricek@cs.ucl.ac.uk](https://reader035.vdocument.in/reader035/viewer/2022062518/56649e9d5503460f94b9ed0e/html5/thumbnails/4.jpg)
4
Datasets - DBLP
DBLP was operated by Micheal Ley since 1994 [8]. It currently contains over 550,000 computer science references from around 368,000 authors.
Each entry is manually inserted by a group of volunteers and occasionally hired students. The entries are obtained from conference proceeding and journals.
![Page 5: A Comparison of On-line Computer Science Citation Databases Vaclav Petricek, Ingemar J. Cox, Hui Han, Isaac G. Councill, C. Lee Giles v.petricek@cs.ucl.ac.uk](https://reader035.vdocument.in/reader035/viewer/2022062518/56649e9d5503460f94b9ed0e/html5/thumbnails/5.jpg)
5
Datasets - CiteSeer
CiteSeer was created by Steve Lawrence and C. Lee Giles in 1997. It currently contains over 716,797 documents.
In contrast, each entry in CiteSeer is automatically entered from an analysis of documents found on the Web.
![Page 6: A Comparison of On-line Computer Science Citation Databases Vaclav Petricek, Ingemar J. Cox, Hui Han, Isaac G. Councill, C. Lee Giles v.petricek@cs.ucl.ac.uk](https://reader035.vdocument.in/reader035/viewer/2022062518/56649e9d5503460f94b9ed0e/html5/thumbnails/6.jpg)
6
Datasets – Publication year
CiteSeer DBLP
Declining CiteSeer maintenance
Increased DBLP funding
![Page 7: A Comparison of On-line Computer Science Citation Databases Vaclav Petricek, Ingemar J. Cox, Hui Han, Isaac G. Councill, C. Lee Giles v.petricek@cs.ucl.ac.uk](https://reader035.vdocument.in/reader035/viewer/2022062518/56649e9d5503460f94b9ed0e/html5/thumbnails/7.jpg)
7
Author bias
CiteSeer papers have higher average number of authors Both databases show growing team sizes
![Page 8: A Comparison of On-line Computer Science Citation Databases Vaclav Petricek, Ingemar J. Cox, Hui Han, Isaac G. Councill, C. Lee Giles v.petricek@cs.ucl.ac.uk](https://reader035.vdocument.in/reader035/viewer/2022062518/56649e9d5503460f94b9ed0e/html5/thumbnails/8.jpg)
8
Author bias
Crossover for low number of authors
CiteSeer has higher proportion of multiauthor papers than DBLP
(for number of authors <4)
![Page 9: A Comparison of On-line Computer Science Citation Databases Vaclav Petricek, Ingemar J. Cox, Hui Han, Isaac G. Councill, C. Lee Giles v.petricek@cs.ucl.ac.uk](https://reader035.vdocument.in/reader035/viewer/2022062518/56649e9d5503460f94b9ed0e/html5/thumbnails/9.jpg)
9
Author bias
“Papers with higher number of authors are more likely to be included in CiteSeer”
Hypothesis
Crawler suffers from acquisition bias due to - Submission- Crawling
![Page 10: A Comparison of On-line Computer Science Citation Databases Vaclav Petricek, Ingemar J. Cox, Hui Han, Isaac G. Councill, C. Lee Giles v.petricek@cs.ucl.ac.uk](https://reader035.vdocument.in/reader035/viewer/2022062518/56649e9d5503460f94b9ed0e/html5/thumbnails/10.jpg)
10
Models - CiteSeer
CiteSeer Submission model
Probability of a document being submitted grows with number of authors
- Publication submitted with probability β- Probabilities independent for coauthors
citeseers(i) = (1-(1- β )i) * all(i)
![Page 11: A Comparison of On-line Computer Science Citation Databases Vaclav Petricek, Ingemar J. Cox, Hui Han, Isaac G. Councill, C. Lee Giles v.petricek@cs.ucl.ac.uk](https://reader035.vdocument.in/reader035/viewer/2022062518/56649e9d5503460f94b9ed0e/html5/thumbnails/11.jpg)
11
Models - CiteSeer
CiteSeer crawler model- Probability of crawling a document grows with number of its
online copies- Probability of a document being online grows with number
of authors- Probabilities independent between authors- Publication published online with probability δ- Publication found by crawler with probability γ
citeseerc(i) = (1-(1- γδ)i) * all(i)
Both models result in equivalent type of bias
![Page 12: A Comparison of On-line Computer Science Citation Databases Vaclav Petricek, Ingemar J. Cox, Hui Han, Isaac G. Councill, C. Lee Giles v.petricek@cs.ucl.ac.uk](https://reader035.vdocument.in/reader035/viewer/2022062518/56649e9d5503460f94b9ed0e/html5/thumbnails/12.jpg)
12
Coverage
Can we estimate the coverage of dblp? Can we estimate the coverage of CiteSeer? Can we estimate the coverage of CS
literature?
We need a model of DBLP acquisition method
![Page 13: A Comparison of On-line Computer Science Citation Databases Vaclav Petricek, Ingemar J. Cox, Hui Han, Isaac G. Councill, C. Lee Giles v.petricek@cs.ucl.ac.uk](https://reader035.vdocument.in/reader035/viewer/2022062518/56649e9d5503460f94b9ed0e/html5/thumbnails/13.jpg)
13
Models - DBLP
DBLP model- Publication included in DBLP with probability α- α is a parameter reflecting DBLP “coverage” of CS
literature
dblp(i) = α * all(i)
![Page 14: A Comparison of On-line Computer Science Citation Databases Vaclav Petricek, Ingemar J. Cox, Hui Han, Isaac G. Councill, C. Lee Giles v.petricek@cs.ucl.ac.uk](https://reader035.vdocument.in/reader035/viewer/2022062518/56649e9d5503460f94b9ed0e/html5/thumbnails/14.jpg)
14
Coverage
citeseer(i) = (1-(1- β )^i) * all(i)
dblp(i) = α * all(i)
r(i) = dblp(i) / citeseer(i)
r(i) = α / (1-(1- β )^i)
![Page 15: A Comparison of On-line Computer Science Citation Databases Vaclav Petricek, Ingemar J. Cox, Hui Han, Isaac G. Councill, C. Lee Giles v.petricek@cs.ucl.ac.uk](https://reader035.vdocument.in/reader035/viewer/2022062518/56649e9d5503460f94b9ed0e/html5/thumbnails/15.jpg)
15
Results
• r(i) = α / (1-(1- β )^i)
Alpha ~ 0.3
DBLP covers approx 30%
of CS literature
CiteSeer covers approx 40%
CS literature ~ 2M publications
![Page 16: A Comparison of On-line Computer Science Citation Databases Vaclav Petricek, Ingemar J. Cox, Hui Han, Isaac G. Councill, C. Lee Giles v.petricek@cs.ucl.ac.uk](https://reader035.vdocument.in/reader035/viewer/2022062518/56649e9d5503460f94b9ed0e/html5/thumbnails/16.jpg)
Citation distribution
![Page 17: A Comparison of On-line Computer Science Citation Databases Vaclav Petricek, Ingemar J. Cox, Hui Han, Isaac G. Councill, C. Lee Giles v.petricek@cs.ucl.ac.uk](https://reader035.vdocument.in/reader035/viewer/2022062518/56649e9d5503460f94b9ed0e/html5/thumbnails/17.jpg)
17
Citation distribution
Studied before Follow a power-law Redner, Laherrere et al, Lehmann and
others Mostly physics community
We use a subset of CiteSeer and DBLP papers that have citation information
![Page 18: A Comparison of On-line Computer Science Citation Databases Vaclav Petricek, Ingemar J. Cox, Hui Han, Isaac G. Councill, C. Lee Giles v.petricek@cs.ucl.ac.uk](https://reader035.vdocument.in/reader035/viewer/2022062518/56649e9d5503460f94b9ed0e/html5/thumbnails/18.jpg)
18
Citation distribution
Power law Sparse data for
high number of citations
![Page 19: A Comparison of On-line Computer Science Citation Databases Vaclav Petricek, Ingemar J. Cox, Hui Han, Isaac G. Councill, C. Lee Giles v.petricek@cs.ucl.ac.uk](https://reader035.vdocument.in/reader035/viewer/2022062518/56649e9d5503460f94b9ed0e/html5/thumbnails/19.jpg)
19
Citation distribution
Exponential binning Data aggregated in
exponentially increasing ‘bins’
Equivalent to constant bins on a logarithmic scale
Easier interpolation
![Page 20: A Comparison of On-line Computer Science Citation Databases Vaclav Petricek, Ingemar J. Cox, Hui Han, Isaac G. Councill, C. Lee Giles v.petricek@cs.ucl.ac.uk](https://reader035.vdocument.in/reader035/viewer/2022062518/56649e9d5503460f94b9ed0e/html5/thumbnails/20.jpg)
20
Citation distribution
Distribution of citations more uneven in CS than in Physics Significant differences between DBLP and CiteSeer
slope
# citations Lehmann DBLP CiteSeer
< 50 -1.29 -1.876 -1.504
> 50 -2.32 -3.509 -3.074
![Page 21: A Comparison of On-line Computer Science Citation Databases Vaclav Petricek, Ingemar J. Cox, Hui Han, Isaac G. Councill, C. Lee Giles v.petricek@cs.ucl.ac.uk](https://reader035.vdocument.in/reader035/viewer/2022062518/56649e9d5503460f94b9ed0e/html5/thumbnails/21.jpg)
21
Citation distribution
CiteSeer contains fewer low cited papers than DBLP
No model yet Lawrence
- “Online or invisible?”
![Page 22: A Comparison of On-line Computer Science Citation Databases Vaclav Petricek, Ingemar J. Cox, Hui Han, Isaac G. Councill, C. Lee Giles v.petricek@cs.ucl.ac.uk](https://reader035.vdocument.in/reader035/viewer/2022062518/56649e9d5503460f94b9ed0e/html5/thumbnails/22.jpg)
22
Conclusions - authors
CiteSeer and DBLP have very different acquisition methods
Significant bias against papers with low number of authors (less than 4) in CiteSeer.
Single author papers appear to be disadvantaged with regard to the CiteSeer acquisition method.
two probabilistic models for paper acquisition in CiteSeer resulting in the same type of bias
- Crawler model- Submission model
![Page 23: A Comparison of On-line Computer Science Citation Databases Vaclav Petricek, Ingemar J. Cox, Hui Han, Isaac G. Councill, C. Lee Giles v.petricek@cs.ucl.ac.uk](https://reader035.vdocument.in/reader035/viewer/2022062518/56649e9d5503460f94b9ed0e/html5/thumbnails/23.jpg)
23
Conclusions - coverage
Simple model of DBLP coverage predicts coverage of approx 30% of the entire Computer Science literature.
This gives us CiteSeer coverage of approx 40%
and total number of CS papers around 2M
![Page 24: A Comparison of On-line Computer Science Citation Databases Vaclav Petricek, Ingemar J. Cox, Hui Han, Isaac G. Councill, C. Lee Giles v.petricek@cs.ucl.ac.uk](https://reader035.vdocument.in/reader035/viewer/2022062518/56649e9d5503460f94b9ed0e/html5/thumbnails/24.jpg)
24
Conclusions - citations
CiteSeer and DBLP citation distributions are different
Both indicate that highly cited papers in Computer Science receive a larger citation share than in Physics.
CiteSeer contains fewer low cited papers
![Page 25: A Comparison of On-line Computer Science Citation Databases Vaclav Petricek, Ingemar J. Cox, Hui Han, Isaac G. Councill, C. Lee Giles v.petricek@cs.ucl.ac.uk](https://reader035.vdocument.in/reader035/viewer/2022062518/56649e9d5503460f94b9ed0e/html5/thumbnails/25.jpg)
25
Future Work
Repeat experiments on most recent CiteSeer data
Other methods to estimate Computer science literature size and trends
- Overlap of CiteSeer and DBLP
Bias introduced by bibliography parsing Collaborative network analysis Connection to internet surveys?
![Page 26: A Comparison of On-line Computer Science Citation Databases Vaclav Petricek, Ingemar J. Cox, Hui Han, Isaac G. Councill, C. Lee Giles v.petricek@cs.ucl.ac.uk](https://reader035.vdocument.in/reader035/viewer/2022062518/56649e9d5503460f94b9ed0e/html5/thumbnails/26.jpg)
Thank you