![Page 1: Efficient blocking method for a large scale citation matching](https://reader033.vdocument.in/reader033/viewer/2022052908/55944d871a28ab4a6f8b46ba/html5/thumbnails/1.jpg)
Efficient blocking method fora large scale citation matching
Mateusz Fedoryszak & Łukasz Bolikowski{matfed,bolo}@icm.edu.pl
Interdisciplinary Centre for Mathematical andComputational Modelling
University of Warsaw
![Page 2: Efficient blocking method for a large scale citation matching](https://reader033.vdocument.in/reader033/viewer/2022052908/55944d871a28ab4a6f8b46ba/html5/thumbnails/2.jpg)
Citation matching
• Note: it's an instance of data linkage problem
References[1] I. Newton, Philosophiae naturalis...[2] N. Copernicus, De revolutionibus...
ID Title Author
Copernicus14 De revolutionibus...
ΕὐκλείδηςΣτοιχεῖα11
![Page 3: Efficient blocking method for a large scale citation matching](https://reader033.vdocument.in/reader033/viewer/2022052908/55944d871a28ab4a6f8b46ba/html5/thumbnails/3.jpg)
Why important?
• Clickable interfaces• Bibliometrics
(think: H-index)• Further analysis
(e.g. similarities)
Why difficult?
• Citation extraction errors (in both digital-born and retro-born docs)
• Countless citation styles used inconsistently
• Typos and other human errors
![Page 4: Efficient blocking method for a large scale citation matching](https://reader033.vdocument.in/reader033/viewer/2022052908/55944d871a28ab4a6f8b46ba/html5/thumbnails/4.jpg)
The Problem
References
ID Title Author
![Page 5: Efficient blocking method for a large scale citation matching](https://reader033.vdocument.in/reader033/viewer/2022052908/55944d871a28ab4a6f8b46ba/html5/thumbnails/5.jpg)
Naïve approach
For 1.3M documents and 12M citations it's 15.6 × 1012 comparisons
References
ID Title Author
![Page 6: Efficient blocking method for a large scale citation matching](https://reader033.vdocument.in/reader033/viewer/2022052908/55944d871a28ab4a6f8b46ba/html5/thumbnails/6.jpg)
Select the best candidates
• I'll present a method of candidate selection and how to implement it using Apache Hadoop
References
ID Title Author
![Page 7: Efficient blocking method for a large scale citation matching](https://reader033.vdocument.in/reader033/viewer/2022052908/55944d871a28ab4a6f8b46ba/html5/thumbnails/7.jpg)
Blocking
References
ID Title Author
![Page 8: Efficient blocking method for a large scale citation matching](https://reader033.vdocument.in/reader033/viewer/2022052908/55944d871a28ab4a6f8b46ba/html5/thumbnails/8.jpg)
Fingerprints
References
ID Title AuthorAAAABBBB CCCC
AAAA
AAAA FFFF
CCCC
EEEE
![Page 9: Efficient blocking method for a large scale citation matching](https://reader033.vdocument.in/reader033/viewer/2022052908/55944d871a28ab4a6f8b46ba/html5/thumbnails/9.jpg)
Workflow
document IDhashcitation IDhash
citation document
document IDhash
citation ID
citation ID
document IDhash
document ID
citation IDhash
citation ID document ID
citation ID document ID
citation ID document ID
Map
Redu
ce
![Page 10: Efficient blocking method for a large scale citation matching](https://reader033.vdocument.in/reader033/viewer/2022052908/55944d871a28ab4a6f8b46ba/html5/thumbnails/10.jpg)
Workflow with tuning
• Before:• Compute bucket sizes• Reject too big ones• Use DistributedCache
disseminate
• After:• For each citation
choose only the most popular candidates
document IDhashcitation IDhash
citation document
document IDhash
citation ID
citation ID
document IDhash
document ID
citation IDhash
citation ID document ID
citation ID document ID
citation ID document ID
Map
Redu
ce
![Page 11: Efficient blocking method for a large scale citation matching](https://reader033.vdocument.in/reader033/viewer/2022052908/55944d871a28ab4a6f8b46ba/html5/thumbnails/11.jpg)
Hash functions
![Page 12: Efficient blocking method for a large scale citation matching](https://reader033.vdocument.in/reader033/viewer/2022052908/55944d871a28ab4a6f8b46ba/html5/thumbnails/12.jpg)
Normalisation• Lowercase• Remove
• diacritics• punctuation marks
• Filter out tokens shorter than 3 characters (except numbers)
![Page 13: Efficient blocking method for a large scale citation matching](https://reader033.vdocument.in/reader033/viewer/2022052908/55944d871a28ab4a6f8b46ba/html5/thumbnails/13.jpg)
Normalisation
Pawlak, Zdzisław (1982). "Rough sets". Internat. J. Comput. Inform. Sci. 11 (5): 341–356.
pawlak zdzislaw 1982 rough sets internat comput inform sci 11 5 341 356
![Page 14: Efficient blocking method for a large scale citation matching](https://reader033.vdocument.in/reader033/viewer/2022052908/55944d871a28ab4a6f8b46ba/html5/thumbnails/14.jpg)
Examples
Pawlak, Zdzisław (1982). "Rough sets". Internat. J. Comput. Inform. Sci. 11 (5): 341–356.
{ author: "Zdzisław Pawlak", year: "1982", title: "Rough sets", journal: "International Journal of Computer & Information Sciences", volume: "11", issue: "5", pages: "341–356"}
![Page 15: Efficient blocking method for a large scale citation matching](https://reader033.vdocument.in/reader033/viewer/2022052908/55944d871a28ab4a6f8b46ba/html5/thumbnails/15.jpg)
Baseline
pawlakzdzislaw
1982rough
...internat
...
zdzislawpawlak1982
rough...
internationaljournal
...
![Page 16: Efficient blocking method for a large scale citation matching](https://reader033.vdocument.in/reader033/viewer/2022052908/55944d871a28ab4a6f8b46ba/html5/thumbnails/16.jpg)
Bigrams
• For document we use only authors and title fields
pawlak zdzislawzdzislaw 1982
1982 roughrough sets
...
zdzislaw pawlakrough sets
![Page 17: Efficient blocking method for a large scale citation matching](https://reader033.vdocument.in/reader033/viewer/2022052908/55944d871a28ab4a6f8b46ba/html5/thumbnails/17.jpg)
name-year• For citation:
• name: any of first 4 distinct text tokens• year: any number between 1900 and 2050
pawlak#1982zdzislaw#1982
rough#1982sets#1982
zdzislaw#1982pawlak#1982
+approximate variant zdzislaw#1981pawlak#1981
zdzislaw#1983pawlak#1983
![Page 18: Efficient blocking method for a large scale citation matching](https://reader033.vdocument.in/reader033/viewer/2022052908/55944d871a28ab4a6f8b46ba/html5/thumbnails/18.jpg)
name-year-pages• For citation:
• pages: any sorted pair of numbers, not year
pawlak#1982#5#11pawlak#1982#5#341
pawlak#1982#...pawlak#1982#341#356
zdzislaw#...zdzislaw#1982#341#356
rough#...sets#...
zdzislaw#1982#341#356pawlak#1982#341#356
+approximate & optimistic variant
![Page 19: Efficient blocking method for a large scale citation matching](https://reader033.vdocument.in/reader033/viewer/2022052908/55944d871a28ab4a6f8b46ba/html5/thumbnails/19.jpg)
Intermezzo: citation parsing
Pawlak , Zdzisław ( 1982 ) .
author other author other year other other
...
...
Pawlak, Zdzisław (1982). "Rough sets". Internat. J. Comput. Inform. Sci. 11 (5): 341–356.
![Page 20: Efficient blocking method for a large scale citation matching](https://reader033.vdocument.in/reader033/viewer/2022052908/55944d871a28ab4a6f8b46ba/html5/thumbnails/20.jpg)
name-year-numn
• n = 1..3• For citation:
• numn: any sorted tuple of numbers, not year
pawlak#1982#5#11#341pawlak#1982#5#341#356pawlak#1982#5#11#356#pawlak#1982#11#341#356
zdzislaw#...rough#...sets#...
+approximate variant
pawlak#1982#5#11#341pawlak#1982#5#341#356pawlak#1982#5#11#356#
pawlak#1982#11#341#356zdzislaw#...
![Page 21: Efficient blocking method for a large scale citation matching](https://reader033.vdocument.in/reader033/viewer/2022052908/55944d871a28ab4a6f8b46ba/html5/thumbnails/21.jpg)
Evaluation
![Page 22: Efficient blocking method for a large scale citation matching](https://reader033.vdocument.in/reader033/viewer/2022052908/55944d871a28ab4a6f8b46ba/html5/thumbnails/22.jpg)
Test dataset<ref id="pone.0052832-Jemal1"><label>2</label><mixed-citation publication-type="journal"><name><surname>Jemal</surname><given-names>A</given-names></name>, <name><surname>Bray</surname><given-names>F</given-names></name>, <name><surname>Center</surname><given-names>MM</given-names></name>, <name><surname>Ferlay</surname><given-names>J</given-names></name>, <name><surname>Ward</surname><given-names>E</given-names></name>, <etal>et al</etal> (<year>2011</year>) <article-title>Global cancer statistics</article-title>. <source>CA Cancer J Clin</source><volume>61</volume>: <fpage>69</fpage>–<lpage>90</lpage><pub-id pub-id-type="pmid">21296855</pub-id></mixed-citation></ref>
![Page 23: Efficient blocking method for a large scale citation matching](https://reader033.vdocument.in/reader033/viewer/2022052908/55944d871a28ab4a6f8b46ba/html5/thumbnails/23.jpg)
Test dataset<ref id="pone.0052832-Jemal1"><label>2</label><mixed-citation publication-type="journal"><name><surname>Jemal</surname><given-names>A</given-names></name>, <name><surname>Bray</surname><given-names>F</given-names></name>, <name><surname>Center</surname><given-names>MM</given-names></name>, <name><surname>Ferlay</surname><given-names>J</given-names></name>, <name><surname>Ward</surname><given-names>E</given-names></name>, <etal>et al</etal> (<year>2011</year>) <article-title>Global cancer statistics</article-title>. <source>CA Cancer J Clin</source><volume>61</volume>: <fpage>69</fpage>–<lpage>90</lpage><pub-id pub-id-type="pmid">21296855</pub-id></mixed-citation></ref>
![Page 24: Efficient blocking method for a large scale citation matching](https://reader033.vdocument.in/reader033/viewer/2022052908/55944d871a28ab4a6f8b46ba/html5/thumbnails/24.jpg)
Test dataset<ref id="pone.0052832-Jemal1"><label>2</label><mixed-citation publication-type="journal"><name><surname>Jemal</surname><given-names>A</given-names></name>, <name><surname>Bray</surname><given-names>F</given-names></name>, <name><surname>Center</surname><given-names>MM</given-names></name>, <name><surname>Ferlay</surname><given-names>J</given-names></name>, <name><surname>Ward</surname><given-names>E</given-names></name>, <etal>et al</etal> (<year>2011</year>) <article-title>Global cancer statistics</article-title>. <source>CA Cancer J Clin</source><volume>61</volume>: <fpage>69</fpage>–<lpage>90</lpage><pub-id pub-id-type="pmid">21296855</pub-id></mixed-citation></ref>
![Page 25: Efficient blocking method for a large scale citation matching](https://reader033.vdocument.in/reader033/viewer/2022052908/55944d871a28ab4a6f8b46ba/html5/thumbnails/25.jpg)
Test dataset
2 Jemal A, Bray F, Center MM, Ferlay J, Ward E, et al (2011) Global cancer statistics. CA Cancer J Clin 61: 69–90
![Page 26: Efficient blocking method for a large scale citation matching](https://reader033.vdocument.in/reader033/viewer/2022052908/55944d871a28ab4a6f8b46ba/html5/thumbnails/26.jpg)
Test dataset
• Based on Open Access Subset of PMC• Only citations preserving original formatting• Only citations with PMID assigned• 528k documents• 3.6M citation out of which 321k resolvable
![Page 27: Efficient blocking method for a large scale citation matching](https://reader033.vdocument.in/reader033/viewer/2022052908/55944d871a28ab4a6f8b46ba/html5/thumbnails/27.jpg)
Metrics
• Recall — the percentage of true citation → document links that are maintained by the heuristic
• Precision — the percentage of citation → document links returned by algorithm that are correct
• Intermediate data — total number of hashes and pairs generated (before selecting the most popular ones)
• Candidate pairs — number of pairs returned by heuristic for further assessment
• F-measure not included intentionally
![Page 28: Efficient blocking method for a large scale citation matching](https://reader033.vdocument.in/reader033/viewer/2022052908/55944d871a28ab4a6f8b46ba/html5/thumbnails/28.jpg)
Limits
• Candidate documents per citation• 30• no limit
• Bucket size• 10• 100• 1000• 10000• no limit
![Page 29: Efficient blocking method for a large scale citation matching](https://reader033.vdocument.in/reader033/viewer/2022052908/55944d871a28ab4a6f8b46ba/html5/thumbnails/29.jpg)
Recallhash precision recall intermediate data to assess
bigrams (10000, 30) 0.4% 98.2% 285,908,900 79,329,459 baseline (10000, 30) 0.3% 97.9% 221,212,080 114,223,777 bigrams (100, 30) 2.9% 92.7% 94,693,721 10,446,883 name-year (approx.) 0.0% 92.4% 928,068,651 862,357,212 name-year (strict) 0.1% 90.2% 322,015,088 290,940,929 baseline (10000, 10) 0.9% 88.7% 221,212,080 49,747,843 name-year-num (approx., 1000, 30) 1.2% 88.5% 170,633,938 23,591,933 name-year-num (strict., 1000, 30) 3.6% 88.3% 85,756,601 7,864,129 name-year (strict, 1000, 30) 2.5% 77.9% 28,463,067 9,940,403 name-year (approx., 1000, 30) 1.4% 75.6% 40,726,102 17,098,080 baseline (1000, 30) 0.9% 73.2% 115,822,141 26,083,677
![Page 30: Efficient blocking method for a large scale citation matching](https://reader033.vdocument.in/reader033/viewer/2022052908/55944d871a28ab4a6f8b46ba/html5/thumbnails/30.jpg)
Precision
hash precision recall intermediate data to assess
name-year-pages (strict, optimistic) 98.4% 7.3% 4,787,215 23,734
name-year-num^3 (strict) 84.0% 43.4% 257,639,965 166,128
name-year-pages (approx., optimistic) 78.2% 7.8% 42,478,742 32,182
name-year-pages (strict, pessimistic) 53.7% 42.5% 132,809,210 254,208
name-year-num^3 (approx.) 17.6% 47.1% 617,193,035 860,314
name-year-num^2 (strict) 14.8% 66.6% 141,885,270 1,444,074
bigrams (10, 10) 11.8% 65.6% 84,042,160 1,784,228
![Page 31: Efficient blocking method for a large scale citation matching](https://reader033.vdocument.in/reader033/viewer/2022052908/55944d871a28ab4a6f8b46ba/html5/thumbnails/31.jpg)
Recall/intermediate datahash precision recall intermediate data to assess
name-year (strict, 1000, 30) 2.5% 77.9% 28,463,067 9,940,403
name-year (approx., 1000, 30) 1.4% 75.6% 40,726,102 17,098,080 name-year-pages (strict, optimistic, 1000, 30) 98.4% 7.3% 4,787,215 23,734
name-year-num (strict., 1000, 30) 3.6% 88.3% 85,756,601 7,864,129
bigrams (100, 30) 2.9% 92.7% 94,693,721 10,446,883
bigrams (10, 30) 11.8% 65.6% 84,042,160 1,793,997
baseline (1000, 30) 0.9% 73.2% 115,822,141 26,083,677
name-year-num (approx., 1000, 30) 1.2% 88.5% 170,633,938 23,591,933
baseline (100, 30) 3.2% 44.0% 91,175,101 4,458,560
name-year-num^2 (strict., 1000, 30) 18.4% 66.6% 141,553,137 1,165,181
baseline (10000, 30) 0.3% 97.9% 221,212,080 114,223,777
![Page 32: Efficient blocking method for a large scale citation matching](https://reader033.vdocument.in/reader033/viewer/2022052908/55944d871a28ab4a6f8b46ba/html5/thumbnails/32.jpg)
Recall vs. intermediate data
![Page 33: Efficient blocking method for a large scale citation matching](https://reader033.vdocument.in/reader033/viewer/2022052908/55944d871a28ab4a6f8b46ba/html5/thumbnails/33.jpg)
Recall/to assesshash precision recall intermediate data to assess name-year-pages (strict, optimistic, 1000, 30) 98.4% 7.3% 4,787,215 23,734 name-year-num^3 (strict., 1000, 30) 84.0% 43.4% 257,637,645 165,995 name-year-pages (approx., optimistic, 1000, 30) 78.5% 7.8% 42,478,742 32,042 name-year-pages (strict, pessimistic, 1000, 30) 56.3% 42.5% 132,792,590 242,261
name-year-num^3 (approx., 1000, 30) 19.1% 47.1% 617,046,925 794,284
name-year-num^2 (strict., 1000, 30) 18.4% 66.6% 141,553,137 1,165,181
bigrams (10, 30) 11.8% 65.6% 84,042,160 1,793,997 name-year-pages (approx., pessimistic, 1000, 30) 9.9% 45.8% 172,447,469 1,483,980
name-year-num (strict., 1000, 30) 3.6% 88.3% 85,756,601 7,864,129
name-year-num^2 (approx., 1000, 30) 3.2% 69.8% 359,051,798 7,023,337
baseline (100, 30) 3.2% 44.0% 91,175,101 4,458,560
bigrams (100, 30) 2.9% 92.7% 94,693,721 10,446,883
![Page 34: Efficient blocking method for a large scale citation matching](https://reader033.vdocument.in/reader033/viewer/2022052908/55944d871a28ab4a6f8b46ba/html5/thumbnails/34.jpg)
Recall vs. to assess
![Page 35: Efficient blocking method for a large scale citation matching](https://reader033.vdocument.in/reader033/viewer/2022052908/55944d871a28ab4a6f8b46ba/html5/thumbnails/35.jpg)
Combination
![Page 36: Efficient blocking method for a large scale citation matching](https://reader033.vdocument.in/reader033/viewer/2022052908/55944d871a28ab4a6f8b46ba/html5/thumbnails/36.jpg)
Lost citationsHash Lost fraction
name-year (approx., 1000, 30) 12.4%name-year-num2 (approx., 1000, 30) 12.3%name-year (strict, 1000, 30) 9.8%name-year-pages (approx., pessimistic, 1000, 30) 9.0%baseline (10000, 10) 6.7%name-year-num (approx., 1000, 30) 6.0%name-year (strict) 5.8%name-year-num2 (strict., 1000, 30) 5.6%name-year (approx.) 5.1%name-year-num (strict., 1000, 30) 4.4%name-year-num3 (approx., 1000, 30) 4.2%baseline (1000, 30) 3.7%
![Page 37: Efficient blocking method for a large scale citation matching](https://reader033.vdocument.in/reader033/viewer/2022052908/55944d871a28ab4a6f8b46ba/html5/thumbnails/37.jpg)
ResultsHash sequence Recall Intermediate data To assess
bigrams (10000, 30) 98.17% 285,908,900 79,329,459
name-year-pages (strict, optimistic)name-year (strict, 1000, 30)name-year (strict, 10000, 30)bigrams (10000, 30)
87.64% 187,394,452 41,152,278
name-year-pages (strict, optimistic)name-year-pages (strict, pessimistic)bigrams (100, 30)bigrams (10000, 30)
96.86% 333,701,109 29,818,635
name-year-pages (strict, optimistic)bigrams (100, 30)bigrams (10000, 30)
97.76% 202,590,413 30,582,488
name-year-pages (strict, optimistic)name-year-num3 (strict)bigrams (10, 10)bigrams (100, 30)bigrams (10000, 30)
97.73% 398,895,930 25,123,164
![Page 38: Efficient blocking method for a large scale citation matching](https://reader033.vdocument.in/reader033/viewer/2022052908/55944d871a28ab4a6f8b46ba/html5/thumbnails/38.jpg)
Future work
• Other combinations• After fine-grained assessment• Various hash functions at the same time
• Further efficiency tuning• Limit number of generated hashes
![Page 39: Efficient blocking method for a large scale citation matching](https://reader033.vdocument.in/reader033/viewer/2022052908/55944d871a28ab4a6f8b46ba/html5/thumbnails/39.jpg)
CoAnSys Project
• An open source framework for mining very large collections of scientific publications
• Contains implementation of the presented workflow
• http://coansys.ceon.pl/