named entity recognition for digitised historical texts
DESCRIPTION
Named Entity Recognition for Digitised Historical Texts. by Claire Grover, Sharon Givon , Richard Tobin and Julian Ball (UK) presented by Thomas Packer. A Need for Named Entity Recognition. Goal and High-Level Process. Automatically index historical texts so they are searchable. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Named Entity Recognition for Digitised Historical Texts](https://reader035.vdocument.in/reader035/viewer/2022062410/5681632e550346895dd3a889/html5/thumbnails/1.jpg)
1
Named Entity Recognition for Digitised Historical Texts
byClaire Grover, Sharon Givon, Richard Tobin and Julian Ball
(UK)
presented byThomas Packer
![Page 2: Named Entity Recognition for Digitised Historical Texts](https://reader035.vdocument.in/reader035/viewer/2022062410/5681632e550346895dd3a889/html5/thumbnails/2.jpg)
2
A Need for Named Entity Recognition
![Page 3: Named Entity Recognition for Digitised Historical Texts](https://reader035.vdocument.in/reader035/viewer/2022062410/5681632e550346895dd3a889/html5/thumbnails/3.jpg)
3
Goal and High-Level Process• Automatically index historical texts so they are searchable.
1. Scanning2. OCR3. Text processing4. Named Entity Recognition (NER)5. Indexing6. Search and Retrieval7. Highlighting
![Page 4: Named Entity Recognition for Digitised Historical Texts](https://reader035.vdocument.in/reader035/viewer/2022062410/5681632e550346895dd3a889/html5/thumbnails/4.jpg)
4
Challenges: Document Structure
![Page 5: Named Entity Recognition for Digitised Historical Texts](https://reader035.vdocument.in/reader035/viewer/2022062410/5681632e550346895dd3a889/html5/thumbnails/5.jpg)
5
Challenges: Placement of Tokens and Capitalization
![Page 6: Named Entity Recognition for Digitised Historical Texts](https://reader035.vdocument.in/reader035/viewer/2022062410/5681632e550346895dd3a889/html5/thumbnails/6.jpg)
6
Challenges: OCR Errors Abound
![Page 7: Named Entity Recognition for Digitised Historical Texts](https://reader035.vdocument.in/reader035/viewer/2022062410/5681632e550346895dd3a889/html5/thumbnails/7.jpg)
7
Challenges Related to ML: Capitalization
• CoNLL 2003 NER.• Trained and tested on English and German• Highest f-score for English: 88.8• Highest f-score for German: 72.4• Capitalization likely to be the main reason
(usually a key feature in ML-based NER).
![Page 8: Named Entity Recognition for Digitised Historical Texts](https://reader035.vdocument.in/reader035/viewer/2022062410/5681632e550346895dd3a889/html5/thumbnails/8.jpg)
8
Challenges Related to ML: POS
• (Untested guess that) POS tagging would do poorly on OCR text.
• Poor POS tagging would produce poor ML-based NER.
![Page 9: Named Entity Recognition for Digitised Historical Texts](https://reader035.vdocument.in/reader035/viewer/2022062410/5681632e550346895dd3a889/html5/thumbnails/9.jpg)
9
Challenges Related to ML: Hand Labeling
• Quality of OCR text and kinds of OCR errors vary greatly from document to document.
• Traditional ML requires hand labeling for each document type.
![Page 10: Named Entity Recognition for Digitised Historical Texts](https://reader035.vdocument.in/reader035/viewer/2022062410/5681632e550346895dd3a889/html5/thumbnails/10.jpg)
10
Data
• Annotated Data:– British parliamentary proceedings– Files: 91 (1800’s); 46 (1600’s).– Person names: 2,751 (1800’s); 3,199 (1600’s).– Location names: 2,021 (1800’s); 164 (1600’s).
• Document Split:– Training: none– Dev-test: 45 docs (1800’s)– Test: 46 files (1800’s); 46 files (1600’s)
![Page 11: Named Entity Recognition for Digitised Historical Texts](https://reader035.vdocument.in/reader035/viewer/2022062410/5681632e550346895dd3a889/html5/thumbnails/11.jpg)
11
IAA, Balanced F-Score
![Page 12: Named Entity Recognition for Digitised Historical Texts](https://reader035.vdocument.in/reader035/viewer/2022062410/5681632e550346895dd3a889/html5/thumbnails/12.jpg)
12
Process
XMLOCR Engine Clean-up XML Extraction XML
Generalist Person
Specialist Place
Specialist Person
Generalist Place
Token IDsRe-tokenize
UTF-8 Conversion
Mark-up Noise
![Page 13: Named Entity Recognition for Digitised Historical Texts](https://reader035.vdocument.in/reader035/viewer/2022062410/5681632e550346895dd3a889/html5/thumbnails/13.jpg)
13
Clean-up
• Convert to UTF-8• Separate trailing whitespace and punctuation
from tokens• Adding IDs to tokens• Mark up noise:• Marginal notes• Quotes• Unusual characters
![Page 14: Named Entity Recognition for Digitised Historical Texts](https://reader035.vdocument.in/reader035/viewer/2022062410/5681632e550346895dd3a889/html5/thumbnails/14.jpg)
14
Extraction: Four Grammars
• Specialist names, (monarchs, earls, lords, dukes, churchmen), e.g. Earl of Warwick.
• Common names, e.g. Mr. Stratford.• High-confidence place names, e.g. Town of
London.• Lower-confidence place names.
![Page 15: Named Entity Recognition for Digitised Historical Texts](https://reader035.vdocument.in/reader035/viewer/2022062410/5681632e550346895dd3a889/html5/thumbnails/15.jpg)
15
Extraction: Several Lexicons
• Male christian names• Female christian names• Surnames• Earldom place names
![Page 16: Named Entity Recognition for Digitised Historical Texts](https://reader035.vdocument.in/reader035/viewer/2022062410/5681632e550346895dd3a889/html5/thumbnails/16.jpg)
16
Extraction: Rule Application
• Apply higher-precision rules first:– Specialist rules before generalist rules.– Person rules before place rules.
![Page 17: Named Entity Recognition for Digitised Historical Texts](https://reader035.vdocument.in/reader035/viewer/2022062410/5681632e550346895dd3a889/html5/thumbnails/17.jpg)
17
Output
![Page 18: Named Entity Recognition for Digitised Historical Texts](https://reader035.vdocument.in/reader035/viewer/2022062410/5681632e550346895dd3a889/html5/thumbnails/18.jpg)
18
Results
![Page 19: Named Entity Recognition for Digitised Historical Texts](https://reader035.vdocument.in/reader035/viewer/2022062410/5681632e550346895dd3a889/html5/thumbnails/19.jpg)
19
Conclusions: Good & Bad
• Good:– Blind test set (developers did not look at).– Computed IAA.– Lots of entities labeled.
• Bad:– Reported IAA scores apparently not corrected
after error was discovered. (Annotations corrected, but IAA score was not.)
– Why is extraction tied to XML processing?
![Page 20: Named Entity Recognition for Digitised Historical Texts](https://reader035.vdocument.in/reader035/viewer/2022062410/5681632e550346895dd3a889/html5/thumbnails/20.jpg)
20
Future Work
• Improve rules and lexicons.• Correct more OCR errors with ML.• Extract organizations and dates.• Extract relations and events.
![Page 21: Named Entity Recognition for Digitised Historical Texts](https://reader035.vdocument.in/reader035/viewer/2022062410/5681632e550346895dd3a889/html5/thumbnails/21.jpg)
21
Successful Named Entity Recognition
![Page 22: Named Entity Recognition for Digitised Historical Texts](https://reader035.vdocument.in/reader035/viewer/2022062410/5681632e550346895dd3a889/html5/thumbnails/22.jpg)
22
The Inconvenience of Good Named Entity Recognition
![Page 23: Named Entity Recognition for Digitised Historical Texts](https://reader035.vdocument.in/reader035/viewer/2022062410/5681632e550346895dd3a889/html5/thumbnails/23.jpg)
23
Questions