impact final conference - gregory crane
DESCRIPTION
TRANSCRIPT
![Page 1: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/1.jpg)
OCR and the Transformation of the Humanities
Gregory Crane and David BammanTufts University
Bruce RobertsonMount Allison University
John Darlington and Brian FuchsImperial College London
![Page 2: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/2.jpg)
Three basic changes
![Page 3: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/3.jpg)
Three basic changes
1. Transformation of scale of questions
![Page 4: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/4.jpg)
Three basic changes
1. Transformation of scale of questionsBreadth and Depth
![Page 5: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/5.jpg)
Three basic changes
1. Transformation of scale of questionsBreadth and Depth
2. Student researchers and citizen scholars
![Page 6: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/6.jpg)
Three basic changes
1. Transformation of scale of questionsBreadth and Depth
2. Student researchers and citizen scholarsNot enough professors and library professionals
![Page 7: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/7.jpg)
Three basic changes
1. Transformation of scale of questionsBreadth and Depth
2. Student researchers and citizen scholarsNot enough professors and library professionals
3. Globalization of cultural heritage
![Page 8: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/8.jpg)
Three basic changes
1. Transformation of scale of questionsBreadth and Depth
2. Student researchers and citizen scholarsNot enough professors and library professionals
3. Globalization of cultural heritage Not enough expertise in Europe + North America
![Page 9: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/9.jpg)
Towards Dynamic Variorum Editions
Gregory Crane and David BammanTufts University
Bruce RobertsonMount Allison University
John Darlington and Brian FuchsImperial College London
![Page 10: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/10.jpg)
Thanks to …
• Digging into Data Phase 1• National Endowment for the Humanities• JISC (UK)• SSHRC (Canada)• National Science Foundation• Mellon Foundation• Google Digital Humanities• Cantus Foundation• German Research Foundation
![Page 11: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/11.jpg)
The Dynamic Variorum as grand challenge
![Page 12: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/12.jpg)
The Dynamic Variorum as grand challenge
• How do you build self-organizing collections?
![Page 13: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/13.jpg)
What is a variorum?
• Short for cum notis variorum, “with notes of different people”
![Page 14: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/14.jpg)
New Variorum Shakespeare Series
![Page 15: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/15.jpg)
New Variorum Shakespeare Series
![Page 16: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/16.jpg)
New Variorum Shakespeare Series
“New” = 140 years old
![Page 17: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/17.jpg)
New Variorum Shakespeare Series
“New” = 140 years old
“New” vs. 1821 Shakespeare Variorum
![Page 18: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/18.jpg)
Heinsius’ Claudian
![Page 19: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/19.jpg)
Heinsius’ Claudian
![Page 20: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/20.jpg)
Heinsius’ Claudian
![Page 21: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/21.jpg)
NVS 2011
![Page 22: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/22.jpg)
NVS 2011
![Page 23: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/23.jpg)
What was in the 1873 NVS Macbeth?
![Page 24: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/24.jpg)
What was in the 1873 NVS Macbeth?
• Index
![Page 25: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/25.jpg)
What was in the 1873 NVS Macbeth?
• Index• [Table of contents]
![Page 26: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/26.jpg)
What was in the 1873 NVS Macbeth?
• Index• [Table of contents]• Sources
![Page 27: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/27.jpg)
What was in the 1873 NVS Macbeth?
• Index• [Table of contents]• Sources
![Page 28: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/28.jpg)
What was in the 1873 NVS Macbeth?
• Index• [Table of contents]• Sources
![Page 29: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/29.jpg)
What was in the 1873 NVS Macbeth?
• Index• [Table of contents]• Sources• Adaptations
![Page 30: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/30.jpg)
What was in the 1873 NVS Macbeth?
• Index• [Table of contents]• Sources• Adaptations
![Page 31: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/31.jpg)
What was in the 1873 NVS Macbeth?
• Index• [Table of contents]• Sources• Adaptations
![Page 32: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/32.jpg)
What was in the 1873 NVS Macbeth?
• Index• [Table of contents]• Sources• Adaptations
![Page 33: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/33.jpg)
What was in the 1873 NVS Macbeth?
• Index• [Table of contents]• Sources• Adaptations
![Page 34: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/34.jpg)
What was in the 1873 NVS Macbeth?
• Index• [Table of contents]• Sources• Adaptations• General Topics
![Page 35: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/35.jpg)
What was in the 1873 NVS Macbeth?
• Index• [Table of contents]• Sources• Adaptations• General Topics
![Page 36: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/36.jpg)
What was in the 1873 NVS Macbeth?
• Index• [Table of contents]• Sources• Adaptations• General Topics
![Page 37: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/37.jpg)
What was in the 1873 NVS Macbeth?
• Index• [Table of contents]• Sources• Adaptations• General Topics
![Page 38: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/38.jpg)
What was in the 1873 NVS Macbeth?
• Index• [Table of contents]• Sources• Adaptations• General Topics
![Page 39: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/39.jpg)
What was in the 1873 NVS Macbeth?
• Index• [Table of contents]• Sources• Adaptations• General Topics• Bibliographies
![Page 40: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/40.jpg)
What was in the 1873 NVS Macbeth?
• Index• [Table of contents]• Sources• Adaptations• General Topics• Bibliographies
![Page 41: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/41.jpg)
What was in the 1873 NVS Macbeth?
• Index• [Table of contents]• Sources• Adaptations• General Topics• Bibliographies• Running Text
![Page 42: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/42.jpg)
What was in the 1873 NVS Macbeth?
• Index• [Table of contents]• Sources• Adaptations• General Topics• Bibliographies• Running Text• Multiple Versions
![Page 43: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/43.jpg)
What was in the 1873 NVS Macbeth?
• Index• [Table of contents]• Sources• Adaptations• General Topics• Bibliographies• Running Text• Multiple Versions• Annotations
![Page 44: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/44.jpg)
Brown’s Intermedia c. 1990
![Page 45: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/45.jpg)
The problem…
• Not feasible to summarize scholarship on any major canonical author by manual means
• An issue in 1665 and in 1905 but much worse now…
• How do we generate a Variorum edition from the very large collections that make this such a challenge? How do we make scale an advantage?
![Page 46: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/46.jpg)
Shakespeare as an easy case…
![Page 47: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/47.jpg)
Shakespeare as an easy case…
![Page 48: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/48.jpg)
Shakespeare as an easy case…
c. 500 years of English ….
![Page 49: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/49.jpg)
Greco-Roman World
![Page 50: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/50.jpg)
Greco-Roman World
From Rabat to Kandahar …
![Page 51: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/51.jpg)
c. 100 CE papyrus from Euclid (c. 300 BCE)
http://www.math.ubc.ca/~cass/Euclid/papyrus/papyrus.html
![Page 52: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/52.jpg)
800-1000 CE: Greek into Arabic
Hunayn Ibn Ishaq (809–873), Arabic version of the Prognosticon from the Hippocratic Corpus
http://www.nlm.nih.gov/exhibition/odysseyofknowledge/
![Page 53: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/53.jpg)
c. 1200-1300: Arabic into Latin
Medieval Translation of the Prognosticon from Arabic into Latin
![Page 54: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/54.jpg)
Return of Greek sources c. 1500
This first edition of Dioscorides' Greek text, printed in Venice in 1499 by Aldo Manuzio (ca. 1447–1515)
![Page 55: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/55.jpg)
Status as of October 2011• What do you do with a billion words?
– 2000 years of Latin• How do you integrate data across languages
– Projecting markup over noisy data• How do you trace ideas?
– Detecting changes within and across languages• How do you get the data you need?
– Customizing OCR for a pre-modern language• How do you scale up your services?
– From workflows to Cloud-based design
![Page 56: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/56.jpg)
Disciplines and Speakers
• David Bamman, Tufts University– Computational Linguistics
• Bruce Robertson, Mount Allison University– Digital Classics
• Brian Fuchs, Imperial College London– Software Engineering
![Page 57: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/57.jpg)
1. Computational Linguistics
David BammanTufts University
(Carnegie Mellon University)United States
![Page 58: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/58.jpg)
Overview: Publications– Bamman, David and Gregory Crane (2011), “Measuring Historical
Word Sense Variation,” Proceedings of the 11th ACM/IEEE Joint Conference on Digital Libraries (JCDL 2011). Nominee, Best Paper Award.
– Bamman, David, Alison Babeu, and Gregory Crane (2010), "Transferring Structural Markup Across Translations Using Multilingual Alignment and Projection," in: Proceedings of the 10th ACM/IEEE Joint Conference on Digital Libraries (JCDL 2010). Winner, Best Paper Award.
– Bamman, David and David Smith (forthcoming), “Extracting Two Thousand Years of Latin from a Million Book Library”, Journal of Computing and Cultural Heritage.
![Page 59: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/59.jpg)
![Page 60: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/60.jpg)
2000+ Years of Latin
![Page 61: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/61.jpg)
Goal: Tracking Language Change
• Lexical change (new vocabulary, shifts in word sense)• Syntactic change (including the influence of the author’s L1 on
the Latin syntax)• Topical change (the rise of new genres)• Identifying the spread of variation across authors.
![Page 62: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/62.jpg)
Corpus development
• Data source– 1.2M books from the Internet Archive (snapshot of
collection from 2009)– 25,886 works catalogued as Latin
• Metadata problems1. Language identification (many of these works are not
Latin.)2. Historical date info (dates of publication != dates of
composition.)
![Page 63: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/63.jpg)
25,886 works catalogued as Latin in the IA, charted by “date.”
![Page 64: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/64.jpg)
![Page 65: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/65.jpg)
![Page 66: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/66.jpg)
Language ID
• Language ID to identify which of these works actually have Latin as a major language.– Trained a language classifier (alias-i Lingpipe) on:
• 24 editions of Wikipedia• Perseus classical corpus• Known badly-OCR’d Greek in the IA.
• Results– 10,263 of 25,886 books catalogued as Latin are not recognizably so
(mostly Greek)– 6,790 books not catalogued as Latin in the 1.2M collection are in fact
so (98% precision).– Net: 22,413 Latin books containing 2.97 billion words.
![Page 67: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/67.jpg)
Composition dating
• With undergraduate students in Classics, established dates of composition for each Latin text. So far, considered 10,398 of them:– 7,055 dated– 3,343 excluded as not representative of language use –
e.g., reference works (dictionaries, catalogues, lists of manuscripts)
• From these 7,055 works, we extract just the Latin to create a dated historical corpus of 389 million words.
![Page 68: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/68.jpg)
25,886 works catalogued as Latin in the IA, charted by “date.”
![Page 69: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/69.jpg)
7,055 Latin works in the IA, charted by date of composition.
![Page 70: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/70.jpg)
“America”
(1066)
![Page 71: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/71.jpg)
“de”
(2,955,462)
![Page 72: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/72.jpg)
“oratio”
![Page 73: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/73.jpg)
“lead” vs. “iron”
![Page 74: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/74.jpg)
Polysemy
Lead Iron
(verb) cause to go (verb) to smooth w. an iron
be in command (noun) element Fe
(noun) position of advantage tool with flat steel base used to smooth clothes
chief part in play golf club
element Pb
graphite in pencil
Oratio
(noun) Speech
Prayer
Words have many senses.
![Page 75: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/75.jpg)
Measuring sense variationMethod: Train broad-coverage word sense disambiguation using aligned
parallel texts
English/French (Diab and Resnik 02), English/Chinese (Chan and Ng 05, Ng et al. 03), English/Portuguese (Specia et al. 05), English/Vietnamese (Dinh 02).
Parallel text alignment1. Identify translations (130 English translations manually identified by students from a
representative range of dates)2. Word align Latin text <-> English text (ca. 1.3M words)3. Induce a sense inventory from the alignment
Word sense disambiguation1. Train a WSD classifier on noisily aligned texts2. Automatically classify remaining 387M words3. Track lexical change
![Page 76: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/76.jpg)
WSD via parallel texts
• SMT based on Brown et al (1990)
• Different senses for a word in one language are translated by different words in another.
• “Bank” (English)– financial institution =
French “banque”– side of a river = French
“rive” (e.g., la rive gauche)
![Page 77: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/77.jpg)
(Dynamic Lexicon)
![Page 78: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/78.jpg)
(Bootstrapping multilingual digital libraries)
+
Projecting XML markup across editions and translations (Bamman and Crane 2010)
1. Alignment of the source document with the target document in a cascading process: document -> sentence -> word
2. Projection of XML tags in the source document to the target document in way that exploits the linguistic similarity of the text pair.
![Page 79: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/79.jpg)
2. Parallel text alignment
• Sentence level: Moore’s Bilingual Sentence Aligner (Moore 2002)
– aligns sentences that are 1-1 translations of each other w/ high precision (98.5% on a corpus of 10K English-Hindi sentences)
• Word level: MGIZA++ (Gao and Vogel 2008)
– parallel version of: GIZA++ (Och and Ney 2003) - implementation of IBM Models 1-5.
![Page 80: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/80.jpg)
3. Sense induction
![Page 81: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/81.jpg)
4. WSD Training
Source word oratione (oratio)
Sense label prayer
Training context ad spem pertinent, quae in … dominica continentur
![Page 82: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/82.jpg)
5. WSD Classification• For all words without an aligned translation, use the surrounding context
to determine the most likely sense.
![Page 83: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/83.jpg)
5a. WSD static evaluation
System villa pastor miles scientia oratio Average
5-gram LM 54.8% 69.2% 90.2% 73.7% 61.4% 69.9%
6-gram LM 58.3% 61.5% 91.2% 65.8% 63.8% 68.1%
Bayes 63.5% 62.3% 92.6% 70.2% 48.0% 67.3%
Token Unigram LM
63.5% 62.4% 92.6% 70.2% 48.0% 67.3%
Token Bigram LM 64.3% 62.4% 92.6% 70.2% 48.8% 67.7%
TF/IDF 64.3% 60.7% 82.8% 70.2% 49.6% 65.5%
KNN 64.3% 73.5% 84.4% 63.2% 40.1% 65.1%
MFS Baseline 60.9% 66.7% 92.6% 79.0% 60.6% 72.0%
• Created held-out test set of 105 instances of 5 Latin nouns with known shifts in meaning sampled uniformly from 21 centuries. Evaluated 7 different WSD classifiers + simple baseline of most frequent sense overall (MFS).
![Page 84: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/84.jpg)
5b. WSD time series evaluation
![Page 85: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/85.jpg)
5b. WSD time series evaluation
![Page 86: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/86.jpg)
5b. WSD time series evaluation• Evaluated via mean square error between gold standard time series and
automatically classified one.
System villa pastor miles scientia oratio Average
5-gram LM .056 .034 .052 .044 .137 .065
6-gram LM .053 .053 .052 .022 .022 .040
Bayes .047 .060 .055 .040 .228 .086
Token Unigram LM
.047 .060 .055 .044 .230 .086
Token Bigram LM .047 .060 .055 .044 .230 .087
TF/IDF .037 .050 .049 .040 .189 .073
KNN .101 .028 .054 .039 .248 .094
MFS Baseline .228 .170 .014 .091 .338 .178
![Page 87: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/87.jpg)
“oratio”
![Page 88: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/88.jpg)
6. Tracking lexical change: “oratio”
![Page 89: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/89.jpg)
Acknowledgments• This work was supported by grants from:
– The Digging into Data Challenge ("Towards Dynamic Variorum Editions”)– The National Science Foundation (IIS-910884, "Mining a Million Scanned
Books: Linguistic and Structure Analysis, Fast Expanded Search, and Improved OCR”)
– The National Endowment for the Humanities (PR-50013-08, "The Dynamic Lexicon: Cyberinfrastructure and the Automated Analysis of Historical Languages”)
• Thanks are also due to research assistants Alison Darling, Elise Goodman-Tuchmayer, Daniel Libatique, Lee Marmor, John Owen and Erin Shanahan.
![Page 90: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/90.jpg)
2. Digital Classics
Bruce RobertsonMount Allison University
Canada
![Page 91: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/91.jpg)
Digitizing and Viewing Difficult Texts:Lessons From Ancient Greek
19th century provides a vast array of editions of Greek text, many still very useful - Yet they could not be accessed digitally What tools and workflows might help us digitize diverse texts such as these?
What applications can we create to make the resulting OCR data useful to researchers and students?
![Page 92: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/92.jpg)
Diversity of 19th Century Fonts and Layout
![Page 93: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/93.jpg)
![Page 94: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/94.jpg)
Character Classification and the Modern UndergraduatePerforming optical character recognition requires a great deal of 'training' This is perfectly suited to the undergraduate researcher
Ph.D. student asks: "why isn't this part of the beginning Greek curriculum?"
It introduces students to the beauty and heritage of the typography of their subject It immediately engages them in a vital research project(True of all languages where learning a new character set is a preliminary skill)
![Page 95: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/95.jpg)
Results
![Page 96: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/96.jpg)
![Page 97: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/97.jpg)
http://www.youtube.com/watch?v=OIjaq7ds2J8
![Page 98: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/98.jpg)
![Page 99: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/99.jpg)
Lessons Learned
Undergraduates provide excellent middle-tier academic labour Shared dictionary data will be fundamental to a cloud-based approach
Include as many languages as possible from the beginning
![Page 100: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/100.jpg)
Future Work
Continue to improve Greek OCR engine based on 'Gamera' Integrate visualization tools that aid students of the language Implement many dictionaries: English, French, Latin, etc. Integrate other crowd-sourcing opportunities so interested viewers can: Verify or correct dubious OCR results Identify the grammar or syntax of words
![Page 101: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/101.jpg)
3. Software Engineering
John DarlingtonBrian Fuchs
Imperial College LondonUnited Kingdom
![Page 102: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/102.jpg)
ICL’s role in DVE
• High-throughput infrastructure for– OCR for Greek and Latin– Text-based Feature extraction
• E-Science utility computing infrastructure
• High-level functional interfaces for e-Science.
![Page 103: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/103.jpg)
DVE: Context at SCG
• E-Science Frameworks – Grid– Cloud– Parallel Processing– Functional / Declarative approaches
• Internet services and economics– Healthcare– Music– Mobile Applications– Transport
![Page 104: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/104.jpg)
OCR parallel challenge
• The key to OCR at scale is miminising the need for eyeballs.
• i.e. “ground-truth”-- manual checking against the original.
![Page 105: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/105.jpg)
Rapid OCR using MapReduce +a Cloud IaaS
![Page 106: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/106.jpg)
Infrastructure: State of play
– 6 node static hadoop testbed– 160 node eucalyptus cluster on old
opteron chips– 20 dual quad-core machines with 16TB
storage on fibre. – Stack assembled and deployed– Initial training sets tested.
![Page 107: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/107.jpg)
Throughput Infrastructure
• Boschetti Aligner– OCR post-processing for Greek/Latin
developed at PDL – multiple sequence alignment dynamic
algorithm ( like BLAST, Clustal, Mr. Bayes)
– bayesian classifier to select the most probable sequence of characters
– spell-checking filtered by ocr evidence
![Page 108: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/108.jpg)
Throughput Infrastructure• MapReduce
– = functional Map/Fold. – Made famous by Google, but developed by others.– Map, then reduce– Map: apply a function in parallel to a bunch of
key/value pairs. – Reduce: apply a function in parallel to each group
of similar k/v pair outputs from Map.
![Page 109: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/109.jpg)
Throughput Infrastructure• MapReduce
– E.g. count occurrences of words in docs– Map( docname, doc.txt))-> ‘mittitur’:1, ‘cura’:1– Reduce(word:count) ‘mittitur’:23, ‘cura’:10,
…
![Page 110: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/110.jpg)
Throughput Infrastructure• MapReduce
– E.g. count occurrences of words in docs– Map:
• Count the words in 1000 documents (in parallel)• map( docname, doc.txt))-> ‘mittitur’:1, ‘cura’:1
– Reduce• Group the output by word, and add up
occurrences (in parallel)• Reduce(word:count) ‘mittitur’:23, ‘cura’:10,…
![Page 111: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/111.jpg)
Throughput Infrastructure• Eucalyptus
– Open Source Cloud Computing – UC, Santa Barbara Spin-off– compatible with Amazon EC2/ S3 – Supported in Ubuntu as of 10.4.
![Page 112: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/112.jpg)
Throughput Infrastructure• Hadoop
– Apache Distributed File System for MapReduce jobs.
– MapReduce Engine—co-ordinates MapReduce
![Page 113: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/113.jpg)
Cluster provisioning
• Create an image with the whole stack• Deploy the image as many times as
nodes are required• Push required config data to the nodes• Turn on• Keep storage separate (i.e. don’t use
hdfs to store data)
![Page 114: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/114.jpg)
OCR parallel methods
• Run parallel jobs on the same scans• Score results• Use highest score in the next round
![Page 115: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/115.jpg)
OCR parallel methods
• 3 different ocr engines per page• x different filters per page• x different filters per section of page.
= c. 30 runs per scan.
![Page 116: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/116.jpg)
OCR vote and error predictionmethods
Courtesy: Federico Boschetti
![Page 117: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/117.jpg)
Alignment voting
• Map: Run three ocr engines/training sets on each page– Gamera– Tesseract: training set 1– Tesseract: training set 2
• Reduce: – spell check and compare
![Page 118: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/118.jpg)
Training set voting
• Map: Run random pages on all avail. training sets.
• Reduce: Check against dictionary, and score.
![Page 119: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/119.jpg)
Tiling
• Map: Run several filters over different parts of a page to compensate for local minima = blotches
• Reduce: score the output and compare.
![Page 120: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/120.jpg)
Why Eucalyptus?
• Scalable– Amazon/NGS hybrid possibilities
• Reuseable– Very fast start-up/tear-down.
• Configurable– Quickly configure custom throughput
clusters
![Page 121: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/121.jpg)
Why MapReduce?
• “Shared Nothing” architecture• = suited to “dumb” processes like page ocr
Why Hadoop? Easy to integrate with other FS’s, e.g. s3 Excellent customisation options Most flexible implementation of MapReduce (cf.
GridGain)
![Page 122: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/122.jpg)
Why not MapReduce?
• Requires extensive refactoring.• Only a subset of functional possibilities.
Why not Hadoop? Filesystem is slooooowwww…. Resource intensive. Headnode is a bottleneck…
![Page 123: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/123.jpg)
Challenges for the future
• Feature Extraction.e.g.– Named Entities– Part of Speech tagging– Multi-lingual alignment
• Iteration is hard with distributed systems!
![Page 124: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/124.jpg)
Conclusions
![Page 125: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/125.jpg)
Three conclusions
• Increased intellectual range
![Page 126: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/126.jpg)
Three conclusions
• Increased intellectual range– Greco-Roman Antiquity is an enabling subject to
understand cultural tectonic forces at work today
![Page 127: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/127.jpg)
![Page 128: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/128.jpg)
Plato’s Republic and the Guardians
The Islamic Republic of Iran and the Guardianship of Islamic Jurists
![Page 129: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/129.jpg)
Sometimes Greek philosophy does have an impact..
Plato’s Republic and the Guardians
The Islamic Republic of Iran and the Guardianship of Islamic Jurists
![Page 130: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/130.jpg)
Three conclusions
• Increased intellectual range– Greco-Roman Antiquity is an enabling subject to
understand cultural tectonic forces at work today
• Cultural heritage -> network of cultures
![Page 131: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/131.jpg)
Three conclusions
• Increased intellectual range– Greco-Roman Antiquity is an enabling subject to
understand cultural tectonic forces at work today
• Cultural heritage -> network of cultures– We share Greco-Roman Cultural Heritage
![Page 132: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/132.jpg)
Students of Greek and Latin
![Page 133: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/133.jpg)
Students of Greek and Latin
![Page 134: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/134.jpg)
Students of Greek and Latin
![Page 135: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/135.jpg)
How do we work together?
![Page 136: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/136.jpg)
Three conclusions
• Increased intellectual range– Greco-Roman Antiquity is an enabling subject to
understand cultural tectonic forces at work today
• Cultural heritage -> network of cultures– We share Greco-Roman Cultural Heritage
• Decentralized Lab Culture in the Humanities
![Page 137: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/137.jpg)
Three conclusions
• Increased intellectual range– Greco-Roman Antiquity is an enabling subject to
understand cultural tectonic forces at work today
• Cultural heritage -> network of cultures– We share Greco-Roman Cultural Heritage
• Decentralized Lab Culture in the Humanities– Even/esp. hard subjects need contributions from
student researchers and citizen scholars
![Page 138: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/138.jpg)
Student Researchers
Tufts
![Page 139: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/139.jpg)
Student Researchers
Tufts
Holy Cross
![Page 140: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/140.jpg)
Student Researchers
TuftsFurman
Holy Cross
![Page 141: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/141.jpg)
Student Researchers
TuftsFurman
Holy CrossHouston
![Page 142: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/142.jpg)
Student Researchers
TuftsFurman
Holy CrossMount Allison
Houston
![Page 143: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/143.jpg)
Huge Open Collections
• Provide the net public with physical access to unprecedented bodies of cultural heritage
• Researchers and automated systems provide initial intellectual access BUT…
• These alone cannot succeed without student researchers and citizen scholars
![Page 144: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/144.jpg)
Three basic changes
1. Transformation of scale of questionsBreadth and Depth
2. Student researchers and citizen scholarsNot enough professors and library professionals
3. Globalization of cultural heritage Not enough expertise in Europe + North America
![Page 145: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/145.jpg)
We can (if we choose) transform our ability to advance the intellectual life of
society
![Page 146: IMPACT Final Conference - Gregory Crane](https://reader037.vdocument.in/reader037/viewer/2022102922/546fd655b4af9f0e648b462b/html5/thumbnails/146.jpg)
Thank you!