bratislava ws - depuydt - inl - lexicon building_pdf

53
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Katrien Depuydt (Institute for Dutch Lexicology, Leiden) A gentle introduction to lexicon building and lexicon application

Upload: impact-centre-of-competence

Post on 11-May-2015

508 views

Category:

Education


1 download

TRANSCRIPT

Page 1: Bratislava WS - Depuydt - INL - lexicon building_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Katrien Depuydt (Institute for Dutch Lexicology, Leiden)

A gentle introduction to lexicon building and lexicon application

Page 2: Bratislava WS - Depuydt - INL - lexicon building_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Outline What is a lexicon Lexica in IMPACT Lexicon building and lexicon application tools Results so far with focus on Dutch

IMPACT workshop, Bratislava, May 7, 2010 2

Page 3: Bratislava WS - Depuydt - INL - lexicon building_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

What is a lexicon?

IMPACT workshop, Bratislava, May 7, 2010 3

Page 4: Bratislava WS - Depuydt - INL - lexicon building_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Lexicon vs. electronic dictionary (1)

IMPACT workshop, Bratislava, May 7, 2010 4

An electronic dictionary has

Of course, digitized full text (no images)Primarily: for human useIdeally: searchable with explicitly (XML) tagged information

lemma, Part of speech, meaning, quotations etc.Example:online Oxford English Dictionary

Page 5: Bratislava WS - Depuydt - INL - lexicon building_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Dictionary XML (example)

IMPACT workshop, Bratislava, May 7, 2010 5

Page 6: Bratislava WS - Depuydt - INL - lexicon building_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Lexicon vs. electronic dictionary (2)

IMPACT workshop, Bratislava, May 7, 2010 6

A computational lexicon isOf course, in structured digital format (XML, relational database)Primarily for use in computer applicationsHas explicitly coded information(eg. lemma, part of speech, morphology, semantics, syntax…).

Used (for instance):Linguistic annotationEnhanced retrieval (basic: inflected forms; advanced: synonyms etc.)Syntactic parsing, machine translation

Page 7: Bratislava WS - Depuydt - INL - lexicon building_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT workshop, Bratislava, May 7, 2010 7

Page 8: Bratislava WS - Depuydt - INL - lexicon building_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Lexica in IMPACT

IMPACT workshop, Bratislava, May 7, 2010 8

Page 9: Bratislava WS - Depuydt - INL - lexicon building_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

The OCR lexicon

IMPACT workshop, Bratislava, May 7, 2010 9

An OCR lexicon isA verified list of words in a languageBased on a corpus, dated to enable relevant selectionPreferably with frequency informationPreferably from same period/text type as the documentsyou want OCR’d (selection!)

Page 10: Bratislava WS - Depuydt - INL - lexicon building_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

OCR lexicon example

IMPACT workshop, Bratislava, May 7, 2010 10

From WNT attestation lexicon From DBNL historical corpus

absoluut 8absoluyt 2absoluyter 1absolveren 3absolverende 1absorbeeren 1absorbeert 1absorberen 1absorptie 3absoute 2abstineeren 1abstinencie 1abstinentie 2abstineren 1abstrackheyt 1abstract 7abstracta 1abstracte 7abstracten 4abstractheid 1abstractie 1abstractiën 1

wechgerukt 5wechgeschickt 6wechgeven 6wech-gevoerde 11wechgevoerde 14wech-gevoert 59wechgevoert 98wechgeworpen 21wechghenomen 12wechghevoert 7wechginck 5wechloopen 6wechneemt 11wechneme 6wech-nemen 20wechnemen 74wechneminge 12wech-neminge 6wechrapen 6wechrucken 6wechruiming 7wecht 7

Page 11: Bratislava WS - Depuydt - INL - lexicon building_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

The IR lexiconIR lexicon: Main information categories:

wordforms (list of words) +- frequency information- quotations (dated sources) from corpora orelectronic dictionaries

- MODERN LEMMA (// dictionary entry) assigned to spelling variants and morphological variants of the same word

The modern lemma forms are the main search keys for retrieval This is a standard practice in corpus linguistics and modern historical

lexicography

IMPACT workshop, Bratislava, May 7, 2010 11

Page 12: Bratislava WS - Depuydt - INL - lexicon building_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT workshop, Bratislava, May 7, 2010 12

<?xml version='1.0'?><!DOCTYPE lexicon SYSTEM 'NL_Structure.dtd'><lexicon><lexical_entry><lemma_id>219490</lemma_id><modern_lemma>aantuilen</modern_lemma><gloss></gloss><POS>VRB</POS><ne_label></ne_label><language_id></language_id><portmanteau_lemma_id></portmanteau_lemma_id>

<wordform><form_representation><wordform_id>850026</wordform_id><written_form>tuyld</written_form><attestation><id>92141</id><token_id></token_id><quote>Verhael ick (<I>t.w. een als vrouw verkleede man</I>) haer mijn min in Vrouwelijcker schynen:Sy acht het boertery, en tuyld daer weer op an, Vermits een Vrou niet op een Vrou verlieven kan,</quote><derivation_id>0</derivation_id><document_id>204</document_id><start_pos>119</start_pos><end_pos>124</end_pos></attestation></form_representation></wordform>

Page 13: Bratislava WS - Depuydt - INL - lexicon building_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

How to build and apply these lexica?

IMPACT workshop, Bratislava, May 7, 2010 13

Page 14: Bratislava WS - Depuydt - INL - lexicon building_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Lexicon buildingBuild a lexicon with the aims of Be profitable to OCR and OCR postcorrection Improving retrieval by building a lexicon of variants with the modern

lemma as a main entry key

Tools for lexicon building Tools on how to use the lexicon (lexicon deployment) for enrichment Lexicon cookbook Best practice and tools to use lexica in OCR

!!! No lexicon will ever contain all variants found in historical text

IMPACT workshop, Bratislava, May 7, 2010 14

Page 15: Bratislava WS - Depuydt - INL - lexicon building_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Types of variation (orthographical and other)

IMPACT workshop, Bratislava, May 7, 2010 15

uytterlijcste uyterlijkste d'uyterlijke uiterlyke uyterlijcke uiterlijke uyterlijck uiterlyken uiterlijkste uiterlicke wterlicke wterlijcke ulterlijk uiterlyk uiterlijk uyterlick wterlicken d'uyterlijcke uiterlijken uiterlijks wterlijck uytterlicke uitterlijke ujterlijke uytterlijk uyterlycke uyterlicken uijterlicke d'uiterlijcke wtterlijcke wterlyke wtterlijk uuterlick uuterlic uyterlijke uyterlijcken uyterlicke d'uiterlyke wterlijke vuyterlijcke uuterlycke uuterlicke wterlijken uyterlijcksten uuyterlicke uuyterlick uuyterlycke uytterlijcke uytterlycke uytterlick vuytterlicke uiterlijker uyterlyck uterliek wterlijcken uiterlijkst uitterlijk uytterlijcken uyterlyk wterlick uutterlijck uuyterlicken uyttelijck uijterlijk uytterlijck uuterlijck uiterlick uitterlyk uuyterlic uuyterlyck uuyterlijck uiterlijck uytterlyck uterlyc wterlijk

I

werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlytwereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels zwerlysswarels swerelts werelts swerrels weirelts tsweerelds werret vverelt werlts werrelt worreld werlden wareld weirelt weireld waerelt werreld werld vvereld weerelts werlde tswerels werreldts weereldt wereldje waereldje weurlt wald weëled

II

(most of these can be dealt with by means of patterns)

(some of these can be dealt with by patterns and/or fuzzy matching, others can only be handled by explicit listing)

Page 16: Bratislava WS - Depuydt - INL - lexicon building_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

The “hypothetical” vs. the witnessed lexicon (1)Mechanisms

- to extend the lexicon- to assess the plausibility of “hypothetical” wordswithout previous attestations, i.e. words we have not seen before.

IMPACT workshop, Bratislava, May 7, 2010 16

Page 17: Bratislava WS - Depuydt - INL - lexicon building_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

The “hypothetical” vs. the witnessed lexicon (2)

Unknown inflected forms of registered lemmata: automatic expansion from the lemma to the full paradigm of word forms: paradigmatic expansion or reverse lemmatization

New spellings of known words can be dealt with by developing a good model of the historical spelling. (The database structure provides for the storage of orthographic variant patterns.)

Previously unseen compounds can be dealt with by means of a good model of word formation. (work scheduled for 2010)

IMPACT workshop, Bratislava, May 7, 2010 17

Page 18: Bratislava WS - Depuydt - INL - lexicon building_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT workshop, Bratislava, May 7, 2010 18

Transformation Patterns

Witnessed Modern Word

Historical Variant 1

Historical Variant 2

Virtual lexiconof generated word forms

Hypothetical Modern Word

Page 19: Bratislava WS - Depuydt - INL - lexicon building_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

What is needed for lexicon building

Build models of linguistic variation (inflection, orthography) Collect variants

Approach Cycle: model helps to construct lexicon, and vice versa (induction of

rules/patterns) Combination of manual work and computational linguistics Lexicon building toolkit to support development, containing both

computational linguistic tools and tools supporting manual work

IMPACT workshop, Bratislava, May 7, 2010 19

Page 20: Bratislava WS - Depuydt - INL - lexicon building_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT workshop, Bratislava, May 7, 2010 20

Cf. Computational Tools and Lexica to Improve Access to Text, Jesse de Does, Katrien Depuydt

Page 21: Bratislava WS - Depuydt - INL - lexicon building_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Spelling variation tools (pattern-based) Language-independent approach: Supervised rule (pattern) induction from pairs (“modern” word,

historical word), yielding patterns like aa/ae, s/z, …. Pattern weights are computed from example material

Additional approaches possible: Use of aligned data (parallel historical text and modern version) Unsupervised pattern weighting (=~ text profiling from TR5)

IMPACT workshop, Bratislava, May 7, 2010 21

Page 22: Bratislava WS - Depuydt - INL - lexicon building_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Lemmatization Reduction of historical word forms to modern lemma Historical word standard (“modern”) spelling lemma form

(pattern matching) (lemmatizer)

Dystels (1) distels (2) distel

When we have a perfect or near-perfect modern full form lexicon, the second step is simply lexicon lookup.

But: 1) We will not have full form information for many lemmata

(especially the historical ones)2) Even lemmata present in modern language may have historical

inflected forms different from the present-day paradigmIMPACT workshop, Bratislava, May 7, 2010 22

Page 23: Bratislava WS - Depuydt - INL - lexicon building_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Lemmatization and reverse lemmatizationWe also need a lemmatization process for these situations A typical lemmatizer assigns some standard form (infinitive,

nominative, stem) to inflected forms. Usually based on patterns relating the inflected form to the standard form.

But: Matching these patterns can be hard to combine with matching

both spelling variation patterns and OCR errors (bok/bokken/bokkeu)

We adopt the solution of actually expanding the “hypothetical modern full form lexicon” containing the most plausible possible paradigmatic expansions of lemmata

This construction is carried out by means of a statistical reverse lemmatizer

IMPACT workshop, Bratislava, May 7, 2010 23

Page 24: Bratislava WS - Depuydt - INL - lexicon building_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Attestation From hypothetical (non-witnessed) lexicon content to attested word

forms in “real” text Automatic selection of candidate attestations Manual work: verification and correction

Two approaches Dictionary based (INL): Woordenboek der Nederlandsche Taal Corpus based (LMU, INL): Dutch DBNL corpus

IMPACT workshop, Bratislava, May 7, 2010 24

Page 25: Bratislava WS - Depuydt - INL - lexicon building_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT Dictionary Attestation Tool

IMPACT workshop, Bratislava, May 7, 2010 25

work• We are working on what works.

• Depart from me, ye that worke iniquity.

• She worcketh knittinge of stockings.

headword

Quotations

variants

TaskFind the variants of a headword as they occur in the quotations

Lexicon building at work: Verifying attestations in historical dictionaries

Page 26: Bratislava WS - Depuydt - INL - lexicon building_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT Dictionary Attestation Tool

IMPACT workshop, Bratislava, May 7, 2010 26

Automatically (preprocessing)

• match literallye.g: work work, Work

• match using existing lexica and listse.g: work works, worked, wrought

• approximate matchinge.g: work worke

By hand (using the tool)

• correct automatic mismatchese.g: works words, worms

• find missed matchese.g: work worketh, wrowght

TaskFind the variants of a headword as they occur in the quotations

Electronic

historical

dictionaryDatabase

with lemmata

and quotatioms

Page 27: Bratislava WS - Depuydt - INL - lexicon building_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT Attestation Tool

IMPACT workshop, Bratislava, May 7, 2010 27

Tool

Lemma headword

QuotationsSorted by uncertainty

Up-to-date overview of what is done and needs to be don

Done by this user so far

Page 28: Bratislava WS - Depuydt - INL - lexicon building_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT Lexicon Tool

IMPACT workshop, Bratislava, May 7, 2010 28

Automatically (preprocessing = apply lemmatizer)

• match literallye.g: work work, Work

• match using existing lexica and listse.g: work works, worked, wrought

• matching using spelling variation modulee.g: uiterlijk uyterlick

By hand (using the tool)

• assign correct lemma e.g: was (N) zijn (V)

• group tokens belonging togethere.g: konings zoon koningszoon

• select attestations

TaskFind and verify attestations in a historical corpus

Page 29: Bratislava WS - Depuydt - INL - lexicon building_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Corpus-based lexicon building: Impact Lexicon Tool

IMPACT workshop, Bratislava, May 7, 2010 29

Page 30: Bratislava WS - Depuydt - INL - lexicon building_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

General vocabulary vs. Named entities Tools for lexicon building described so far: applicable to general

lexicon Tools for NE recognition, classification and variant matching

- library requirement- distinguish general vocabulary from NE’s- avoid unpleasant mixups like Abimelech apemelk!

(b/p; i/e; e/0; k/ch)

IMPACT workshop, Bratislava, May 7, 2010 30

Page 31: Bratislava WS - Depuydt - INL - lexicon building_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Improvement of state of the art / innovation

We use existing computational linguistic approaches, but figure out how to apply them to historical language

We develop a workflow to deal with the problems posed by historical language, figuring out how all pieces fit together Data selection and acquisition Manual work Computational linguistics tools

IMPACT workshop, Bratislava, May 7, 2010 31

Page 32: Bratislava WS - Depuydt - INL - lexicon building_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Some results so far with focus on Dutch

IMPACT workshop, Bratislava, May 7, 2010 32

Page 33: Bratislava WS - Depuydt - INL - lexicon building_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Measuring results for Dutch

IMPACT workshop, Bratislava, May 7, 2010 33

We use the ground truth data developed in the projectEvaluation of EE toolsEvaluation of lexicon coverageEvaluation of lexicon usage in IR (2010)Evaluation of OCR and lexicon usage in OCR (2010)Evaluation of benefit of lexicon building for OCR (for which type of material / quality of OCR does this make sense) (2010-11)

Page 34: Bratislava WS - Depuydt - INL - lexicon building_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Dutch ground truth data

IMPACT workshop, Bratislava, May 7, 2010 34

Type and genre # wordsGold Standard Book 300kRandom Set Book 340kRandom Set Staten Generaal 2.5MGold Standard Staten Generaal 500kGold Standard Newspapers 1 3.4MGold Standard Newspapers 2 170kRandom Set Newspapers 3.2M

total 13.1M

Page 35: Bratislava WS - Depuydt - INL - lexicon building_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Efficiency of lexicon buildingDictionary-based lexicon building using historical dictionary:

Woordenboek der Nederlandsche Taal Lemmata: 220211, quotations: 1524366 Tempo: 1725 quotations/hour; 231 lemmata/hour

IMPACT workshop, Bratislava, May 7, 2010 35

Page 36: Bratislava WS - Depuydt - INL - lexicon building_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Reverse lemmatization Reminder: build hypothetical (non-attested) word forms in a “quick

and dirty” way to use in lemmatization and corpus-based lexicon building

Using simple statistical algorithms and a simple approach to inflection

Results:

IMPACT workshop, Bratislava, May 7, 2010 36

Accuracy

Small Dutch lexicon (JVKlex) 96.6%French lexicon (Morphalou) 99.4%Polish lexicon, verbs (Morfologik) 98.7%

Page 37: Bratislava WS - Depuydt - INL - lexicon building_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Lexicon coverage (1: ground truth books)

IMPACT workshop, Bratislava, May 7, 2010 37

Type coverage Token coverage

Modern lexicon (e-Lex) 46% 76%

EE3.3 56% 84%

1 + 2 63% 89%Type frequency list historical corpus, top 200K (freq >= 19)

70% 93%

Type frequency list historical corpus, top 500K (freq >= 5)

78% 95%

Page 38: Bratislava WS - Depuydt - INL - lexicon building_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Lexicon coverage (2: gt newspapers 18th-19th c.)

IMPACT workshop, Bratislava, May 7, 2010 38

Type coverage Token coverage

Modern lexicon (e-Lex) 40% 83%

EE3.3 41% 84%

1 + 2 51% 89%Type frequency list historical corpus, top 200K

52% 93%

Type frequency list historical corpus, top 500K

62% 95%

Page 39: Bratislava WS - Depuydt - INL - lexicon building_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Lexicon coverage (3: gt Parl. Papers 19th c.)

IMPACT workshop, Bratislava, May 7, 2010 39

Type coverage Token coverage

Modern lexicon (e-Lex) 51% 89%

EE3.3 47% 88%

1 + 2 58% 93%Type frequency historical corpus, top 200K

59% 96%

Type frequency historical corpus, top 500K

68% 97%

Page 40: Bratislava WS - Depuydt - INL - lexicon building_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Lexicon coverage (4: gt Parl. Papers 20th c.)

IMPACT workshop, Bratislava, May 7, 2010 40

Type coverage Token coverage

Modern lexicon (e-Lex) 70% 93%

EE3.3 66% 93%

1 + 2 76% 96%Type frequency historical corpus, top 200K

74% 97%

Type frequency historical corpus, top 500K

81% 98%

Page 41: Bratislava WS - Depuydt - INL - lexicon building_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Lexicon coverage (5: Genesis, 1637 bible)

IMPACT workshop, Bratislava, May 7, 2010 41

Type coverage Token coverage

Modern lexicon (e-Lex) 31% 61%

EE3.3 62% 83%

1 + 2 65% 89%Type frequency historical corpus, top 200K

76% 97%

Type frequency historical corpus, top 500K

87% 98.6%

Page 42: Bratislava WS - Depuydt - INL - lexicon building_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Lexicon coverage (6: Hooft, historiën)

IMPACT workshop, Bratislava, May 7, 2010 42

Type coverage Token coverage

Modern lexicon (e-Lex) 26% 67%

EE3.3 47% 88%

1 + 2 50% 90%Type frequency historical corpus, top 200K

44% 93%

Type frequency historical corpus, top 500K

58% 96%

Page 43: Bratislava WS - Depuydt - INL - lexicon building_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Evident next step for Dutch lexicon building is corpus based work

First target: cover the top 200000 from the historical corpus.– Contains 97885 types not in the witnessed historical EE3.3

lexicon– Roughly 24% of these are covered by the modern lexicon– Roughly 25% are names– This leaves about 45000 common words to look into.

IMPACT workshop, Bratislava, May 7, 2010 43

Conclusion from this evaluation

Page 44: Bratislava WS - Depuydt - INL - lexicon building_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Measuring effect of lexicon use in IR Example: Improved recall for retrieval in a historical corpus of about

150 million tokens, using only the modern lexicon for wereld yields 23396 hits, using th current EE3.3 lexicon we get 34339 hits.

Simple IR will be part of the demonstrators Hard to IR results proper without special datasets We have measured up to now either lemmatization or modern to

historical word form matching accuracy

IMPACT workshop, Bratislava, May 7, 2010 44

Page 45: Bratislava WS - Depuydt - INL - lexicon building_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Lemmatization Combination of lookup, matching of spelling variation, reverse

lemmatization As yet no good evaluation set for IMPACT (current work) Evaluation on “type” levelWe will use other material here (1637 Genesis, 97144 tokens)Approach Restrict to “ordinary words” (no names, numbers, clitic

combinations) Ambiguous lemmatization (context is not used) (avg. 5

suggestions per word) Ranking based on frequency and pattern weightsIMPACT workshop, Bratislava, May 7, 2010 45

Page 46: Bratislava WS - Depuydt - INL - lexicon building_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Result 6265 distinct types. 5991 (95.7%) had at least one correct

suggestion Average rank of correct suggestions: 1.23

– 5222 types found in current EE3.3 (83%)– 65 additional types in modern lexicon– 49 types without any match– 969 types (15%) identified with “approximate” matching using

~500 weighted patterns and returning at most 2 suggestions

IMPACT workshop, Bratislava, May 7, 2010 46

Page 47: Bratislava WS - Depuydt - INL - lexicon building_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Real and hypothetical lexicon coverage (Hooft, historiën) Result (again restricting to ‘ordinary’ words) 36332 distinct types. Avg rank of correct suggestions: 1.23

– 20087 types found in current EE3.3 (55%)– 1061 additional types in modern lexicon– 2411 types without any match (7%)– 12773 types (35%) identified with “approximate” matching using

~500 weighted patterns and returning at most 2 suggestions (Probably about 75% of the highest-ranking approximate matches are correct)

IMPACT workshop, Bratislava, May 7, 2010 47

Page 48: Bratislava WS - Depuydt - INL - lexicon building_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Evaluation of TR results

IMPACT workshop, Bratislava, May 7, 2010 48

Using Finereader SDK (version 9) External dictionary interface for experimentation Not completely straighforward how to apply thisTranslation of corpus frequencies to weights on a scale 0-100Other details: hyphenated words, case-sensitivity, …Workaround to circumvent the long s problem

Lexicon Data usedCorpus-based type-frequency listEE3.2 deliverable lexiconFinereader internal lexicon

Page 49: Bratislava WS - Depuydt - INL - lexicon building_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

OCR evaluation1. Character accuracy2. Word accuracy3. In case of block alignment problems, a simple alternative is bag-of-

words accuracy

1. and 2. presuppose a good alignment of OCR with ground truth.

We will use word accuracy, or the simpler alternative 3. when there are alignment problems

IMPACT workshop, Bratislava, May 7, 2010 49

Page 50: Bratislava WS - Depuydt - INL - lexicon building_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

OCR results

IMPACT workshop, Bratislava, May 7, 2010 50

Dataset With ABBYY internal Dutch lexicon

With combination of corpus-based historical lexicon and EE3.2 deliverable (case insensitive, taking hyphenation into account)

With combination of corpus-based historical lexicon and EE3.2 deliverable improved deployment

DPO35(word accuracy)

88.8% 90.9% 94.4 % accuracy

Parliamentary papers, 1826-27 selection(bag of words recall)

90.9% 94.9% 94.9%

Page 51: Bratislava WS - Depuydt - INL - lexicon building_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

‘The Book’

“Kort begrip der waereld-historie voor de jeugd” J.F. Martinet

Predikant te Zutphen, uit 1789.

IMPACT workshop, Bratislava, May 7, 2010 51

Why this book?Representative font and amount of spelling variation etc for late 18th century DutchIt has the “long s problem”:

…. = stilste not ftilfte

Page 52: Bratislava WS - Depuydt - INL - lexicon building_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

The long s problem: An example ….

IMPACT workshop, Bratislava, May 7, 2010 52

OCR at start of project Results April 2010

A. De eerde was de gevaarlykflti om de verlei¬ding aan 't Hof; de tweede de ftillie en veiligde;de derde de zwaarde, daar hy byna drie millioenenharde en onbefchaafde Menfchen beftieren moest.

A. De eerste was de gevaarlykste om de verlei-ding aan 't Hof; de tweede de stilste en veiligste;de derde de zwaarste, daar hy byna drie millioenenharde en onbeschaafde Menschen bestieren moest.

Workaround: “integrated postcorrection” tell the engine that “eerfte” is OK and postcorrect it afterwards with the lexicon.

In this way we keep it from turning to “eerde”

Page 53: Bratislava WS - Depuydt - INL - lexicon building_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Future work Compound analysis Irregular historical white space use (“impacttok++”) (cf attestations) Corpus based lexicon extension Testing and optimization with ground truth data Improve the TR lexicon by extending the IR lexicon and removing

false friends from the DBNL-corpus based TR lexicon Continue work on best way deploy lexica in OCR, with help from

ABBYY

IMPACT workshop, Bratislava, May 7, 2010 53