datech2014 session 2 - automated assignment of topics to ocred texts

18
Automated Assignment of Topics to OCRed Historical Texts Florian Fink, Christoph Ringlstetter, Klaus U. Schulz CIS - Center for Information and Language Processing University of Munich

Upload: impact-centre-of-competence

Post on 22-Nov-2014

229 views

Category:

Documents


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Datech2014 Session 2 - Automated Assignment of Topics to OCRed Texts

Automated Assignment of

Topics to OCRed Historical

Texts

Florian Fink, Christoph Ringlstetter, Klaus U. Schulz

CIS - Center for Information and Language Processing

University of Munich

Page 2: Datech2014 Session 2 - Automated Assignment of Topics to OCRed Texts

Motivation

Standard (modern) repositories in libraries • documents come with metadata describing subjects and topics covered in the texts (deep subject classification, e.g. UDC) • subjects often primary key for bringing order to large repositories • supporting users interested in particular fields

OCRed historical texts from digitization tsunami • mostly poor metadata, no subject classification • missing order on whole collection, only keyword search • missing survey: what IS the collection about, what can I hope to find?

Can we automatically find subjects/topics covered?

Page 3: Datech2014 Session 2 - Automated Assignment of Topics to OCRed Texts

Automated topic assignment

Task Automatically compute all topics/fields that adequately describe contents of given document, add hierarchical order to topics.

Challenges • huge number of topics and fields, encyclopedic coverage • hierarchical order, from general fields to very specific topics science -> mathematics -> algebra -> group theory -> permutation groups

Comparison: document classification • small number of given disjoint fields (e.g., politics, science, sports,..) • Task: find best label(s) for document Not only „replacement“ for manual topic assignment but

Page 4: Datech2014 Session 2 - Automated Assignment of Topics to OCRed Texts

New Visions!

• assigning topics to document parts on all levels of granularity (chapters, pages, paragraphs, ….)

• horizontal access – automated linking of documents and

document parts using topics found

• detecting „topic reuse“, parallelisms and differences across repositories and subrepositories

• time lines & trend analysis

• ……..

Page 5: Datech2014 Session 2 - Automated Assignment of Topics to OCRed Texts

Method used

TopicZoom • university spin-off founded by our group in 2008 • topic assignment to texts (head hunting, trend analysis, ...)

Background technology • huge semantic net: 120,000 nodes (topics, persons, organizations,

events, geographic locations, time periods) • ordered as a directed acyclic graph • topic names come with linguistic variants; many multi-word

expressions German (main focus) and English

Free web service • users send (manually or XML interface) texts • receive topics found in texts • ranked using two relevance scores

Page 6: Datech2014 Session 2 - Automated Assignment of Topics to OCRed Texts

Example

Weight Degree of Generality

Significance Topic

1 8 7.31492196 South Africa

1 4 5.26957792 Elections

1 7 4.60475280 African countries

1 6 4.45792943 Africa

1 3 3.91069886 Political events

1 2 1.84870472 Politics

“The 2014 South

African general election will be

held on 7 May

2014 to elect a new National

Assembly and new provincial

legislatures in

each province.” (Wikipedia)

Page 7: Datech2014 Session 2 - Automated Assignment of Topics to OCRed Texts

Questions asked

Can this technology be used to bring order to collections of OCRed historical texts?

• How is topic assignment affected by OCR errors?

• How is topic assignment affected by historical orthography?

• TopicZoom hierarchy („modern topics“) suitable for topics found in historical texts?

Page 8: Datech2014 Session 2 - Automated Assignment of Topics to OCRed Texts

Historical corpus - Zedler lexicon

Johann Heinrich Zedler „Grosses vollständiges Universallexicon aller Wissenschafften und Künste“ (Great Complete Encyclopedia of All Sciences and Arts)

• largest and most famous 18th century German encyclopedia • 64 volumes plus four supplements • ca. 284,000 articles • 63,000 two column pages • article sizes extremely unbalanced • accessible in the web Images (tif) received from Bavarian State Library

Page 9: Datech2014 Session 2 - Automated Assignment of Topics to OCRed Texts

Experiment

• started with scans from 14 pages of Zedler • prepared three versions:

1. OCRed page (Finereader) 2. ground truth 3. ground truth with modernized orthography

• manually assigned topics to the 14 pages • automated topic assignment for the three versions of each page • looked at recall and precision obtained for three page versions • analysis of results and problems

OCR quality • percentage correctly recognized words (tokens) average 75.03%, for words of length > 3: 71.12% • for OCR versus ground truth with modernized orthography average 68.37%, for words of length >3: 62.31%

Page 10: Datech2014 Session 2 - Automated Assignment of Topics to OCRed Texts

Zedler manually assigned topics

Average: 25 topics assigned per page • Main topic (lemma) „Zeugen“ (witnesses) law and justice, contracts, last will, marriage, rights, courts, judges, handicapped persons, laws, children, teenagers, corruption, civil law, childhood, adolescense.

• Several lemmata… peoples, plague, language, gypsies, eviction, paper production, hunting helpers, hunting, mines, mining, grammar, rhetoric, Zeugma (city), bridges, Roman Empire, Romans, Euphrates, Alexander the Great, nations, France, Spain, Netherlands.

• Main topic (lemma) historiography („giving witness“). history, historiography , historians, Heinrich Cornelius Agrippa, jews, diluvian, genesis, Adam and Eve, biblical figures, Persia, Romulus and Remus, Jesus Christ, Arabs, Koran, Bible, Fables, Mecca, Mosques, The Franks, Christianity, Paganism, Plutarch. • ………….

Page 11: Datech2014 Session 2 - Automated Assignment of Topics to OCRed Texts

Recall – average values

AA

AA

OCRed Ground truth Modernized ground truth

Recall: Percentage of manually assigned topics found among computed topics Threshold for TopicZoom significance value 1.0 0.6 0.3 0.0

50% 50% 50% 50%

Page 12: Datech2014 Session 2 - Automated Assignment of Topics to OCRed Texts

Notion of recall not fully adequate

Often for a missed topic a closely related is found in the answer set. E.g. page 1, topics “children”, “teenagers” missed, “childhood” and “adolescence” are found. Intuitively, “felt recall” larger than computed recall

Manually assigned spatial areas Computed spatial areas Recall: 20%

Page 13: Datech2014 Session 2 - Automated Assignment of Topics to OCRed Texts

Real problems for recall

• Very rare topics Zedler treats rare topics such as “civet”, “campher” not represented in the TopicZoom semantic net.

• Changing world Topics from parts of world that have dramatically changed Old professions, habits, and techniques etc., E.g. “paper production”, “hunting helpers”, “perfume manufacture”, “potency means”, “brick oil” many old professions (“Drechsler”, turner) now very popular family names.

Page 14: Datech2014 Session 2 - Automated Assignment of Topics to OCRed Texts

Average precision values

Correct topic Questionable Wrong topic

OCR ground truth ground truth with modernized orthography

Threshold: significance 0.6

Page 15: Datech2014 Session 2 - Automated Assignment of Topics to OCRed Texts

Problems for precision

• Wrong time periods OCR had problems to recognize years -> wrong time periods assigned

• Wrong resolution of ambigious words Words of the texts confused with the names of small villages -> several wrong topics

• Language changes beyond the level of orthography • e.g., “Flüsse” (rivers) used twice for liquids of the nose and the eyes -> several wrong topics (rivers and more general geographic objects) • e.g. “Verstopfung” (main modern meaning: constipation) refering to problems of the brain, nose, and ears (interpretation hardly found in modern texts) -> several wrong topics, all related to diseases of the digestive tract • e.g. “Blattern” used for a problem of the eyes. Modern language and

TopicZoom net: “Blattern” synonym for “Pocken” (smallpox) -> several wrong topics

Page 16: Datech2014 Session 2 - Automated Assignment of Topics to OCRed Texts

Resume

Unavoidable subjectivity of evaluation • manually assigned topics • classifying computed topics into correct, questionable, wrong !!Do not primarily believe in numbers! Get own impression!

Automated topic assignment • valuable and useful if some errors are considered acceptable • insufficient if errors cannot be tolerated • combination with social tagging (e.g., error elimination)? Significant improvements – in particular for precision - would be possible with minor modification of the underlying semantic net

Page 17: Datech2014 Session 2 - Automated Assignment of Topics to OCRed Texts

Future work

• extend empirical basis

• realize easy improvements

• combine with social tagging

• look at new visions • assigning topics to document parts • interlink documents based on topical similarity • detection of topic parallelism • time line analysis and topic trends

Page 18: Datech2014 Session 2 - Automated Assignment of Topics to OCRed Texts

Thanks for your attention!

… special thanks to Bavarian State Library …