the cadastral register of the town wismar (1677 – 1838) structure mining in historical documents...
Post on 26-Mar-2015
217 Views
Preview:
TRANSCRIPT
The cadastral register of the town Wismar (1677 – 1838)
Structure Mining in historical documents
Manja NeliusMeike Klettke
Department of Computer Science Database Research Group
University of Rostock
Dagstuhl, 4.-8.12.2006 Manja Nelius, Meike Klettke
University of RostockFolie 2
Overview
1. Available documents Cadastral register Additional information
Index of persons List of abbreviations
2. Algorithms Overview of the approach Analysis of (implicit) structure, layout and of the regular parts Association of markup in the historical texts Rule-based semantic analysis Structuring of information
3. Results4. Conclusion & Literature
Dagstuhl, 4.-8.12.2006 Manja Nelius, Meike Klettke
University of RostockFolie 3
Motivation
In the Department of History at the University of Rostock (Prof. Papay)◦ Historical geographical information systems are developed,
underlying databases◦ Cadastral registers contain lots of information for these geographical
information systems (in texts)– Characteristics of historical texts:
– No unique orthography– Varying spelling (temporal and regional), also the spelling of proper
names varies– Influences from different languages– Usage of Latin words (for instances saint feasts (Heiligenfeiertagen) for
dates)– Varying use of capitalisation
◦ Inserting data in the databases is till now a manual process◦ With a diploma thesis: trying to automate this process
Dagstuhl, 4.-8.12.2006 Manja Nelius, Meike Klettke
University of RostockFolie 4
Available documents
Cadastral register of the town Wismar (1677 – 1838) ◦ Contains information about the history of the town◦ Reflects changes of the building development◦ Shows ownership structure
Similar Cadastral register are available for the towns Rostock and Stralsund◦ Different categories and different structures in the registers◦ Then-mayor (burgomaster) fixed the structure that‘s why it is different
in each town
2002 all Cadastral registers had been (manually) digitalised - aim: preserving of sources◦ That means no changes on the original texts◦ but addition of an index for persons and a list of usual abbreviations
Street
Categories
Neighbour objects
Alt Wismarsche Strasse vom thor her. Norderseite1 Grundbuch Nr. 16 ( 16 ), fol. 15v , neues Stadtbuch Nr. 1962 siehe Grundbuch Nr. 15.3 Haus4 jus protimiseos an dem dahinten belegenen Garten. Medard. 1673.5 ut No. praeced. siehe Grundbuch Nr. 15.Peter Koppe. adjud. et aedif. Tr. Regum 1657.Georg Gammelkern. empt. Medard. 1673.[…]
6 400 Rthlr. Joh. Christoff Müller. Clem. 1677. delirt Mattaei 1688.600 m. l. die Stadt Cämmerey t. Joh. mit 5 procent zu vertzinsen.[…]
Contig. Oldeböterstr. Ost Grundbuch Nr. 310.
Dagstuhl, 4.-8.12.2006 Manja Nelius, Meike Klettke
University of RostockFolie 6
List of persons
Information about a personDifferent alternative spellings of a family rnameReferences onto another spelling of a family nameFamily name with additional information about a person (first name)and the cadastral number (where the person is mentioned)
Gagzow, David 1207Gahde siehe GadeGahrtz, Gehrtz-, Agneta Cristiane Hedwig (Witwe, geborene Haase) 22, 29, 263
Dagstuhl, 4.-8.12.2006 Manja Nelius, Meike Klettke
University of RostockFolie 7
List of abbreviations
Abbreviations of phrases
Alternative abbreviations Explanations
infr.: infra (unten)Inn., Innoc., Innocent. Puer. : (dies) Innocentum Puerorum (28. Dezember)intab., intabul.: intabulatus (intabuliert, in das Grundbuch eingetragen)inter., interim. : interimistisch
Dagstuhl, 4.-8.12.2006 Manja Nelius, Meike Klettke
University of RostockFolie 8
Overview
Combining analysis of layout, structure and texts Stepwise enrichment of the original documents with markup
(markup contains the meaning of the text fractions)
generation of XML documents (bottom up)
Storage in databases
HTMLTXTTXT
XMLXMLXML
DOC RTFRTF
Transformation of the input format
List of abbreviations
List of persons
Cadastral register
Analysis of layout and structure
Exploitation of regular expressions
Storage in database tables
Structuring of information
Mapping rules
Dictionaries
Semantic rules
Rules for structuring
Grammars
NormalisationRules for replacement
Semantic Analysis
Analysis of full texts
Dagstuhl, 4.-8.12.2006 Manja Nelius, Meike Klettke
University of RostockFolie 10
Analysis of layout and structure
Usage of layout characteristics◦ Bold and italic fonts (HTML-Markup)◦ Usage of X-Fetch Wrappers
Analysis of structural characteristics◦ Numbers at the beginning of rows determine the categories◦ cadastral item is divided into the different categories◦ Implementation uses the parser generator ANTLR
adding of XML-Markup into the
cadastral items
Dagstuhl, 4.-8.12.2006 Manja Nelius, Meike Klettke
University of RostockFolie 11
Exploitation of regular expressions
Some parts of the input documents are regular (list of abbreviations, list of persons, category 1 of cadastral items)
Example:
Grundbuch Nr. 16 ( 16 ), fol. 15v , neues Stadtbuch Nr. 196
A Grammar describes these expressions
adding of further XML-Markup into the
cadastral items
Dagstuhl, 4.-8.12.2006 Manja Nelius, Meike Klettke
University of RostockFolie 12
HTMLTXTTXT
XMLXMLXML
DOC RTFRTF
Transformation of the input format
List of abbreviations
List of persons
Cadastral register
Analysis of layout and structure
Exploitation of regular expressions
Storage in database tables
Structuring of information
Mapping rules
Dictionaries
Semantic rules
Rules for structuring
Grammars
NormalisationRules for replacement
Semantic Analysis
Analysis of full texts
Dagstuhl, 4.-8.12.2006 Manja Nelius, Meike Klettke
University of RostockFolie 13
Analysis of the historical full texts /1
Usage of 23 different dictionaries◦ Some of them generated from the list of persons (first names, family
names)◦ Some created by hand (streets, saint feasts, professions, stop words)
Same method that was used in the GETESS project, developed from the DFKI Saarbrücken, AG Prof. Uszkoreit
Determining the similarity between terms in the cadastral register and in the dictionaries ◦ with phonetic encoding and ◦ phonetic similarity search
Dagstuhl, 4.-8.12.2006 Manja Nelius, Meike Klettke
University of RostockFolie 14
Analysis of the historical full texts /2
phoneticencoding
Word distance with Levenshtein distance
Cadastral item dictionariesTerm 1
Term 2
phonetic encoding
Term 1‘ Term 2‘
Phonetic encoding norms all spellings that sound similar Example: Friedrich VRYEDRYCH Substitution rules for that had been defined (because that were
not available for historical texts)
Dagstuhl, 4.-8.12.2006 Manja Nelius, Meike Klettke
University of RostockFolie 15
HTMLTXTTXT
XMLXMLXML
DOC RTFRTF
Transformation of the input format
List of abbreviations
List of persons
Cadastral register
Analysis of layout and structure
Exploitation of regular expressions
Storage in database tables
Structuring of information
Mapping rules
Dictionaries
Semantic rules
Rules for structuring
Grammars
NormalisationRules for replacement
Semantic Analysis
Analysis of full texts
Dagstuhl, 4.-8.12.2006 Manja Nelius, Meike Klettke
University of RostockFolie 16
semantic analysis with rules
Association of terms to dictionaries:◦ unique◦ ambiguous (two or more meanings) semantic◦ no association found rules
Different semantic rules (with priorities) are applied In the rules information about the context are used
(predecessor –successor) Example:
◦ (a rule in informal description) ◦ token: number, has no associated meaning◦ predecessor token: date
meaning year is associated to the token
<Feiertag value="Jacobi" pos="6“/> <Jahr value="1693" pos="7"/>
Dagstuhl, 4.-8.12.2006 Manja Nelius, Meike Klettke
University of RostockFolie 17
HTMLTXTTXT
XMLXMLXML
DOC RTFRTF
Transformation of the input format
List of abbreviations
List of persons
Cadastral register
Analysis of layout and structure
Exploitation of regular expressions
Storage in database tables
Structuring of information
Mapping rules
Dictionaries
Semantic rules
Rules for structuring
Grammars
NormalisationRules for replacement
Semantic Analysis
Analysis of full texts
Dagstuhl, 4.-8.12.2006 Manja Nelius, Meike Klettke
University of RostockFolie 18
Structuring of information
Up to now: meanings are associated to terms of the cadastral register
Now: bottom up structuring of information Example:
<Eigentuemer>
<Person>
<Vorname value=„Peter" pos="1"/>
<Nachname value=„Koppe" pos="2"/>
</Person>
<Erwerbsart> …</Erwerbsart>
…
</Eigentuemer>
Process base on rules
Dagstuhl, 4.-8.12.2006 Manja Nelius, Meike Klettke
University of RostockFolie 19
HTMLTXTTXT
XMLXMLXML
DOC RTFRTF
Transformation of the input format
List of abbreviations
List of persons
Cadastral register
Analysis of layout and structure
Exploitation of regular expressions
Storage in database tables
Structuring of information
Mapping rules
Dictionaries
Semantic rules
Rules for structuring
Grammars
NormalisationRules for replacement
Semantic Analysis
Analysis of full texts
Dagstuhl, 4.-8.12.2006 Manja Nelius, Meike Klettke
University of RostockFolie 20
Result of this process
<Eintrag> <Grundbuchnummer> <GbNr value="16"/> <SchNr value="16"/> <FolNr value="15v"/> <SbNr value="196"/> ... </Grundbuchnummer> ... <GegQualitaet> <Bauwerk value="Haus" pos="1"/> </GegQualitaet> ... <Eigentuemer> <Person> <Vorname value="Peter" pos="1"/> <Nachname value="Koppe" pos="2"/> </Person> <Erwerbsart value="adjudicatio" pos="3"/> <Stoppwort value="et" pos="4"/> <Erwerbsart value="aedificatio" pos="5"/> <unbekannt value="Tr" pos="6"/> <unbekannt value="Regum" pos="7"/> <Zahl value="1657" pos="8"/> </Eigentuemer>
... <Eigentuemer> <Person> <Vorname value="Georg" pos="1"/> <Nachname value="Gammelkern" pos="2"/> </Person> <Erwerbsart value="emptio" pos="3"/> <unbekannt value="Medard" pos="4"/> <Zahl value="1673" pos="5"/> </Eigentuemer> <Eigentuemer> <Person> <Vorname value="Johan" pos="1"/> <Nachname value="Faber" pos="2"/> </Person> <Erwerbsart value="emptio" pos="3"/> <Zeitangabe> <Wochentag value="Veneris" pos="4"/> <Feiertag value="Jacobi" pos="6"> <Zeitbezug value="ante" pos="5"/> </Feiertag> <Jahr value="1693" pos="7"/> </Zeitangabe> </Eigentuemer> ...
Dagstuhl, 4.-8.12.2006 Manja Nelius, Meike Klettke
University of RostockFolie 21
Instead of a benchmark
01020304050607080
tokens
with
unique
markup
tokens
without
markup
token
with two
or more
meanings
after full textanalysis
afterapplication ofsemantic rules
Semantic rules assign numbers a meaning (most number are four-digit numbers: year or cadastral numbers)
Solves ambiguous meanings (first name vs. date, family name vs. profession)
Dagstuhl, 4.-8.12.2006 Manja Nelius, Meike Klettke
University of RostockFolie 22
Conclusion
Approach bases on ◦ extensible rules and◦ dictionaries
flexible tool Extraction of information is realised in a stepwise process Check mechanism supports error detection An evaluation of the results is difficult because we cannot
compare the results of the process with correct results
All methods that are represented here had been developed in the diploma thesis from Manja Nelius
Future work:◦ Semantic analysis and information structuring in one step:◦ Matching between XML documents with incomplete markup and a
schema
Dagstuhl, 4.-8.12.2006 Manja Nelius, Meike Klettke
University of RostockFolie 23
Literatur
Ernst Münch: Das Wismarer Grundbuch ( 1677/80 - 1838 ), Verlag Schmidt-Römhild, Rostock, 2002
Hans-Jürgen Martin, Geschichtlicher Abriß der Rechtschreibung, http://www.schriftdeutsch.de/orth-his.htm, 2004
Justin Zobel and Philip W. Dart: Phonetic string matching: lessons from information retrieval, Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval table of contents, 1996
Justin Zobel, Philip W. Dart, Finding Approximate Matches in Large Lexicons, Software --- Practice and Experience, Volume 25, Number 3, 1995
Norbert Fuhr: Regelbasierte Suche in Textdatenbanken mit nichtstandardisierter Rechtschreibung, http://www.is.informatik.uni-duisburg.de/projects/rsnsr/, 2005
M. Abolhassani and N. Fuhr and N. Gövert, Information Extraction and Automatic Markup for XML documents, in Intelligent Search on XML Data. Applications, Languages, Models, Implementations, and Benchmarks, ed. Henk M. Blanken and Torsten Grabs and Hans-Jörg Schek and Ralf Schenkel and Gerhard Weikum, Lecture Notes in Computer Science, Vol 2818, 2003
Arnaud Sahuguet, Fabien Azavant: Building Light-Weight Wrappers for Legacy Web Data-Sources using W4F, 25th Conference on Very Large Database Systems, Edingurgh, UK, 1999
Terence Parr: ANTLR -- ANother Tool for Language Recognition, http://www.antlr.org, 2004 X-Fetch Suite, Republica Corporation, www.x-fetch.com Kai-Uwe Sattler, Stefan Conrad, Gunter Saake, Datenintegration und Mediatoren, In Web &
Datenbanken, d.punkt Verlag, Heidelberg, 2003
top related