Lexicography and computer science: a harmless
drudgery?
Judith Knapp ([email protected])Andrea Abel ([email protected])
European Academy Bozen - Bolzano
Content
Learner‘s Difficulties and Needs Pedagogical Lexicography Today – A Short Overview ELDIT – Linguistic-lexicographic Background & Live
Demo Datamodel Implementation Content Authoring ELDIT and Word Manager ELDIT and the TreeTagger Literature Conclusion
Learners‘ difficulties and needs
Problems with foreign language use
Decoding Encoding
Problems
Syntagmaticlevel
Paradigmaticlevel
Semanticlevel
PROBLEMS WITH SYNONYMS
AND SIMILAR WORDS
(meeting)
convegno
riunione
incontro
assemblea
Assemblea condominiale (condominium meeting)
assemblea d‘affari (business meeting)
DIFFICULTIES WITH WORD COMBINATIONS
Collocations
fixed combinations
of words (arbitrary,
unpredictable):
Ex:
• to brush one‘s teeth
• lavarsi i denti
•sich die Zähne putzen
Grammatical Constructions
formed according to the rules of grammar, partly arbitrary:
Ex:
• to ask sb sth
•chiedere qlco a qlcu
• jemanden etwas fragen
Paradigmaticlevel
Learners‘ difficulties and needs
Problems with foreign language use
Decoding Encoding
Problems
Syntagmaticlevel
Semanticlevel
Metalanguage
Problems with dictionary useProblems with dictionary use
Problems with dictionary use
Abbreviations
Technicalterms
Other„codes“
Descriptivelanguage
Italian
agg.
art.
tr.
determ.
pron.
femm.
ant.
volg.
region.
mus.
sociol.
ABBREVIATIONSGerman
Adj.
Art.
tr.
best.
Pron.
w./Fem.
veralt.
vulg.
landsch.
Mus.
Soziol..
(adjective)
(article)
(transitive verb)
(definite article)
(pronoun)
(feminine)
(archaic)
(vulgar)
(regional)
(music)
(sociology)
aggettivo
articolo
ausiliare
transitivo
determinativo
pronome
femminile
antico
volgare
dialetto
musica
sociologia
TECHNICAL TERMS
Adjektiv
Artikel
Hilfsverb
transitiv
bestimmt
Pronomen
weiblich
veraltet
vulgär
landschaftlich
Musik
Soziologie
grammar
language
variation
OTHER „CODES“
International Phonetic Alphabet (IPA) or other transcription systems
focus
shake
chiesa [chiè-sa]
Syntactic information (valency) provided in coded or abbreviated form
Ex.: (a) geben; [...] Vt j-m etw. g (Langenscheidt)
(b) give 2 Vnn (Cobuild) Vn (c) dare 17. N-V-N1 (N2/a N3) (Blumenthal/
Rovere)
.
UNDERSTANDING THE DEFINITION...
„Ich muß im Lexikon nachschlagen, um herauszufinden, was eine Jungfrau ist. [...] Im Lexikon steht, Jungfrau, Frau (gewöhnlich jung), welche sich in einem Zustand unangetasteter Keuschheit befindet und in diesem verbleibt.Jetzt muß ich unangetastet und Keuschheit nachschlagen, und alles, was ich hier finde, ist, daß unangetastet das Gegenteil von angetastet bedeutet, und Keuschheit bedeutet keusch, und das bedeutet frei von gesetzeswidrigem geschlechtlichen Interkursus. Jetzt muß ich Interkursus nachschlagen [...] und ich weiß nicht, was das bedeutet, und ich bin es einfach leid, in dem schweren Lexikon von einem Wort zum anderen geschickt zu werden wie ein Vollidiot, und das alles nur, weil die Leute, die das Lexikon geschrieben haben, nicht wollen, daß unsereins etwas erfährt.Ich will doch nur wissen, wo ich hergekommen bin, aber wenn man jemanden fragt, sagen sie einem, man soll jemand anderen fragen, oder sie schicken einen von Wort zu Wort.“(McCourt 1998: 412 – 413, dt. Übersetzung)
Paradigmaticlevel
Learners‘ difficulties and needs
Problems with foreign language use
Decoding Encoding
Problems with dictionary useProblems
Syntagmaticlevel
Problems with dictionary use
Problems with dictionary use
Semanticlevel
Metalanguage
Abbreviations
Technicalterms
Other„codes“
Descriptivelanguage
Formal Problems
Search Presentation
Problems with searching
• Time consuming
- 2000 pages- Small characters- Difficult metalanguage
• Complex expressions
- Collocations (“Zähne putzen”)- Idiomatic expressions
• …
Problems with presentation
•Limited space
• Linear presentation order
• Organisation of the dictionary
• Organisation of the entries
Paradigmaticlevel
Learners‘ difficulties and needs
Problems with foreign language use
Decoding Encoding
Problems with dictionary use
Metalanguage
Problems
Syntagmaticlevel
Problems with dictionary use
Problems with dictionary use
Semanticlevel
Abbreviations
Technicalterms
Other„codes“
Descriptivelanguage
Formal Problems
Search Presentation
Solutions
Pedagogical Dictionaries
Target Group: language learners Functions: encoding & decoding General characteristics:
- (usually) monolingual- selective regarding macrostructure (limited
number of entries ) ‐ exhaustive regarding microstructure (detailled
information for each entry)
ELDITELDIT
Elektronisches Elektronisches Lern(er)wörterbuch Lern(er)wörterbuch Deutsch-ItalienischDeutsch-Italienisch
Dizionario elettronico Dizionario elettronico per per apprendentiapprendentiItaliano-TedescoItaliano-Tedesco
http://www.eurac.edu/eldit
1. typologically innovative:
• a monolingual dictionary (German or Italian): definitions, collocations, idiomatic expressions, examples … in the target language
&
• a bilingual dictionary (German and Italian): translation equivalents, explanations in L1
„cross-lingual“ dictionary German-Italian
Three main characteristics:
2. well defined target group:
• beginners – intermediate students (Waystage level A1 up to Threshold level B1):basic vocabulary: ~ 3.000 entry words for each language
• addressed to the linguistic layman:limited use of meta-language, abbrevations and symbols
3. designed solely for computer use:
• not a transformation of a paper dictionary into a electronic dictionary
• exploits the possibilities of the electronic medium (multimedia & hypertext)
• modular structure: contains detailled informations which you usually find in different types of dictionaries
Paradigmaticlevel
Learners‘ difficulties and needs
Solutions
Problems with foreign language use
Decoding Encoding
Problems with dictionary use
Metalanguage
Problems
Syntagmaticlevel
Descriptivelanguage
Other„codes“
Technicalterms
AbbrevationsPresentationSearch
Problems with dictionary use
Problems with dictionary use
Semanticlevel
Formal Problems
1) Simple2) Use of L1
3) Multimedia
1) Definitions2) Examples
3) ...
Electronicsearch
possibilities
Hypertext and
hyperlinks 1) Sound-files2) Verb patterns
1) Avoiding2) Explaining
1) Avoiding2) Explaining
1. Simple
2. Multiple descriptions
3. Hypertext
SOLUTIONS ...
Descriptive language
a) Limited defining vocabulary
b) Easy syntax
d) Avoid circularity
1. Simple =
a) Definitions
b) Lexicographic examples
c) Word fields
d) L1 (semantic equivalents)
[e) images]
2. Multiple descriptions =
Semantic Level:
Semantic information:
1. Definitions
2. Examples
3. Word fields
4. Equivalents
Hypernyms
Coordinates
Kinds of ...
das Gebäude
das Hausdas Haus, die Villa, das Schloss, die Wohnung ...
das Hochhaus, das Bauernhaus ...
1.a) Ein Haus ist ein Gebäude, in dem Menschen wohnen.casa
Sie wohnt mit ihrer Familie in einem zweistöckigen Haus am Stadtrand.
b) Ein Haus ist das Gebäude, in dem man ständig lebt und in das man
regelmäßig zurückkehrt. Es ist der Ort, wo man daheim ist.
Sie verlässt das Haus jeden Morgen um sieben Uhr, um zur Arbeit
zu fahren.
2. Das Haus sind die Bewohner eines Hauses (1a). casa
....
a) Click on unknown words inside the definition
b) Click on the semantic equivalents
c) Click on any information you‘re interested in
3. Hypertext =
Paradigmaticlevel
Learners‘ difficulties and needs
1) Simple2) Use of L1
3) Multimedia
Solutions
Problems with foreign language use
Decoding Encoding
Problems with dictionary use
Metalanguage
Problems
Syntagmaticlevel
Descriptivelanguage
Other„codes“
Technicalterms
AbbrevationsPresentationSearch
Problems with dictionary use
Problems with dictionary use
Semanticlevel
1) Definitions2) Examples
3) ...
1) Collocations2) Examples
3) ...Hypertext
andhyperlinks
Electronicsearch
possibilities
Formal Problems
1) Sound-files2) Verb patterns
1) Avoiding2) Explaining
1) Avoiding2) Explaining
1. Collocations
2. Idiomatic Expressions
3. Verb Valency
Syntagmatic level:
- Definition: “Valency refers to the capacity of a verb to take a specific number and type of arguments” (Bianco)
- Theoric origin: dependency grammar (Lucien Tesnière)
Verb Valency
• verb constructions are largely arbitrary and unpredictable
• number of obligatory and facultative elements
• distinction between transitivity and intransitivity
• …
Verb Valency: a problem for learners and researchers
• General monolingual dictionaries
The description of verb valency in different dictionary types
fragen: [jemdn.] unvermittelt, ... etw. fragen
Duden Deutsches
Universalwörterbuch
chiedere: v.tr. (2 argom.)
Disc
chiedere: v.tr. Devoto/Oli
2. Special mono- and bilingual verb valency dictionaries
The description of verb valency in different dictionary types
fragen: 01a v 1b C Bianco
chiedere: N- V- N1 (N2/a N3)
Blumenthal/Rovere
3. (Monolingual) learners‘ dictionaries
The description of verb valency in different dictionary types
fragen: Vt/i (j-n) (etw.) f. Langenscheidt
fragen: tr K jd fragt jdn [nach etw dat]
Pons Basiswörterbuch
chiedere: tr. Dib
Description of Verb Valency in ELDIT
Explicit way of describing verb valency
N-V-N1-(N2) v.tr. (2 argom.) Vt/i (etw.) (über j-n/etw.) r.
I. Learner friendly description:
Description of Verb Valency in ELDIT
II. Multimedia:
Visualization of information to support comprehension
(colors and animations instead of meta-language)
Description of Verb Valency in ELDIT
III. Semiotic didactics:
Functions of the different colors:
- they indicate the parts of the sentence
- they show which parts of the verbs belong together
- correspondence between patterns and examples
Description of Verb Valency in ELDIT
IV. Additional explanations for the learner:
- Visible notes to describe semantic restrictions
- Variations for realizing single parts of the sentence
Paradigmaticlevel
Learners‘ difficulties and needs
Hypertext and
hyperlinks
1) Simple2) Use of L1
3) Multimedia
1) Collocations2) Examples
3) ...
1) Definitions2) Examples
3) ...
Solutions
Problems with foreign language use
Decoding Encoding
Problems with dictionary use
FormalProblems
Metalanguage
Problems
Lexical fieldsThree dimensional
graphics
Syntagmaticlevel
Descriptivelanguage
Other„codes“
Technicalterms
AbbreviationsPresentationSearch
Problems with dictionary use
Problems with dictionary use
Semanticlevel
Electronicsearch
possibilities
1) Sound-files2) Verb patterns
1) Avoiding2) Explaining
1) Avoiding2) Explaining
• Word field theory:
„Ein Wortfeld ist eine Gruppe von Wörtern, die inhaltlich einander eng benachbart sind und die sich vermöge Interdependenz ihre Leistungen gegenseitig zuweisen.“ (Trier 1968/1973: 189, späte Def.)
• Existing Projects
- WordNet (GermaNet, Italian WordNet)
- Alexia
- Kirrkirr
PARADIGMATIC RELATIONS
Paradigmatic relations in ELDIT
• Ca. 150 words per language• interactive graphic representation• spacial arrangement and colors for the
representation of paradigmatic lexical relations• explicit description of the semantic relations
between the lexical units and the lemma (no metalanguage)
• definitions and examples for describing similarities/differences of meaning, register, authentic context
Lexical fields in ELDIT
Type of meaning relations:• hierachical relations
(hyperonymy/hyponymy; holonymy/meronymy)
• non-hierarchical relations
(similarity: synonyms, quasi-synonyms … -
contrast: gradable and nongradable antonyms;
converse terms)
Paradigmaticlevel
Learners‘ difficulties and needs
Hypertext and
hyperlinks
1) Simple2) Use of L1
3) Multimedia
Three dimensionalgraphics
1) Collocations2) Examples
3) ...1) Sound-files
2) Verb patterns
1) Avoiding2) Explaining
1) Avoiding2) Explaining
1) Definitions2) Examples
3) ...
Solutions
Problems with foreign language use
Decoding Encoding
Problems with dictionary use
Formal Problems
Metalanguage
Problems
Syntagmaticlevel
Descriptivelanguage
Other„codes“
Technicalterms
AbbreviationsPresentationSearch
Problems with dictionary use
Problems with dictionary use
Semanticlevel
Electronicsearch
possibilities
Other modules
• Flexion
• Word family
• N.B.
DatamodelNeeds for an innovative presentation
A detailed data model
Implementation
– Hierarchical structured data– Many changes were expected– Communication with linguists
Use of XML
– XML und XML-Editor• Hierarchic Structure• Communication with Linguists
– Java-Servlet Technology– DXML or JDOM– Dynamic Generation of HTML
Content Authoring
• Content Authoring– Difficult– Time consuming– Error prone
• In ELDIT:– Innovative Presentation– Efficient Interface
(Real World System)– Research of Linguists
“Efficient” Authoring Interface
• Semi-structured Data
• Automatic full-structuring
• Automatic enriching
Efficient Authoring Interface
Semi-structured Data
Automatic full-structuring
<example> <w>Meine</w> <w>Eltern</w> <w style="emphasized">haben</w> <w style="emphasized">das</w> <w style="emphasized">Haus</w> <w>vor</w> <w>50</w> <w>Jahren</w> <w style="emphasized">gebaut</w> <w>.</w></example>
<prebasuf> <article>die</article> <praefix>Be</praefix> <basis>haus</basis> <suffix>ung</suffix></prebasuf>
Automatic Enriching
By using Computational Linguistics tools
• WordManager• TreeTagger• PhraseManager, WordNet, Parser, …
<derivation> <prebasuf>die Be_haus_ung</prebasuf> <translation>la dimora</translation></derivation>
<derivation id="de.n.haus.1.deriv2">
<pattern id="de.n.haus.1.deriv2.patt0" base="Behausung" ctag="N" lexref=""> <article base="der" ctag="art" lexref="de.g.articles.1.item1">die</article> <praefix explref="de.prae.h.be">Be</praefix> <basis>haus</basis> <suffix explref="de.suff.h.ung">ung</suffix> </pattern>
<translation id="de.n.haus.1.deriv2.trans0"> <w id="de.n.haus.1.deriv2.trans0.w0" type="content" base="il" ctag="art" lexref="it.g.articles.1.item2">la</w> <w id="de.n.haus.1.deriv2.trans0.w1" type="content" base="dimora" ctag="N" lexref="it.n.dimora.1">dimora</w> </translation>
</derivation>
ELDIT and WordManager
• WordManager
• WM Transducers
• WordManager in ELDIT
WordManager - 1992
– System for reusable morphological dictionaries
– Information of a word about• Flexion (Declination and Conjugation) • Word formation (Derivation and Composition)• Orthography (Old and new for German)• …
– German, Italian, English
Lemmatizer Häusern → haus (Cat N)
Inflection Analyzer Häusern →
haus (Cat N)(Gender N)(Num PL)(Case Dat)
Inflection Generator Haus →
haus (Cat N)(Gender N)(Num SG)(Case Nom),
haus (Cat N)(Gender N)(Num SG)(Case Gen),
häuser (Cat N)(Gender N)(Num PL)(Case Nom),
häusern (Cat N)(Gender N)(Num PL)(Case Dat)
…
Word Formation Analyzer
kennenlernen → kennen (Cat V)(Aux haben)
lernen (Cat V)(Aux haben)
Word Formation Generator
bosco → abbracciabosco (Cat N)(Gen M)
boscaglia (Cat N)(Gen F)
boscaiolo (Cat N)(Gen M)
…
WM Transducers - 2000
WM in ELDIT
Search (Lemmatizer)
Links and Additional Examples (Lemmatizer)
Exercises (Analyzer)
Conjugation tables (Generator)
ELDIT and TreeTagger
• ELDIT Text Corpus• Development• Tagging • Manual Corrections
Development
• MSWord
(Goethe Institut of Milan)
• HTML
• Simple XML
Tagging
• POS – tagging (→ TreeTagger)
• XML with links
• Iterative Correction by frequency of unlinked words
Corrections
• Old German spelling rules valid until 1998
• The Italian verb “sono” (they are) was always tagged with “sonare” (=suonare, make music) instead of with “essere” (to be).
• The verb “sia” (he may be) was always recognized as a conjunction and tagged with “sia” (as well as) instead of with “essere” (to be).
• Many conjugated forms of “avere” were tagged with “riavere” (to get something back) instead of with “avere” (to have).
• Many conjugated forms of “andare” were tagged with “riandare” (to go back) instead of with “andare”.
• Abbreviated forms of Italian words (such as “bel”, “vuol”, “pur”, “fin”) were tagged as nouns and with the original form as lemma.
• Some Italian words which exist both as nouns and as past participles (such as the word “successo” (the success, it happened)) were tagged with the wrong word class.
Literature
• http://www.eurac.edu/about/collaborators/JKnapp/index.htm
→ Publications
(some linguistic ones, too)
→ PhD-Thesis
(Abel Andrea – Uni Innsbruck;
Judith Knapp – Uni Hannover)
Conclusionsyntagmatisch, paradigmatisch, pragmatisch, Polysemie, Homographie, Homonymie, Holonymie, Hyponymie, Hyperonymie, semiotisch, ludativ, …
Goal based scenarios, blended
learning …
TEI, CES, NLP, Lemmatizing, POS-
Tagging …
Fileserver, Webserver, Datenmodell, HTTP request, Client, Protokoll, Port, …
+∞
∫√∂u∆v- ∞