emeld workshop on digitizing lexical information modeling lexical entries in bilingual dictionaries...
TRANSCRIPT
![Page 1: EMELD Workshop on Digitizing Lexical Information Modeling Lexical Entries in Bilingual Dictionaries —Or— Exegeting the UML Model Mike Maxwell Linguistic](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e2d5503460f94b1cf55/html5/thumbnails/1.jpg)
EMELD Workshop on Digitizing Lexical Information
Modeling Lexical Entries in Bilingual Dictionaries
—Or— Exegeting the UML Model
Mike Maxwell
Linguistic Data Consortium
![Page 2: EMELD Workshop on Digitizing Lexical Information Modeling Lexical Entries in Bilingual Dictionaries —Or— Exegeting the UML Model Mike Maxwell Linguistic](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e2d5503460f94b1cf55/html5/thumbnails/2.jpg)
EMELD Workshop on Digitizing Lexical Information
Three Levels of Abstraction
• File formats
• Data models
• Ontologies
![Page 3: EMELD Workshop on Digitizing Lexical Information Modeling Lexical Entries in Bilingual Dictionaries —Or— Exegeting the UML Model Mike Maxwell Linguistic](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e2d5503460f94b1cf55/html5/thumbnails/3.jpg)
EMELD Workshop on Digitizing Lexical Information
Conceptual Structure vs. Views
• Data model = Conceptual/ Underlying structure
• View = layout, formatting• Examples of views:
– Page layout– Definition numbers– Alphabetization– Filtered subsets
![Page 4: EMELD Workshop on Digitizing Lexical Information Modeling Lexical Entries in Bilingual Dictionaries —Or— Exegeting the UML Model Mike Maxwell Linguistic](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e2d5503460f94b1cf55/html5/thumbnails/4.jpg)
EMELD Workshop on Digitizing Lexical Information
Conceptual Structure vs. Views
• Spanish-English and English-Spanish sides of bilingual dictionary: View
• Spanish lexical entries, English lexical entries, and relations between them: Underlying structure
![Page 5: EMELD Workshop on Digitizing Lexical Information Modeling Lexical Entries in Bilingual Dictionaries —Or— Exegeting the UML Model Mike Maxwell Linguistic](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e2d5503460f94b1cf55/html5/thumbnails/5.jpg)
EMELD Workshop on Digitizing Lexical Information
UML Models
• What is UML?“The Unified Modeling Language™ (UML) is the industry-standard language for specifying, visualizing, constructing, and documenting the artifacts of software systems. It simplifies the complex process of software design, making a ‘blueprint’ for construction.” (http://www.rational.com/uml/index.jsp)
• Blueprint language
• We’ll use small subset
![Page 6: EMELD Workshop on Digitizing Lexical Information Modeling Lexical Entries in Bilingual Dictionaries —Or— Exegeting the UML Model Mike Maxwell Linguistic](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e2d5503460f94b1cf55/html5/thumbnails/6.jpg)
EMELD Workshop on Digitizing Lexical Information
UML Models
• Objects
• Classes
• Attributes
• Links– Composition– Association
• Class hierarchy
![Page 7: EMELD Workshop on Digitizing Lexical Information Modeling Lexical Entries in Bilingual Dictionaries —Or— Exegeting the UML Model Mike Maxwell Linguistic](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e2d5503460f94b1cf55/html5/thumbnails/7.jpg)
EMELD Workshop on Digitizing Lexical Information
UML Models
• Normalization– Data item appears once– Attribute (‘field’) holds one type of data
• Strings– MultiUnicode– MultiString
![Page 8: EMELD Workshop on Digitizing Lexical Information Modeling Lexical Entries in Bilingual Dictionaries —Or— Exegeting the UML Model Mike Maxwell Linguistic](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e2d5503460f94b1cf55/html5/thumbnails/8.jpg)
EMELD Workshop on Digitizing Lexical Information
SIL-developed Model
• Bilingual lexicon(one-way: full information for vernacular language only)
• Developed for LinguaLinks
• Modified for Fieldworks• Embedded in larger model of language
description (http://fieldworks.sil.org/ModelDoc/)
![Page 9: EMELD Workshop on Digitizing Lexical Information Modeling Lexical Entries in Bilingual Dictionaries —Or— Exegeting the UML Model Mike Maxwell Linguistic](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e2d5503460f94b1cf55/html5/thumbnails/9.jpg)
EMELD Workshop on Digitizing Lexical Information
Lexicon[see Lexicon.gif]
• Front matter, appendices, …
• Lexical entries– Lexemes (stems, roots, words)– Affixes– Larger constructs (idioms etc.)
![Page 10: EMELD Workshop on Digitizing Lexical Information Modeling Lexical Entries in Bilingual Dictionaries —Or— Exegeting the UML Model Mike Maxwell Linguistic](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e2d5503460f94b1cf55/html5/thumbnails/10.jpg)
EMELD Workshop on Digitizing Lexical Information
Lexical Entry[see LexEntries.gif]
• Kinds of lexical entries– Major Entry– Subentry– Minor Entry
![Page 11: EMELD Workshop on Digitizing Lexical Information Modeling Lexical Entries in Bilingual Dictionaries —Or— Exegeting the UML Model Mike Maxwell Linguistic](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e2d5503460f94b1cf55/html5/thumbnails/11.jpg)
EMELD Workshop on Digitizing Lexical Information
Major Entries
• LexMajorEntry
• For morphemes and non-compositional word-level “things”– Stems, roots, affixes
(not a theoretical statement!)– But citation forms can be words
![Page 12: EMELD Workshop on Digitizing Lexical Information Modeling Lexical Entries in Bilingual Dictionaries —Or— Exegeting the UML Model Mike Maxwell Linguistic](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e2d5503460f94b1cf55/html5/thumbnails/12.jpg)
EMELD Workshop on Digitizing Lexical Information
Subentries
• LexSubentry– Subclass of LexMajorEntry
• For multi-morphemic constructs:– Derivatives– Compounds– Idioms– Sayings– Phrasal verbs
![Page 13: EMELD Workshop on Digitizing Lexical Information Modeling Lexical Entries in Bilingual Dictionaries —Or— Exegeting the UML Model Mike Maxwell Linguistic](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e2d5503460f94b1cf55/html5/thumbnails/13.jpg)
EMELD Workshop on Digitizing Lexical Information
Subentries (cont’d)
• Points to morphemes (etc.) of which it is composed
• Does not “belong” to morphemes (LexMajorEntries) of which it is composed
![Page 14: EMELD Workshop on Digitizing Lexical Information Modeling Lexical Entries in Bilingual Dictionaries —Or— Exegeting the UML Model Mike Maxwell Linguistic](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e2d5503460f94b1cf55/html5/thumbnails/14.jpg)
EMELD Workshop on Digitizing Lexical Information
Minor Entries
• LexMinorEntry– Subclass of LexMajorEntry
(but usually much simpler)
• For irregular forms (oxen, been, went)
• Belong to a LexMajorEntry(but alphabetization is a view!)
![Page 15: EMELD Workshop on Digitizing Lexical Information Modeling Lexical Entries in Bilingual Dictionaries —Or— Exegeting the UML Model Mike Maxwell Linguistic](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e2d5503460f94b1cf55/html5/thumbnails/15.jpg)
EMELD Workshop on Digitizing Lexical Information
Parts of Lexical Entries
• Lexica est omnis divisa in partes tres (plus a label):– Citation form (= the label)– Forms– Morphosyntactic information– Senses
• No provision for etymology
![Page 16: EMELD Workshop on Digitizing Lexical Information Modeling Lexical Entries in Bilingual Dictionaries —Or— Exegeting the UML Model Mike Maxwell Linguistic](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e2d5503460f94b1cf55/html5/thumbnails/16.jpg)
EMELD Workshop on Digitizing Lexical Information
Parts of lexical entries:Citation Form
• = Lemma, Headword, Canonical Form
• CitationForm attribute
• multiUnicode
![Page 17: EMELD Workshop on Digitizing Lexical Information Modeling Lexical Entries in Bilingual Dictionaries —Or— Exegeting the UML Model Mike Maxwell Linguistic](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e2d5503460f94b1cf55/html5/thumbnails/17.jpg)
EMELD Workshop on Digitizing Lexical Information
Parts of lexical entries:Forms [see MoForm.gif]
• PronunciationsLexPronunciation (written form + sound)
• AllomorphsMoForm (written form, morph type, phonological context…)
• Underlying FormMoForm
![Page 18: EMELD Workshop on Digitizing Lexical Information Modeling Lexical Entries in Bilingual Dictionaries —Or— Exegeting the UML Model Mike Maxwell Linguistic](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e2d5503460f94b1cf55/html5/thumbnails/18.jpg)
EMELD Workshop on Digitizing Lexical Information
Parts of lexical entries: Morphosyntactic Information [see MSI.gif]
• MoStemMsi (for Stems/ Roots, whether bound or free)– Part of speech – Inherent morphosyntactic features– Inflection class (= paradigm/ declension)– Exception features
![Page 19: EMELD Workshop on Digitizing Lexical Information Modeling Lexical Entries in Bilingual Dictionaries —Or— Exegeting the UML Model Mike Maxwell Linguistic](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e2d5503460f94b1cf55/html5/thumbnails/19.jpg)
EMELD Workshop on Digitizing Lexical Information
Parts of lexical entries: Morphosyntactic Information
• MoInflectionalAffixMsi (for Inflectional Affixes)– Morphosyntactic features– Exception features
![Page 20: EMELD Workshop on Digitizing Lexical Information Modeling Lexical Entries in Bilingual Dictionaries —Or— Exegeting the UML Model Mike Maxwell Linguistic](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e2d5503460f94b1cf55/html5/thumbnails/20.jpg)
EMELD Workshop on Digitizing Lexical Information
Parts of lexical entries: Morphosyntactic Information
• MoDerivationalAffixMsi (for Derivational Affixes)– From/ to POS– From/ to morphosyntactic features– From/ to inflection classes– From/ to exception features
![Page 21: EMELD Workshop on Digitizing Lexical Information Modeling Lexical Entries in Bilingual Dictionaries —Or— Exegeting the UML Model Mike Maxwell Linguistic](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e2d5503460f94b1cf55/html5/thumbnails/21.jpg)
EMELD Workshop on Digitizing Lexical Information
Parts of lexical entries:Senses [see LexSenses.gif]
• LexSense:– Definition– Gloss– Scientific name– Pictures– Example sentences– Sub-senses (more LexSense objects)
![Page 22: EMELD Workshop on Digitizing Lexical Information Modeling Lexical Entries in Bilingual Dictionaries —Or— Exegeting the UML Model Mike Maxwell Linguistic](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e2d5503460f94b1cf55/html5/thumbnails/22.jpg)
EMELD Workshop on Digitizing Lexical Information
Parts of lexical entries:Senses
• LexSense (cont’d):– Morphosyntactic information: points to a
‘MorphosyntaxInfo’ object– This MorphosyntaxInfo’ object can be shared
among different senses of the same LexEntry:run = to jogrun = to go (to the store)
(both can be nouns or intransitive verbs)
![Page 23: EMELD Workshop on Digitizing Lexical Information Modeling Lexical Entries in Bilingual Dictionaries —Or— Exegeting the UML Model Mike Maxwell Linguistic](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e2d5503460f94b1cf55/html5/thumbnails/23.jpg)
EMELD Workshop on Digitizing Lexical Information
Parts of lexical entries:Senses
• LexSense (cont’d):– Use of shared ‘MorphosyntaxInfo’ object
allows flexibility via views:The particular way in which definitions and other features of the dictionary article are presented comprise the macrostructure. Are definitions arranged by part-of-speech?… (Landau, Dictionaries: The Art and Craft of Lexicography, p. 99)
– A view!
![Page 24: EMELD Workshop on Digitizing Lexical Information Modeling Lexical Entries in Bilingual Dictionaries —Or— Exegeting the UML Model Mike Maxwell Linguistic](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e2d5503460f94b1cf55/html5/thumbnails/24.jpg)
EMELD Workshop on Digitizing Lexical Information
Parts of lexical entries:Senses
• LexSense (cont’d):– Points to set of ‘ReversalIndexEntry’ objects
• Can be shared among senses belonging to the same or other LexEntries
• Many-to-many relation between LexSenses and ReversalIndexEntries
![Page 25: EMELD Workshop on Digitizing Lexical Information Modeling Lexical Entries in Bilingual Dictionaries —Or— Exegeting the UML Model Mike Maxwell Linguistic](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e2d5503460f94b1cf55/html5/thumbnails/25.jpg)
EMELD Workshop on Digitizing Lexical Information
Parts of lexical entries:Senses
• ReversalIndexEntry:Impoverished LexEntry– Name (= citation form)– POS– Sub-entries
Allows for reversal entries like: Green (adj.) to be green: yax
![Page 26: EMELD Workshop on Digitizing Lexical Information Modeling Lexical Entries in Bilingual Dictionaries —Or— Exegeting the UML Model Mike Maxwell Linguistic](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e2d5503460f94b1cf55/html5/thumbnails/26.jpg)
EMELD Workshop on Digitizing Lexical Information
Relationships among Senses:Synonyms [see LexSets.gif]
• LexSimpleSetOne set per group of synonyms(asymmetry in model?)
• ‘Members’ = LexSetItems, in turn pointing to a LexSense(LexSetItems are a throw-away class?)
![Page 27: EMELD Workshop on Digitizing Lexical Information Modeling Lexical Entries in Bilingual Dictionaries —Or— Exegeting the UML Model Mike Maxwell Linguistic](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e2d5503460f94b1cf55/html5/thumbnails/27.jpg)
EMELD Workshop on Digitizing Lexical Information
Relationships among Senses:Antonyms and other Binary Relations
• LexPairRelations, owning sets of LexPairs
• Allows: – Directed relations (e.g. individual-group)
or – Undirected relations (e.g. antonyms)
![Page 28: EMELD Workshop on Digitizing Lexical Information Modeling Lexical Entries in Bilingual Dictionaries —Or— Exegeting the UML Model Mike Maxwell Linguistic](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e2d5503460f94b1cf55/html5/thumbnails/28.jpg)
EMELD Workshop on Digitizing Lexical Information
Relationships among Senses: Part-Whole, Generic-Specific
• LexTreeRelations, owning sequence of LexTreeItems
• Outline structure:(animal (mammal (dog cat)) (reptile (snake turtle)))
![Page 29: EMELD Workshop on Digitizing Lexical Information Modeling Lexical Entries in Bilingual Dictionaries —Or— Exegeting the UML Model Mike Maxwell Linguistic](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e2d5503460f94b1cf55/html5/thumbnails/29.jpg)
EMELD Workshop on Digitizing Lexical Information
Relationships among Senses: Scales
• LexScale(relation not specified: asymmetry in model)– Negative-neutral-positive scales
(tiny, small; medium; big, huge)– Positive (or neutral) scales
(inch, foot, yard, furlong)(January, …December)
![Page 30: EMELD Workshop on Digitizing Lexical Information Modeling Lexical Entries in Bilingual Dictionaries —Or— Exegeting the UML Model Mike Maxwell Linguistic](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e2d5503460f94b1cf55/html5/thumbnails/30.jpg)
EMELD Workshop on Digitizing Lexical Information
Dialects
• Q: What can vary between dialects?
• A: Anything
![Page 31: EMELD Workshop on Digitizing Lexical Information Modeling Lexical Entries in Bilingual Dictionaries —Or— Exegeting the UML Model Mike Maxwell Linguistic](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e2d5503460f94b1cf55/html5/thumbnails/31.jpg)
EMELD Workshop on Digitizing Lexical Information
Dialects
• Modeling dialects– Separate encodings– Separate lexicons– Mark objects for dialect
(what level of granularity?)