zdroje jazykových dat word senses sense tagged corpora

24
Zdroje jazykových dat Word senses Sense tagged corpora

Upload: gabriel-terry

Post on 18-Dec-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

Zdroje jazykových dat

Word sensesSense tagged corpora

• Lev V. Ščerba: And indeed, every sufficiently complex word must actually become the subject of a scientific monograph; therefore it is hard to expect in the near future the completion of a good dictionary.

Word sense disambiguation

• The different meanings of polysemous words are known as “senses” and the process of deciding which is being used in a particular context “word sense disambiguation”

Lexical Acquisition Bottleneck

• In NLP many systems do not perform in practice as well as they could with adequate dictionary resources, due to the cost of production, adaptation, and maintenance of these resouces

• Solutions– Reusing existing dictionaries and ontologies as

lexicons– Deriving disambiguation information directly from

corpora

Usefulness of WSD

• NLP tools:– Systems – carries out some task of “interest for its

own sake” (e.g. MT,IR); applications potentially interesting for non-linguists

– Components – interesting for linguists and language engineers; e.g. WSD

Early approaches

• Preference semantics – 1970’s– Selectional constraints (e.g. ANIMATE for subject of “to

drink”)

• Word experts – 1980’s– Hand crafted disambiguators constructed for each word

separately– Limited applicability

• Polaroid words– Gradual disambiguation (grammar, parser, lexicon, semantic

interpreter, knowledge representation language)

Dictionary Based Approaches

• Since 1980’s – dictionary publishers started to produced “Machine Readable Dictionaries” (now - m. tractable d.)

• Wider polysemy than in the systems described so far

Two claimsabout sense distribution

• One sense per discourse– There is a very strong tendency for multiple uses of a

word to share the same sense in a well-written discourse

• One sense per collocation– With a high probability an ambiguous word has only

one sense in a given collocation

Taxonomy of WSD Algorithms

• Knowledge based• Corpus based

– Tagged corpora– Untagged corpora (bananaelephant)

• Hybrid approaches

Word Senses and Lexicons

Sense tagging = attaching senses from some lexicon to words in text

Sense-enumerative dictionary

Deficiencies of dictionaries

• Omissions and oversights• Coverage of names• Ghost words – Dord=density (D or d)• Differentiating senses (P.Hanks: A serious problem for

computer applications if that dictionaries compiled for human users focus on giving lists of meanings for each entry, without saying much about how one meaning may be distinguished from another in text)

Two levels of sense distinction

• Homography– Two senses of a word are homographic when there

is no obvious semantic relation between them (e.g. a ball – a dance or a rounded object)

– Risk of amateur etymology

• Polysemy

Distinguishing senses

• P.Hanks: No generally agreed criteria exist for what counts as a sense, or for how to distinguish one sense from another

• Zeugma: Arthur and his driving license expired last Thursday.

• Polysemy vs. vagueness (e.g. mountain)

The Bank Model

• Assumption A – Words have a finite set of clearly distinct, well-defined sense

• Assumption B – Native speakers of … know instantly and effortlessly which meaning applies in a given situation

• Criticism of the bank model: Kilgarriff (“I don’t believe word senses”), Pustejovsky (Generative lexicon), and many others…

NLP Lexicons

• Longman Dictionary of Contemporary English (LDOCE) – three-level embedded structure for sense distinctions (homographs,senses,optional subsenses)

• Roget’s Thesaurus• Cambridge International Dictionary of English• COBUILD English Language Dictionary• WordNet

Thesaurus

Ontology

Ontology

• There is little agreement on what an ontology is… In general, an ontology can be described as an inventory of the objects, processes, etc. in a domain, as well as a specification of (some of ) the relation that hold among them.

• Aristotle: genus (category to which something belongs)and differentiae (property that uniquely distinguish the category member from their parent and from one another)

• Nodes (concepts) in the hierarchy related by subsumption

Ontologies in different traditions

• Philosophical• Cognitive • Artificial intelligence• Lexical semantics• Lexicography• Information science

Princeton WordNet

• Lexical semantic network structured around the notion of synsets

• Synset - skupina literálů téhož slovního druhu, které jsou v určitém kontextu vzájemně zaměnitelné („set of synonyms“)

• http://www.cogsci.princeton.edu/~wn/w3wn.html• Inspired by psycholinguistic theories of human lexical

memory• broad coverage, rich lexical information, freely available• too fine-grained for practical NLP tasks• Relations between two synsets: homonymy,

hyperonymy, meronymy …

EuroWordNet (i)

• Multilingual database containing several monoloingual wordnets structured along the same lines as the Princeton WordNet1.5

• English,Dutch,German,Spanish,French,Italian, Czech,Estonian

• Inter-Lingual-Index• http://www.hum.uva.nl/~ewn

EuroWordNet (ii)

Princeton WordNet 1.5 EuroWordNet

note, observe, make a remark, remark

prohodit, pozname

nat,připomen

out anmerken,

bemerken

. . . . . .. . . . . .

. . . . . .

Sense tagged corpora

• “interest” corpus – 2kS containing the word “interest”

• SENSEVAL– http://www.senseval.org– WSD evaluation exercise, first run in 1998

• SEMCOR– http://multisemcor.itc.it/semcor.phpSubset of the English Brown corpus,700kW– More than 200kW sense-tagged according to Princeton

WordNet 1.6

Final remarks

• Similarity of POS- and sense tagging• Mapping lexical resources