kivik 2013nlp. corpus processing, ontologies1 the contribution of nlp corpus processing ontologies...
TRANSCRIPT
Kivik 2013 NLP. Corpus processing, Ontologies 1
The contribution of NLPCorpus processing
Ontologies and terminologies
Kivik 2013 NLP. Corpus processing, Ontologies 2
What is NLP?
• Natural Language Processing– natural language vs. computer languages
• Other names – Computational Linguistics
• emphasizes scientific not technological
– Language Engineering – Language Technology
Kivik 2013 NLP. Corpus processing, Ontologies 3
NLP and linguistics
LING
NLP
supply ideasinterpret results
test theoriesexpose gaps
plus turn into technology
Kivik 2013 NLP. Corpus processing, Ontologies 4
Example: regular morphology
LINGUISTICS: – Rules: stems -> inflected forms
NLP: – program the rules
– apply rules to a lexicon of stems
– Is the output correct? Errors?
LINGUISTICS:– refine the theory
Needed for: web search, spell-checkers, machine translation, speech recognition systems etc.
Kivik 2013 NLP. Corpus processing, Ontologies 5
Applications
• web search– Basic search– Filtering results
• spelling and grammar checking • machine translation (MT) • talk to computers
– speech processing as well
• information extraction– finding facts in a database of documents– answering questions
Kivik 2013 NLP. Corpus processing, Ontologies 6
How can NLP make better dictionaries?
By pre-processing a corpus:
• tokenization
• sentence splitting
• lemmatization
• POS-tagging
• parsing
Each step builds on predecessors
Kivik 2013 NLP. Corpus processing, Ontologies 7
Tokenization
“identifying the words”
from:he didn't arrive.
to: Hedidn’t arrive.
Kivik 2013 NLP. Corpus processing, Ontologies 8
Automatic tokenization
• Western writing systems – easy! space is separator
• Chinese, Japanese, some other writing systems– do not use word-separator
– hard • like POS-tagging (below)
Kivik 2013 NLP. Corpus processing, Ontologies 9
Why isn't space=separator enough (even for English)?
• what is a space– linebreaks, paragraph breaks, tabs
• Punctuation– characters do not form parts of words but may
be attached to words (with no spaces)
• brackets, quotation marks
• Hyphenation– is co-op one word or two? is well-managed?
Kivik 2013 NLP. Corpus processing, Ontologies 10
Sentence splitting
“identifying the sentences”
from:he didn't arrive. to: Hedidn’t arrive.
to:<s> Hedidn’t arrive.</s>
Kivik 2013 NLP. Corpus processing, Ontologies 11
Lemmatization
Mapping from text-word to lemma help (verb)
text-word to lemmahelp help (v)helps help (v)helping help (v)helped help (v)
.
Kivik 2013 NLP. Corpus processing, Ontologies 12
Lemmatization
Mapping from text-word to lemma help (verb) help (noun), helping (noun)
text-word to lemmahelp help (v), help (n)helps help (v), helps (n)**helping help (v), helping (n)helped help (v) helpings helping (n)
**help (n): usually a mass noun, but part of compound home help which is a count noun, taking the "s" ending.
.
Kivik 2013 NLP. Corpus processing, Ontologies 13
Lemmatization
Dictionary entries are for lemmas
Match between text-word and dictionary-word
lemmatization
Kivik 2013 NLP. Corpus processing, Ontologies 14
Lemmatization
• Searching by lemma – English: little inflection
– French: 36 forms per verb
– Finno-Ugric: 2000.
• Not always wanted:– English royalty
• singular: kings and queens
• plural royalties: payments to authors
Kivik 2013 NLP. Corpus processing, Ontologies 15
Automatic lemmatization• Write rules:
– if word ends in "ing", delete "ing"; – if the remainder is verb lemma, add to list of possible lemmas
• If detailed grammar available, use it• full lemma list is also required
– Often available from dictionary companies
Kivik 2013 NLP. Corpus processing, Ontologies 16
Part-of-speech (POS) tagging
“identifying parts of speech”
from:he didn't arrive. to: …
.
to:<s> He PNP pers pronoun
did VVD past tense verb
n’t XNOT not
arrive VV base form of verb
. C punctuation
</s>
Kivik 2013 NLP. Corpus processing, Ontologies 17
Tagsets
• The set of part-of-speech tags to choose between– Basic: noun, verb, pronoun …– Advanced: examples - CLAWS English
tagset• NN2 plural noun• VVG -ing form of lexical verb
• Based on linguistics of the language.
Kivik 2013 NLP. Corpus processing, Ontologies 18
POS-tagging: why?
• Use grammar when searching– Nouns modified by buckle– Verbs that buckle is object of
Kivik 2013 NLP. Corpus processing, Ontologies 19
POS-tagging: how?
• Big topic for computational linguistics – well understood – taggers available for major languages
• Some taggers use lemmatized input, others do not • Methods
– constraint-based: set of rules of the form if previous word is "the" and VERB is one of the
possibilities, delete VERB – Statistical:
• Machine learning from tagged corpus• Various methods
• Ref: Manning and Schutze, Foundations of Statistical Natural Language Processing, MIT Press 1999.
Kivik 2013 NLP. Corpus processing, Ontologies 20
Parsing
• Find the structure:– Phrase structure (trees)
The cat sat on the mat– Dependency structure (links)
– The cat sat on the mat
Kivik 2013 NLP. Corpus processing, Ontologies 21
Automatic parsing
• Big topic – see Jurafsky and Martin or other NLP
textbook
• Many methods too slow for large corpora
• Sketch Engine usually uses “shallow parsing”– Patterns of POS-tags– Regular expressions
Kivik 2013 NLP. Corpus processing, Ontologies 22
Summary
• What is NLP?
• How can it help?– Tokenizing– Sentence splitting– Lemmatizing– POS-tagging– Parsing
Ontologies and Terminology
and how they relate to lexicography
Kivik 2013 23NLP. Corpus processing, Ontologies
Kivik 2013 NLP. Corpus processing, Ontologies 24
Terminology
• Contains terms – for the objects and concepts in a domain – organized according to relations between
objects– Different language
• Same objects, so• Same organization• Different terms
Kivik 2013 NLP. Corpus processing, Ontologies 25
Ontology• Artificial Intelligence• Like terminology with reasoning
• Tweety is-a swallow• A swallow is-a bird• Birds flyInference-----------------------• Tweety flies
the rationalist dream of automated reasoning
Bird• flies
swallow robin …
Tweety
Kivik 2013 NLP. Corpus processing, Ontologies 26
Ontology
• Chris is-a dentist• Chris has-practice in Lancing• Chris works 9am-3pm Mon-Fri• …• You live-near Lancing• You want-to-visit dentist• You are-available …Inference---------------------------------------------------------Appointment, you, Chris, Lancing, 10am, Thursday
Kivik 2013 NLP. Corpus processing, Ontologies 27
Items in an ontology
• Defined by relations in ontology
• Labelled (only) by words/phrases in various languages
X1EN: birdFR: oiseau
X2EN: swallowFR: hirondelle
…
•Ontology/things: language independent
Kivik 2013 NLP. Corpus processing, Ontologies 28
Mismatches and gaps
Y1EN: body partsSP: …
Y2SP: dedo
Y5EN: armSP: bras
Y3EN: finger
Y4EN: toe
Kivik 2013 NLP. Corpus processing, Ontologies 29
Thesaurus (eg Roget)
• Looks like a simple ontology – hierarchy only– supports inference?
• usually fudged
• Language independent?
Kivik 2013 NLP. Corpus processing, Ontologies 30
WordNet
• Princeton Univ project, from ca 1990
• Thesaurus– Synonym sets or synsets– Hyponyms/hyperonyms, antonyms, part-of,
other lexical relations
• Free, online and available for download– Very widely used– Replicated for many languages, Global WN
Assn
Kivik 2013 NLP. Corpus processing, Ontologies 31
Lexicon/dictionary
• About words
• Organized by words
• Language specific
Kivik 2013 NLP. Corpus processing, Ontologies 32
Rationalists Empiricists
• Structure• Depth• Logic• Semantic Web
• Terminology
• Data• Breadth• Statistics• Google
• Lexicography
Kivik 2013 NLP. Corpus processing, Ontologies 33
Terminology Lexicography
• What is the thing called– in languages x, y, z
• What kind of thing is it – Is-a link
– Its place in ontology
• Well-structured hierarchy
• How does the word behave?– what does it denote?
• Where does it occur?
Kivik 2013 NLP. Corpus processing, Ontologies 34
Synthesis• Thesis
– Ontology, terminology, taxonomical lexicography• Semantic web, Roget, WordNets
• Antithesis– Corpus lexicography
• Synthesis: integrating• language-independent structure• language-specific word/phrase behaviour
– Corpus-based terminology– FrameNet