kivik 2013nlp. corpus processing, ontologies1 the contribution of nlp corpus processing ontologies...

35
Kivik 2013 NLP. Corpus processing, Ontologies 1 The contribution of NLP Corpus processing Ontologies and terminologies

Upload: norah-harmon

Post on 18-Dec-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

Kivik 2013 NLP. Corpus processing, Ontologies 1

The contribution of NLPCorpus processing

Ontologies and terminologies

Kivik 2013 NLP. Corpus processing, Ontologies 2

What is NLP?

• Natural Language Processing– natural language vs. computer languages

• Other names – Computational Linguistics

• emphasizes scientific not technological

– Language Engineering – Language Technology

Kivik 2013 NLP. Corpus processing, Ontologies 3

NLP and linguistics

LING

NLP

supply ideasinterpret results

test theoriesexpose gaps

plus turn into technology

Kivik 2013 NLP. Corpus processing, Ontologies 4

Example: regular morphology

LINGUISTICS: – Rules: stems -> inflected forms

NLP: – program the rules

– apply rules to a lexicon of stems

– Is the output correct? Errors?

LINGUISTICS:– refine the theory

Needed for: web search, spell-checkers, machine translation, speech recognition systems etc.

Kivik 2013 NLP. Corpus processing, Ontologies 5

Applications

• web search– Basic search– Filtering results

• spelling and grammar checking • machine translation (MT) • talk to computers

– speech processing as well

• information extraction– finding facts in a database of documents– answering questions

Kivik 2013 NLP. Corpus processing, Ontologies 6

How can NLP make better dictionaries?

By pre-processing a corpus:

• tokenization

• sentence splitting

• lemmatization

• POS-tagging

• parsing

Each step builds on predecessors

Kivik 2013 NLP. Corpus processing, Ontologies 7

Tokenization

“identifying the words”

from:he didn't arrive.

to: Hedidn’t arrive.

Kivik 2013 NLP. Corpus processing, Ontologies 8

Automatic tokenization

• Western writing systems – easy! space is separator

• Chinese, Japanese, some other writing systems– do not use word-separator

– hard • like POS-tagging (below)

Kivik 2013 NLP. Corpus processing, Ontologies 9

Why isn't space=separator enough (even for English)?

• what is a space– linebreaks, paragraph breaks, tabs

• Punctuation– characters do not form parts of words but may

be attached to words (with no spaces)

• brackets, quotation marks

• Hyphenation– is co-op one word or two? is well-managed?

Kivik 2013 NLP. Corpus processing, Ontologies 10

Sentence splitting

“identifying the sentences”

from:he didn't arrive. to: Hedidn’t arrive.

to:<s> Hedidn’t arrive.</s>

Kivik 2013 NLP. Corpus processing, Ontologies 11

Lemmatization

Mapping from text-word to lemma help (verb)

text-word to lemmahelp help (v)helps help (v)helping help (v)helped help (v)

.

Kivik 2013 NLP. Corpus processing, Ontologies 12

Lemmatization

Mapping from text-word to lemma help (verb) help (noun), helping (noun)

text-word to lemmahelp help (v), help (n)helps help (v), helps (n)**helping help (v), helping (n)helped help (v) helpings helping (n)

**help (n): usually a mass noun, but part of compound home help which is a count noun, taking the "s" ending.

.

Kivik 2013 NLP. Corpus processing, Ontologies 13

Lemmatization

Dictionary entries are for lemmas

Match between text-word and dictionary-word

lemmatization

Kivik 2013 NLP. Corpus processing, Ontologies 14

Lemmatization

• Searching by lemma – English: little inflection

– French: 36 forms per verb

– Finno-Ugric: 2000.

• Not always wanted:– English royalty

• singular: kings and queens

• plural royalties: payments to authors

Kivik 2013 NLP. Corpus processing, Ontologies 15

Automatic lemmatization• Write rules:

– if word ends in "ing", delete "ing"; – if the remainder is verb lemma, add to list of possible lemmas

• If detailed grammar available, use it• full lemma list is also required

– Often available from dictionary companies

Kivik 2013 NLP. Corpus processing, Ontologies 16

Part-of-speech (POS) tagging

“identifying parts of speech”

from:he didn't arrive. to: …

.

to:<s> He PNP pers pronoun

did VVD past tense verb

n’t XNOT not

arrive VV base form of verb

. C punctuation

</s>

Kivik 2013 NLP. Corpus processing, Ontologies 17

Tagsets

• The set of part-of-speech tags to choose between– Basic: noun, verb, pronoun …– Advanced: examples - CLAWS English

tagset• NN2 plural noun• VVG -ing form of lexical verb

• Based on linguistics of the language.

Kivik 2013 NLP. Corpus processing, Ontologies 18

POS-tagging: why?

• Use grammar when searching– Nouns modified by buckle– Verbs that buckle is object of

Kivik 2013 NLP. Corpus processing, Ontologies 19

POS-tagging: how?

• Big topic for computational linguistics – well understood – taggers available for major languages

• Some taggers use lemmatized input, others do not • Methods

– constraint-based: set of rules of the form if previous word is "the" and VERB is one of the

possibilities, delete VERB – Statistical:

• Machine learning from tagged corpus• Various methods

• Ref: Manning and Schutze, Foundations of Statistical Natural Language Processing, MIT Press 1999.

Kivik 2013 NLP. Corpus processing, Ontologies 20

Parsing

• Find the structure:– Phrase structure (trees)

The cat sat on the mat– Dependency structure (links)

– The cat sat on the mat

Kivik 2013 NLP. Corpus processing, Ontologies 21

Automatic parsing

• Big topic – see Jurafsky and Martin or other NLP

textbook

• Many methods too slow for large corpora

• Sketch Engine usually uses “shallow parsing”– Patterns of POS-tags– Regular expressions

Kivik 2013 NLP. Corpus processing, Ontologies 22

Summary

• What is NLP?

• How can it help?– Tokenizing– Sentence splitting– Lemmatizing– POS-tagging– Parsing

Ontologies and Terminology

and how they relate to lexicography

Kivik 2013 23NLP. Corpus processing, Ontologies

Kivik 2013 NLP. Corpus processing, Ontologies 24

Terminology

• Contains terms – for the objects and concepts in a domain – organized according to relations between

objects– Different language

• Same objects, so• Same organization• Different terms

Kivik 2013 NLP. Corpus processing, Ontologies 25

Ontology• Artificial Intelligence• Like terminology with reasoning

• Tweety is-a swallow• A swallow is-a bird• Birds flyInference-----------------------• Tweety flies

the rationalist dream of automated reasoning

Bird• flies

swallow robin …

Tweety

Kivik 2013 NLP. Corpus processing, Ontologies 26

Ontology

• Chris is-a dentist• Chris has-practice in Lancing• Chris works 9am-3pm Mon-Fri• …• You live-near Lancing• You want-to-visit dentist• You are-available …Inference---------------------------------------------------------Appointment, you, Chris, Lancing, 10am, Thursday

Kivik 2013 NLP. Corpus processing, Ontologies 27

Items in an ontology

• Defined by relations in ontology

• Labelled (only) by words/phrases in various languages

X1EN: birdFR: oiseau

X2EN: swallowFR: hirondelle

•Ontology/things: language independent

Kivik 2013 NLP. Corpus processing, Ontologies 28

Mismatches and gaps

Y1EN: body partsSP: …

Y2SP: dedo

Y5EN: armSP: bras

Y3EN: finger

Y4EN: toe

Kivik 2013 NLP. Corpus processing, Ontologies 29

Thesaurus (eg Roget)

• Looks like a simple ontology – hierarchy only– supports inference?

• usually fudged

• Language independent?

Kivik 2013 NLP. Corpus processing, Ontologies 30

WordNet

• Princeton Univ project, from ca 1990

• Thesaurus– Synonym sets or synsets– Hyponyms/hyperonyms, antonyms, part-of,

other lexical relations

• Free, online and available for download– Very widely used– Replicated for many languages, Global WN

Assn

Kivik 2013 NLP. Corpus processing, Ontologies 31

Lexicon/dictionary

• About words

• Organized by words

• Language specific

Kivik 2013 NLP. Corpus processing, Ontologies 32

Rationalists Empiricists

• Structure• Depth• Logic• Semantic Web

• Terminology

• Data• Breadth• Statistics• Google

• Lexicography

Kivik 2013 NLP. Corpus processing, Ontologies 33

Terminology Lexicography

• What is the thing called– in languages x, y, z

• What kind of thing is it – Is-a link

– Its place in ontology

• Well-structured hierarchy

• How does the word behave?– what does it denote?

• Where does it occur?

Kivik 2013 NLP. Corpus processing, Ontologies 34

Synthesis• Thesis

– Ontology, terminology, taxonomical lexicography• Semantic web, Roget, WordNets

• Antithesis– Corpus lexicography

• Synthesis: integrating• language-independent structure• language-specific word/phrase behaviour

– Corpus-based terminology– FrameNet

Kivik 2013 NLP. Corpus processing, Ontologies 35

Summary

words

Lexicon

Thesaurus/Terminology

Ontology

things