i256 applied natural language processing fall 2009 lecture 2 python related fields linguistic...

Post on 15-Jan-2016

232 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

I256

Applied Natural Language Processing

Fall 2009

Lecture 2

• Python• Related fields• Linguistic essentials

Barbara Rosario

Today

• Announcements– I admitted all the students in the waiting list. Tele-Bears should

reflect the change by today.– Any questions/concerns about the class?– Homework due next Tuesday September 8 at 12:30

• Make sure you are all set to start with Python & NLTK– Office hours (Room 6)

• Today: Gopal at 2• Wednesday 3-4: Gopal (iIf there is request, let him know) • Thursday: Barbara at 2

– Some (light) readings for Thursday • Python• Related fields• Linguistic essentials

Python - Simple yet powerful

The zen of python : http://www.python.org/dev/peps/pep-0020/

• Very clear, readable syntax• Strong introspection capabilities

– http://www.ibm.com/developerworks/library/l-pyint.html (recommended) • Intuitive object orientation• Natural expression of procedural code• Full modularity, supporting hierarchical packages• Exception-based error handling• Very high level dynamic data types• Extensive standard libraries and third party modules for virtually every task

– Excellent functionality for processing linguistic data.– NLTK is one such extensive third party module. 

Source : python.org

Python

• Numeric types– plain integers - long in C, 32 bit precision (try: sys.maxint) 

– long integers -(unlimited precision)

– floating point numbers  

– complex numbers

• Sequences– Strings (immutable)

– Lists (mutable)

– Tuples (immutable)

• Mappings– Dictionary

• File objects

• Classes

• Instances

• Exceptions

Source : python.org

Python (built-in types)

LISTS • More than an ‘array’. • Hold arbitrary objects and expand/collapse dynamically.

Source : python.org

Python (Lists and tuples)

>>> mylist=[‘nlp’,42577,256,’applied_nlp’]>>> mylist[3]‘applied_nlp’ >>> mylist[-1]‘applied_nlp’>>> mylist[1:3][42577,256]

Define using standard array like syntaxFew methods

List li

•len(li)•li.append(‘something’)•li.extend([list])•li.insert(index,’value’)•li.index(“nlp”)•li.remove(“nlp”)•li=li+[list]…….………..

TUPLE• A tuple is an immutable list. Cannot be changed once created.

>>> mytuple=(‘nlp’,42577,256,’applied_nlp’)>>> mytuple[3]’applied_nlp’>>> mytuple[3]=‘blahblah’Traceback (most recent call last):  File "<stdin>", line 1, in <module>TypeError:’tuple’ object does not support item assignment

• Provides many string manipulation methods

• Strings can be subscripted (indexed)– Can use some list style methods

• String formatting (the % operator)

Source : python.org

Python (Strings)

Few methods

String str

•len(str)•str.capitalize()•str.count(sub[, start[, end]])•str.find(sub[, start[, end]]) •str.replace(old, new[, count])•str.strip([chars])• str.split([sep[, maxsplit]])…….………..

>>> mystring=“jolly good”>>>mystring[1:5]‘olly’

>>> print “this is a %s course”%(“NLP”)“this is a NLP course”>>> print “this is a %s course in fall%d”%(“NLP”,9)“this is a NLP course in fall9”>>> print “this is %(course)s course”%{‘course’:”NLP”}“this is a NLP course”

>>> print “uc” + “berkeley”“ucberkeley” >>> li = [‘a',‘b',‘c’,‘d']>>> s = ";".join(li)>>> s‘a;b;c;d'>>> s.split(";")[‘a',‘b',‘c’,‘d']

• A mapping object maps hashable values to arbitrary objects. • Mappings are mutable objects. • There is currently only one standard mapping type, the dictionary.

• Creating dictionaries

Source : python.org

Python (Mapping objects)

>>> mydict={‘nlp’:42577,256:’applied_nlp’}>>>mydict[256]‘applied_nlp’

comma-separated list of key: value pairs within braces

dict(one=2, two=3)dict({'one': 2, 'two': 3})dict(zip(('one', 'two'), (2, 3)))dict([['two', 3], ['one', 2]])

Using the constructor of a built-in dict class

Few methods

Dictionary d

•len(d)•d[key]•d[key] = value•del d[key]•key in d•clear()•copy)()•get(key[, default])•Items()•iteritems()…….………..

Submission for assignment 1For Assignment 1 (see also web site)

• create a file LastNameFirstName_assignment1.py • This is the main file where all your code will reside.• We will evaluate each question/sub-question as

• Add logic to your code based on the command line argument (process your command line argument string ) and output accordingly. The command line arguments in python are accessed through sys.argv list . You can also use getopt module.

• Make sure you include a this header information in the beginning of your code

For question on the homework, please email gvaswani@ischool.berkeley.edu

email your assignment to gvaswani@ischool.berkeley.edu and barbara.rosario@intel.com

>>> python LastNameFirstName_assignment1.py question1

>>> python LastNameFirstName_assignment1.py question1.1

#! /usr/bin/env python   #author: ‘Your name' #email = ‘your email address' #python_version = ‘python version you are using'

Related Fields

• NLP• Linguistics

– All about languages

• Computational Linguistics– Using computational methods to learn more about how language

works

• Speech Recognition– Mapping audio signals to text– Two components: acoustic models and language models– Language models in the domain of stat NLP

• Cognitive Science– Figuring out how the human brain work, including language

Linguistics essentials

• Important distinction: – study of language structure (grammar)– study of meaning (semantics)

• Grammar– Phonology (the study of sound systems and abstract

sound units).– Morphology (the formation and composition of words)– Syntax (the rules that determine how words combine

into sentences) • Semantics

– The study of the meaning of words (lexical semantics) and fixed word combinations (phraseology), and how these combine to form the meanings of sentences

http://en.wikipedia.org/wiki/Linguistics

Linguistics sub-fields

• Discourse analysis – concerned with the structure of texts and

conversations

• Pragmatics – concerned with how meaning is transmitted

based on a combination of linguistic competence, non-linguistic knowledge, and the context of the speech act.

Linguistics sub-fields• Evolutionary linguistics

– origins of language• Historical linguistics

– explores language change• Sociolinguistics

– looks at the relation between linguistic variation and social structures• Psycholinguistics

– explores the representation and functioning of language in the mind• Neurolinguistics

– looks at the representation of language in the brain• Language acquisition

– how children acquire their first language and how children and adults acquire and learn their second and subsequent languages

• And others:– for an overview see http://en.wikipedia.org/wiki/Linguistics

Adapted from http://en.wikipedia.org/wiki/Linguistics

Linguistics essentials

• This course:

• Some grammar

• Mostly “semantics”

Grammar: words

• Words of a language are grouped into classes to reflect similar syntactic behaviors

• Syntactical or grammatical categories (aka part-of-speech)– Nouns (people, animal, concepts)– Verbs (actions, states)– Adjectives– Prepositions– Determiners

• Open or lexical categories (nouns, verbs, adjective)– Large number of members, new words are commonly added

• Closed or functional categories (prepositions, determiners)– Few members, clear grammatical use

Grammar: words

• Word categories are related by morphological processes– s for plural nouns– ed for verbs’ past forms– Next class– Why important for NLP?– More important for some languages

• English regular verbs have 4 forms (at most 8 in irregular verbs)

• Finnish verbs have 10,000 forms

Grammatical categories

• Nouns typically refer to entities in the world like people, animals, things, ideas..

• Type of inflections– Number – Gender – Case (nominative, genitive, accusative,

dative)

• Pronouns: variables to refer to an entity previously mentioned

Grammatical categories: Verbs

• Usually denote an action (bring, read), an occurrence (decompose, glitter), or a state of being (exist, stand).

• Depending on the language, a verb may vary in form according to many factors, possibly including its tense, aspect, mood and voice.

• It may also agree with the person, gender, and/or number of some of its arguments (subject, object, etc.)

Verbs’ factors

• Tense: time of the action– Present, past, future

• Mood: signal modality (possibility and necessity)– Realis mood

– The state is known (John is sick)

– Irrealis mood – Indicate that a certain situation or action is not known to

have happened as the speaker is talking. – Just may/must be sick

Verbs’ factors

• Aspect– Defines the temporal flow (or lack thereof) in

the event or state. – Habitual aspect

• I eat, I have eaten, I ate, I had eaten

– Progressive, or continuous, aspect• I am eating, I have been eating, I was eating, I had

been eating

Verbs’ factors

• Voice– Describes the relationship between the action

(or state) that the verb expresses and the participants identified by its arguments (subject, object, etc.).

– Active voice: when the subject is the agent or actor of the verb (the cat ate the mouse)

– Passive voice: when the subject is the patient, target or undergoer of the action (the mouse was eaten by the cat)

Other grammatical categories

• Adverbs• Prepositions

– In, on, over, at

• Coordinating Conjunctions– Link 2 sentences

• and, or, but…• She bought or leased the car

• Subordinating Conjunctions• That, because, if…• She said that she would lease a car

Phrase structure

• Words are organized in phrases

• Phrases: grouping of words that are clumped as a unit

• Syntax: study of the regularities and constraints of word order and phrase structure

Major phrase types

• Sentence (S) (whole grammatical unit). Normally rewrites as a subject noun phrase and a verb phrase

• Noun phrase (NP): phrase whose head is a noun or a pronoun, optionally accompanied by a set of modifiers – Head is the word that determines the syntactic

type of the phrase– The smart student of physics with long hair

determiner adjective complements (prepositional phrase)

(post) modifier(prepositional phrase)

Major phrase types

• Prepositional phrases (PP)– Headed by a preposition and containing a NP

• She is [on the computer]• They walked [to their school]

• Verb phrases (VP)– Phrase whose head is a verb

• [Getting to school on time] was a struggle• He [was trying to keep his temper]• That woman [quickly showed me the way to hide]

Phrase structure grammar

• Syntactic analysis of sentences– (Ultimately) to extract meaning:

• Mary gave Peter a book• Peter gave Mary a book

• Rewrite rules– Category category* (i.e. the symbol on the

left side can be rewritten as the sequence of symbols on the right side)

– Start symbol is S (for sentence)

Phrase structure grammar

• S NP VP

• NP AT NN

• NP NP PP

• VP VP PP

• VP VP

• PP IN NP

• AT the• NN child• NN cat• NN box• VP sleep• VP eat• IN in• IN of

Lexicon

The cat sleeps

The cat sleeps in the box

The cat hopes she can sleeps in the box NO

Context free grammars

• The rewrite rules depend solely on the category and not on any surrounding context: Context Free Grammar

• Main problems:– Identify these grammars for natural languages

(linguistics)– Known the grammar, identify the phrase

structures of sentences (NLP, parsing)

Phrase structure parsing

• Parsing: the process of reconstructing the derivation(s) or phrase structure trees that give rise to a particular sequence of words

• Parse is a phrase structure tree– New art critics write reviews with computers

Phrase structure parsing & ambiguity

• The children ate the cake with a spoon

• PP Attachment Ambiguity

• Why is it important for NLP?

Semantics

• Semantics is the study of the meaning of words, construction and utterances

1. Study of the meaning of individual words (lexical semantics)

2. Study of how meanings of individual words are combined into the meaning of sentences (or larger units)

Lexical semantics

• How words are related with each other• Hyponymy

– scarlet, vermilion, carmine, and crimson are all hyponyms of red

• Hypernymy• Antonymy (opposite)

– Male, female

• Meronymy (part of)– Tire is meromym of car

• Etc..

Semantics: beyond individual words

• Once we have the meaning of the individual words, we need to assemble them to et the meaning of the whole sentence

• Hard because natural language does not obey the principle of compositionality by which the meaning of the whole can be predicted by the meanings of the parts

Semantics: beyond individual words:complications

• Collocations– White skin, white wine, white hair

• Idioms: meaning is opaque– Kick the bucket

• Scope– Everyone didn’t go to the movie

1. Everyone’s scope is over not (i.e. not one person went to the movie)

2. Negation not has scope over everyone (at least one person didn’t go)

Semantics: beyond individual words

• Discourse

• Anaphoric relations– Mary helped Peter get out of the cat. He

thanked her. [He and Peter are the same person, her and Mary too]

Next class

• Syntax of words• Morphology• Stemming

– Collapse related morphological forms to the original lexeme

– Sit, sits, sitting, sat lexeme: sit

• Tokenization– Divide text into units (words, numbers etc)

• Word segmentation– For languages with no spaces between words

top related