python for nlp and the natural language toolkit cs1573: ai application development, spring 2003...

Python for NLP and the Natural Language Toolkit

CS1573: AI Application Development, Spring 2003

(modified from Edward Loper’s notes)

Outline

• Review: Introduction to NLP (knowledge of language, ambiguity, representations and algorithms, applications)

• HW 2 discussion• Tutorials: Basics, Probability

Python and Natural Language Processing

• Python is a great language for NLP:– Simple

– Easy to debug:• Exceptions

• Interpreted language

– Easy to structure• Modules

• Object oriented programming

– Powerful string manipulation

Modules and Packages

• Python modules “package program code and data for reuse.” (Lutz)

– Similar to library in C, package in Java.

• Python packages are hierarchical modules (i.e., modules that contain other modules).

• Three commands for accessing modules:1. import

2. from…import

3. reload

Modules and Packages: import

• The import command loads a module:# Load the regular expression module

>>> import re

• To access the contents of a module, use dotted names:

# Use the search method from the re module

>>> re.search(‘\w+’, str)

• To list the contents of a module, use dir:>>> dir(re)

[‘DOTALL’, ‘I’, ‘IGNORECASE’,…]

Modules and Packagesfrom…import

• The from…import command loads individual functions and objects from a module:

# Load the search function from the re module

>>> from re import search

• Once an individual function or object is loaded with from…import, it can be used directly:

# Use the search method from the re module

>>> search (‘\w+’, str)

Import vs. from…import

Import• Keeps module

functions separate from user functions.

• Requires the use of dotted names.

• Works with reload.

from…import• Puts module functions

and user functions together.

• More convenient names.

• Does not work with reload.

Modules and Packages: reload

• If you edit a module, you must use the reload command before the changes become visible in Python:

>>> import mymodule...>>> reload (mymodule)

• The reload command only affects modules that have been loaded with import; it does not update individual functions and objects loaded with from...import.

Introduction to NLTK

• The Natural Language Toolkit (NLTK) provides:– Basic classes for representing data relevant to natural

language processing.

– Standard interfaces for performing tasks, such as tokenization, tagging, and parsing.

– Standard implementations of each task, which can be combined to solve complex problems.

NLTK: Example Modules • nltk.token: processing individual elements of text,

such as words or sentences.• nltk.probability: modeling frequency distributions

and probabilistic systems.• nltk.tagger: tagging tokens with supplemental

information, such as parts of speech or wordnet sense tags.• nltk.parser: high-level interface for parsing texts.• nltk.chartparser: a chart-based implementation of

the parser interface.• nltk.chunkparser: a regular-expression based

surface parser.

NLTK: Top-Level Organization

• NLTK is organized as a flat hierarchy of packages and modules.

• Each module provides the tools necessary to address a specific task

• Modules contain two types of classes:– Data-oriented classes are used to represent information

relevant to natural language processing.

– Task-oriented classes encapsulate the resources and methods needed to perform a specific task.

To the First Tutorials

• Tokens and Tokenization• Frequency Distributions

The Token Module

• It is often useful to think of a text in terms of smaller elements, such as words or sentences.

• The nltk.token module defines classes for representing and processing these smaller elements.

• What might be other useful smaller elements?

Tokens and Types

• The term word can be used in two different ways:1. To refer to an individual occurrence of a word

2. To refer to an abstract vocabulary item

• For example, the sentence “my dog likes his dog” contains five occurrences of words, but four vocabulary items.

• To avoid confusion use more precise terminology:1. Word token: an occurrence of a word

2. Word Type: a vocabulary item

Tokens and Types (continued)

• In NLTK, tokens are constructed from their types using the Token constructor: >>> from nltk.token import * >>> my_word_type = 'dog' 'dog' >>> my_word_token =Token(my_word_type) ‘dog'@[?]

• Token member functions include type and loc

Text Locations• A text location @ [s:e] specifies a region of a text:

– s is the start index– e is the end index

• The text location @ [s:e]specifies the text beginning at s, and including everything up to (but not including) the text at e.

• This definition is consistent with Python slice.• Think of indices as appearing between elements:

I saw a man 0 1 2

3 4• Shorthand notation when location width = 1.

Text Locations (continued)• Indices can be based on different units:

– character– word – sentence

• Locations can be tagged with sources (files, other text locations – e.g., the first word of the first sentence in the file)

• Location member functions:– start– end – unit– source

Tokenization

• The simplest way to represent a text is with a single string.

• Difficult to process text in this format.• Often, it is more convenient to work with a list of

tokens.• The task of converting a text from a single string

to a list of tokens is known as tokenization.

Tokenization (continued)

• Tokenization is harder that it seemsI’ll see you in New York.

The aluminum-export ban.

• The simplest approach is to use “graphic words” (i.e., separate words using whitespace)

• Another approach is to use regular expressions to specify which substrings are valid words.

• NLTK provides a generic tokenization interface: TokenizerI

TokenizerI

• Defines a single method, tokenize, which takes a string and returns a list of tokens

• Tokenize is independent of the level of tokenization and the implementation algorithm

Example• from nltk.token import WSTokenizer

from nltk.draw.plot import Plot #Extract a list of words from the corpus corpus = open('corpus.txt').read() tokens = WSTokenizer().tokenize(corpus) # Count up how many times each word length occurs wordlen_count_list = [] for token in tokens: wordlen = len(token.type()) # Add zeros until wordlen_count_list is long enough while wordlen >= len(wordlen_count_list):

wordlen_count_list.append(0) # Increment the count for this word length

wordlen_count_list[wordlen] += 1 Plot(wordlen_count_list)

Next Tutorial: Probability

• An experiment is any process which leads to a well-defined outcome

• A sample is any possible outcome of a given experiment

• Rolling a die?

Outline

Review Basics

Probability

Experiments and Samples

Frequency Distributions

Conditional Frequency Distributions

Review: NLTK Goals

• Classes for NLP data

• Interfaces for NLP tasks• Implementations, easily combined (what is an

example?)

Accessing NLTK

• What is the relation to Python?

Words

• Types and Tokens • Text Locations• Member Functions

Tokenization

• TokenizerI • Implementations

• >>> tokenizer = WSTokenizer() • >>> tokenizer.tokenize(text_str)

['Hello'@[0w], 'world.'@[1w], 'This'@[2w], 'is'@[3w], 'a'@[4w], 'test'@[5w], 'file.'@[6w]]

Word Length Freq. Distribution Example

• from nltk.token import WSTokenizer from nltk.probability import SimpleFreqDist # Extract a list of words from the corpus corpus = open('corpus.txt').read() tokens = WSTokenizer().tokenize(corpus) # Construct a frequency distribution of word lengths wordlen_freqs = SimpleFreqDist() for token in tokens: wordlen_freqs.inc(len(token.type())) # Extract the set of word lengths found in the corpus wordlens = wordlen_freqs.samples()

Frequency Distributions

• A frequency distribution records the number of times each outcome of an experiment has occurred

• >>> freq_dist = FreqDist() >>> for token in document: ...

freq_dist.inc(token.type()) • Constructor, then initialization by storing

experimental outcomes

Methods• The freq method returns the frequencey of a given

sample.• We can find the number of times a given sample

occured with the count method• We can find the total number of sample outcomes

recorded by a frequency distribution with the N method• The samples method returns a list of all samples that

have been recorded as outcomes by a frequency distribution

• We can find the sample with the greatest number of outcomes with the max method

http://nltk.sourceforge.net/ref/public/nltk.probability.FreqDist-class.html#count

http://nltk.sourceforge.net/ref/public/nltk.probability.FreqDist-class.html#count

http://nltk.sourceforge.net/ref/public/nltk.probability.FreqDist-class.html#N

http://nltk.sourceforge.net/ref/public/nltk.probability.FreqDist-class.html#samples

http://nltk.sourceforge.net/ref/public/nltk.probability.FreqDist-class.html#max

Examples of Methods

• >>> freq_dist.count('the') 6

• >>> freq_dist.freq('the') 0.012

• >>> freq_dist.N() 500

• >>> freq_dist.max() ‘the’

Simple Word Length Example

• >>> from nltk.token import WSTokenizer >>> from nltk.probability import FreqDist >>> corpus = open('corpus.txt').read() >>> tokens = WSTokenizer().tokenize(corpus) # What is the distribution of word lengths in a corpus? >>> freq_dist = FreqDist() >>> for token in tokens: ... freq_dist.inc(len(token.type()))– What is the "outcome" for our experiment?

Simple Word Length Example

• >>> from nltk.token import WSTokenizer >>> from nltk.probability import FreqDist >>> corpus = open('corpus.txt').read() >>> tokens = WSTokenizer().tokenize(corpus) # What is the distribution of word lengths in a corpus? >>> freq_dist = FreqDist() >>> for token in tokens: ... freq_dist.inc(len(token.type()))– This length is the "outcome" for our experiment, so we use

inc() to increment its count in a frequency distribution.

Complex Word Length Example

• # define vowels as "a", "e", "i", "o", and "u" >>> VOWELS = ('a', 'e', 'i', 'o', 'u') # distribution for words ending in vowels? >>> freq_dist = FreqDist() >>> for token in tokens: ... if token.type()[-1].lower() in VOWELS: ...

freq_dist.inc(len(token.type())) • What is the condition?

More Complex Example

• # What is the distribution of word lengths for # words following words that end in vowels? >>> ended_in_vowel = 0 #Did last word end in vowel? >>> freq_dist = FreqDist() >>> for token in tokens: ... if ended_in_vowel: ... Freq_dist.inc(len(token.type()))

... ended_in_vowel=token.type()[-1].lower() in VOWELS

Conditional Frequency Distributions

• A condition specifies the context in which an experiment is performed

• A conditional frequency distribution is a collection of frequency distribtuions for the same experiment, run under different conditions

• The individual frequency distributions are indexed by the condition.

• NLTK ConditionalFreqDist class• >>> cfdist = ConditionalFreqDist()

<ConditionalFreqDist with 0 conditions>

Conditional Frequency Distributions (continued)

• To access the frequency distribution for a condition, use the indexing operator : >>> cfdist['a'] <FreqDist with 0 outcomes>

• # Record lengths of some words starting with 'a' >>> for word in 'apple and arm'.split(): ... cfdist['a'].inc(len(word))

• # How many are 3 characters long? >>> cfdist['a'].freq(3) 0.66667

• To list accessed conditions, use the conditions method: >>> cfdist.conditions()

['a']

http://nltk.sourceforge.net/ref/public/nltk.probability.ConditionalFreqDist-class.html#__getitem__

http://nltk.sourceforge.net/ref/public/nltk.probability.ConditionalFreqDist-class.html#__getitem__

Example: Conditioning on a Word’s Initial Letter

• >>> from nltk.token import WSTokenizer >>> from nltk.probability import ConditionalFreqDist >>> from nltk.draw.plot import Plot # >>> corpus = open('corpus.txt').read() >>> tokens = WSTokenizer().tokenize(corpus) >>> cfdist = ConditionalFreqDist()

Example (continued)

• # How does initial letter affect word length? >>> for token in tokens: ... outcome = len(token.type())

... condition = token.type()[0].lower() ... cfdist[condition].inc(outcome)

• What are the condition and the outcome?

Example (continued)

• # How does initial letter affect word length? >>> for token in tokens: ... outcome = len(token.type())

... condition = token.type()[0].lower() ... cfdist[condition].inc(outcome)

• What are the condition and the outcome?• Condition = the initial letter of the token• Outcome = its word length

Prediction

• Prediction is the problem of deciding a likely outcome for a given run of an experiment.

• To predict the outcome, we first examine a training corpus.

• Training corpus– The context and outcome for each run are known– Given a new run, we choose the outcome that occurred

most frequently for the context– Conditional frequency distribution finds the most

frequent occurrrence

Prediction Example: Outline

• Record each outcome in the training corpus, using the context that the experiment was under as the condition

• Access the frequency distribution for a given context with the indexing operator

• Use the max() method to find the most likely outcome

Example: Predicting Words

• Predict word's type, based on preceding word type

• >>> from nltk.token import WSTokenizer

>>> from nltk.probability import ConditionalFreqDist >>> corpus = open('corpus.txt').read() >>> tokens = WSTokenizer().tokenize(corpus) >>> cfdist = ConditionalFreqDist() #empty

Example (continued)

• >>> context = None # The type of the preceding word >>> for token in tokens: ...

outcome = token.type() ... cfdist[context].inc(outcome) ... context = token.type()

Example (continued)

• >>> cfdist['prediction'].max() 'problems' >>> cfdist['problems'].max() 'in' >>> cfdist['in'].max() 'the‘

• What are we predicting here?

Example (continued)

We predict the most likely word for any contextGeneration application: >>> word = 'prediction'

>>> for i in range(15): ... print word, ... word = cfdist[word].max()

prediction problems in the frequency distribution of the frequency distribution of the frequency distribution of

For Next Time

• HW3 • To run NLTK from unixs.cis.pitt.edu, you

should add /afs/cs.pitt.edu/projects/nltk/bin to your search path

• Regular Expressions (J&M handout, NLTK tutorial)

python for nlp and the natural language toolkit cs1573: ai application development, spring 2003...

Documents