python for nlp and the natural language toolkit cs1573: ai application development, spring 2003...
TRANSCRIPT
Python for NLP and the Natural Language Toolkit
CS1573: AI Application Development, Spring 2003
(modified from Edward Loper’s notes)
Outline
• Review: Introduction to NLP (knowledge of language, ambiguity, representations and algorithms, applications)
• HW 2 discussion• Tutorials: Basics, Probability
Python and Natural Language Processing
• Python is a great language for NLP:– Simple
– Easy to debug:• Exceptions
• Interpreted language
– Easy to structure• Modules
• Object oriented programming
– Powerful string manipulation
Modules and Packages
• Python modules “package program code and data for reuse.” (Lutz)
– Similar to library in C, package in Java.
• Python packages are hierarchical modules (i.e., modules that contain other modules).
• Three commands for accessing modules:1. import
2. from…import
3. reload
Modules and Packages: import
• The import command loads a module:# Load the regular expression module
>>> import re
• To access the contents of a module, use dotted names:
# Use the search method from the re module
>>> re.search(‘\w+’, str)
• To list the contents of a module, use dir:>>> dir(re)
[‘DOTALL’, ‘I’, ‘IGNORECASE’,…]
Modules and Packagesfrom…import
• The from…import command loads individual functions and objects from a module:
# Load the search function from the re module
>>> from re import search
• Once an individual function or object is loaded with from…import, it can be used directly:
# Use the search method from the re module
>>> search (‘\w+’, str)
Import vs. from…import
Import• Keeps module
functions separate from user functions.
• Requires the use of dotted names.
• Works with reload.
from…import• Puts module functions
and user functions together.
• More convenient names.
• Does not work with reload.
Modules and Packages: reload
• If you edit a module, you must use the reload command before the changes become visible in Python:
>>> import mymodule...>>> reload (mymodule)
• The reload command only affects modules that have been loaded with import; it does not update individual functions and objects loaded with from...import.
Introduction to NLTK
• The Natural Language Toolkit (NLTK) provides:– Basic classes for representing data relevant to natural
language processing.
– Standard interfaces for performing tasks, such as tokenization, tagging, and parsing.
– Standard implementations of each task, which can be combined to solve complex problems.
NLTK: Example Modules • nltk.token: processing individual elements of text,
such as words or sentences.• nltk.probability: modeling frequency distributions
and probabilistic systems.• nltk.tagger: tagging tokens with supplemental
information, such as parts of speech or wordnet sense tags.• nltk.parser: high-level interface for parsing texts.• nltk.chartparser: a chart-based implementation of
the parser interface.• nltk.chunkparser: a regular-expression based
surface parser.
NLTK: Top-Level Organization
• NLTK is organized as a flat hierarchy of packages and modules.
• Each module provides the tools necessary to address a specific task
• Modules contain two types of classes:– Data-oriented classes are used to represent information
relevant to natural language processing.
– Task-oriented classes encapsulate the resources and methods needed to perform a specific task.
To the First Tutorials
• Tokens and Tokenization• Frequency Distributions
The Token Module
• It is often useful to think of a text in terms of smaller elements, such as words or sentences.
• The nltk.token module defines classes for representing and processing these smaller elements.
• What might be other useful smaller elements?
Tokens and Types
• The term word can be used in two different ways:1. To refer to an individual occurrence of a word
2. To refer to an abstract vocabulary item
• For example, the sentence “my dog likes his dog” contains five occurrences of words, but four vocabulary items.
• To avoid confusion use more precise terminology:1. Word token: an occurrence of a word
2. Word Type: a vocabulary item
Tokens and Types (continued)
• In NLTK, tokens are constructed from their types using the Token constructor: >>> from nltk.token import * >>> my_word_type = 'dog' 'dog' >>> my_word_token =Token(my_word_type) ‘dog'@[?]
• Token member functions include type and loc
Text Locations• A text location @ [s:e] specifies a region of a text:
– s is the start index– e is the end index
• The text location @ [s:e]specifies the text beginning at s, and including everything up to (but not including) the text at e.
• This definition is consistent with Python slice.• Think of indices as appearing between elements:
I saw a man 0 1 2
3 4• Shorthand notation when location width = 1.
Text Locations (continued)• Indices can be based on different units:
– character– word – sentence
• Locations can be tagged with sources (files, other text locations – e.g., the first word of the first sentence in the file)
• Location member functions:– start– end – unit– source
Tokenization
• The simplest way to represent a text is with a single string.
• Difficult to process text in this format.• Often, it is more convenient to work with a list of
tokens.• The task of converting a text from a single string
to a list of tokens is known as tokenization.
Tokenization (continued)
• Tokenization is harder that it seemsI’ll see you in New York.
The aluminum-export ban.
• The simplest approach is to use “graphic words” (i.e., separate words using whitespace)
• Another approach is to use regular expressions to specify which substrings are valid words.
• NLTK provides a generic tokenization interface: TokenizerI
TokenizerI
• Defines a single method, tokenize, which takes a string and returns a list of tokens
• Tokenize is independent of the level of tokenization and the implementation algorithm
Example• from nltk.token import WSTokenizer
from nltk.draw.plot import Plot #Extract a list of words from the corpus corpus = open('corpus.txt').read() tokens = WSTokenizer().tokenize(corpus) # Count up how many times each word length occurs wordlen_count_list = [] for token in tokens: wordlen = len(token.type()) # Add zeros until wordlen_count_list is long enough while wordlen >= len(wordlen_count_list):
wordlen_count_list.append(0) # Increment the count for this word length
wordlen_count_list[wordlen] += 1 Plot(wordlen_count_list)
Next Tutorial: Probability
• An experiment is any process which leads to a well-defined outcome
• A sample is any possible outcome of a given experiment
• Rolling a die?
Outline
Review Basics
Probability
Experiments and Samples
Frequency Distributions
Conditional Frequency Distributions
Review: NLTK Goals
• Classes for NLP data
• Interfaces for NLP tasks• Implementations, easily combined (what is an
example?)
Accessing NLTK
• What is the relation to Python?
Words
• Types and Tokens • Text Locations• Member Functions
Tokenization
• TokenizerI • Implementations
• >>> tokenizer = WSTokenizer() • >>> tokenizer.tokenize(text_str)
['Hello'@[0w], 'world.'@[1w], 'This'@[2w], 'is'@[3w], 'a'@[4w], 'test'@[5w], 'file.'@[6w]]
Word Length Freq. Distribution Example
• from nltk.token import WSTokenizer from nltk.probability import SimpleFreqDist # Extract a list of words from the corpus corpus = open('corpus.txt').read() tokens = WSTokenizer().tokenize(corpus) # Construct a frequency distribution of word lengths wordlen_freqs = SimpleFreqDist() for token in tokens: wordlen_freqs.inc(len(token.type())) # Extract the set of word lengths found in the corpus wordlens = wordlen_freqs.samples()
Frequency Distributions
• A frequency distribution records the number of times each outcome of an experiment has occurred
• >>> freq_dist = FreqDist() >>> for token in document: ...
freq_dist.inc(token.type()) • Constructor, then initialization by storing
experimental outcomes
Methods• The freq method returns the frequencey of a given
sample.• We can find the number of times a given sample
occured with the count method• We can find the total number of sample outcomes
recorded by a frequency distribution with the N method• The samples method returns a list of all samples that
have been recorded as outcomes by a frequency distribution
• We can find the sample with the greatest number of outcomes with the max method
Examples of Methods
• >>> freq_dist.count('the') 6
• >>> freq_dist.freq('the') 0.012
• >>> freq_dist.N() 500
• >>> freq_dist.max() ‘the’
Simple Word Length Example
• >>> from nltk.token import WSTokenizer >>> from nltk.probability import FreqDist >>> corpus = open('corpus.txt').read() >>> tokens = WSTokenizer().tokenize(corpus) # What is the distribution of word lengths in a corpus? >>> freq_dist = FreqDist() >>> for token in tokens: ... freq_dist.inc(len(token.type()))– What is the "outcome" for our experiment?
Simple Word Length Example
• >>> from nltk.token import WSTokenizer >>> from nltk.probability import FreqDist >>> corpus = open('corpus.txt').read() >>> tokens = WSTokenizer().tokenize(corpus) # What is the distribution of word lengths in a corpus? >>> freq_dist = FreqDist() >>> for token in tokens: ... freq_dist.inc(len(token.type()))– This length is the "outcome" for our experiment, so we use
inc() to increment its count in a frequency distribution.
Complex Word Length Example
• # define vowels as "a", "e", "i", "o", and "u" >>> VOWELS = ('a', 'e', 'i', 'o', 'u') # distribution for words ending in vowels? >>> freq_dist = FreqDist() >>> for token in tokens: ... if token.type()[-1].lower() in VOWELS: ...
freq_dist.inc(len(token.type())) • What is the condition?
More Complex Example
• # What is the distribution of word lengths for # words following words that end in vowels? >>> ended_in_vowel = 0 #Did last word end in vowel? >>> freq_dist = FreqDist() >>> for token in tokens: ... if ended_in_vowel: ... Freq_dist.inc(len(token.type()))
... ended_in_vowel=token.type()[-1].lower() in VOWELS
Conditional Frequency Distributions
• A condition specifies the context in which an experiment is performed
• A conditional frequency distribution is a collection of frequency distribtuions for the same experiment, run under different conditions
• The individual frequency distributions are indexed by the condition.
• NLTK ConditionalFreqDist class• >>> cfdist = ConditionalFreqDist()
<ConditionalFreqDist with 0 conditions>
Conditional Frequency Distributions (continued)
• To access the frequency distribution for a condition, use the indexing operator : >>> cfdist['a'] <FreqDist with 0 outcomes>
• # Record lengths of some words starting with 'a' >>> for word in 'apple and arm'.split(): ... cfdist['a'].inc(len(word))
• # How many are 3 characters long? >>> cfdist['a'].freq(3) 0.66667
• To list accessed conditions, use the conditions method: >>> cfdist.conditions()
['a']
Example: Conditioning on a Word’s Initial Letter
• >>> from nltk.token import WSTokenizer >>> from nltk.probability import ConditionalFreqDist >>> from nltk.draw.plot import Plot # >>> corpus = open('corpus.txt').read() >>> tokens = WSTokenizer().tokenize(corpus) >>> cfdist = ConditionalFreqDist()
Example (continued)
• # How does initial letter affect word length? >>> for token in tokens: ... outcome = len(token.type())
... condition = token.type()[0].lower() ... cfdist[condition].inc(outcome)
• What are the condition and the outcome?
Example (continued)
• # How does initial letter affect word length? >>> for token in tokens: ... outcome = len(token.type())
... condition = token.type()[0].lower() ... cfdist[condition].inc(outcome)
• What are the condition and the outcome?• Condition = the initial letter of the token• Outcome = its word length
Prediction
• Prediction is the problem of deciding a likely outcome for a given run of an experiment.
• To predict the outcome, we first examine a training corpus.
• Training corpus– The context and outcome for each run are known– Given a new run, we choose the outcome that occurred
most frequently for the context– Conditional frequency distribution finds the most
frequent occurrrence
Prediction Example: Outline
• Record each outcome in the training corpus, using the context that the experiment was under as the condition
• Access the frequency distribution for a given context with the indexing operator
• Use the max() method to find the most likely outcome
Example: Predicting Words
• Predict word's type, based on preceding word type
• >>> from nltk.token import WSTokenizer
>>> from nltk.probability import ConditionalFreqDist >>> corpus = open('corpus.txt').read() >>> tokens = WSTokenizer().tokenize(corpus) >>> cfdist = ConditionalFreqDist() #empty
Example (continued)
• >>> context = None # The type of the preceding word >>> for token in tokens: ...
outcome = token.type() ... cfdist[context].inc(outcome) ... context = token.type()
Example (continued)
• >>> cfdist['prediction'].max() 'problems' >>> cfdist['problems'].max() 'in' >>> cfdist['in'].max() 'the‘
• What are we predicting here?
Example (continued)
We predict the most likely word for any contextGeneration application: >>> word = 'prediction'
>>> for i in range(15): ... print word, ... word = cfdist[word].max()
prediction problems in the frequency distribution of the frequency distribution of the frequency distribution of
For Next Time
• HW3 • To run NLTK from unixs.cis.pitt.edu, you
should add /afs/cs.pitt.edu/projects/nltk/bin to your search path
• Regular Expressions (J&M handout, NLTK tutorial)