Page 1: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004


SIMS 290-2: Applied Natural Language Processing

Marti HearstSept 22, 2004 


Page 2: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004



Cascaded ChunkingExample of Using Chunking: Word AssociationsEvaluating ChunkingGoing to the next level: Parsing

Page 3: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004


Cascaded Chunking

Goal: create chunks that include other chunksExamples:

PP consists of preposition + NPVP consists of verb followed by PPs or NPs

How to make it work in NLTKThe tutorial is a bit confusing, I attempt to clarify

Page 4: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004


Creating Cascaded Chunkers

Start with a sentence tokenA list of words with parts of speech assignedCreate a fresh one or use one from a corpus

Page 5: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004


Creating Cascaded Chunkers

Create a set of chunk parsersOne for each chunk typeEach one takes as input some kind of list of tokens, and produced as output a NEW list of tokens

– You can decide what this new list is called Examples: NP-CHUNK, PP-CHUNK, VP-CHUNK

– You can also decide what to name each occurrence of the chunk type, as it is assigned to a subset of tokens

Examples: NP, VP, PP

How to match higher-level tags?It just seems to match their string descriptionSo best be certain that their name does not overlap with POS tags too

Page 6: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004


Page 7: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004


Page 8: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004


Page 9: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004


Let’s do some text analysis

Let’s try this on more complex sentencesFirst, read in part of a corpusThen, count how often each word occurs with each POSDetermine some common verbs, choose oneMake a list of sentences containing that verbTest out the chunker on them; examine further

Page 10: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004


Page 11: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004


Page 12: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004


Page 13: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004


Why didn’t this parse work?

Page 14: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004


Why didn’t this parse work?

Page 15: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004


Why d




e w



Page 16: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004


Why didn’t this parse work?

Page 17: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004


Corpus Analysis for Discovery ofWord Associations

Classic paper by Church & Hanks showed how to use a corpus and a shallow parser to find interesting dependencies between words

– Word Association Norms, Mutual Information, and Lexicography, Computational Linguistics, 16(1), 1991


Some cognitive evidence:Word association norms: which word to people say most often after hearing another word

– Given doctor: nurse, sick, health, medicine, hospital…

People respond more quickly to a word if they’ve seen an associated word

– E.g., if you show “bread” they’re faster at recognizing “butter” than “nurse” (vs a nonsense string)

Page 18: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004


Corpus Analysis for Discovery ofWord Associations

Idea: use a corpus to estimate word associationsAssociation ratio: log ( P(x,y) / P(x)P(y) )

The probability of seeing x followed by y vs. the probably of seeing x anywhere times the probability of seeing y anywhereP(x) is how often x appears in the corpusP(x,y) is how often y follows x within w wordsInteresting associations with “doctor”:

– X: honorary Y: doctor– X: doctors Y: dentists– X: doctors Y: nurses– X: doctors Y: treating– X: examined Y:doctor– X: doctors Y: treat

Page 19: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004


Corpus Analysis for Discovery ofWord Associations

Now let’s make use of syntactic information.Look at which words and syntactic forms follow a given verb, to see what kinds of arguments it takesCompute triples of subject-verb-object

Example: nouns that appear as the object of the verb usage of “drink”:

– martinis, cup_water, champagne, beverage, cup_coffee, cognac, beer, cup, coffee, toast, alcohol…

– What can we note about many of these words?

Example: verbs that have “telephone” in their object: – sit_by, disconnect, answer, hang_up, tap, pick_up, return,

be_by, spot, repeat, place, receive, install, be_on

Page 20: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004


Corpus Analysis for Discovery ofWord Associations

The approach has become standardEntire collections available

Dekang Lin’s Dependency Database– Given a word, retrieve words that had dependency

relationship with the input word

Dependency-based Word Similarity– Given a word, retrieve the words that are most similar

to it, based on dependencies

Page 21: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004


Example Dependency Database: “sell”

Page 22: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004


Example Dependency-based Similarity: “sell”

Page 23: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004


Homework Assignment

Choose a verb of interestAnalyze the context in which the verb appears

Can use any corpus you like– Can train a tagger and run it on some fresh text

Example: What kinds of arguments does it take?Improve on my chunking rules to get better characterizations

Page 24: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004


Evaluating the Chunker

Why not just use accuracy?Accuracy = #correct/total number

DefinitionsTotal: number of chunks in gold standardGuessed: set of chunks that were labeledCorrect: of the guessed, which were correctMissed: how many correct chunks not guessed?Precision: #correct / #guessedRecall: #correct / #totalF-measure: 2 * (Prec*Recall) / (Prec + Recall)

Page 25: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004



Assume the following numbersTotal: 100Guessed: 120Correct: 80Missed: 20Precision: 80 / 120 = 0.67Recall: 80 / 100 = 0.80F-measure: 2 * (.67*.80) / (.67 + .80) = 0.69

Page 26: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004


Evaluating in NLTKWe have some already chunked text from the Treebank

The code below uses the existing parse to compare against, and to generate Tokens of type word/tag to parse with our own chunker.

Have to add location information so the evaluation code can compare which words have been assigned which labels

Page 27: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004


How to get better accuracy?

Use a full syntactic parserThese days the probabilistic ones work surprisingly well

They are getting faster too.Prof. Dan Klein’s is very good and easy to run


Page 28: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004


Page 29: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004


Page 30: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004


Page 31: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004


Page 32: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004


Next Week

Shallow Parsing AssignmentDue on Wed Sept 29

Next week:Read paper on end-of-sentence disambiguationPresley and Barbara lecturing on categorizationWe will read the categorization tutorial the following week

Top Related