lecture 7 nltk pos tagging topics taggers rule based taggers probabilistic taggers transformation...

Lecture 7 NLTK POS Tagging

Lecture 7 NLTK POS Tagging

Topics Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning

Readings: Chapter 5.4-?Readings: Chapter 5.4-?

February 3, 2011

CSCE 771 Natural Language Processing

– 2 – CSCE 771 Spring 2011

NLTK taggingNLTK tagging

>>> text = nltk.word_tokenize("And now for something >>> text = nltk.word_tokenize("And now for something completely different") completely different")

>>> nltk.pos_tag(text) >>> nltk.pos_tag(text)

[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', [('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')]'NN'), ('completely', 'RB'), ('different', 'JJ')]

– 3 – CSCE 771 Spring 2011

>>> text = nltk.word_tokenize("They refuse to permit us >>> text = nltk.word_tokenize("They refuse to permit us to obtain the refuse permit") to obtain the refuse permit")

>>> nltk.pos_tag(text) >>> nltk.pos_tag(text)

[('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', [('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'), ('us', 'PRP'), ('to', 'TO'), ('obtain', 'VB'), ('the', 'VB'), ('us', 'PRP'), ('to', 'TO'), ('obtain', 'VB'), ('the', 'DT'), ('refuse', 'NN'), ('permit', 'NN')]'DT'), ('refuse', 'NN'), ('permit', 'NN')]

– 4 – CSCE 771 Spring 2011

>>> text = nltk.Text(word.lower() for word in nltk.corpus.brown.words())>>> text = nltk.Text(word.lower() for word in nltk.corpus.brown.words())

>>> text.similar('woman') >>> text.similar('woman')

Building word-context index... man time day year car moment world Building word-context index... man time day year car moment world family house country child boy state job way war girl place room family house country child boy state job way war girl place room word word

>>> text.similar('bought') >>> text.similar('bought')

made said put done seen had found left given heard brought got been made said put done seen had found left given heard brought got been was set told took in felt that was set told took in felt that

>>> text.similar('over')>>> text.similar('over')

in on to of and for with from at by that into as up out down through is in on to of and for with from at by that into as up out down through is all about all about

>>> text.similar('the') >>> text.similar('the')

a his this their its her an that our any all one these my in your no some a his this their its her an that our any all one these my in your no some other andother and

– 5 – CSCE 771 Spring 2011

Tagged CorporaTagged Corpora

By convention in NLTK, a tagged token is a tuple.By convention in NLTK, a tagged token is a tuple.

function str2tuple()function str2tuple()

>>> tagged_token = nltk.tag.str2tuple('fly/NN') >>> tagged_token = nltk.tag.str2tuple('fly/NN')

>>> tagged_token >>> tagged_token

('fly', 'NN') ('fly', 'NN')

>>> tagged_token[0] >>> tagged_token[0]

'fly' 'fly'

>>> tagged_token[1] >>> tagged_token[1]

'NN''NN'

– 6 – CSCE 771 Spring 2011

Specifying Tags with StringsSpecifying Tags with Strings

>>> sent = ''' >>> sent = '''

... The/AT grand/JJ jury/NN commented/VBD on/IN a/AT ... The/AT grand/JJ jury/NN commented/VBD on/IN a/AT number/NN of/IN ... other/AP topics/NNS ,/, number/NN of/IN ... other/AP topics/NNS ,/, AMONG/IN them/PPO the/AT Atlanta/NP and/CCAMONG/IN them/PPO the/AT Atlanta/NP and/CC

... ...

... accepted/VBN practices/NNS which/WDT inure/VB ... accepted/VBN practices/NNS which/WDT inure/VB to/IN the/AT best/JJT ... interest/NN of/IN both/ABX to/IN the/AT best/JJT ... interest/NN of/IN both/ABX governments/NNS ''/'' ./. ... ''' governments/NNS ''/'' ./. ... '''

>>> [nltk.tag.str2tuple(t) for t in sent.split()] >>> [nltk.tag.str2tuple(t) for t in sent.split()]

[('The', 'AT'), ('grand', 'JJ'), ('jury', 'NN'), ('commented', [('The', 'AT'), ('grand', 'JJ'), ('jury', 'NN'), ('commented', 'VBD'), ('on', 'IN'), ('a', 'AT'), ('number', 'NN'), ... ('.', '.')]'VBD'), ('on', 'IN'), ('a', 'AT'), ('number', 'NN'), ... ('.', '.')]

– 7 – CSCE 771 Spring 2011

Reading Tagged CorporaReading Tagged Corpora

>>> nltk.corpus.brown.tagged_words() >>> nltk.corpus.brown.tagged_words()

[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ...] [('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ...]

>>> nltk.corpus.brown.tagged_words(simplify_tags=True) >>> nltk.corpus.brown.tagged_words(simplify_tags=True)

[('The', 'DET'), ('Fulton', 'N'), ('County', 'N'), ...][('The', 'DET'), ('Fulton', 'N'), ('County', 'N'), ...]

– 8 – CSCE 771 Spring 2011

tagged_words() methodtagged_words() method

>>> print nltk.corpus.nps_chat.tagged_words() >>> print nltk.corpus.nps_chat.tagged_words()

[('now', 'RB'), ('im', 'PRP'), ('left', 'VBD'), ...] [('now', 'RB'), ('im', 'PRP'), ('left', 'VBD'), ...]

>>> nltk.corpus.conll2000.tagged_words() >>> nltk.corpus.conll2000.tagged_words()

[('Confidence', 'NN'), ('in', 'IN'), ('the', 'DT'), ...] [('Confidence', 'NN'), ('in', 'IN'), ('the', 'DT'), ...]

>>> nltk.corpus.treebank.tagged_words() >>> nltk.corpus.treebank.tagged_words()

[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ...][('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ...]

– 9 – CSCE 771 Spring 2011

>>> nltk.corpus.brown.tagged_words(simplify_tags=True) >>> nltk.corpus.brown.tagged_words(simplify_tags=True)

[('The', 'DET'), ('Fulton', 'NP'), ('County', 'N'), ...] [('The', 'DET'), ('Fulton', 'NP'), ('County', 'N'), ...]

>>> nltk.corpus.treebank.tagged_words(simplify_tags=True) >>> nltk.corpus.treebank.tagged_words(simplify_tags=True)

[('Pierre', 'NP'), ('Vinken', 'NP'), (',', ','), ...][('Pierre', 'NP'), ('Vinken', 'NP'), (',', ','), ...]

– 10 – CSCE 771 Spring 2011

readme() methodsreadme() methods

– 11 – CSCE 771 Spring 2011

Table 5.1: Simplified Part-of-Speech TagsetTable 5.1: Simplified Part-of-Speech Tagset

Tag Meaning Examples

ADJ adjective new, good, high, special, big, local

ADV adverb really, already, still, early, now

CNJ conjunction and, or, but, if, while, although

DET determiner the, a, some, most, every, no

EX existential there, there's

FW foreign word dolce, ersatz, esprit, quo, maitre

– 12 – CSCE 771 Spring 2011

MOD modal verb will, can, would, may, must, should

N noun year, home, costs, time, education

NP proper noun Alison, Africa, April, Washington

NUM number twenty-four, fourth, 1991, 14:24

PRO pronoun he, their, her, its, my, I, us

P preposition on, of, at, with, by, into, under

TO the word to to

UH interjection ah, bang, ha, whee, hmpf, oops

V verb is, has, get, do, make, see, run

VD past tense said, took, told, made, asked

VGpresent participle

making, going, playing, working

VN past participle given, taken, begun, sung

WH wh determiner who, which, when, what, where, how

– 13 – CSCE 771 Spring 2011

>>> from nltk.corpus import brown >>> from nltk.corpus import brown

>>> brown_news_tagged = >>> brown_news_tagged = brown.tagged_words(categories='news', brown.tagged_words(categories='news', simplify_tags=True) simplify_tags=True)

>>> tag_fd = nltk.FreqDist(tag for (word, tag) in >>> tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged) brown_news_tagged)

>>> tag_fd.keys() >>> tag_fd.keys()

['N', 'P', 'DET', 'NP', 'V', 'ADJ', ',', '.', 'CNJ', 'PRO', 'ADV', ['N', 'P', 'DET', 'NP', 'V', 'ADJ', ',', '.', 'CNJ', 'PRO', 'ADV', 'VD', ...]'VD', ...]

– 14 – CSCE 771 Spring 2011

NounsNouns

>>> word_tag_pairs = nltk.bigrams(brown_news_tagged) >>> word_tag_pairs = nltk.bigrams(brown_news_tagged)

>>> list(nltk.FreqDist(a[1] for (a, b) in word_tag_pairs if >>> list(nltk.FreqDist(a[1] for (a, b) in word_tag_pairs if b[1] == 'N')) b[1] == 'N'))

['DET', 'ADJ', 'N', 'P', 'NP', 'NUM', 'V', 'PRO', 'CNJ', '.', ',', ['DET', 'ADJ', 'N', 'P', 'NP', 'NUM', 'V', 'PRO', 'CNJ', '.', ',', 'VG', 'VN', ...]'VG', 'VN', ...]

– 15 – CSCE 771 Spring 2011

VerbsVerbs

>>> wsj = >>> wsj = nltk.corpus.treebank.tagged_words(simplify_tags=Truenltk.corpus.treebank.tagged_words(simplify_tags=True) )

>>> word_tag_fd = nltk.FreqDist(wsj) >>> word_tag_fd = nltk.FreqDist(wsj)

>>> [word + "/" + tag for (word, tag) in word_tag_fd if >>> [word + "/" + tag for (word, tag) in word_tag_fd if tag.startswith('V')] tag.startswith('V')]

['is/V', 'said/VD', 'was/VD', 'are/V', 'be/V', 'has/V', 'have/V', ['is/V', 'said/VD', 'was/VD', 'are/V', 'be/V', 'has/V', 'have/V', 'says/V', 'were/VD', 'had/VD', 'been/VN', "'s/V", 'do/V', 'says/V', 'were/VD', 'had/VD', 'been/VN', "'s/V", 'do/V', 'say/V', 'make/V', 'did/VD', 'rose/VD', 'does/V', 'say/V', 'make/V', 'did/VD', 'rose/VD', 'does/V', 'expected/VN', 'buy/V', 'take/V', 'get/V', 'sell/V', 'help/V', 'expected/VN', 'buy/V', 'take/V', 'get/V', 'sell/V', 'help/V', 'added/VD', 'including/VG', 'according/VG', 'made/VN', 'added/VD', 'including/VG', 'according/VG', 'made/VN', 'pay/V', ...]'pay/V', ...]

– 16 – CSCE 771 Spring 2011

>>> cfd1 = nltk.ConditionalFreqDist(wsj) >>> cfd1 = nltk.ConditionalFreqDist(wsj)

>>> cfd1['yield'].keys() >>> cfd1['yield'].keys()

['V', 'N'] ['V', 'N']

>>> cfd1['cut'].keys() >>> cfd1['cut'].keys()

['V', 'VD', 'N', 'VN']['V', 'VD', 'N', 'VN']

– 17 – CSCE 771 Spring 2011

>>> cfd2 = nltk.ConditionalFreqDist((tag, word) for >>> cfd2 = nltk.ConditionalFreqDist((tag, word) for (word, tag) in wsj) (word, tag) in wsj)

>>> cfd2['VN'].keys() >>> cfd2['VN'].keys()

['been', 'expected', 'made', 'compared', 'based', 'priced', ['been', 'expected', 'made', 'compared', 'based', 'priced', 'used', 'sold', 'named', 'designed', 'held', 'fined', 'used', 'sold', 'named', 'designed', 'held', 'fined', 'taken', 'paid', 'traded', 'said', ...]'taken', 'paid', 'traded', 'said', ...]

– 18 – CSCE 771 Spring 2011

>>> [w for w in cfd1.conditions() if 'VD' in cfd1[w] and 'VN' >>> [w for w in cfd1.conditions() if 'VD' in cfd1[w] and 'VN' in cfd1[w]] in cfd1[w]]

['Asked', 'accelerated', 'accepted', 'accused', 'acquired', ['Asked', 'accelerated', 'accepted', 'accused', 'acquired', 'added', 'adopted', ...] 'added', 'adopted', ...]

>>> idx1 = wsj.index(('kicked', 'VD')) >>> idx1 = wsj.index(('kicked', 'VD'))

>>> wsj[idx1-4:idx1+1] >>> wsj[idx1-4:idx1+1]

[('While', 'P'), ('program', 'N'), ('trades', 'N'), ('swiftly', [('While', 'P'), ('program', 'N'), ('trades', 'N'), ('swiftly', 'ADV'), ('kicked', 'VD')] 'ADV'), ('kicked', 'VD')]

>>> idx2 = wsj.index(('kicked', 'VN')) >>> idx2 = wsj.index(('kicked', 'VN'))

>>> wsj[idx2-4:idx2+1] >>> wsj[idx2-4:idx2+1]

[('head', 'N'), ('of', 'P'), ('state', 'N'), ('has', 'V'), ('kicked', [('head', 'N'), ('of', 'P'), ('state', 'N'), ('has', 'V'), ('kicked', 'VN')]'VN')]

– 19 – CSCE 771 Spring 2011

def findtags(tag_prefix, tagged_text): def findtags(tag_prefix, tagged_text):

cfd = nltk.ConditionalFreqDist((tag, word) for cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in tagged_text if (word, tag) in tagged_text if tag.startswith(tag_prefix)) return dict((tag, tag.startswith(tag_prefix)) return dict((tag, cfd[tag].keys()[:5]) for tag in cfd.conditions())cfd[tag].keys()[:5]) for tag in cfd.conditions())

– 20 – CSCE 771 Spring 2011

Reading URLsReading URLs

NLTK book 3.1 NLTK book 3.1

>>> from urllib import urlopen >>> from urllib import urlopen

>>> url = "http://www.gutenberg.org/files/2554/2554.txt" >>> url = "http://www.gutenberg.org/files/2554/2554.txt"

>>> raw = urlopen(url).read() >>> raw = urlopen(url).read()

>>> type(raw) <type 'str'> >>> type(raw) <type 'str'>

>>> len(raw) 1176831 >>> len(raw) 1176831

>>> raw[:75]>>> raw[:75]

http://docs.python.org/2/library/urllib2.html

– 21 – CSCE 771 Spring 2011

>>> tokens = nltk.word_tokenize(raw) >>> tokens = nltk.word_tokenize(raw)

>>> type(tokens) <type 'list'> >>> type(tokens) <type 'list'>

>>> len(tokens) 255809 >>> len(tokens) 255809

>>> tokens[:10] ['The', 'Project', 'Gutenberg', 'EBook', >>> tokens[:10] ['The', 'Project', 'Gutenberg', 'EBook', 'of', 'Crime', 'and', 'Punishment', ',', 'by']'of', 'Crime', 'and', 'Punishment', ',', 'by']

– 22 – CSCE 771 Spring 2011

Dealing with HTMLDealing with HTML

>>> url = "http://news.bbc.co.uk/2/hi/health/2284783.stm" >>> url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"

>>> html = urlopen(url).read() >>> html = urlopen(url).read()

>>> html[:60] '<!doctype html public "-//W3C//DTD HTML 4.0 >>> html[:60] '<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN‘Transitional//EN‘

>>> raw = nltk.clean_html(html) >>> raw = nltk.clean_html(html)

>>> tokens = nltk.word_tokenize(raw) >>> tokens = nltk.word_tokenize(raw)

>>> tokens ['BBC', 'NEWS', '|', 'Health', '|', 'Blondes', "'", 'to', >>> tokens ['BBC', 'NEWS', '|', 'Health', '|', 'Blondes', "'", 'to', 'die', 'out', ...]'die', 'out', ...]

– 23 – CSCE 771 Spring 2011

..

– 24 – CSCE 771 Spring 2011

Chap 2 Brown corpusChap 2 Brown corpus


>>> brown.categories() ['adventure', 'belles_lettres', >>> brown.categories() ['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction'] 'romance', 'science_fiction']

>>> brown.words(categories='news') ['The', 'Fulton', >>> brown.words(categories='news') ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...] 'County', 'Grand', 'Jury', 'said', ...]

>>> brown.words(fileids=['cg22']) ['Does', 'our', >>> brown.words(fileids=['cg22']) ['Does', 'our', 'society', 'have', 'a', 'runaway', ',', ...] 'society', 'have', 'a', 'runaway', ',', ...]

>>> brown.sents(categories=['news', 'editorial', >>> brown.sents(categories=['news', 'editorial', 'reviews']) [['The', 'Fulton', 'County'...], ['The', 'jury', 'reviews']) [['The', 'Fulton', 'County'...], ['The', 'jury', 'further'...], ...]'further'...], ...]

– 25 – CSCE 771 Spring 2011

Freq DistFreq Dist


>>> news_text = brown.words(categories='news') >>> news_text = brown.words(categories='news')

>>> fdist = nltk.FreqDist([w.lower() for w in news_text]) >>> fdist = nltk.FreqDist([w.lower() for w in news_text])

>>> modals = ['can', 'could', 'may', 'might', 'must', 'will'] >>> modals = ['can', 'could', 'may', 'might', 'must', 'will']

>>> for m in modals: ... print m + ':', fdist[m], >>> for m in modals: ... print m + ':', fdist[m],

... can: 94 could: 87 may: 93 might: 38 must: 53 will: 389... can: 94 could: 87 may: 93 might: 38 must: 53 will: 389

– 26 – CSCE 771 Spring 2011

>>> fdist1 = FreqDist(text1) >>> fdist1 = FreqDist(text1)

>>> fdist1 <FreqDist with 260819 outcomes> >>> fdist1 <FreqDist with 260819 outcomes>

>>> vocabulary1 = fdist1.keys() >>> vocabulary1 = fdist1.keys()

>>> vocabulary1[:50] >>> vocabulary1[:50]

[',', 'the', '.', 'of', 'and', 'a', 'to', ';', 'in', 'that', "'", '-', 'his', [',', 'the', '.', 'of', 'and', 'a', 'to', ';', 'in', 'that', "'", '-', 'his', 'it', 'I', 's', 'is', 'he', 'with', 'was', 'as', '"', 'all', 'for', 'this', 'it', 'I', 's', 'is', 'he', 'with', 'was', 'as', '"', 'all', 'for', 'this', '!', 'at', 'by', 'but', 'not', '--', 'him', 'from', 'be', 'on', 'so', '!', 'at', 'by', 'but', 'not', '--', 'him', 'from', 'be', 'on', 'so', 'whale', 'one', 'you', 'had', 'have', 'there', 'But', 'or', 'whale', 'one', 'you', 'had', 'have', 'there', 'But', 'or', 'were', 'now', 'which', '?', 'me', 'like'] 'were', 'now', 'which', '?', 'me', 'like']

>>> fdist1['whale'] 906 >>>>>> fdist1['whale'] 906 >>>

– 27 – CSCE 771 Spring 2011

>>> cfd = nltk.ConditionalFreqDist( ... (genre, word) ... >>> cfd = nltk.ConditionalFreqDist( ... (genre, word) ... for genre in brown.categories() ... for word in for genre in brown.categories() ... for word in brown.words(categories=genre)) brown.words(categories=genre))

>>> genres = ['news', 'religion', 'hobbies', >>> genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor'] 'science_fiction', 'romance', 'humor']

>>> modals = ['can', 'could', 'may', 'might', 'must', 'will'] >>> modals = ['can', 'could', 'may', 'might', 'must', 'will']

>>> cfd.tabulate(conditions=genres, samples=modals)>>> cfd.tabulate(conditions=genres, samples=modals)

– 28 – CSCE 771 Spring 2011

Table 2.2: Some of the Corpora and Corpus Samples Distributed with NLTK

Table 2.2: Some of the Corpora and Corpus Samples Distributed with NLTK

– 29 – CSCE 771 Spring 2011

Table 2.3 Basic Corpus FunctionalityTable 2.3 Basic Corpus Functionalityfileids() the files of the corpus

fileids([categories])the files of the corpus corresponding to these categories

categories() the categories of the corpus

categories([fileids])the categories of the corpus corresponding to these files

raw() the raw content of the corpus

raw(fileids=[f1,f2,f3]) the raw content of the specified files

raw(categories=[c1,c2])the raw content of the specified categories

words() the words of the whole corpus

words(fileids=[f1,f2,f3]) the words of the specified fileids

words(categories=[c1,c2]) the words of the specified categories

sents() the sentences of the whole corpus

sents(fileids=[f1,f2,f3]) the sentences of the specified fileids

sents(categories=[c1,c2])the sentences of the specified categories

abspath(fileid)..................

the location of the given file on disk………….

– 30 – CSCE 771 Spring 2011

def generate_model(cfdist, word, num=15):def generate_model(cfdist, word, num=15):

for i in range(num): for i in range(num):

print word, print word,

word = cfdist[word].max() word = cfdist[word].max()

text = nltk.corpus.genesis.words('english-kjv.txt') text = nltk.corpus.genesis.words('english-kjv.txt')

bigrams = nltk.bigrams(text) bigrams = nltk.bigrams(text)

cfd = nltk.ConditionalFreqDist(bigrams)cfd = nltk.ConditionalFreqDist(bigrams)

– 31 – CSCE 771 Spring 2011

Example 2.5 (code_random_text.py)Example 2.5 (code_random_text.py)

– 32 – CSCE 771 Spring 2011

Table 2.4 Table 2.4

Example Description

cfdist = ConditionalFreqDist(pairs)

create a conditional frequency distribution from a list of pairs

cfdist.conditions()alphabetically sorted list of conditions

cfdist[condition]the frequency distribution for this condition

cfdist[condition][sample]frequency for the given sample for this condition

cfdist.tabulate()tabulate the conditional frequency distribution

cfdist.tabulate(samples, conditions)

tabulation limited to the specified samples and conditions

cfdist.plot()graphical plot of the conditional frequency distribution

cfdist.plot(samples, conditions)

graphical plot limited to the specified samples and conditions

cfdist1 < cfdist2test if samples in cfdist1 occur less frequently than in cfdist2

– 33 – CSCE 771 Spring 2011

>>> wsj = >>> wsj = nltk.corpus.treebank.tagged_words(simplify_tags=Trnltk.corpus.treebank.tagged_words(simplify_tags=True) ue)

>>> word_tag_fd = nltk.FreqDist(wsj) >>> word_tag_fd = nltk.FreqDist(wsj)

>>> [word + "/" + tag for (word, tag) in word_tag_fd if >>> [word + "/" + tag for (word, tag) in word_tag_fd if tag.startswith('V')] tag.startswith('V')]

['is/V', 'said/VD', 'was/VD', 'are/V', 'be/V', 'has/V', ['is/V', 'said/VD', 'was/VD', 'are/V', 'be/V', 'has/V', 'have/V', 'says/V', 'were/VD', 'had/VD', 'been/VN', 'have/V', 'says/V', 'were/VD', 'had/VD', 'been/VN', "'s/V", 'do/V', 'say/V', 'make/V', 'did/VD', 'rose/VD', "'s/V", 'do/V', 'say/V', 'make/V', 'did/VD', 'rose/VD', 'does/V', 'expected/VN', 'buy/V', 'take/V', 'get/V', 'does/V', 'expected/VN', 'buy/V', 'take/V', 'get/V', 'sell/V', 'help/V', 'added/VD', 'including/VG', 'sell/V', 'help/V', 'added/VD', 'including/VG', 'according/VG', 'made/VN', 'pay/V', ...]'according/VG', 'made/VN', 'pay/V', ...]

– 34 – CSCE 771 Spring 2011

Example 5.2 (code_findtags.py)Example 5.2 (code_findtags.py)

– 35 – CSCE 771 Spring 2011

highly ambiguous wordshighly ambiguous words

>>> brown_news_tagged = >>> brown_news_tagged = brown.tagged_words(categories='news', brown.tagged_words(categories='news', simplify_tags=True)simplify_tags=True)

>>> data = nltk.ConditionalFreqDist((word.lower(), tag) ... >>> data = nltk.ConditionalFreqDist((word.lower(), tag) ... for (word, tag) in brown_news_tagged) for (word, tag) in brown_news_tagged)

>>> for word in data.conditions(): >>> for word in data.conditions():

... if len(data[word]) > 3: ... if len(data[word]) > 3:

... tags = data[word].keys() ... tags = data[word].keys()

... print word, ' '.join(tags) ... print word, ' '.join(tags)

... ...

best ADJ ADV NP V best ADJ ADV NP V

better ADJ ADV V DETbetter ADJ ADV V DET

……..

lecture 7 nltk pos tagging topics taggers rule based taggers probabilistic taggers transformation...

Documents