lecture 7 nltk pos tagging topics taggers rule based taggers probabilistic taggers transformation...
TRANSCRIPT
Lecture 7 NLTK POS Tagging
Lecture 7 NLTK POS Tagging
Topics Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning
Readings: Chapter 5.4-?Readings: Chapter 5.4-?
February 3, 2011
CSCE 771 Natural Language Processing
– 2 – CSCE 771 Spring 2011
NLTK taggingNLTK tagging
>>> text = nltk.word_tokenize("And now for something >>> text = nltk.word_tokenize("And now for something completely different") completely different")
>>> nltk.pos_tag(text) >>> nltk.pos_tag(text)
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', [('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')]'NN'), ('completely', 'RB'), ('different', 'JJ')]
– 3 – CSCE 771 Spring 2011
>>> text = nltk.word_tokenize("They refuse to permit us >>> text = nltk.word_tokenize("They refuse to permit us to obtain the refuse permit") to obtain the refuse permit")
>>> nltk.pos_tag(text) >>> nltk.pos_tag(text)
[('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', [('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'), ('us', 'PRP'), ('to', 'TO'), ('obtain', 'VB'), ('the', 'VB'), ('us', 'PRP'), ('to', 'TO'), ('obtain', 'VB'), ('the', 'DT'), ('refuse', 'NN'), ('permit', 'NN')]'DT'), ('refuse', 'NN'), ('permit', 'NN')]
– 4 – CSCE 771 Spring 2011
>>> text = nltk.Text(word.lower() for word in nltk.corpus.brown.words())>>> text = nltk.Text(word.lower() for word in nltk.corpus.brown.words())
>>> text.similar('woman') >>> text.similar('woman')
Building word-context index... man time day year car moment world Building word-context index... man time day year car moment world family house country child boy state job way war girl place room family house country child boy state job way war girl place room word word
>>> text.similar('bought') >>> text.similar('bought')
made said put done seen had found left given heard brought got been made said put done seen had found left given heard brought got been was set told took in felt that was set told took in felt that
>>> text.similar('over')>>> text.similar('over')
in on to of and for with from at by that into as up out down through is in on to of and for with from at by that into as up out down through is all about all about
>>> text.similar('the') >>> text.similar('the')
a his this their its her an that our any all one these my in your no some a his this their its her an that our any all one these my in your no some other andother and
– 5 – CSCE 771 Spring 2011
Tagged CorporaTagged Corpora
By convention in NLTK, a tagged token is a tuple.By convention in NLTK, a tagged token is a tuple.
function str2tuple()function str2tuple()
>>> tagged_token = nltk.tag.str2tuple('fly/NN') >>> tagged_token = nltk.tag.str2tuple('fly/NN')
>>> tagged_token >>> tagged_token
('fly', 'NN') ('fly', 'NN')
>>> tagged_token[0] >>> tagged_token[0]
'fly' 'fly'
>>> tagged_token[1] >>> tagged_token[1]
'NN''NN'
– 6 – CSCE 771 Spring 2011
Specifying Tags with StringsSpecifying Tags with Strings
>>> sent = ''' >>> sent = '''
... The/AT grand/JJ jury/NN commented/VBD on/IN a/AT ... The/AT grand/JJ jury/NN commented/VBD on/IN a/AT number/NN of/IN ... other/AP topics/NNS ,/, number/NN of/IN ... other/AP topics/NNS ,/, AMONG/IN them/PPO the/AT Atlanta/NP and/CCAMONG/IN them/PPO the/AT Atlanta/NP and/CC
... ...
... accepted/VBN practices/NNS which/WDT inure/VB ... accepted/VBN practices/NNS which/WDT inure/VB to/IN the/AT best/JJT ... interest/NN of/IN both/ABX to/IN the/AT best/JJT ... interest/NN of/IN both/ABX governments/NNS ''/'' ./. ... ''' governments/NNS ''/'' ./. ... '''
>>> [nltk.tag.str2tuple(t) for t in sent.split()] >>> [nltk.tag.str2tuple(t) for t in sent.split()]
[('The', 'AT'), ('grand', 'JJ'), ('jury', 'NN'), ('commented', [('The', 'AT'), ('grand', 'JJ'), ('jury', 'NN'), ('commented', 'VBD'), ('on', 'IN'), ('a', 'AT'), ('number', 'NN'), ... ('.', '.')]'VBD'), ('on', 'IN'), ('a', 'AT'), ('number', 'NN'), ... ('.', '.')]
– 7 – CSCE 771 Spring 2011
Reading Tagged CorporaReading Tagged Corpora
>>> nltk.corpus.brown.tagged_words() >>> nltk.corpus.brown.tagged_words()
[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ...] [('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ...]
>>> nltk.corpus.brown.tagged_words(simplify_tags=True) >>> nltk.corpus.brown.tagged_words(simplify_tags=True)
[('The', 'DET'), ('Fulton', 'N'), ('County', 'N'), ...][('The', 'DET'), ('Fulton', 'N'), ('County', 'N'), ...]
– 8 – CSCE 771 Spring 2011
tagged_words() methodtagged_words() method
>>> print nltk.corpus.nps_chat.tagged_words() >>> print nltk.corpus.nps_chat.tagged_words()
[('now', 'RB'), ('im', 'PRP'), ('left', 'VBD'), ...] [('now', 'RB'), ('im', 'PRP'), ('left', 'VBD'), ...]
>>> nltk.corpus.conll2000.tagged_words() >>> nltk.corpus.conll2000.tagged_words()
[('Confidence', 'NN'), ('in', 'IN'), ('the', 'DT'), ...] [('Confidence', 'NN'), ('in', 'IN'), ('the', 'DT'), ...]
>>> nltk.corpus.treebank.tagged_words() >>> nltk.corpus.treebank.tagged_words()
[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ...][('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ...]
– 9 – CSCE 771 Spring 2011
>>> nltk.corpus.brown.tagged_words(simplify_tags=True) >>> nltk.corpus.brown.tagged_words(simplify_tags=True)
[('The', 'DET'), ('Fulton', 'NP'), ('County', 'N'), ...] [('The', 'DET'), ('Fulton', 'NP'), ('County', 'N'), ...]
>>> nltk.corpus.treebank.tagged_words(simplify_tags=True) >>> nltk.corpus.treebank.tagged_words(simplify_tags=True)
[('Pierre', 'NP'), ('Vinken', 'NP'), (',', ','), ...][('Pierre', 'NP'), ('Vinken', 'NP'), (',', ','), ...]
– 10 – CSCE 771 Spring 2011
readme() methodsreadme() methods
– 11 – CSCE 771 Spring 2011
Table 5.1: Simplified Part-of-Speech TagsetTable 5.1: Simplified Part-of-Speech Tagset
Tag Meaning Examples
ADJ adjective new, good, high, special, big, local
ADV adverb really, already, still, early, now
CNJ conjunction and, or, but, if, while, although
DET determiner the, a, some, most, every, no
EX existential there, there's
FW foreign word dolce, ersatz, esprit, quo, maitre
– 12 – CSCE 771 Spring 2011
MOD modal verb will, can, would, may, must, should
N noun year, home, costs, time, education
NP proper noun Alison, Africa, April, Washington
NUM number twenty-four, fourth, 1991, 14:24
PRO pronoun he, their, her, its, my, I, us
P preposition on, of, at, with, by, into, under
TO the word to to
UH interjection ah, bang, ha, whee, hmpf, oops
V verb is, has, get, do, make, see, run
VD past tense said, took, told, made, asked
VGpresent participle
making, going, playing, working
VN past participle given, taken, begun, sung
WH wh determiner who, which, when, what, where, how
– 13 – CSCE 771 Spring 2011
>>> from nltk.corpus import brown >>> from nltk.corpus import brown
>>> brown_news_tagged = >>> brown_news_tagged = brown.tagged_words(categories='news', brown.tagged_words(categories='news', simplify_tags=True) simplify_tags=True)
>>> tag_fd = nltk.FreqDist(tag for (word, tag) in >>> tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged) brown_news_tagged)
>>> tag_fd.keys() >>> tag_fd.keys()
['N', 'P', 'DET', 'NP', 'V', 'ADJ', ',', '.', 'CNJ', 'PRO', 'ADV', ['N', 'P', 'DET', 'NP', 'V', 'ADJ', ',', '.', 'CNJ', 'PRO', 'ADV', 'VD', ...]'VD', ...]
– 14 – CSCE 771 Spring 2011
NounsNouns
>>> word_tag_pairs = nltk.bigrams(brown_news_tagged) >>> word_tag_pairs = nltk.bigrams(brown_news_tagged)
>>> list(nltk.FreqDist(a[1] for (a, b) in word_tag_pairs if >>> list(nltk.FreqDist(a[1] for (a, b) in word_tag_pairs if b[1] == 'N')) b[1] == 'N'))
['DET', 'ADJ', 'N', 'P', 'NP', 'NUM', 'V', 'PRO', 'CNJ', '.', ',', ['DET', 'ADJ', 'N', 'P', 'NP', 'NUM', 'V', 'PRO', 'CNJ', '.', ',', 'VG', 'VN', ...]'VG', 'VN', ...]
– 15 – CSCE 771 Spring 2011
VerbsVerbs
>>> wsj = >>> wsj = nltk.corpus.treebank.tagged_words(simplify_tags=Truenltk.corpus.treebank.tagged_words(simplify_tags=True) )
>>> word_tag_fd = nltk.FreqDist(wsj) >>> word_tag_fd = nltk.FreqDist(wsj)
>>> [word + "/" + tag for (word, tag) in word_tag_fd if >>> [word + "/" + tag for (word, tag) in word_tag_fd if tag.startswith('V')] tag.startswith('V')]
['is/V', 'said/VD', 'was/VD', 'are/V', 'be/V', 'has/V', 'have/V', ['is/V', 'said/VD', 'was/VD', 'are/V', 'be/V', 'has/V', 'have/V', 'says/V', 'were/VD', 'had/VD', 'been/VN', "'s/V", 'do/V', 'says/V', 'were/VD', 'had/VD', 'been/VN', "'s/V", 'do/V', 'say/V', 'make/V', 'did/VD', 'rose/VD', 'does/V', 'say/V', 'make/V', 'did/VD', 'rose/VD', 'does/V', 'expected/VN', 'buy/V', 'take/V', 'get/V', 'sell/V', 'help/V', 'expected/VN', 'buy/V', 'take/V', 'get/V', 'sell/V', 'help/V', 'added/VD', 'including/VG', 'according/VG', 'made/VN', 'added/VD', 'including/VG', 'according/VG', 'made/VN', 'pay/V', ...]'pay/V', ...]
– 16 – CSCE 771 Spring 2011
>>> cfd1 = nltk.ConditionalFreqDist(wsj) >>> cfd1 = nltk.ConditionalFreqDist(wsj)
>>> cfd1['yield'].keys() >>> cfd1['yield'].keys()
['V', 'N'] ['V', 'N']
>>> cfd1['cut'].keys() >>> cfd1['cut'].keys()
['V', 'VD', 'N', 'VN']['V', 'VD', 'N', 'VN']
– 17 – CSCE 771 Spring 2011
>>> cfd2 = nltk.ConditionalFreqDist((tag, word) for >>> cfd2 = nltk.ConditionalFreqDist((tag, word) for (word, tag) in wsj) (word, tag) in wsj)
>>> cfd2['VN'].keys() >>> cfd2['VN'].keys()
['been', 'expected', 'made', 'compared', 'based', 'priced', ['been', 'expected', 'made', 'compared', 'based', 'priced', 'used', 'sold', 'named', 'designed', 'held', 'fined', 'used', 'sold', 'named', 'designed', 'held', 'fined', 'taken', 'paid', 'traded', 'said', ...]'taken', 'paid', 'traded', 'said', ...]
– 18 – CSCE 771 Spring 2011
>>> [w for w in cfd1.conditions() if 'VD' in cfd1[w] and 'VN' >>> [w for w in cfd1.conditions() if 'VD' in cfd1[w] and 'VN' in cfd1[w]] in cfd1[w]]
['Asked', 'accelerated', 'accepted', 'accused', 'acquired', ['Asked', 'accelerated', 'accepted', 'accused', 'acquired', 'added', 'adopted', ...] 'added', 'adopted', ...]
>>> idx1 = wsj.index(('kicked', 'VD')) >>> idx1 = wsj.index(('kicked', 'VD'))
>>> wsj[idx1-4:idx1+1] >>> wsj[idx1-4:idx1+1]
[('While', 'P'), ('program', 'N'), ('trades', 'N'), ('swiftly', [('While', 'P'), ('program', 'N'), ('trades', 'N'), ('swiftly', 'ADV'), ('kicked', 'VD')] 'ADV'), ('kicked', 'VD')]
>>> idx2 = wsj.index(('kicked', 'VN')) >>> idx2 = wsj.index(('kicked', 'VN'))
>>> wsj[idx2-4:idx2+1] >>> wsj[idx2-4:idx2+1]
[('head', 'N'), ('of', 'P'), ('state', 'N'), ('has', 'V'), ('kicked', [('head', 'N'), ('of', 'P'), ('state', 'N'), ('has', 'V'), ('kicked', 'VN')]'VN')]
– 19 – CSCE 771 Spring 2011
def findtags(tag_prefix, tagged_text): def findtags(tag_prefix, tagged_text):
cfd = nltk.ConditionalFreqDist((tag, word) for cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in tagged_text if (word, tag) in tagged_text if tag.startswith(tag_prefix)) return dict((tag, tag.startswith(tag_prefix)) return dict((tag, cfd[tag].keys()[:5]) for tag in cfd.conditions())cfd[tag].keys()[:5]) for tag in cfd.conditions())
– 20 – CSCE 771 Spring 2011
Reading URLsReading URLs
NLTK book 3.1 NLTK book 3.1
>>> from urllib import urlopen >>> from urllib import urlopen
>>> url = "http://www.gutenberg.org/files/2554/2554.txt" >>> url = "http://www.gutenberg.org/files/2554/2554.txt"
>>> raw = urlopen(url).read() >>> raw = urlopen(url).read()
>>> type(raw) <type 'str'> >>> type(raw) <type 'str'>
>>> len(raw) 1176831 >>> len(raw) 1176831
>>> raw[:75]>>> raw[:75]
http://docs.python.org/2/library/urllib2.html
– 21 – CSCE 771 Spring 2011
>>> tokens = nltk.word_tokenize(raw) >>> tokens = nltk.word_tokenize(raw)
>>> type(tokens) <type 'list'> >>> type(tokens) <type 'list'>
>>> len(tokens) 255809 >>> len(tokens) 255809
>>> tokens[:10] ['The', 'Project', 'Gutenberg', 'EBook', >>> tokens[:10] ['The', 'Project', 'Gutenberg', 'EBook', 'of', 'Crime', 'and', 'Punishment', ',', 'by']'of', 'Crime', 'and', 'Punishment', ',', 'by']
– 22 – CSCE 771 Spring 2011
Dealing with HTMLDealing with HTML
>>> url = "http://news.bbc.co.uk/2/hi/health/2284783.stm" >>> url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
>>> html = urlopen(url).read() >>> html = urlopen(url).read()
>>> html[:60] '<!doctype html public "-//W3C//DTD HTML 4.0 >>> html[:60] '<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN‘Transitional//EN‘
>>> raw = nltk.clean_html(html) >>> raw = nltk.clean_html(html)
>>> tokens = nltk.word_tokenize(raw) >>> tokens = nltk.word_tokenize(raw)
>>> tokens ['BBC', 'NEWS', '|', 'Health', '|', 'Blondes', "'", 'to', >>> tokens ['BBC', 'NEWS', '|', 'Health', '|', 'Blondes', "'", 'to', 'die', 'out', ...]'die', 'out', ...]
– 23 – CSCE 771 Spring 2011
..
– 24 – CSCE 771 Spring 2011
Chap 2 Brown corpusChap 2 Brown corpus
>>> from nltk.corpus import brown >>> from nltk.corpus import brown
>>> brown.categories() ['adventure', 'belles_lettres', >>> brown.categories() ['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction'] 'romance', 'science_fiction']
>>> brown.words(categories='news') ['The', 'Fulton', >>> brown.words(categories='news') ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...] 'County', 'Grand', 'Jury', 'said', ...]
>>> brown.words(fileids=['cg22']) ['Does', 'our', >>> brown.words(fileids=['cg22']) ['Does', 'our', 'society', 'have', 'a', 'runaway', ',', ...] 'society', 'have', 'a', 'runaway', ',', ...]
>>> brown.sents(categories=['news', 'editorial', >>> brown.sents(categories=['news', 'editorial', 'reviews']) [['The', 'Fulton', 'County'...], ['The', 'jury', 'reviews']) [['The', 'Fulton', 'County'...], ['The', 'jury', 'further'...], ...]'further'...], ...]
– 25 – CSCE 771 Spring 2011
Freq DistFreq Dist
>>> from nltk.corpus import brown >>> from nltk.corpus import brown
>>> news_text = brown.words(categories='news') >>> news_text = brown.words(categories='news')
>>> fdist = nltk.FreqDist([w.lower() for w in news_text]) >>> fdist = nltk.FreqDist([w.lower() for w in news_text])
>>> modals = ['can', 'could', 'may', 'might', 'must', 'will'] >>> modals = ['can', 'could', 'may', 'might', 'must', 'will']
>>> for m in modals: ... print m + ':', fdist[m], >>> for m in modals: ... print m + ':', fdist[m],
... can: 94 could: 87 may: 93 might: 38 must: 53 will: 389... can: 94 could: 87 may: 93 might: 38 must: 53 will: 389
– 26 – CSCE 771 Spring 2011
>>> fdist1 = FreqDist(text1) >>> fdist1 = FreqDist(text1)
>>> fdist1 <FreqDist with 260819 outcomes> >>> fdist1 <FreqDist with 260819 outcomes>
>>> vocabulary1 = fdist1.keys() >>> vocabulary1 = fdist1.keys()
>>> vocabulary1[:50] >>> vocabulary1[:50]
[',', 'the', '.', 'of', 'and', 'a', 'to', ';', 'in', 'that', "'", '-', 'his', [',', 'the', '.', 'of', 'and', 'a', 'to', ';', 'in', 'that', "'", '-', 'his', 'it', 'I', 's', 'is', 'he', 'with', 'was', 'as', '"', 'all', 'for', 'this', 'it', 'I', 's', 'is', 'he', 'with', 'was', 'as', '"', 'all', 'for', 'this', '!', 'at', 'by', 'but', 'not', '--', 'him', 'from', 'be', 'on', 'so', '!', 'at', 'by', 'but', 'not', '--', 'him', 'from', 'be', 'on', 'so', 'whale', 'one', 'you', 'had', 'have', 'there', 'But', 'or', 'whale', 'one', 'you', 'had', 'have', 'there', 'But', 'or', 'were', 'now', 'which', '?', 'me', 'like'] 'were', 'now', 'which', '?', 'me', 'like']
>>> fdist1['whale'] 906 >>>>>> fdist1['whale'] 906 >>>
– 27 – CSCE 771 Spring 2011
>>> cfd = nltk.ConditionalFreqDist( ... (genre, word) ... >>> cfd = nltk.ConditionalFreqDist( ... (genre, word) ... for genre in brown.categories() ... for word in for genre in brown.categories() ... for word in brown.words(categories=genre)) brown.words(categories=genre))
>>> genres = ['news', 'religion', 'hobbies', >>> genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor'] 'science_fiction', 'romance', 'humor']
>>> modals = ['can', 'could', 'may', 'might', 'must', 'will'] >>> modals = ['can', 'could', 'may', 'might', 'must', 'will']
>>> cfd.tabulate(conditions=genres, samples=modals)>>> cfd.tabulate(conditions=genres, samples=modals)
– 28 – CSCE 771 Spring 2011
Table 2.2: Some of the Corpora and Corpus Samples Distributed with NLTK
Table 2.2: Some of the Corpora and Corpus Samples Distributed with NLTK
– 29 – CSCE 771 Spring 2011
Table 2.3 Basic Corpus FunctionalityTable 2.3 Basic Corpus Functionalityfileids() the files of the corpus
fileids([categories])the files of the corpus corresponding to these categories
categories() the categories of the corpus
categories([fileids])the categories of the corpus corresponding to these files
raw() the raw content of the corpus
raw(fileids=[f1,f2,f3]) the raw content of the specified files
raw(categories=[c1,c2])the raw content of the specified categories
words() the words of the whole corpus
words(fileids=[f1,f2,f3]) the words of the specified fileids
words(categories=[c1,c2]) the words of the specified categories
sents() the sentences of the whole corpus
sents(fileids=[f1,f2,f3]) the sentences of the specified fileids
sents(categories=[c1,c2])the sentences of the specified categories
abspath(fileid)..................
the location of the given file on disk………….
– 30 – CSCE 771 Spring 2011
def generate_model(cfdist, word, num=15):def generate_model(cfdist, word, num=15):
for i in range(num): for i in range(num):
print word, print word,
word = cfdist[word].max() word = cfdist[word].max()
text = nltk.corpus.genesis.words('english-kjv.txt') text = nltk.corpus.genesis.words('english-kjv.txt')
bigrams = nltk.bigrams(text) bigrams = nltk.bigrams(text)
cfd = nltk.ConditionalFreqDist(bigrams)cfd = nltk.ConditionalFreqDist(bigrams)
– 31 – CSCE 771 Spring 2011
Example 2.5 (code_random_text.py)Example 2.5 (code_random_text.py)
– 32 – CSCE 771 Spring 2011
Table 2.4 Table 2.4
Example Description
cfdist = ConditionalFreqDist(pairs)
create a conditional frequency distribution from a list of pairs
cfdist.conditions()alphabetically sorted list of conditions
cfdist[condition]the frequency distribution for this condition
cfdist[condition][sample]frequency for the given sample for this condition
cfdist.tabulate()tabulate the conditional frequency distribution
cfdist.tabulate(samples, conditions)
tabulation limited to the specified samples and conditions
cfdist.plot()graphical plot of the conditional frequency distribution
cfdist.plot(samples, conditions)
graphical plot limited to the specified samples and conditions
cfdist1 < cfdist2test if samples in cfdist1 occur less frequently than in cfdist2
– 33 – CSCE 771 Spring 2011
>>> wsj = >>> wsj = nltk.corpus.treebank.tagged_words(simplify_tags=Trnltk.corpus.treebank.tagged_words(simplify_tags=True) ue)
>>> word_tag_fd = nltk.FreqDist(wsj) >>> word_tag_fd = nltk.FreqDist(wsj)
>>> [word + "/" + tag for (word, tag) in word_tag_fd if >>> [word + "/" + tag for (word, tag) in word_tag_fd if tag.startswith('V')] tag.startswith('V')]
['is/V', 'said/VD', 'was/VD', 'are/V', 'be/V', 'has/V', ['is/V', 'said/VD', 'was/VD', 'are/V', 'be/V', 'has/V', 'have/V', 'says/V', 'were/VD', 'had/VD', 'been/VN', 'have/V', 'says/V', 'were/VD', 'had/VD', 'been/VN', "'s/V", 'do/V', 'say/V', 'make/V', 'did/VD', 'rose/VD', "'s/V", 'do/V', 'say/V', 'make/V', 'did/VD', 'rose/VD', 'does/V', 'expected/VN', 'buy/V', 'take/V', 'get/V', 'does/V', 'expected/VN', 'buy/V', 'take/V', 'get/V', 'sell/V', 'help/V', 'added/VD', 'including/VG', 'sell/V', 'help/V', 'added/VD', 'including/VG', 'according/VG', 'made/VN', 'pay/V', ...]'according/VG', 'made/VN', 'pay/V', ...]
– 34 – CSCE 771 Spring 2011
Example 5.2 (code_findtags.py)Example 5.2 (code_findtags.py)
– 35 – CSCE 771 Spring 2011
highly ambiguous wordshighly ambiguous words
>>> brown_news_tagged = >>> brown_news_tagged = brown.tagged_words(categories='news', brown.tagged_words(categories='news', simplify_tags=True)simplify_tags=True)
>>> data = nltk.ConditionalFreqDist((word.lower(), tag) ... >>> data = nltk.ConditionalFreqDist((word.lower(), tag) ... for (word, tag) in brown_news_tagged) for (word, tag) in brown_news_tagged)
>>> for word in data.conditions(): >>> for word in data.conditions():
... if len(data[word]) > 3: ... if len(data[word]) > 3:
... tags = data[word].keys() ... tags = data[word].keys()
... print word, ' '.join(tags) ... print word, ' '.join(tags)
... ...
best ADJ ADV NP V best ADJ ADV NP V
better ADJ ADV V DETbetter ADJ ADV V DET
……..