lecture 7 nltk pos tagging topics taggers rule based taggers probabilistic taggers transformation...

35
Lecture 7 NLTK POS Tagging Topics Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings: Chapter 5.4-? Readings: Chapter 5.4-? February 3, 2011 CSCE 771 Natural Language Processing

Upload: benedict-allen

Post on 13-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 7 NLTK POS Tagging Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings: Chapter

Lecture 7 NLTK POS Tagging

Lecture 7 NLTK POS Tagging

Topics Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning

Readings: Chapter 5.4-?Readings: Chapter 5.4-?

February 3, 2011

CSCE 771 Natural Language Processing

Page 2: Lecture 7 NLTK POS Tagging Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings: Chapter

– 2 – CSCE 771 Spring 2011

NLTK taggingNLTK tagging

>>> text = nltk.word_tokenize("And now for something >>> text = nltk.word_tokenize("And now for something completely different") completely different")

>>> nltk.pos_tag(text) >>> nltk.pos_tag(text)

[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', [('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')]'NN'), ('completely', 'RB'), ('different', 'JJ')]

Page 3: Lecture 7 NLTK POS Tagging Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings: Chapter

– 3 – CSCE 771 Spring 2011

>>> text = nltk.word_tokenize("They refuse to permit us >>> text = nltk.word_tokenize("They refuse to permit us to obtain the refuse permit") to obtain the refuse permit")

>>> nltk.pos_tag(text) >>> nltk.pos_tag(text)

[('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', [('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'), ('us', 'PRP'), ('to', 'TO'), ('obtain', 'VB'), ('the', 'VB'), ('us', 'PRP'), ('to', 'TO'), ('obtain', 'VB'), ('the', 'DT'), ('refuse', 'NN'), ('permit', 'NN')]'DT'), ('refuse', 'NN'), ('permit', 'NN')]

Page 4: Lecture 7 NLTK POS Tagging Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings: Chapter

– 4 – CSCE 771 Spring 2011

>>> text = nltk.Text(word.lower() for word in nltk.corpus.brown.words())>>> text = nltk.Text(word.lower() for word in nltk.corpus.brown.words())

>>> text.similar('woman') >>> text.similar('woman')

Building word-context index... man time day year car moment world Building word-context index... man time day year car moment world family house country child boy state job way war girl place room family house country child boy state job way war girl place room word word

>>> text.similar('bought') >>> text.similar('bought')

made said put done seen had found left given heard brought got been made said put done seen had found left given heard brought got been was set told took in felt that was set told took in felt that

>>> text.similar('over')>>> text.similar('over')

in on to of and for with from at by that into as up out down through is in on to of and for with from at by that into as up out down through is all about all about

>>> text.similar('the') >>> text.similar('the')

a his this their its her an that our any all one these my in your no some a his this their its her an that our any all one these my in your no some other andother and

Page 5: Lecture 7 NLTK POS Tagging Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings: Chapter

– 5 – CSCE 771 Spring 2011

Tagged CorporaTagged Corpora

By convention in NLTK, a tagged token is a tuple.By convention in NLTK, a tagged token is a tuple.

function str2tuple()function str2tuple()

>>> tagged_token = nltk.tag.str2tuple('fly/NN') >>> tagged_token = nltk.tag.str2tuple('fly/NN')

>>> tagged_token >>> tagged_token

('fly', 'NN') ('fly', 'NN')

>>> tagged_token[0] >>> tagged_token[0]

'fly' 'fly'

>>> tagged_token[1] >>> tagged_token[1]

'NN''NN'

Page 6: Lecture 7 NLTK POS Tagging Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings: Chapter

– 6 – CSCE 771 Spring 2011

Specifying Tags with StringsSpecifying Tags with Strings

>>> sent = ''' >>> sent = '''

... The/AT grand/JJ jury/NN commented/VBD on/IN a/AT ... The/AT grand/JJ jury/NN commented/VBD on/IN a/AT number/NN of/IN ... other/AP topics/NNS ,/, number/NN of/IN ... other/AP topics/NNS ,/, AMONG/IN them/PPO the/AT Atlanta/NP and/CCAMONG/IN them/PPO the/AT Atlanta/NP and/CC

... ...

... accepted/VBN practices/NNS which/WDT inure/VB ... accepted/VBN practices/NNS which/WDT inure/VB to/IN the/AT best/JJT ... interest/NN of/IN both/ABX to/IN the/AT best/JJT ... interest/NN of/IN both/ABX governments/NNS ''/'' ./. ... ''' governments/NNS ''/'' ./. ... '''

>>> [nltk.tag.str2tuple(t) for t in sent.split()] >>> [nltk.tag.str2tuple(t) for t in sent.split()]

[('The', 'AT'), ('grand', 'JJ'), ('jury', 'NN'), ('commented', [('The', 'AT'), ('grand', 'JJ'), ('jury', 'NN'), ('commented', 'VBD'), ('on', 'IN'), ('a', 'AT'), ('number', 'NN'), ... ('.', '.')]'VBD'), ('on', 'IN'), ('a', 'AT'), ('number', 'NN'), ... ('.', '.')]

Page 7: Lecture 7 NLTK POS Tagging Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings: Chapter

– 7 – CSCE 771 Spring 2011

Reading Tagged CorporaReading Tagged Corpora

>>> nltk.corpus.brown.tagged_words() >>> nltk.corpus.brown.tagged_words()

[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ...] [('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ...]

>>> nltk.corpus.brown.tagged_words(simplify_tags=True) >>> nltk.corpus.brown.tagged_words(simplify_tags=True)

[('The', 'DET'), ('Fulton', 'N'), ('County', 'N'), ...][('The', 'DET'), ('Fulton', 'N'), ('County', 'N'), ...]

Page 8: Lecture 7 NLTK POS Tagging Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings: Chapter

– 8 – CSCE 771 Spring 2011

tagged_words() methodtagged_words() method

>>> print nltk.corpus.nps_chat.tagged_words() >>> print nltk.corpus.nps_chat.tagged_words()

[('now', 'RB'), ('im', 'PRP'), ('left', 'VBD'), ...] [('now', 'RB'), ('im', 'PRP'), ('left', 'VBD'), ...]

>>> nltk.corpus.conll2000.tagged_words() >>> nltk.corpus.conll2000.tagged_words()

[('Confidence', 'NN'), ('in', 'IN'), ('the', 'DT'), ...] [('Confidence', 'NN'), ('in', 'IN'), ('the', 'DT'), ...]

>>> nltk.corpus.treebank.tagged_words() >>> nltk.corpus.treebank.tagged_words()

[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ...][('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ...]

Page 9: Lecture 7 NLTK POS Tagging Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings: Chapter

– 9 – CSCE 771 Spring 2011

>>> nltk.corpus.brown.tagged_words(simplify_tags=True) >>> nltk.corpus.brown.tagged_words(simplify_tags=True)

[('The', 'DET'), ('Fulton', 'NP'), ('County', 'N'), ...] [('The', 'DET'), ('Fulton', 'NP'), ('County', 'N'), ...]

>>> nltk.corpus.treebank.tagged_words(simplify_tags=True) >>> nltk.corpus.treebank.tagged_words(simplify_tags=True)

[('Pierre', 'NP'), ('Vinken', 'NP'), (',', ','), ...][('Pierre', 'NP'), ('Vinken', 'NP'), (',', ','), ...]

Page 10: Lecture 7 NLTK POS Tagging Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings: Chapter

– 10 – CSCE 771 Spring 2011

readme() methodsreadme() methods

Page 11: Lecture 7 NLTK POS Tagging Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings: Chapter

– 11 – CSCE 771 Spring 2011

Table 5.1: Simplified Part-of-Speech TagsetTable 5.1: Simplified Part-of-Speech Tagset

Tag Meaning Examples

ADJ adjective new, good, high, special, big, local

ADV adverb really, already, still, early, now

CNJ conjunction and, or, but, if, while, although

DET determiner the, a, some, most, every, no

EX existential there, there's

FW foreign word dolce, ersatz, esprit, quo, maitre

Page 12: Lecture 7 NLTK POS Tagging Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings: Chapter

– 12 – CSCE 771 Spring 2011

MOD modal verb will, can, would, may, must, should

N noun year, home, costs, time, education

NP proper noun Alison, Africa, April, Washington

NUM number twenty-four, fourth, 1991, 14:24

PRO pronoun he, their, her, its, my, I, us

P preposition on, of, at, with, by, into, under

TO the word to to

UH interjection ah, bang, ha, whee, hmpf, oops

V verb is, has, get, do, make, see, run

VD past tense said, took, told, made, asked

VGpresent participle

making, going, playing, working

VN past participle given, taken, begun, sung

WH wh determiner who, which, when, what, where, how

Page 13: Lecture 7 NLTK POS Tagging Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings: Chapter

– 13 – CSCE 771 Spring 2011

>>> from nltk.corpus import brown >>> from nltk.corpus import brown

>>> brown_news_tagged = >>> brown_news_tagged = brown.tagged_words(categories='news', brown.tagged_words(categories='news', simplify_tags=True) simplify_tags=True)

>>> tag_fd = nltk.FreqDist(tag for (word, tag) in >>> tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged) brown_news_tagged)

>>> tag_fd.keys() >>> tag_fd.keys()

['N', 'P', 'DET', 'NP', 'V', 'ADJ', ',', '.', 'CNJ', 'PRO', 'ADV', ['N', 'P', 'DET', 'NP', 'V', 'ADJ', ',', '.', 'CNJ', 'PRO', 'ADV', 'VD', ...]'VD', ...]

Page 14: Lecture 7 NLTK POS Tagging Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings: Chapter

– 14 – CSCE 771 Spring 2011

NounsNouns

>>> word_tag_pairs = nltk.bigrams(brown_news_tagged) >>> word_tag_pairs = nltk.bigrams(brown_news_tagged)

>>> list(nltk.FreqDist(a[1] for (a, b) in word_tag_pairs if >>> list(nltk.FreqDist(a[1] for (a, b) in word_tag_pairs if b[1] == 'N')) b[1] == 'N'))

['DET', 'ADJ', 'N', 'P', 'NP', 'NUM', 'V', 'PRO', 'CNJ', '.', ',', ['DET', 'ADJ', 'N', 'P', 'NP', 'NUM', 'V', 'PRO', 'CNJ', '.', ',', 'VG', 'VN', ...]'VG', 'VN', ...]

Page 15: Lecture 7 NLTK POS Tagging Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings: Chapter

– 15 – CSCE 771 Spring 2011

VerbsVerbs

>>> wsj = >>> wsj = nltk.corpus.treebank.tagged_words(simplify_tags=Truenltk.corpus.treebank.tagged_words(simplify_tags=True) )

>>> word_tag_fd = nltk.FreqDist(wsj) >>> word_tag_fd = nltk.FreqDist(wsj)

>>> [word + "/" + tag for (word, tag) in word_tag_fd if >>> [word + "/" + tag for (word, tag) in word_tag_fd if tag.startswith('V')] tag.startswith('V')]

['is/V', 'said/VD', 'was/VD', 'are/V', 'be/V', 'has/V', 'have/V', ['is/V', 'said/VD', 'was/VD', 'are/V', 'be/V', 'has/V', 'have/V', 'says/V', 'were/VD', 'had/VD', 'been/VN', "'s/V", 'do/V', 'says/V', 'were/VD', 'had/VD', 'been/VN', "'s/V", 'do/V', 'say/V', 'make/V', 'did/VD', 'rose/VD', 'does/V', 'say/V', 'make/V', 'did/VD', 'rose/VD', 'does/V', 'expected/VN', 'buy/V', 'take/V', 'get/V', 'sell/V', 'help/V', 'expected/VN', 'buy/V', 'take/V', 'get/V', 'sell/V', 'help/V', 'added/VD', 'including/VG', 'according/VG', 'made/VN', 'added/VD', 'including/VG', 'according/VG', 'made/VN', 'pay/V', ...]'pay/V', ...]

Page 16: Lecture 7 NLTK POS Tagging Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings: Chapter

– 16 – CSCE 771 Spring 2011

>>> cfd1 = nltk.ConditionalFreqDist(wsj) >>> cfd1 = nltk.ConditionalFreqDist(wsj)

>>> cfd1['yield'].keys() >>> cfd1['yield'].keys()

['V', 'N'] ['V', 'N']

>>> cfd1['cut'].keys() >>> cfd1['cut'].keys()

['V', 'VD', 'N', 'VN']['V', 'VD', 'N', 'VN']

Page 17: Lecture 7 NLTK POS Tagging Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings: Chapter

– 17 – CSCE 771 Spring 2011

>>> cfd2 = nltk.ConditionalFreqDist((tag, word) for >>> cfd2 = nltk.ConditionalFreqDist((tag, word) for (word, tag) in wsj) (word, tag) in wsj)

>>> cfd2['VN'].keys() >>> cfd2['VN'].keys()

['been', 'expected', 'made', 'compared', 'based', 'priced', ['been', 'expected', 'made', 'compared', 'based', 'priced', 'used', 'sold', 'named', 'designed', 'held', 'fined', 'used', 'sold', 'named', 'designed', 'held', 'fined', 'taken', 'paid', 'traded', 'said', ...]'taken', 'paid', 'traded', 'said', ...]

Page 18: Lecture 7 NLTK POS Tagging Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings: Chapter

– 18 – CSCE 771 Spring 2011

>>> [w for w in cfd1.conditions() if 'VD' in cfd1[w] and 'VN' >>> [w for w in cfd1.conditions() if 'VD' in cfd1[w] and 'VN' in cfd1[w]] in cfd1[w]]

['Asked', 'accelerated', 'accepted', 'accused', 'acquired', ['Asked', 'accelerated', 'accepted', 'accused', 'acquired', 'added', 'adopted', ...] 'added', 'adopted', ...]

>>> idx1 = wsj.index(('kicked', 'VD')) >>> idx1 = wsj.index(('kicked', 'VD'))

>>> wsj[idx1-4:idx1+1] >>> wsj[idx1-4:idx1+1]

[('While', 'P'), ('program', 'N'), ('trades', 'N'), ('swiftly', [('While', 'P'), ('program', 'N'), ('trades', 'N'), ('swiftly', 'ADV'), ('kicked', 'VD')] 'ADV'), ('kicked', 'VD')]

>>> idx2 = wsj.index(('kicked', 'VN')) >>> idx2 = wsj.index(('kicked', 'VN'))

>>> wsj[idx2-4:idx2+1] >>> wsj[idx2-4:idx2+1]

[('head', 'N'), ('of', 'P'), ('state', 'N'), ('has', 'V'), ('kicked', [('head', 'N'), ('of', 'P'), ('state', 'N'), ('has', 'V'), ('kicked', 'VN')]'VN')]

Page 19: Lecture 7 NLTK POS Tagging Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings: Chapter

– 19 – CSCE 771 Spring 2011

def findtags(tag_prefix, tagged_text): def findtags(tag_prefix, tagged_text):

cfd = nltk.ConditionalFreqDist((tag, word) for cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in tagged_text if (word, tag) in tagged_text if tag.startswith(tag_prefix)) return dict((tag, tag.startswith(tag_prefix)) return dict((tag, cfd[tag].keys()[:5]) for tag in cfd.conditions())cfd[tag].keys()[:5]) for tag in cfd.conditions())

Page 20: Lecture 7 NLTK POS Tagging Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings: Chapter

– 20 – CSCE 771 Spring 2011

Reading URLsReading URLs

NLTK book 3.1 NLTK book 3.1

>>> from urllib import urlopen >>> from urllib import urlopen

>>> url = "http://www.gutenberg.org/files/2554/2554.txt" >>> url = "http://www.gutenberg.org/files/2554/2554.txt"

>>> raw = urlopen(url).read() >>> raw = urlopen(url).read()

>>> type(raw) <type 'str'> >>> type(raw) <type 'str'>

>>> len(raw) 1176831 >>> len(raw) 1176831

>>> raw[:75]>>> raw[:75]

http://docs.python.org/2/library/urllib2.html

Page 21: Lecture 7 NLTK POS Tagging Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings: Chapter

– 21 – CSCE 771 Spring 2011

>>> tokens = nltk.word_tokenize(raw) >>> tokens = nltk.word_tokenize(raw)

>>> type(tokens) <type 'list'> >>> type(tokens) <type 'list'>

>>> len(tokens) 255809 >>> len(tokens) 255809

>>> tokens[:10] ['The', 'Project', 'Gutenberg', 'EBook', >>> tokens[:10] ['The', 'Project', 'Gutenberg', 'EBook', 'of', 'Crime', 'and', 'Punishment', ',', 'by']'of', 'Crime', 'and', 'Punishment', ',', 'by']

Page 22: Lecture 7 NLTK POS Tagging Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings: Chapter

– 22 – CSCE 771 Spring 2011

Dealing with HTMLDealing with HTML

>>> url = "http://news.bbc.co.uk/2/hi/health/2284783.stm" >>> url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"

>>> html = urlopen(url).read() >>> html = urlopen(url).read()

>>> html[:60] '<!doctype html public "-//W3C//DTD HTML 4.0 >>> html[:60] '<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN‘Transitional//EN‘

>>> raw = nltk.clean_html(html) >>> raw = nltk.clean_html(html)

>>> tokens = nltk.word_tokenize(raw) >>> tokens = nltk.word_tokenize(raw)

>>> tokens ['BBC', 'NEWS', '|', 'Health', '|', 'Blondes', "'", 'to', >>> tokens ['BBC', 'NEWS', '|', 'Health', '|', 'Blondes', "'", 'to', 'die', 'out', ...]'die', 'out', ...]

Page 23: Lecture 7 NLTK POS Tagging Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings: Chapter

– 23 – CSCE 771 Spring 2011

..

Page 24: Lecture 7 NLTK POS Tagging Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings: Chapter

– 24 – CSCE 771 Spring 2011

Chap 2 Brown corpusChap 2 Brown corpus

>>> from nltk.corpus import brown >>> from nltk.corpus import brown

>>> brown.categories() ['adventure', 'belles_lettres', >>> brown.categories() ['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction'] 'romance', 'science_fiction']

>>> brown.words(categories='news') ['The', 'Fulton', >>> brown.words(categories='news') ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...] 'County', 'Grand', 'Jury', 'said', ...]

>>> brown.words(fileids=['cg22']) ['Does', 'our', >>> brown.words(fileids=['cg22']) ['Does', 'our', 'society', 'have', 'a', 'runaway', ',', ...] 'society', 'have', 'a', 'runaway', ',', ...]

>>> brown.sents(categories=['news', 'editorial', >>> brown.sents(categories=['news', 'editorial', 'reviews']) [['The', 'Fulton', 'County'...], ['The', 'jury', 'reviews']) [['The', 'Fulton', 'County'...], ['The', 'jury', 'further'...], ...]'further'...], ...]

Page 25: Lecture 7 NLTK POS Tagging Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings: Chapter

– 25 – CSCE 771 Spring 2011

Freq DistFreq Dist

>>> from nltk.corpus import brown >>> from nltk.corpus import brown

>>> news_text = brown.words(categories='news') >>> news_text = brown.words(categories='news')

>>> fdist = nltk.FreqDist([w.lower() for w in news_text]) >>> fdist = nltk.FreqDist([w.lower() for w in news_text])

>>> modals = ['can', 'could', 'may', 'might', 'must', 'will'] >>> modals = ['can', 'could', 'may', 'might', 'must', 'will']

>>> for m in modals: ... print m + ':', fdist[m], >>> for m in modals: ... print m + ':', fdist[m],

... can: 94 could: 87 may: 93 might: 38 must: 53 will: 389... can: 94 could: 87 may: 93 might: 38 must: 53 will: 389

Page 26: Lecture 7 NLTK POS Tagging Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings: Chapter

– 26 – CSCE 771 Spring 2011

>>> fdist1 = FreqDist(text1) >>> fdist1 = FreqDist(text1)

>>> fdist1 <FreqDist with 260819 outcomes> >>> fdist1 <FreqDist with 260819 outcomes>

>>> vocabulary1 = fdist1.keys() >>> vocabulary1 = fdist1.keys()

>>> vocabulary1[:50] >>> vocabulary1[:50]

[',', 'the', '.', 'of', 'and', 'a', 'to', ';', 'in', 'that', "'", '-', 'his', [',', 'the', '.', 'of', 'and', 'a', 'to', ';', 'in', 'that', "'", '-', 'his', 'it', 'I', 's', 'is', 'he', 'with', 'was', 'as', '"', 'all', 'for', 'this', 'it', 'I', 's', 'is', 'he', 'with', 'was', 'as', '"', 'all', 'for', 'this', '!', 'at', 'by', 'but', 'not', '--', 'him', 'from', 'be', 'on', 'so', '!', 'at', 'by', 'but', 'not', '--', 'him', 'from', 'be', 'on', 'so', 'whale', 'one', 'you', 'had', 'have', 'there', 'But', 'or', 'whale', 'one', 'you', 'had', 'have', 'there', 'But', 'or', 'were', 'now', 'which', '?', 'me', 'like'] 'were', 'now', 'which', '?', 'me', 'like']

>>> fdist1['whale'] 906 >>>>>> fdist1['whale'] 906 >>>

Page 27: Lecture 7 NLTK POS Tagging Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings: Chapter

– 27 – CSCE 771 Spring 2011

>>> cfd = nltk.ConditionalFreqDist( ... (genre, word) ... >>> cfd = nltk.ConditionalFreqDist( ... (genre, word) ... for genre in brown.categories() ... for word in for genre in brown.categories() ... for word in brown.words(categories=genre)) brown.words(categories=genre))

>>> genres = ['news', 'religion', 'hobbies', >>> genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor'] 'science_fiction', 'romance', 'humor']

>>> modals = ['can', 'could', 'may', 'might', 'must', 'will'] >>> modals = ['can', 'could', 'may', 'might', 'must', 'will']

>>> cfd.tabulate(conditions=genres, samples=modals)>>> cfd.tabulate(conditions=genres, samples=modals)

Page 28: Lecture 7 NLTK POS Tagging Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings: Chapter

– 28 – CSCE 771 Spring 2011

Table 2.2: Some of the Corpora and Corpus Samples Distributed with NLTK

Table 2.2: Some of the Corpora and Corpus Samples Distributed with NLTK

Page 29: Lecture 7 NLTK POS Tagging Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings: Chapter

– 29 – CSCE 771 Spring 2011

Table 2.3 Basic Corpus FunctionalityTable 2.3 Basic Corpus Functionalityfileids() the files of the corpus

fileids([categories])the files of the corpus corresponding to these categories

categories() the categories of the corpus

categories([fileids])the categories of the corpus corresponding to these files

raw() the raw content of the corpus

raw(fileids=[f1,f2,f3]) the raw content of the specified files

raw(categories=[c1,c2])the raw content of the specified categories

words() the words of the whole corpus

words(fileids=[f1,f2,f3]) the words of the specified fileids

words(categories=[c1,c2]) the words of the specified categories

sents() the sentences of the whole corpus

sents(fileids=[f1,f2,f3]) the sentences of the specified fileids

sents(categories=[c1,c2])the sentences of the specified categories

abspath(fileid)..................

the location of the given file on disk………….

Page 30: Lecture 7 NLTK POS Tagging Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings: Chapter

– 30 – CSCE 771 Spring 2011

def generate_model(cfdist, word, num=15):def generate_model(cfdist, word, num=15):

for i in range(num): for i in range(num):

print word, print word,

word = cfdist[word].max() word = cfdist[word].max()

text = nltk.corpus.genesis.words('english-kjv.txt') text = nltk.corpus.genesis.words('english-kjv.txt')

bigrams = nltk.bigrams(text) bigrams = nltk.bigrams(text)

cfd = nltk.ConditionalFreqDist(bigrams)cfd = nltk.ConditionalFreqDist(bigrams)

Page 31: Lecture 7 NLTK POS Tagging Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings: Chapter

– 31 – CSCE 771 Spring 2011

Example 2.5 (code_random_text.py)Example 2.5 (code_random_text.py)

Page 32: Lecture 7 NLTK POS Tagging Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings: Chapter

– 32 – CSCE 771 Spring 2011

Table 2.4 Table 2.4

Example Description

cfdist = ConditionalFreqDist(pairs)

create a conditional frequency distribution from a list of pairs

cfdist.conditions()alphabetically sorted list of conditions

cfdist[condition]the frequency distribution for this condition

cfdist[condition][sample]frequency for the given sample for this condition

cfdist.tabulate()tabulate the conditional frequency distribution

cfdist.tabulate(samples, conditions)

tabulation limited to the specified samples and conditions

cfdist.plot()graphical plot of the conditional frequency distribution

cfdist.plot(samples, conditions)

graphical plot limited to the specified samples and conditions

cfdist1 < cfdist2test if samples in cfdist1 occur less frequently than in cfdist2

Page 33: Lecture 7 NLTK POS Tagging Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings: Chapter

– 33 – CSCE 771 Spring 2011

>>> wsj = >>> wsj = nltk.corpus.treebank.tagged_words(simplify_tags=Trnltk.corpus.treebank.tagged_words(simplify_tags=True) ue)

>>> word_tag_fd = nltk.FreqDist(wsj) >>> word_tag_fd = nltk.FreqDist(wsj)

>>> [word + "/" + tag for (word, tag) in word_tag_fd if >>> [word + "/" + tag for (word, tag) in word_tag_fd if tag.startswith('V')] tag.startswith('V')]

['is/V', 'said/VD', 'was/VD', 'are/V', 'be/V', 'has/V', ['is/V', 'said/VD', 'was/VD', 'are/V', 'be/V', 'has/V', 'have/V', 'says/V', 'were/VD', 'had/VD', 'been/VN', 'have/V', 'says/V', 'were/VD', 'had/VD', 'been/VN', "'s/V", 'do/V', 'say/V', 'make/V', 'did/VD', 'rose/VD', "'s/V", 'do/V', 'say/V', 'make/V', 'did/VD', 'rose/VD', 'does/V', 'expected/VN', 'buy/V', 'take/V', 'get/V', 'does/V', 'expected/VN', 'buy/V', 'take/V', 'get/V', 'sell/V', 'help/V', 'added/VD', 'including/VG', 'sell/V', 'help/V', 'added/VD', 'including/VG', 'according/VG', 'made/VN', 'pay/V', ...]'according/VG', 'made/VN', 'pay/V', ...]

Page 34: Lecture 7 NLTK POS Tagging Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings: Chapter

– 34 – CSCE 771 Spring 2011

Example 5.2 (code_findtags.py)Example 5.2 (code_findtags.py)

Page 35: Lecture 7 NLTK POS Tagging Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings: Chapter

– 35 – CSCE 771 Spring 2011

highly ambiguous wordshighly ambiguous words

>>> brown_news_tagged = >>> brown_news_tagged = brown.tagged_words(categories='news', brown.tagged_words(categories='news', simplify_tags=True)simplify_tags=True)

>>> data = nltk.ConditionalFreqDist((word.lower(), tag) ... >>> data = nltk.ConditionalFreqDist((word.lower(), tag) ... for (word, tag) in brown_news_tagged) for (word, tag) in brown_news_tagged)

>>> for word in data.conditions(): >>> for word in data.conditions():

... if len(data[word]) > 3: ... if len(data[word]) > 3:

... tags = data[word].keys() ... tags = data[word].keys()

... print word, ' '.join(tags) ... print word, ' '.join(tags)

... ...

best ADJ ADV NP V best ADJ ADV NP V

better ADJ ADV V DETbetter ADJ ADV V DET

……..