Transcript
Page 1: Word Bi-grams and PoS Tags

School of somethingFACULTY OF OTHER

School of ComputingFACULTY OF ENGINEERING

Word Bi-grams and PoS Tags

COMP3310 Natural Language Processing

Eric Atwell, Language Research Group

(with thanks to Katja Markert, Marti Hearst, and other contributors)

Page 2: Word Bi-grams and PoS Tags

Reminder

FreqDist counts of tokens and their distribution can be useful

Eg find main characters in Gutenberg texts

Eg compare word-lengths in different languages

Human can predict the next word …

N-gram models are based on counts in a large corpus

Auto-generate a story ... (but gets stuck in local maximum)

Grammatical trends: modal verb distribution predicts genre

Page 3: Word Bi-grams and PoS Tags

Why do puns make us groan?

He drove his expensive car into a tree and found

out how the Mercedes bends.

Isn't the Grand Canyon just gorges?

Time flies like an arrow. Fruit flies like a banana.

Page 4: Word Bi-grams and PoS Tags

Predicting Next Words

One reason puns make us groan is they play on our assumptions of what the next word will be – human language processing involves predicting the most probable next word

They also exploit

• homonymy – same sound, different spelling and meaning (bends, Benz; gorges, gorgeous)

• polysemy – same spelling, different meaning

NLP programs can also make use of word-sequence modeling

Page 5: Word Bi-grams and PoS Tags

Auto-generate a Story

How to fix this? Use a random number generator.

Page 6: Word Bi-grams and PoS Tags

Auto-generate a Story The choice() method chooses one item

randomly from a list(from random import *)

Page 7: Word Bi-grams and PoS Tags

Part-of-Speech Tagging: Terminology

Tagging

• The process of associating labels with each token in a text, using an algorithm to select a tag for each word, eg

Hand-coded rules

Statistical taggers

Brill (transformation-based) tagger

Hybrid tagger: combination, eg by “vote”

Tags

• The labels

Tag Set

• The collection of tags used for a particular task, eg Brown or LOB tagset

Page 8: Word Bi-grams and PoS Tags

Example from the GENIA corpus

Typically a tagged text is a sequence of white-space separated word/tag tokens:

These/DT

findings/NNS

should/MD

be/VB

useful/JJ

for/IN

therapeutic/JJ

strategies/NNS

and/CC

the/DT

development/NN

of/IN

immunosuppressants/NNS

targeting/VBG

the/DT

CD28/NN

costimulatory/NN

pathway/NN

./.

Page 9: Word Bi-grams and PoS Tags

What does Tagging do?

Collapses Distinctions

• Lexical identity may be discarded

• e.g., all personal pronouns tagged with PRP

Introduces Distinctions

• Ambiguities may be resolved

• e.g. deal tagged with NN or VB

Helps in classification and prediction

Page 10: Word Bi-grams and PoS Tags

Significance of Parts of Speech

A word’s POS tells us a lot about the word and its neighbors:

• Limits the range of meanings (deal), pronunciation (object vs object) or both (wind)

• Helps in stemming

• Limits the range of following words

• Can help select nouns from a document for summarization

• Basis for partial parsing (chunked parsing)

• Parsers can build trees directly on the POS tags instead of maintaining a lexicon

Page 11: Word Bi-grams and PoS Tags

Choosing a tagset

The choice of tagset greatly affects the difficulty of the problem

Need to strike a balance between

• Getting better information about context

• Make it possible for classifiers to do their job

Page 12: Word Bi-grams and PoS Tags

Some of the best-known Tagsets

Brown corpus: 87 tags

• (more when tags are combined, eg isn’t)

LOB corpus: 132 tags

Penn Treebank: 45 tags

Lancaster UCREL C5 (used to tag the BNC): 61 tags

Lancaster C7: 145 tags

Page 13: Word Bi-grams and PoS Tags

The Brown Corpus

An early digital corpus (1961)

• Francis and Kucera, Brown University

Contents: 500 texts, each 2000 words long

• From American books, newspapers, magazines

• Representing genres:

• Science fiction, romance fiction, press reportage scientific writing, popular lore

Page 14: Word Bi-grams and PoS Tags

help(nltk.corpus.brown)

>>> help(nltk.corpus.brown)

| paras(self, fileids=None, categories=None)

|

| raw(self, fileids=None, categories=None)

|

| sents(self, fileids=None, categories=None)

|

| tagged_paras(self, fileids=None, categories=None, simplify_tags=False)

|

| tagged_sents(self, fileids=None, categories=None, simplify_tags=False)

|

| tagged_words(self, fileids=None, categories=None, simplify_tags=False)

|

| words(self, fileids=None, categories=None)

|

Page 15: Word Bi-grams and PoS Tags

nltk.corpus.brown

>>> nltk.corpus.brown.words()

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]

>>> nltk.corpus.brown.tagged_words()

[('The', 'AT'), ('Fulton', 'NP-TL'), ...]

>>> nltk.corpus.brown.tagged_sents()

[[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), …

Page 16: Word Bi-grams and PoS Tags

Penn Treebank

First large syntactically annotated corpus

1 million words from Wall Street Journal

Part-of-speech tags and syntax trees

Page 17: Word Bi-grams and PoS Tags

help(nltk.corpus.treebank)

| parsed(*args, **kwargs)

| @deprecated: Use .parsed_sents() instead.

|

| parsed_sents(self, files=None)

|

| raw(self, files=None)

|

| read(*args, **kwargs)

| @deprecated: Use .raw() or .sents() or .tagged_sents() or

| .parsed_sents() instead.

|

| sents(self, files=None)

|

| tagged(*args, **kwargs)

| @deprecated: Use .tagged_sents() instead.

|

| tagged_sents(self, files=None)

|

| tagged_words(self, files=None)

Page 18: Word Bi-grams and PoS Tags

How hard is POS tagging?

Number of tags 1 2 3 4 5 6 7

Number of word types

35340 3760 264 61 12 2 1

In the Brown corpus, 12% of word types ambiguous 40% of word tokens ambiguous

Page 19: Word Bi-grams and PoS Tags

Tagging with lexical frequencies

Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NN

People/NNS continue/VBP to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN

Problem: assign a tag to race given its lexical frequency

Solution: we choose the tag that has the greater probability

• P(race|VB)

• P(race|NN)

Actual estimate from the Switchboard corpus:

• P(race|NN) = .00041

• P(race|VB) = .00003

This suggests we should always tag race/NN (correct 41/44=93%)

Page 20: Word Bi-grams and PoS Tags

Reminder

Puns play on our assumptions of the next word…

… eg they present us with an unexpected homonym (bends)

ConditionalFreqDist() counts word-pairs: word bigrams

Used for story generation, Speech recognition, …

Parts of Speech: groups words into grammatical categories

… and separates different functions of a word

In English, many words are ambiguous: 2 or more PoS-tags

Very simple tagger: choose by lexical probability (only)

Better Pos-Taggers: to come…


Top Related