tokenization & pos-tagging presented by: yajing zhang saarland university [email protected]

Tokenization & POS-Tagging

presented by: Yajing Zhang Saarland University [email protected]

winter semester 05/06 2

Outline

Tokenization Importance Problems & solutions

POS tagging HMM tagger TnT statistical tagger


Why Tokenization?

Tokenization: the isolation of word-like units from a text.

Building blocks of other text processing.

The accuracy of tokenization affects the results of other higher level processing, e.g.: parsing.


Problems of tokenization

Definition of Token United States, AT&T, 3-year-old

Ambiguity of punctuation as sentence boundary Prof. Dr. J.M.

Ambiguity in numbers 123,456.78


Some Solutions

Using regular expressions to match numbers and abbreviations ([0-9]+[,])*[0-9]([.][0-9]+)? [A-Z][bcdfghj-np-tvxz]+\.

Using corpus as a filter to identify abbreviations

Using a lexical list (most important abbreviations are listed)


POS Tagging

Labeling each word in a sentence with its appropriate part of speech

Information sources in tagging: Tags of other words in the context The word itself

Different approaches: Rule-based Tagger Stochastic POS Tagger

Simplest stochastic Tagger HMM Tagger

Yajing

the ultimate goal of NLP is to parse and understand natural languagesPOS is one of the intermediate tasks before achieving this goal


Simplest Stochastic Tagger

Each word is assigned its most frequent tag (most frequently encountered in the training set)

Problem: may generate a valid tag for a word but unacceptable tag sequences Time flies like an arrow

NN VBZ VB DT NN


Markov Models (MM)

In a Markov chain, the future element of the sequence depends only on the current element in the sequence, but not the past elements

X = (X1, …, XT) is a sequence of random variables, S = {s1, …, sN} is the state space

and

)|( 1 itjtij sXsXPa

jiaij ,,0

N

jij ia

1

.,1


Example of Markov Models (MM)

Cf. Manning & Schütze, 1999, page 319


Hidden Markov Model

In (visible) MM, we know the state sequences the model passes, so the state sequence is regarded as output

In HMM, we don’t know the state sequences, but only some probabilistic function of it

Markov models can be used wherever one wants to model the probability of a linear sequence of events

HMM can be trained from unannotated text


HMM Tagger

Assumption: word’s tag only depends on the previous tag and this dependency does not change over time

HMM tagger uses states to represent POS tags and outputs (symbol emission) to represent the words.

Tagging task is to find the most probable tag sequence for a sequence of words.


Finding the most probable sequence

Cf. Erhard Hinrichs & Sandra Kübler


HMM tagging – an example



Calculating the most likely sequence

Green: transition probabilities

Blue: emission probabilities


Dealing with unknown words

The simplest model: assume that unknown words can have any POS tags, or the most frequent one in the tagset

In practice, morphological info like suffix is used as hint


TnT (Trigrams’n’Tags)

A statistical tagger using Markov Models: states represent tags and outputs represent words

To find the current tag is to calculate:

)|()]|(),|([maxarg 1211...1

TTiiiii

T

ittttPtwPtttP

r


Transition and emission probabilities

Transition and output probabilities are estimated from a tagged corpus:Bigrams:

Trigrams:

Lexical:

)(

),()|(

2

3223

^

tf

ttfttP

),(

),,(),|(

21

321213

^

ttf

tttftttP

)(

),()|(

3

3333

^

tf

twftwP


Smoothing Technique

Needed due to sparse-data problem The trigram is most likely to be zero

in a limited corpus: Without smoothing, the complete

probability becomes zero Smoothing:

where

),|()|()(),|( 213

^

323

^

23

^

1213 tttPttPtPtttP

1321


Other techniques

Handling unknown words Using the longest suffix (the final

sequence of characters of a word) as a strong predictor for word classes

To calculate the probability of a tag t given the last m letters li of an n letter word. m depends on the specific word

Capitalization Works better for English than for

German


Evaluation

Corpora: German NEGRA corpus around 355,000

tokens WSJ (Wall Street Journal) in the Penn

Treebank around 1.2 Million tokens 10-fold cross validation The tagger assigns tags as well as

probabilities to wordsrank different assignments


Results for German and English


POS Learning Curve for NEGRA


Learning Curve for Penn Treebank


Conclusion

Good results for both German and English corpus

Average accuracy TnT achieves is between 96% and 97%

The accuracy for known tokens is significantly higher than for unknown tokens


References:

What is a word, what’s a sentence (Grefenstette 94)

POS-Tagging and Partial Parsing (Abney 96)

TNT- A Statistical Part-of-Speech Tagger (Brants 2000)

Foundations of Statistical Natural Language Processing (Manning & Schütze 99)

tokenization & pos-tagging presented by: yajing zhang saarland university [email protected]

Documents