tokenization & pos-tagging presented by: yajing zhang saarland university [email protected]

26
Tokenization & POS- Tagging presented by: Yajing Zhang Saarland University [email protected]

Upload: emerald-lindsey

Post on 05-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Tokenization & POS-Tagging presented by: Yajing Zhang Saarland University yazhang@coli.uni-sb.de

Tokenization & POS-Tagging

presented by: Yajing Zhang Saarland University [email protected]

Page 2: Tokenization & POS-Tagging presented by: Yajing Zhang Saarland University yazhang@coli.uni-sb.de

winter semester 05/06 2

Outline

Tokenization Importance Problems & solutions

POS tagging HMM tagger TnT statistical tagger

Page 3: Tokenization & POS-Tagging presented by: Yajing Zhang Saarland University yazhang@coli.uni-sb.de

winter semester 05/06 3

Why Tokenization?

Tokenization: the isolation of word-like units from a text.

Building blocks of other text processing.

The accuracy of tokenization affects the results of other higher level processing, e.g.: parsing.

Page 4: Tokenization & POS-Tagging presented by: Yajing Zhang Saarland University yazhang@coli.uni-sb.de

winter semester 05/06 4

Problems of tokenization

Definition of Token United States, AT&T, 3-year-old

Ambiguity of punctuation as sentence boundary Prof. Dr. J.M.

Ambiguity in numbers 123,456.78

Page 5: Tokenization & POS-Tagging presented by: Yajing Zhang Saarland University yazhang@coli.uni-sb.de

winter semester 05/06 5

Some Solutions

Using regular expressions to match numbers and abbreviations ([0-9]+[,])*[0-9]([.][0-9]+)? [A-Z][bcdfghj-np-tvxz]+\.

Using corpus as a filter to identify abbreviations

Using a lexical list (most important abbreviations are listed)

Page 6: Tokenization & POS-Tagging presented by: Yajing Zhang Saarland University yazhang@coli.uni-sb.de

winter semester 05/06 6

POS Tagging

Labeling each word in a sentence with its appropriate part of speech

Information sources in tagging: Tags of other words in the context The word itself

Different approaches: Rule-based Tagger Stochastic POS Tagger

Simplest stochastic Tagger HMM Tagger

Yajing
the ultimate goal of NLP is to parse and understand natural languagesPOS is one of the intermediate tasks before achieving this goal
Page 7: Tokenization & POS-Tagging presented by: Yajing Zhang Saarland University yazhang@coli.uni-sb.de

winter semester 05/06 7

Simplest Stochastic Tagger

Each word is assigned its most frequent tag (most frequently encountered in the training set)

Problem: may generate a valid tag for a word but unacceptable tag sequences Time flies like an arrow

NN VBZ VB DT NN

Page 8: Tokenization & POS-Tagging presented by: Yajing Zhang Saarland University yazhang@coli.uni-sb.de

winter semester 05/06 8

Markov Models (MM)

In a Markov chain, the future element of the sequence depends only on the current element in the sequence, but not the past elements

X = (X1, …, XT) is a sequence of random variables, S = {s1, …, sN} is the state space

and

)|( 1 itjtij sXsXPa

jiaij ,,0

N

jij ia

1

.,1

Page 9: Tokenization & POS-Tagging presented by: Yajing Zhang Saarland University yazhang@coli.uni-sb.de

winter semester 05/06 9

Example of Markov Models (MM)

Cf. Manning & Schütze, 1999, page 319

Page 10: Tokenization & POS-Tagging presented by: Yajing Zhang Saarland University yazhang@coli.uni-sb.de

winter semester 05/06 10

Hidden Markov Model

In (visible) MM, we know the state sequences the model passes, so the state sequence is regarded as output

In HMM, we don’t know the state sequences, but only some probabilistic function of it

Markov models can be used wherever one wants to model the probability of a linear sequence of events

HMM can be trained from unannotated text

Page 11: Tokenization & POS-Tagging presented by: Yajing Zhang Saarland University yazhang@coli.uni-sb.de

winter semester 05/06 11

HMM Tagger

Assumption: word’s tag only depends on the previous tag and this dependency does not change over time

HMM tagger uses states to represent POS tags and outputs (symbol emission) to represent the words.

Tagging task is to find the most probable tag sequence for a sequence of words.

Page 12: Tokenization & POS-Tagging presented by: Yajing Zhang Saarland University yazhang@coli.uni-sb.de

winter semester 05/06 12

Finding the most probable sequence

Cf. Erhard Hinrichs & Sandra Kübler

Page 13: Tokenization & POS-Tagging presented by: Yajing Zhang Saarland University yazhang@coli.uni-sb.de

winter semester 05/06 13

HMM tagging – an example

Cf. Erhard Hinrichs & Sandra Kübler

Page 14: Tokenization & POS-Tagging presented by: Yajing Zhang Saarland University yazhang@coli.uni-sb.de

winter semester 05/06 14

HMM tagging – an example

Cf. Erhard Hinrichs & Sandra Kübler

Page 15: Tokenization & POS-Tagging presented by: Yajing Zhang Saarland University yazhang@coli.uni-sb.de

winter semester 05/06 15

Calculating the most likely sequence

Green: transition probabilities

Blue: emission probabilities

Page 16: Tokenization & POS-Tagging presented by: Yajing Zhang Saarland University yazhang@coli.uni-sb.de

winter semester 05/06 16

Dealing with unknown words

The simplest model: assume that unknown words can have any POS tags, or the most frequent one in the tagset

In practice, morphological info like suffix is used as hint

Page 17: Tokenization & POS-Tagging presented by: Yajing Zhang Saarland University yazhang@coli.uni-sb.de

winter semester 05/06 17

TnT (Trigrams’n’Tags)

A statistical tagger using Markov Models: states represent tags and outputs represent words

To find the current tag is to calculate:

)|()]|(),|([maxarg 1211...1

TTiiiii

T

ittttPtwPtttP

r

Page 18: Tokenization & POS-Tagging presented by: Yajing Zhang Saarland University yazhang@coli.uni-sb.de

winter semester 05/06 18

Transition and emission probabilities

Transition and output probabilities are estimated from a tagged corpus:Bigrams:

Trigrams:

Lexical:

)(

),()|(

2

3223

^

tf

ttfttP

),(

),,(),|(

21

321213

^

ttf

tttftttP

)(

),()|(

3

3333

^

tf

twftwP

Page 19: Tokenization & POS-Tagging presented by: Yajing Zhang Saarland University yazhang@coli.uni-sb.de

winter semester 05/06 19

Smoothing Technique

Needed due to sparse-data problem The trigram is most likely to be zero

in a limited corpus: Without smoothing, the complete

probability becomes zero Smoothing:

where

),|()|()(),|( 213

^

323

^

23

^

1213 tttPttPtPtttP

1321

Page 20: Tokenization & POS-Tagging presented by: Yajing Zhang Saarland University yazhang@coli.uni-sb.de

winter semester 05/06 20

Other techniques

Handling unknown words Using the longest suffix (the final

sequence of characters of a word) as a strong predictor for word classes

To calculate the probability of a tag t given the last m letters li of an n letter word. m depends on the specific word

Capitalization Works better for English than for

German

Page 21: Tokenization & POS-Tagging presented by: Yajing Zhang Saarland University yazhang@coli.uni-sb.de

winter semester 05/06 21

Evaluation

Corpora: German NEGRA corpus around 355,000

tokens WSJ (Wall Street Journal) in the Penn

Treebank around 1.2 Million tokens 10-fold cross validation The tagger assigns tags as well as

probabilities to wordsrank different assignments

Page 22: Tokenization & POS-Tagging presented by: Yajing Zhang Saarland University yazhang@coli.uni-sb.de

winter semester 05/06 22

Results for German and English

Page 23: Tokenization & POS-Tagging presented by: Yajing Zhang Saarland University yazhang@coli.uni-sb.de

winter semester 05/06 23

POS Learning Curve for NEGRA

Page 24: Tokenization & POS-Tagging presented by: Yajing Zhang Saarland University yazhang@coli.uni-sb.de

winter semester 05/06 24

Learning Curve for Penn Treebank

Page 25: Tokenization & POS-Tagging presented by: Yajing Zhang Saarland University yazhang@coli.uni-sb.de

winter semester 05/06 25

Conclusion

Good results for both German and English corpus

Average accuracy TnT achieves is between 96% and 97%

The accuracy for known tokens is significantly higher than for unknown tokens

Page 26: Tokenization & POS-Tagging presented by: Yajing Zhang Saarland University yazhang@coli.uni-sb.de

winter semester 05/06 26

References:

What is a word, what’s a sentence (Grefenstette 94)

POS-Tagging and Partial Parsing (Abney 96)

TNT- A Statistical Part-of-Speech Tagger (Brants 2000)

Foundations of Statistical Natural Language Processing (Manning & Schütze 99)