evalita 2007 frascati, september 10th 2007 emanuele pianta and roberto zanoli fbk-irst, trento

EVALITA 2007EVALITA 2007

Frascati, September 10th 2007Frascati, September 10th 2007

TAGPROA system for ITALIAN POS A system for ITALIAN POS TAGGING based on SVMTAGGING based on SVM

Emanuele Pianta and Roberto ZanoliEmanuele Pianta and Roberto Zanoli

FBK-irst, TrentoFBK-irst, Trento

TextProTextPro

22

A suite of modular NLP tools developed at FBK-irst TokenPro: tokenization MorphoPro: morphological analysis TagPro: Part-of-Speech tagging LemmaPro: lemmatization EntityPro: Named Entity recognition ChunkPro: phrase chunking SentencePro: sentence splitting

Architecture designed to be efficient, scalable and robust. Cross-platform: Unix / Linux / Windows / MacOS X Multi-lingual models All modules integrated and accessible through unified command line interface

33

TagPro’s architecture

To build TagPro we used YamCha, an SVM-based machine learning environment. TagPro can exploit a rich set of linguistic features, such as morphological analysis, prefixes and suffixes

Feature Feature selectionselection

ControllerController

Feature extractionortho, prefix, suffix, dictionary,

morpho analysis

dictionary

Learning

models

ClassificationClassification

YamCha

Training

data

Test

dataFeature Feature selectionselection

TagPro

Feature extractionortho, prefix, suffix, dictionary,

morpho analysis

MorphoPro

YamChaYamCha

44

• Created as generic, customizable, open source text chunker

• Can be adapted to a lot of other tag-oriented NLP tasks

• Uses state-of-the-art machine learning algorithm (SVM)

Can redefine Context (window-size) parsing-direction (forward/backward) algorithms for multi-class problem (pair wise/one vs rest)

Practical chunking time (1 or 2 sec./sentence.)

Available as C/C++ library

Support Vector MachinesSupport Vector Machines

55

Based on the Structural Risk Minimization principle (Vladimir N. Vapnik, 1995)

• SVM map input vectors to a higher dimensional space where a maximal separating hyperplane is constructed. • Two parallel hyperplanes are constructed on each side of the hyperplane that separates the data. • The separating hyperplane is the hyperplane that maximizes the distance between the two parallel hyperplanes.

YamCha: YamCha: Setting Window SizeSetting Window Size

66

Default setting is "F:-2..2:0.. T:-2..-1".

The window setting can be customized

Training and Tuning SetTraining and Tuning Set

77

The Evalita development set was randomly split into 2 parts

Training: 89,170 tokens

Tuning: 44,586 tokens

FEATURESFEATURES

88

For each running word a rich set of features are extracted

WORD: the word itself (both unchanged and lower-cased) e.g. Autore autore

MORPHO: the morphological analysis (produced by MorphoPro)e.g. Autore autore+n+m+sing Calcio calcio calcio+n+m+sing

calciare+v+indic+pres+nil+1+sing

AFFIX: prefixes/suffixes (2, 3, 4 or 5 chars. at the start/end of the word)e.g. libro {li,lib,libr,libro,ro,bro,ibro,libro}

ORTHOgraphic information (e.g. capitalization, hypenation)e.g. Oggi C (capitalized) oggi L (lowercased)

GAZETTeers of proper nouns (154,000 proper names, 12,000 cities,5,000 organizations and 3,200 locations)

99

Static vs Dynamic FeaturesStatic vs Dynamic Features

STATIC FEATURES extracted for the current, previous and

following word WORD, MORPHO, AFFIXes, ORTHO,

GAZET

DYNAMIC FEATURES decided dynamically during tagging tag of the two tokens preceding the current

token.

An Example of An Example of Feature Extraction Feature Extraction

1010

l' ARTex ADJleader NNsocialista ADJBettino NN_PCraxi NN_P

l' l' l' __nil__ __nil__ __nil__ l' __nil__ __nil__ __nil__ L A N N N N N N N N N N N Y N N N N N N N N Y N O O O O ARTex ex ex __nil__ __nil__ __nil__ ex __nil__ __nil__ __nil__ L N N N N N N N N N N Y 2 N N N Y N N N N N N N O O O O ADJleader leader le lea lead leade er der ader eader L N N N N N N N N N N Y N N Y 0 N N N N N N N N O O O O NNsocialista socialista so soc soci socia ta sta ista lista L N N N N N N N N N N Y 2 N Y 0 N N N N N N N N O O O O ADJBettino bettino be bet bett betti no ino tino ttino C N N N N N N N N N N N N N N N N N N N N Y N N O O O B-NAM NN_PCraxi craxi cr cra crax craxi xi axi raxi craxi C N N N N N N N N N N N N N N N N N N N N Y N N O O O B-SUR NN_P

Finding the best featuresFinding the best features

1111

EAGLES TagSet Accuracy UTAccuracy

baseline 86.70 59.95

+AFFIX +ORTHO +8.56 +25.56

+AFFIX +ORTHO +MORPHO +10,69 +33.18

+AFFIX +ORTHO +MORPHO +GAZETT +10.72 +33.13

Baseline: WORD (both unchanged and lower-cased) window-size: +1,-1

Finding the best window-sizeFinding the best window-size

1212

EAGLES TagSet STAT DYN Accuracy

+1,-1 -1 97.42

+2,-2 -2 -0.34+1,-1 -2 +0.23+1,-1 -3 +0.22

Given the best set of features (F1=97.42) we tried to improve Accuracy by changing the window-size

multi-class problemmulti-class problempair-wise/one vs restpair-wise/one vs rest

1313

one vs rest: fewer bigger classifiers pairwise:

a classifier for each possible pair of classes choose the classifier with best confidence many relatively small classifiers faster, less memory

EAGLES TagSet method Accuracy

pairwise 97.65one vs rest 97.78

Evaluating the best algorithmEvaluating the best algorithmPKI vs. PKEPKI vs. PKE

1414

EAGLES TagSet Accuracy

PKI 97.78PKE 97.64

YamCha uses two implementations of SVMs: PKI and PKE.

• both are faster than the original SVMs

PKI (3-12 x faster) produces the same accuracy as the original SVMs.

PKE (10-300 x) approximates the orginal SVM, slightly less accurate but much faster

Results on the development setResults on the development set

1515

EAGLES DISTRIB

Accuracy 97.78 97.52

Known Words: 40,320 MorphoPro coverage: 96.20% Accuracy 98.29 97.95

Unknown Words: 4,396 MorphoPro coverage: 84.41% Accuracy 93.03 93.56

Test ResultsTest Results

1616

TagSet Accuracy UTAccuracy

EAGLES 98.04 95.02

DISTRIB 97.68 94.65

ConclusionsConclusions

1717

A statistical approach to PoS-Tagging for Italian based on YamCha / SVMs.

Results confirm that SVMs can deal with a big number of features without incurring in overfitting.

We used the same best configuration for both tagsets.

No specific method was applied for classifying unknown words.

Features: AFFIX+ORTHO: +8.56 over baselineMORPHO: 2.13 improvement over AFFIX+ORTHOGAZETteers do not contribute any further significant improvement

• Features for unknown words: AFFIX+ORTHO:+25.56 MORPHO: ++7,62

No benefit from a larger context (e.g. window-size +2,-2 and more)

TagProTagPro

1818

TagPro is a system for PoS-tagging based on YamCha.

YamCha (Yet Another Multipurpose Chunk Annotator, by Taku Kudo)

is a generic, customizable, and open source text chunker.is based on Support Vector Machines (SVMs)

TagPro exploits a rich set of linguistic features such as the morphological analysis prefixes and suffixes.

The system is part of TextPro, a suite of NLP tools developed at FBK-irst.

1919

PRON_PER V_AVERE V_PP ADV ART ADJ NN NN_P P_OTH PREP_A P_EOS V_GVRB PREP CONJ_S ADJ_DIM V_MOD CONJ_C V_ESSERE PRON_REL PRON_DIM PRON_IND ADJ_IND PRON_IES V_CLIT ADJ_NUM C_NUM ADJ_POS ADJ_IES PRON_POS NULL P_APO INT

PRON_PER 972 0 0 20 12 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

V_AVERE 0 521 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

V_PP 0 0 1188 2 0 97 19 0 0 0 0 2 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0

ADV 9 0 2 2248 0 12 10 3 0 0 0 0 20 14 0 0 0 0 0 0 3 1 0 0 0 0 0 0 0 0 0 0

ART 6 0 0 0 3780 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0

ADJ 0 0 97 14 0 2454 120 3 0 0 0 6 2 0 0 0 0 0 0 0 0 2 0 0 2 0 0 0 0 0 0 0

NN 0 0 21 23 0 114 8673 43 0 0 0 24 0 1 0 0 0 1 0 0 1 0 2 0 6 2 0 0 0 1 0 0

NN_P 0 0 0 1 1 3 15 1847 0 3 0 0 2 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 2 0 0

P_OTH 0 0 0 0 0 0 0 0 3476 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0

PREP_A 0 0 0 0 0 2 0 1 0 2655 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

P_EOS 0 0 0 0 0 0 0 0 2 0 1786 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

V_GVRB 0 0 7 2 0 15 27 0 0 0 0 2754 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0

PREP 0 0 0 5 0 0 1 0 0 1 0 0 4230 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

CONJ_S 1 0 0 7 0 0 1 0 0 0 0 0 4 598 0 0 3 0 70 0 0 0 0 0 0 0 0 0 0 0 0 0

ADJ_DIM 0 0 0 0 0 2 0 0 0 0 0 0 0 0 261 0 0 0 0 8 0 7 0 0 0 0 1 0 0 0 0 0

V_MOD 0 0 1 0 0 3 0 0 0 0 0 4 0 0 0 285 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

CONJ_C 0 0 0 2 0 0 3 0 0 1 0 0 1 10 0 0 1759 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0

V_ESSERE 0 0 0 2 0 0 2 0 0 0 0 1 0 0 0 0 1 1285 0 0 0 0 0 0 1 0 0 0 0 0 0 0

PRON_REL 0 0 0 1 0 0 0 0 0 0 0 0 0 24 0 0 0 0 664 0 0 2 2 0 0 0 0 0 0 0 0 0

PRON_DIM 0 0 0 0 0 0 0 0 0 0 0 0 0 0 18 0 0 0 0 191 0 0 0 0 0 0 0 0 0 0 0 0

PRON_IND 0 0 0 7 4 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 214 16 0 0 0 0 0 0 0 0 0 0

ADJ_IND 0 0 0 6 0 4 1 0 0 0 0 0 1 0 0 0 0 0 0 0 5 395 0 0 0 0 0 0 0 0 0 0

PRON_IES 0 0 0 1 0 0 4 0 0 0 0 0 0 0 0 0 0 0 9 0 0 0 19 0 0 0 0 0 0 0 0 0

V_CLIT 0 3 7 0 0 0 8 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 232 0 0 0 0 0 0 0 0

ADJ_NUM 0 0 0 0 2 4 3 2 0 0 0 0 3 0 0 0 0 0 0 0 2 3 0 0 314 0 0 0 0 0 0 0

C_NUM 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 348 0 0 0 0 0 0

ADJ_POS 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 351 0 0 0 0 0

ADJ_IES 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 3 0 0 2 0 0 0 0 0 8 0 0 0 0

PRON_POS 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 0 0 0 0 0

NULL 0 0 0 0 0 1 4 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 21 0 0

P_APO 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 30 0

INT 0 0 0 0 0 2 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2

Confusion matrix

evalita 2007 frascati, september 10th 2007 emanuele pianta and roberto zanoli fbk-irst, trento

Documents