evalita 2007 frascati, september 10th 2007 emanuele pianta and roberto zanoli fbk-irst, trento
TRANSCRIPT
EVALITA 2007EVALITA 2007
Frascati, September 10th 2007Frascati, September 10th 2007
TAGPROA system for ITALIAN POS A system for ITALIAN POS TAGGING based on SVMTAGGING based on SVM
Emanuele Pianta and Roberto ZanoliEmanuele Pianta and Roberto Zanoli
FBK-irst, TrentoFBK-irst, Trento
TextProTextPro
22
A suite of modular NLP tools developed at FBK-irst TokenPro: tokenization MorphoPro: morphological analysis TagPro: Part-of-Speech tagging LemmaPro: lemmatization EntityPro: Named Entity recognition ChunkPro: phrase chunking SentencePro: sentence splitting
Architecture designed to be efficient, scalable and robust. Cross-platform: Unix / Linux / Windows / MacOS X Multi-lingual models All modules integrated and accessible through unified command line interface
33
TagPro’s architecture
To build TagPro we used YamCha, an SVM-based machine learning environment. TagPro can exploit a rich set of linguistic features, such as morphological analysis, prefixes and suffixes
Feature Feature selectionselection
ControllerController
Feature extractionortho, prefix, suffix, dictionary,
morpho analysis
dictionary
Learning
models
ClassificationClassification
YamCha
Training
data
Test
dataFeature Feature selectionselection
TagPro
Feature extractionortho, prefix, suffix, dictionary,
morpho analysis
MorphoPro
YamChaYamCha
44
• Created as generic, customizable, open source text chunker
• Can be adapted to a lot of other tag-oriented NLP tasks
• Uses state-of-the-art machine learning algorithm (SVM)
Can redefine Context (window-size) parsing-direction (forward/backward) algorithms for multi-class problem (pair wise/one vs rest)
Practical chunking time (1 or 2 sec./sentence.)
Available as C/C++ library
Support Vector MachinesSupport Vector Machines
55
Based on the Structural Risk Minimization principle (Vladimir N. Vapnik, 1995)
• SVM map input vectors to a higher dimensional space where a maximal separating hyperplane is constructed. • Two parallel hyperplanes are constructed on each side of the hyperplane that separates the data. • The separating hyperplane is the hyperplane that maximizes the distance between the two parallel hyperplanes.
YamCha: YamCha: Setting Window SizeSetting Window Size
66
Default setting is "F:-2..2:0.. T:-2..-1".
The window setting can be customized
Training and Tuning SetTraining and Tuning Set
77
The Evalita development set was randomly split into 2 parts
Training: 89,170 tokens
Tuning: 44,586 tokens
FEATURESFEATURES
88
For each running word a rich set of features are extracted
WORD: the word itself (both unchanged and lower-cased) e.g. Autore autore
MORPHO: the morphological analysis (produced by MorphoPro)e.g. Autore autore+n+m+sing Calcio calcio calcio+n+m+sing
calciare+v+indic+pres+nil+1+sing
AFFIX: prefixes/suffixes (2, 3, 4 or 5 chars. at the start/end of the word)e.g. libro {li,lib,libr,libro,ro,bro,ibro,libro}
ORTHOgraphic information (e.g. capitalization, hypenation)e.g. Oggi C (capitalized) oggi L (lowercased)
GAZETTeers of proper nouns (154,000 proper names, 12,000 cities,5,000 organizations and 3,200 locations)
99
Static vs Dynamic FeaturesStatic vs Dynamic Features
STATIC FEATURES extracted for the current, previous and
following word WORD, MORPHO, AFFIXes, ORTHO,
GAZET
DYNAMIC FEATURES decided dynamically during tagging tag of the two tokens preceding the current
token.
An Example of An Example of Feature Extraction Feature Extraction
1010
l' ARTex ADJleader NNsocialista ADJBettino NN_PCraxi NN_P
l' l' l' __nil__ __nil__ __nil__ l' __nil__ __nil__ __nil__ L A N N N N N N N N N N N Y N N N N N N N N Y N O O O O ARTex ex ex __nil__ __nil__ __nil__ ex __nil__ __nil__ __nil__ L N N N N N N N N N N Y 2 N N N Y N N N N N N N O O O O ADJleader leader le lea lead leade er der ader eader L N N N N N N N N N N Y N N Y 0 N N N N N N N N O O O O NNsocialista socialista so soc soci socia ta sta ista lista L N N N N N N N N N N Y 2 N Y 0 N N N N N N N N O O O O ADJBettino bettino be bet bett betti no ino tino ttino C N N N N N N N N N N N N N N N N N N N N Y N N O O O B-NAM NN_PCraxi craxi cr cra crax craxi xi axi raxi craxi C N N N N N N N N N N N N N N N N N N N N Y N N O O O B-SUR NN_P
Finding the best featuresFinding the best features
1111
EAGLES TagSet Accuracy UTAccuracy
baseline 86.70 59.95
+AFFIX +ORTHO +8.56 +25.56
+AFFIX +ORTHO +MORPHO +10,69 +33.18
+AFFIX +ORTHO +MORPHO +GAZETT +10.72 +33.13
Baseline: WORD (both unchanged and lower-cased) window-size: +1,-1
Finding the best window-sizeFinding the best window-size
1212
EAGLES TagSet STAT DYN Accuracy
+1,-1 -1 97.42
+2,-2 -2 -0.34+1,-1 -2 +0.23+1,-1 -3 +0.22
Given the best set of features (F1=97.42) we tried to improve Accuracy by changing the window-size
multi-class problemmulti-class problempair-wise/one vs restpair-wise/one vs rest
1313
one vs rest: fewer bigger classifiers pairwise:
a classifier for each possible pair of classes choose the classifier with best confidence many relatively small classifiers faster, less memory
EAGLES TagSet method Accuracy
pairwise 97.65one vs rest 97.78
Evaluating the best algorithmEvaluating the best algorithmPKI vs. PKEPKI vs. PKE
1414
EAGLES TagSet Accuracy
PKI 97.78PKE 97.64
YamCha uses two implementations of SVMs: PKI and PKE.
• both are faster than the original SVMs
PKI (3-12 x faster) produces the same accuracy as the original SVMs.
PKE (10-300 x) approximates the orginal SVM, slightly less accurate but much faster
Results on the development setResults on the development set
1515
EAGLES DISTRIB
Accuracy 97.78 97.52
Known Words: 40,320 MorphoPro coverage: 96.20% Accuracy 98.29 97.95
Unknown Words: 4,396 MorphoPro coverage: 84.41% Accuracy 93.03 93.56
Test ResultsTest Results
1616
TagSet Accuracy UTAccuracy
EAGLES 98.04 95.02
DISTRIB 97.68 94.65
ConclusionsConclusions
1717
A statistical approach to PoS-Tagging for Italian based on YamCha / SVMs.
Results confirm that SVMs can deal with a big number of features without incurring in overfitting.
We used the same best configuration for both tagsets.
No specific method was applied for classifying unknown words.
Features: AFFIX+ORTHO: +8.56 over baselineMORPHO: 2.13 improvement over AFFIX+ORTHOGAZETteers do not contribute any further significant improvement
• Features for unknown words: AFFIX+ORTHO:+25.56 MORPHO: ++7,62
No benefit from a larger context (e.g. window-size +2,-2 and more)
TagProTagPro
1818
TagPro is a system for PoS-tagging based on YamCha.
YamCha (Yet Another Multipurpose Chunk Annotator, by Taku Kudo)
is a generic, customizable, and open source text chunker.is based on Support Vector Machines (SVMs)
TagPro exploits a rich set of linguistic features such as the morphological analysis prefixes and suffixes.
The system is part of TextPro, a suite of NLP tools developed at FBK-irst.
1919
PRON_PER V_AVERE V_PP ADV ART ADJ NN NN_P P_OTH PREP_A P_EOS V_GVRB PREP CONJ_S ADJ_DIM V_MOD CONJ_C V_ESSERE PRON_REL PRON_DIM PRON_IND ADJ_IND PRON_IES V_CLIT ADJ_NUM C_NUM ADJ_POS ADJ_IES PRON_POS NULL P_APO INT
PRON_PER 972 0 0 20 12 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
V_AVERE 0 521 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
V_PP 0 0 1188 2 0 97 19 0 0 0 0 2 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
ADV 9 0 2 2248 0 12 10 3 0 0 0 0 20 14 0 0 0 0 0 0 3 1 0 0 0 0 0 0 0 0 0 0
ART 6 0 0 0 3780 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0
ADJ 0 0 97 14 0 2454 120 3 0 0 0 6 2 0 0 0 0 0 0 0 0 2 0 0 2 0 0 0 0 0 0 0
NN 0 0 21 23 0 114 8673 43 0 0 0 24 0 1 0 0 0 1 0 0 1 0 2 0 6 2 0 0 0 1 0 0
NN_P 0 0 0 1 1 3 15 1847 0 3 0 0 2 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 2 0 0
P_OTH 0 0 0 0 0 0 0 0 3476 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
PREP_A 0 0 0 0 0 2 0 1 0 2655 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
P_EOS 0 0 0 0 0 0 0 0 2 0 1786 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
V_GVRB 0 0 7 2 0 15 27 0 0 0 0 2754 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0
PREP 0 0 0 5 0 0 1 0 0 1 0 0 4230 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
CONJ_S 1 0 0 7 0 0 1 0 0 0 0 0 4 598 0 0 3 0 70 0 0 0 0 0 0 0 0 0 0 0 0 0
ADJ_DIM 0 0 0 0 0 2 0 0 0 0 0 0 0 0 261 0 0 0 0 8 0 7 0 0 0 0 1 0 0 0 0 0
V_MOD 0 0 1 0 0 3 0 0 0 0 0 4 0 0 0 285 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
CONJ_C 0 0 0 2 0 0 3 0 0 1 0 0 1 10 0 0 1759 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0
V_ESSERE 0 0 0 2 0 0 2 0 0 0 0 1 0 0 0 0 1 1285 0 0 0 0 0 0 1 0 0 0 0 0 0 0
PRON_REL 0 0 0 1 0 0 0 0 0 0 0 0 0 24 0 0 0 0 664 0 0 2 2 0 0 0 0 0 0 0 0 0
PRON_DIM 0 0 0 0 0 0 0 0 0 0 0 0 0 0 18 0 0 0 0 191 0 0 0 0 0 0 0 0 0 0 0 0
PRON_IND 0 0 0 7 4 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 214 16 0 0 0 0 0 0 0 0 0 0
ADJ_IND 0 0 0 6 0 4 1 0 0 0 0 0 1 0 0 0 0 0 0 0 5 395 0 0 0 0 0 0 0 0 0 0
PRON_IES 0 0 0 1 0 0 4 0 0 0 0 0 0 0 0 0 0 0 9 0 0 0 19 0 0 0 0 0 0 0 0 0
V_CLIT 0 3 7 0 0 0 8 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 232 0 0 0 0 0 0 0 0
ADJ_NUM 0 0 0 0 2 4 3 2 0 0 0 0 3 0 0 0 0 0 0 0 2 3 0 0 314 0 0 0 0 0 0 0
C_NUM 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 348 0 0 0 0 0 0
ADJ_POS 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 351 0 0 0 0 0
ADJ_IES 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 3 0 0 2 0 0 0 0 0 8 0 0 0 0
PRON_POS 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 0 0 0 0 0
NULL 0 0 0 0 0 1 4 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 21 0 0
P_APO 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 30 0
INT 0 0 0 0 0 2 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2
Confusion matrix