totale multilingual tokenisation, tagging and lemmatisation tomaž erjavec dept. of knowledge...
Post on 16-Dec-2015
222 Views
Preview:
TRANSCRIPT
totaletotaleMultilingual Multilingual Tokenisation, Tagging Tokenisation, Tagging and Lemmatisationand Lemmatisation
Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Dept. of Knowledge Technologies, Jožef Stefan Institute Jožef Stefan Institute Ljubljana, SloveniaLjubljana, Slovenia
JRC Workshop, 26-27 September JRC Workshop, 26-27 September 20052005
JRC Workshop, JRC Workshop, 26-27 September 200526-27 September 2005
Tomaž ErjavecTomaž Erjavec: Multilingual Tokenisation, Tagging & : Multilingual Tokenisation, Tagging & LemmatisationLemmatisation
Overview of the talkOverview of the talk
1.1. IntroductionIntroduction
2.2. The totale pipelineThe totale pipeline
3.3. Training totaleTraining totale
4.4. Annotating JRC-ACQUIS-slAnnotating JRC-ACQUIS-sl
5.5. Conclusions Conclusions
JRC Workshop, JRC Workshop, 26-27 September 200526-27 September 2005
Tomaž ErjavecTomaž Erjavec: Multilingual Tokenisation, Tagging & : Multilingual Tokenisation, Tagging & LemmatisationLemmatisation
IntroductionIntroduction
Hypothesis: to efficiently exploit the Hypothesis: to efficiently exploit the JRC-ACQUIS its texts need to be JRC-ACQUIS its texts need to be linguistically pre-processedlinguistically pre-processed
This normalizes (reduces) the data This normalizes (reduces) the data and gives other tools more features and gives other tools more features to work withto work with
JRC Workshop, JRC Workshop, 26-27 September 200526-27 September 2005
Tomaž ErjavecTomaž Erjavec: Multilingual Tokenisation, Tagging & : Multilingual Tokenisation, Tagging & LemmatisationLemmatisation
ExampleExample
TOKENTOKEN TYPE LEMMA MSD TYPE LEMMA MSD----------------------------------------------------------------------------2. TOK_ENUM 2. Rmp 2. TOK_ENUM 2. Rmp (a) TOK_ENUM (a) Rmp (a) TOK_ENUM (a) Rmp Where TOK where Cs Where TOK where Cs an TOK a Di an TOK a Di exporter TOK exporter Ncns exporter TOK exporter Ncns has TOK have Vaip3s has TOK have Vaip3s declared TOK declare Vmps declared TOK declare Vmps goods TOK good Ncnp goods TOK good Ncnp packaged TOK package Vmis packaged TOK package Vmis using TOK use Vmpp using TOK use Vmpp automatic TOK automatic Afp automatic TOK automatic Afp systems TOK system Ncnp systems TOK system Ncnp for TOK for Sp for TOK for Sp bagging TOK bag Vmpp bagging TOK bag Vmpp , PUN , PUN canning TOK can Vmpp canning TOK can Vmpp , PUN , PUN bottling TOK bottle Vmpp bottling TOK bottle Vmpp , PUN , PUN etc. TOK_ABBR etc. Rmpetc. TOK_ABBR etc. Rmp
MSD and LEMMA are MSD and LEMMA are context dependentcontext dependent
MSD useful for any MSD useful for any syntactically oriented syntactically oriented further processing (PoS further processing (PoS filtering)filtering)
LEMMA useful for LEMMA useful for reducing the lexical reducing the lexical space (easier searches)space (easier searches)
Task is much harder for Task is much harder for inflectionally rich (or inflectionally rich (or agglutinative) languages agglutinative) languages than for English or most than for English or most ‘old’ EU!‘old’ EU!
2. (a) Where an exporter has declared goods 2. (a) Where an exporter has declared goods packaged using automatic systems for packaged using automatic systems for bagging, canning, bottling, etc.,bagging, canning, bottling, etc.,
JRC Workshop, JRC Workshop, 26-27 September 200526-27 September 2005
Tomaž ErjavecTomaž Erjavec: Multilingual Tokenisation, Tagging & : Multilingual Tokenisation, Tagging & LemmatisationLemmatisation
Nagging doubtsNagging doubts
Normalization loses informationNormalization loses information Annotation introduces errors and Annotation introduces errors and
biasbias Evaluation for IE non-conclusiveEvaluation for IE non-conclusive Unsupervised methods!Unsupervised methods!
Still…Still…
JRC Workshop, JRC Workshop, 26-27 September 200526-27 September 2005
Tomaž ErjavecTomaž Erjavec: Multilingual Tokenisation, Tagging & : Multilingual Tokenisation, Tagging & LemmatisationLemmatisation
Wanted Wanted
A tool that would take text in any A tool that would take text in any language andlanguage and
tokenise,tokenise, PoS tag andPoS tag and lemmatise it.lemmatise it.Should be simple to install and use, Should be simple to install and use,
robust, fast, and adaptable to new robust, fast, and adaptable to new languages, preferably with a large languages, preferably with a large number of already available modelsnumber of already available models
(and work under Linux!)(and work under Linux!)
JRC Workshop, JRC Workshop, 26-27 September 200526-27 September 2005
Tomaž ErjavecTomaž Erjavec: Multilingual Tokenisation, Tagging & : Multilingual Tokenisation, Tagging & LemmatisationLemmatisation
What is out thereWhat is out there
Component software:Component software:tokenisers, taggers, (stemmers)tokenisers, taggers, (stemmers)
FS/RE environments: INTEX, CLARKFS/RE environments: INTEX, CLARK Various LT workbenches, most Various LT workbenches, most
famous GATEfamous GATE Alas: Java, time investment, Alas: Java, time investment,
historyhistory
JRC Workshop, JRC Workshop, 26-27 September 200526-27 September 2005
Tomaž ErjavecTomaž Erjavec: Multilingual Tokenisation, Tagging & : Multilingual Tokenisation, Tagging & LemmatisationLemmatisation
Linguistic annotation Linguistic annotation with with totaletotale Multilingual Multilingual totokenisation, kenisation, tatagging gging
and and lelemmatisationmmatisation Perl program with a simple pipeline Perl program with a simple pipeline
architecturearchitecture Input is plain UTF-8 text Input is plain UTF-8 text Output is a list of annotated tokensOutput is a list of annotated tokens Several output formats (tabular, Several output formats (tabular,
XML)XML)
JRC Workshop, JRC Workshop, 26-27 September 200526-27 September 2005
Tomaž ErjavecTomaž Erjavec: Multilingual Tokenisation, Tagging & : Multilingual Tokenisation, Tagging & LemmatisationLemmatisation
Example useExample use
$ totale -l en$ totale -l enDoctor, can you help?Doctor, can you help?^D^D
<TEXT><TEXT>DoctorDoctor TOKTOK doctordoctor NcfsNcfs,, PUNPUNcancan TOKTOK cancan VoipVoipyouyou TOKTOK youyou Pp2Pp2helphelp TOKTOK helphelp VmnVmn?? PUN_TERMPUN_TERM
<S/><S/> </TEXT></TEXT>
JRC Workshop, JRC Workshop, 26-27 September 200526-27 September 2005
Tomaž ErjavecTomaž Erjavec: Multilingual Tokenisation, Tagging & : Multilingual Tokenisation, Tagging & LemmatisationLemmatisation
MultilingualMultilingual
resourcesresources
Totale building blocksTotale building blocks
mlTokenmlTokenTnTTnT
CLOGCLOG
MultilingualMultilingual
resourcesresourcesMultilingualMultilingual
resourcesresources
PerlPerl
JRC Workshop, JRC Workshop, 26-27 September 200526-27 September 2005
Tomaž ErjavecTomaž Erjavec: Multilingual Tokenisation, Tagging & : Multilingual Tokenisation, Tagging & LemmatisationLemmatisation
Tokenisation in totaleTokenisation in totale
Perl module Perl module mlToken.pmmlToken.pm(Camelia Ignat, JRC)(Camelia Ignat, JRC)
Multilingual, with resource files for Multilingual, with resource files for supported languages (also default rules)supported languages (also default rules)
Splits text into tokens, marks token typeSplits text into tokens, marks token type Marks paragraph and sentence Marks paragraph and sentence
boundariesboundaries Modelled on Modelled on mtSegmtSeg
JRC Workshop, JRC Workshop, 26-27 September 200526-27 September 2005
Tomaž ErjavecTomaž Erjavec: Multilingual Tokenisation, Tagging & : Multilingual Tokenisation, Tagging & LemmatisationLemmatisation
Tagging in totaleTagging in totale
Annotating words in the text with their Annotating words in the text with their context disambiguated context disambiguated morphosyntactic annotations (MSDs)morphosyntactic annotations (MSDs)
Used the tri-gram tagger Used the tri-gram tagger TnTTnT Trainable, fast, unknown-word guessing Trainable, fast, unknown-word guessing
module, able to accommodate the large module, able to accommodate the large morphosyntactic tagsets of various EU morphosyntactic tagsets of various EU languageslanguages
Uses (and induces from annotated Uses (and induces from annotated corpus) a lexicon with ambiguity corpus) a lexicon with ambiguity classes and tri-gram file classes and tri-gram file
JRC Workshop, JRC Workshop, 26-27 September 200526-27 September 2005
Tomaž ErjavecTomaž Erjavec: Multilingual Tokenisation, Tagging & : Multilingual Tokenisation, Tagging & LemmatisationLemmatisation
Lemmatisation in Lemmatisation in totaletotale Used Used CLOGCLOG, which learns first-order , which learns first-order
decision lists (+ list of exceptions)decision lists (+ list of exceptions) Learns lemmatisation rules for each MSDLearns lemmatisation rules for each MSD CLOGCLOG produces Prolog programs, but produces Prolog programs, but
these converted into Perlthese converted into Perl
Tomaž Erjavec and Sašo Džeroski: Machine Tomaž Erjavec and Sašo Džeroski: Machine Learning of Morphosyntactic Structure: Learning of Morphosyntactic Structure: Lemmatising Unknown Slovene Words. Lemmatising Unknown Slovene Words. Applied Artificial IntelligenceApplied Artificial Intelligence 18(1), pp. 18(1), pp. 17-40, 2004. 17-40, 2004.
JRC Workshop, JRC Workshop, 26-27 September 200526-27 September 2005
Tomaž ErjavecTomaž Erjavec: Multilingual Tokenisation, Tagging & : Multilingual Tokenisation, Tagging & LemmatisationLemmatisation
Example CLOG ruleExample CLOG rulesub SUB_afcfda {sub SUB_afcfda { my $w = $_[0]; my $lem;my $w = $_[0]; my $lem; if ($w=~/^(.*)svetlej#353i$/){$lem=$1."svetel"}if ($w=~/^(.*)svetlej#353i$/){$lem=$1."svetel"} elsif ($w=~/^(.*)polnej#353i$/){$lem=$1."poln"}elsif ($w=~/^(.*)polnej#353i$/){$lem=$1."poln"} elsif ($w=~/^(.*)b#353i$/) {$lem=$1."b"}elsif ($w=~/^(.*)b#353i$/) {$lem=$1."b"} elsif ($w=~/^(.*)elej#353i$/) {$lem=$1."el"}elsif ($w=~/^(.*)elej#353i$/) {$lem=$1."el"} elsif ($w=~/^(.*)ivej#353i$/) {$lem=$1."iv"}elsif ($w=~/^(.*)ivej#353i$/) {$lem=$1."iv"} elsif ($w=~/^(.*)anej#353i$/) {$lem=$1."an"}elsif ($w=~/^(.*)anej#353i$/) {$lem=$1."an"} elsif ($w=~/^(.*)kej#353i$/) {$lem=$1."ek"}elsif ($w=~/^(.*)kej#353i$/) {$lem=$1."ek"} elsif ($w=~/^(.*)tej#353i$/) {$lem=$1."t"}elsif ($w=~/^(.*)tej#353i$/) {$lem=$1."t"} elsif ($w=~/^(.*)i#382ji$/) {$lem=$1."izek"}elsif ($w=~/^(.*)i#382ji$/) {$lem=$1."izek"} elsif ($w=~/^(.*)enej#353i$/) {$lem=$1."en"}elsif ($w=~/^(.*)enej#353i$/) {$lem=$1."en"} elsif ($w=~/^(.*)rej#353i$/) {$lem=$1."er"}elsif ($w=~/^(.*)rej#353i$/) {$lem=$1."er"} elsif ($w=~/^(.*)nej#353i$/) {$lem=$1."en"}elsif ($w=~/^(.*)nej#353i$/) {$lem=$1."en"} else {$lem="???"}else {$lem="???"} return $lem;return $lem;}}
JRC Workshop, JRC Workshop, 26-27 September 200526-27 September 2005
Tomaž ErjavecTomaž Erjavec: Multilingual Tokenisation, Tagging & : Multilingual Tokenisation, Tagging & LemmatisationLemmatisation
Training totale with Training totale with MULTEXT-East MULTEXT-East resourcesresources
Learning totale tagging and lemmatisation modelsLearning totale tagging and lemmatisation models MULTEXT-East language resources V3, a MULTEXT-East language resources V3, a
standardised multilingual dataset for language standardised multilingual dataset for language engineering R&Dengineering R&D
Covers mainly Central and Eastern European Covers mainly Central and Eastern European languageslanguages
Freely available for research use from Freely available for research use from http://nl.ijs.si/ME/V3/http://nl.ijs.si/ME/V3/
Used MSD tagged “1984” corpus (100kW) for Used MSD tagged “1984” corpus (100kW) for tagger trainingtagger training
Used MSD lexica (15k lemmas) for lemmatiser Used MSD lexica (15k lemmas) for lemmatiser trainingtraining
JRC Workshop, JRC Workshop, 26-27 September 200526-27 September 2005
Tomaž ErjavecTomaž Erjavec: Multilingual Tokenisation, Tagging & : Multilingual Tokenisation, Tagging & LemmatisationLemmatisation
Currently supported Currently supported languageslanguages English English SloveneSlovene CzechCzech RomanianRomanian SerbianSerbian EstonianEstonian HungarianHungarian
JRC Workshop, JRC Workshop, 26-27 September 200526-27 September 2005
Tomaž ErjavecTomaž Erjavec: Multilingual Tokenisation, Tagging & : Multilingual Tokenisation, Tagging & LemmatisationLemmatisation
Processing JRC’s Processing JRC’s ACQUIS-sl with totaleACQUIS-sl with totale sl.tar.gz 03-Sep-2005 03:51 34.4Msl.tar.gz 03-Sep-2005 03:51 34.4M
sl/slcelex_*.xml = 144M, 7772 filessl/slcelex_*.xml = 144M, 7772 files Wrapper Wrapper perlperl program: program:
for each filefor each file 1.1. extract text (all <P>s except first)extract text (all <P>s except first)2.2. | totale -l sl -f XML || totale -l sl -f XML | 3.3. substitute contents of original <P>s with substitute contents of original <P>s with
annotated onesannotated ones4.4. validate against DTDvalidate against DTD
72 hrs on asterix 72 hrs on asterix but 10s startup time = 77720s = but 10s startup time = 77720s = 21hrs21hrs
JRC Workshop, JRC Workshop, 26-27 September 200526-27 September 2005
Tomaž ErjavecTomaž Erjavec: Multilingual Tokenisation, Tagging & : Multilingual Tokenisation, Tagging & LemmatisationLemmatisation
The problem of titlesThe problem of titles
Dual role of titles: as text and name of documentDual role of titles: as text and name of document Should they contain P at all?Should they contain P at all? Many titles untranslated – experiment Many titles untranslated – experiment
with TextCat:with TextCat:4,964 sl 4,964 sl 1,663 en 1,663 en ““Ni na razpolago v slovenskem jezikuNi na razpolago v slovenskem jeziku””1,074 en 1,074 en 59 sl or en 59 sl or en 12 en or sl 12 en or sl
Also cases like Also cases like “ODLOCBA t. 1346/2001/ES …”“ODLOCBA t. 1346/2001/ES …”
So, did not process them..So, did not process them..
JRC Workshop, JRC Workshop, 26-27 September 200526-27 September 2005
Tomaž ErjavecTomaž Erjavec: Multilingual Tokenisation, Tagging & : Multilingual Tokenisation, Tagging & LemmatisationLemmatisation
Quantitative results: Quantitative results: elementselements
<document> / <text><document> / <text> 7,771 7,771
<signature><signature> 7,683 7,683
<annex><annex> 3,658 3,658
<P><P> 1,063,5771,063,577
<c><c> 2,865,3072,865,307
<w><w> 15,934,00315,934,003
#IMPLIED#IMPLIED 2,452,5412,452,541
TERMTERM 412,766412,766
#IMPLIED#IMPLIED 14,393,95314,393,953
DIGDIG 1,036,0761,036,076
ENUMENUM 331,426331,426
ABBRABBR 159,022159,022
MWMW 11,04811,048
TAGTAG 2,2342,234
URLURL 108108
EMAILEMAIL 4747
JRC Workshop, JRC Workshop, 26-27 September 200526-27 September 2005
Tomaž ErjavecTomaž Erjavec: Multilingual Tokenisation, Tagging & : Multilingual Tokenisation, Tagging & LemmatisationLemmatisation
Lexical analysisLexical analysis
Extracted the MULTEXT lexicon from Extracted the MULTEXT lexicon from corpus:corpus:
……8 rafinacija rafinacija Ncfsn 8 rafinacija rafinacija Ncfsn 2 rafinacije rafinacija Ncf2 rafinacije rafinacija Ncfppa a 40 rafinacije rafinacija Ncfsg 40 rafinacije rafinacija Ncfsg 2 2 rafinacije15rafinacije15 rafinacije15 Mc---drafinacije15 Mc---d 26 rafinaciji 26 rafinaciji rafinacijrafinacij N Npmpnpmpn 9 rafinaciji rafinacija Ncfsl 9 rafinaciji rafinacija Ncfsl 17 rafinacijo rafinacija Ncfsa 17 rafinacijo rafinacija Ncfsa …… Number of lexical entries: 381,068Number of lexical entries: 381,068
Different word-forms: 221,876 Different word-forms: 221,876 Different lemmas: 154,241 Different lemmas: 154,241
Different MSDs: 970Different MSDs: 970
JRC Workshop, JRC Workshop, 26-27 September 200526-27 September 2005
Tomaž ErjavecTomaž Erjavec: Multilingual Tokenisation, Tagging & : Multilingual Tokenisation, Tagging & LemmatisationLemmatisation
Some problemsSome problems
Complex tokenisation – over 15% “weird” Complex tokenisation – over 15% “weird” words: words: priloge.opomba priloge.opomba Ncfsnpriloge.opomba priloge.opomba Ncfsn who/fsf/fos/97.7 who/fsf/fos/97.7 Rgp who/fsf/fos/97.7 who/fsf/fos/97.7 Rgp zavarovalnica(-e) zavarovalnica(-e) Ncmsi zavarovalnica(-e) zavarovalnica(-e) Ncmsi
Weak tagging model (likes verbs!):Weak tagging model (likes verbs!):3 anion anion Ncmsa--n 3 anion anion Ncmsa--n 4 anion anion Ncmsn 4 anion anion Ncmsn 1 anion anion 1 anion anion NpmsnNpmsn 3 anion anion 3 anion anion Vmp--smpVmp--smp 6 aniona anion Ncmsg 6 aniona anion Ncmsg 8 anione anion Ncmpa 8 anione anion Ncmpa 1 anioni 1 anioni anioenanioen Afpmsny Afpmsny 1 anioni anion Ncmpn 1 anioni anion Ncmpn 1 anioni 1 anioni anionianioni Vmp--pmpVmp--pmp
1 anioni 1 anioni anioniti Vmip3s--nanioniti Vmip3s--n
JRC Workshop, JRC Workshop, 26-27 September 200526-27 September 2005
Tomaž ErjavecTomaž Erjavec: Multilingual Tokenisation, Tagging & : Multilingual Tokenisation, Tagging & LemmatisationLemmatisation
ConclusionsConclusions
Presented processing with Presented processing with totale totale onon ACQUIS-sl and a quick ACQUIS-sl and a quick evaluationevaluation
Further work:Further work:– methodology of semi-manual methodology of semi-manual
annotation (model tweaking)annotation (model tweaking)– ““lexical priming” in totalelexical priming” in totale
Translations and collocatesTranslations and collocates
top related