totale multilingual tokenisation, tagging and lemmatisation tomaž erjavec dept. of knowledge...

totaletotaleMultilingual Multilingual Tokenisation, Tagging Tokenisation, Tagging and Lemmatisationand Lemmatisation

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Dept. of Knowledge Technologies, Jožef Stefan Institute Jožef Stefan Institute Ljubljana, SloveniaLjubljana, Slovenia

JRC Workshop, 26-27 September JRC Workshop, 26-27 September 20052005

JRC Workshop, JRC Workshop, 26-27 September 200526-27 September 2005

Tomaž ErjavecTomaž Erjavec: Multilingual Tokenisation, Tagging & : Multilingual Tokenisation, Tagging & LemmatisationLemmatisation

Overview of the talkOverview of the talk

1.1. IntroductionIntroduction

2.2. The totale pipelineThe totale pipeline

3.3. Training totaleTraining totale

4.4. Annotating JRC-ACQUIS-slAnnotating JRC-ACQUIS-sl

5.5. Conclusions Conclusions

IntroductionIntroduction

Hypothesis: to efficiently exploit the Hypothesis: to efficiently exploit the JRC-ACQUIS its texts need to be JRC-ACQUIS its texts need to be linguistically pre-processedlinguistically pre-processed

This normalizes (reduces) the data This normalizes (reduces) the data and gives other tools more features and gives other tools more features to work withto work with

ExampleExample

TOKENTOKEN TYPE LEMMA MSD TYPE LEMMA MSD----------------------------------------------------------------------------2. TOK_ENUM 2. Rmp 2. TOK_ENUM 2. Rmp (a) TOK_ENUM (a) Rmp (a) TOK_ENUM (a) Rmp Where TOK where Cs Where TOK where Cs an TOK a Di an TOK a Di exporter TOK exporter Ncns exporter TOK exporter Ncns has TOK have Vaip3s has TOK have Vaip3s declared TOK declare Vmps declared TOK declare Vmps goods TOK good Ncnp goods TOK good Ncnp packaged TOK package Vmis packaged TOK package Vmis using TOK use Vmpp using TOK use Vmpp automatic TOK automatic Afp automatic TOK automatic Afp systems TOK system Ncnp systems TOK system Ncnp for TOK for Sp for TOK for Sp bagging TOK bag Vmpp bagging TOK bag Vmpp , PUN , PUN canning TOK can Vmpp canning TOK can Vmpp , PUN , PUN bottling TOK bottle Vmpp bottling TOK bottle Vmpp , PUN , PUN etc. TOK_ABBR etc. Rmpetc. TOK_ABBR etc. Rmp

MSD and LEMMA are MSD and LEMMA are context dependentcontext dependent

MSD useful for any MSD useful for any syntactically oriented syntactically oriented further processing (PoS further processing (PoS filtering)filtering)

LEMMA useful for LEMMA useful for reducing the lexical reducing the lexical space (easier searches)space (easier searches)

Task is much harder for Task is much harder for inflectionally rich (or inflectionally rich (or agglutinative) languages agglutinative) languages than for English or most than for English or most ‘old’ EU!‘old’ EU!

2. (a) Where an exporter has declared goods 2. (a) Where an exporter has declared goods packaged using automatic systems for packaged using automatic systems for bagging, canning, bottling, etc.,bagging, canning, bottling, etc.,

Nagging doubtsNagging doubts

Normalization loses informationNormalization loses information Annotation introduces errors and Annotation introduces errors and

biasbias Evaluation for IE non-conclusiveEvaluation for IE non-conclusive Unsupervised methods!Unsupervised methods!

Still…Still…

Wanted Wanted

A tool that would take text in any A tool that would take text in any language andlanguage and

tokenise,tokenise, PoS tag andPoS tag and lemmatise it.lemmatise it.Should be simple to install and use, Should be simple to install and use,

robust, fast, and adaptable to new robust, fast, and adaptable to new languages, preferably with a large languages, preferably with a large number of already available modelsnumber of already available models

(and work under Linux!)(and work under Linux!)

What is out thereWhat is out there

Component software:Component software:tokenisers, taggers, (stemmers)tokenisers, taggers, (stemmers)

FS/RE environments: INTEX, CLARKFS/RE environments: INTEX, CLARK Various LT workbenches, most Various LT workbenches, most

famous GATEfamous GATE Alas: Java, time investment, Alas: Java, time investment,

historyhistory

Linguistic annotation Linguistic annotation with with totaletotale Multilingual Multilingual totokenisation, kenisation, tatagging gging

and and lelemmatisationmmatisation Perl program with a simple pipeline Perl program with a simple pipeline

architecturearchitecture Input is plain UTF-8 text Input is plain UTF-8 text Output is a list of annotated tokensOutput is a list of annotated tokens Several output formats (tabular, Several output formats (tabular,

XML)XML)

Example useExample use

$ totale -l en$ totale -l enDoctor, can you help?Doctor, can you help?^D^D

<TEXT><TEXT>DoctorDoctor TOKTOK doctordoctor NcfsNcfs,, PUNPUNcancan TOKTOK cancan VoipVoipyouyou TOKTOK youyou Pp2Pp2helphelp TOKTOK helphelp VmnVmn?? PUN_TERMPUN_TERM

MultilingualMultilingual

resourcesresources

Totale building blocksTotale building blocks

mlTokenmlTokenTnTTnT

CLOGCLOG

MultilingualMultilingual

resourcesresourcesMultilingualMultilingual

resourcesresources

PerlPerl

Tokenisation in totaleTokenisation in totale

Perl module Perl module mlToken.pmmlToken.pm(Camelia Ignat, JRC)(Camelia Ignat, JRC)

Multilingual, with resource files for Multilingual, with resource files for supported languages (also default rules)supported languages (also default rules)

Splits text into tokens, marks token typeSplits text into tokens, marks token type Marks paragraph and sentence Marks paragraph and sentence

boundariesboundaries Modelled on Modelled on mtSegmtSeg

Tagging in totaleTagging in totale

Annotating words in the text with their Annotating words in the text with their context disambiguated context disambiguated morphosyntactic annotations (MSDs)morphosyntactic annotations (MSDs)

Used the tri-gram tagger Used the tri-gram tagger TnTTnT Trainable, fast, unknown-word guessing Trainable, fast, unknown-word guessing

module, able to accommodate the large module, able to accommodate the large morphosyntactic tagsets of various EU morphosyntactic tagsets of various EU languageslanguages

Uses (and induces from annotated Uses (and induces from annotated corpus) a lexicon with ambiguity corpus) a lexicon with ambiguity classes and tri-gram file classes and tri-gram file

Lemmatisation in Lemmatisation in totaletotale Used Used CLOGCLOG, which learns first-order , which learns first-order

decision lists (+ list of exceptions)decision lists (+ list of exceptions) Learns lemmatisation rules for each MSDLearns lemmatisation rules for each MSD CLOGCLOG produces Prolog programs, but produces Prolog programs, but

these converted into Perlthese converted into Perl

Tomaž Erjavec and Sašo Džeroski: Machine Tomaž Erjavec and Sašo Džeroski: Machine Learning of Morphosyntactic Structure: Learning of Morphosyntactic Structure: Lemmatising Unknown Slovene Words. Lemmatising Unknown Slovene Words. Applied Artificial IntelligenceApplied Artificial Intelligence 18(1), pp. 18(1), pp. 17-40, 2004. 17-40, 2004.

Example CLOG ruleExample CLOG rulesub SUB_afcfda {sub SUB_afcfda { my $w = $_[0]; my $lem;my $w = $_[0]; my $lem; if ($w=~/^(.*)svetlej#353i$/){$lem=$1."svetel"}if ($w=~/^(.*)svetlej#353i$/){$lem=$1."svetel"} elsif ($w=~/^(.*)polnej#353i$/){$lem=$1."poln"}elsif ($w=~/^(.*)polnej#353i$/){$lem=$1."poln"} elsif ($w=~/^(.*)b#353i$/) {$lem=$1."b"}elsif ($w=~/^(.*)b#353i$/) {$lem=$1."b"} elsif ($w=~/^(.*)elej#353i$/) {$lem=$1."el"}elsif ($w=~/^(.*)elej#353i$/) {$lem=$1."el"} elsif ($w=~/^(.*)ivej#353i$/) {$lem=$1."iv"}elsif ($w=~/^(.*)ivej#353i$/) {$lem=$1."iv"} elsif ($w=~/^(.*)anej#353i$/) {$lem=$1."an"}elsif ($w=~/^(.*)anej#353i$/) {$lem=$1."an"} elsif ($w=~/^(.*)kej#353i$/) {$lem=$1."ek"}elsif ($w=~/^(.*)kej#353i$/) {$lem=$1."ek"} elsif ($w=~/^(.*)tej#353i$/) {$lem=$1."t"}elsif ($w=~/^(.*)tej#353i$/) {$lem=$1."t"} elsif ($w=~/^(.*)i#382ji$/) {$lem=$1."izek"}elsif ($w=~/^(.*)i#382ji$/) {$lem=$1."izek"} elsif ($w=~/^(.*)enej#353i$/) {$lem=$1."en"}elsif ($w=~/^(.*)enej#353i$/) {$lem=$1."en"} elsif ($w=~/^(.*)rej#353i$/) {$lem=$1."er"}elsif ($w=~/^(.*)rej#353i$/) {$lem=$1."er"} elsif ($w=~/^(.*)nej#353i$/) {$lem=$1."en"}elsif ($w=~/^(.*)nej#353i$/) {$lem=$1."en"} else {$lem="???"}else {$lem="???"} return $lem;return $lem;}}

Training totale with Training totale with MULTEXT-East MULTEXT-East resourcesresources

Learning totale tagging and lemmatisation modelsLearning totale tagging and lemmatisation models MULTEXT-East language resources V3, a MULTEXT-East language resources V3, a

standardised multilingual dataset for language standardised multilingual dataset for language engineering R&Dengineering R&D

Covers mainly Central and Eastern European Covers mainly Central and Eastern European languageslanguages

Freely available for research use from Freely available for research use from http://nl.ijs.si/ME/V3/http://nl.ijs.si/ME/V3/

Used MSD tagged “1984” corpus (100kW) for Used MSD tagged “1984” corpus (100kW) for tagger trainingtagger training

Used MSD lexica (15k lemmas) for lemmatiser Used MSD lexica (15k lemmas) for lemmatiser trainingtraining

Currently supported Currently supported languageslanguages English English SloveneSlovene CzechCzech RomanianRomanian SerbianSerbian EstonianEstonian HungarianHungarian

Processing JRC’s Processing JRC’s ACQUIS-sl with totaleACQUIS-sl with totale sl.tar.gz 03-Sep-2005 03:51 34.4Msl.tar.gz 03-Sep-2005 03:51 34.4M

sl/slcelex_*.xml = 144M, 7772 filessl/slcelex_*.xml = 144M, 7772 files Wrapper Wrapper perlperl program: program:

for each filefor each file 1.1. extract text (all s except first)extract text (all s except first)2.2. | totale -l sl -f XML || totale -l sl -f XML | 3.3. substitute contents of original s with substitute contents of original s with

annotated onesannotated ones4.4. validate against DTDvalidate against DTD

72 hrs on asterix 72 hrs on asterix but 10s startup time = 77720s = but 10s startup time = 77720s = 21hrs21hrs

The problem of titlesThe problem of titles

Dual role of titles: as text and name of documentDual role of titles: as text and name of document Should they contain P at all?Should they contain P at all? Many titles untranslated – experiment Many titles untranslated – experiment

with TextCat:with TextCat:4,964 sl 4,964 sl 1,663 en 1,663 en ““Ni na razpolago v slovenskem jezikuNi na razpolago v slovenskem jeziku””1,074 en 1,074 en 59 sl or en 59 sl or en 12 en or sl 12 en or sl

Also cases like Also cases like “ODLOCBA t. 1346/2001/ES …”“ODLOCBA t. 1346/2001/ES …”

So, did not process them..So, did not process them..

Quantitative results: Quantitative results: elementselements

<document> / <text><document> / <text> 7,771 7,771

<signature><signature> 7,683 7,683

<annex><annex> 3,658 3,658

1,063,5771,063,577

<c><c> 2,865,3072,865,307

<w><w> 15,934,00315,934,003

#IMPLIED#IMPLIED 2,452,5412,452,541

TERMTERM 412,766412,766

#IMPLIED#IMPLIED 14,393,95314,393,953

DIGDIG 1,036,0761,036,076

ENUMENUM 331,426331,426

ABBRABBR 159,022159,022

MWMW 11,04811,048

TAGTAG 2,2342,234

URLURL 108108

EMAILEMAIL 4747

Lexical analysisLexical analysis

Extracted the MULTEXT lexicon from Extracted the MULTEXT lexicon from corpus:corpus:

……8 rafinacija rafinacija Ncfsn 8 rafinacija rafinacija Ncfsn 2 rafinacije rafinacija Ncf2 rafinacije rafinacija Ncfppa a 40 rafinacije rafinacija Ncfsg 40 rafinacije rafinacija Ncfsg 2 2 rafinacije15rafinacije15 rafinacije15 Mc---drafinacije15 Mc---d 26 rafinaciji 26 rafinaciji rafinacijrafinacij N Npmpnpmpn 9 rafinaciji rafinacija Ncfsl 9 rafinaciji rafinacija Ncfsl 17 rafinacijo rafinacija Ncfsa 17 rafinacijo rafinacija Ncfsa …… Number of lexical entries: 381,068Number of lexical entries: 381,068

Different word-forms: 221,876 Different word-forms: 221,876 Different lemmas: 154,241 Different lemmas: 154,241

Different MSDs: 970Different MSDs: 970

Some problemsSome problems

Complex tokenisation – over 15% “weird” Complex tokenisation – over 15% “weird” words: words: priloge.opomba priloge.opomba Ncfsnpriloge.opomba priloge.opomba Ncfsn who/fsf/fos/97.7 who/fsf/fos/97.7 Rgp who/fsf/fos/97.7 who/fsf/fos/97.7 Rgp zavarovalnica(-e) zavarovalnica(-e) Ncmsi zavarovalnica(-e) zavarovalnica(-e) Ncmsi

Weak tagging model (likes verbs!):Weak tagging model (likes verbs!):3 anion anion Ncmsa--n 3 anion anion Ncmsa--n 4 anion anion Ncmsn 4 anion anion Ncmsn 1 anion anion 1 anion anion NpmsnNpmsn 3 anion anion 3 anion anion Vmp--smpVmp--smp 6 aniona anion Ncmsg 6 aniona anion Ncmsg 8 anione anion Ncmpa 8 anione anion Ncmpa 1 anioni 1 anioni anioenanioen Afpmsny Afpmsny 1 anioni anion Ncmpn 1 anioni anion Ncmpn 1 anioni 1 anioni anionianioni Vmp--pmpVmp--pmp

1 anioni 1 anioni anioniti Vmip3s--nanioniti Vmip3s--n

ConclusionsConclusions

Presented processing with Presented processing with totale totale onon ACQUIS-sl and a quick ACQUIS-sl and a quick evaluationevaluation

Further work:Further work:– methodology of semi-manual methodology of semi-manual

annotation (model tweaking)annotation (model tweaking)– ““lexical priming” in totalelexical priming” in totale

Translations and collocatesTranslations and collocates

totale multilingual tokenisation, tagging and lemmatisation tomaž erjavec dept. of knowledge...

pun canning tok

lemmatisation toma erjavec

tok package vmis

slovenia jrc workshop

pun bottling tok bottle

sp bagging tok bag vmpp

di exporter tok exporter

vmps goods tok good

Documents

the tokenisation of assets and potential implications for...

xml and overlapping hierarchies tomaž erjavec dept. of...

the fida & multext-east language resources tomaž erjavec...

the tokenisation of assets and potential implications for...

the lemmatisation of idioms - opus 4

tokenisation: reducing data security risk · © 2009...

chapter 2: tokenisation and sentence segmentation ·...

host card emulation with tokenisation

imact final conference - language parallel sessions -...

tomaž erjavec dept. of knowledge technologies jožef stefan...

tokenisation - visa.com.au...and american express, developed...

tomaž erjavec

jack erjavec auto tech book

the lemmatisation of adverbs in northern sotho*

introduction to human language technologies tomaž erjavec...

corpus-driven bantu lexicography part 2: lemmatisation and...

gate and social media: language id, tokenisation and...

tomaž erjavec 1, adam kilgarriff 2, irena srdanović...

erjavec, k. - news reporting about genetically modified...

psna isnst ageb, gd iii 2019 b 2 klobućar t, erjavec j