technology is an effective tool to promote use of basque strategies to develop hlt for minority...
TRANSCRIPT
Technology is an effective tool Technology is an effective tool to promote use of Basqueto promote use of Basque
Strategies to develop HLT for minority languagesStrategies to develop HLT for minority languages
IXA Research Group on NLPUniversity of the Basque Country
Dublin 2006
01/04/06 2
OutlineOutline
Basque language Ixa GroupStrategy to develop HLTApplicationsTools Linguistic resources
History of BasqueHistory of Basque
Prerromanic languages in SpainPrerromanic languages in Spain
Basque in 7Basque in 7thth, 12th and 19th centuries, 12th and 19th centuries
Basque nowadaysBasque nowadays
Six different dialects !Six different dialects !
1,033,900 Speakers1,033,900 Speakers (First lang.: 700,000)(First lang.: 700,000)
Non homogeneus Non homogeneus distribution !distribution !
Main reasons of Basque regression.Main reasons of Basque regression.
• No official language
• Out of the education system
• 6 dialects!
• Out of media
• Out of industry
Main reasons of Basque regressionMain reasons of Basque regression
• No official language• Out of the education
system• 6 dialects!• Out of media• Out of industry
But since 1980...
Coofficial languageIntegrated in education
(even at university)Unified Basque (1966)
TV, newspaper...
Out of new ICTs ???
Basque. Linguistic featuresBasque. Linguistic features
Case suffixes and free order of sentence components The dog brought the newspaper in his mouthTxakur-rak egunkari-a aho-an zekarren.The-dog the-newspaper in-his-mouth broughtergative-3-s absolutive-3-s inessive-3-sSubject Object Modifier Verb
Alternative possible orders:Txakur-rak aho-an egunkari-a zekarren.Txakur-rak aho-an zekarren egunkari-a.Egunkari-a txakur-rak zekarren aho-an....
Basque. Linguistic featuresBasque. Linguistic features
Ergative case. Subject of transitive verbs– I am Ni naiz (absolutive)– I saw the cat Nik katua ikusi nuen (ergative)
Agreement in number and person between verb and (subject, object and indirect object)
I saw the cat Nik katua ikusi nuen I saw the cats Nik katuak ikusi nituen I saw you Nik zu ikusi zintudan
01/04/06 9
OutlineOutline
Basque language Ixa GroupStrategy to develop HLTApplicationsTools Linguistic resources
01/04/06 10
IXA Research GroupIXA Research Groupon NLP (UPV/EHU) (I)on NLP (UPV/EHU) (I)
Main research fields: NLP, computational linguistics, language engineering.
Goal: to collaborate on – laying foundations for research;– the development of language processing software.
Application language: mainly Basque.
01/04/06 11
http://ixa.si.ehu.es
IXA Research GroupIXA Research Groupon NLP (UPV/EHU) (II)on NLP (UPV/EHU) (II)
1986/1987:1986/1987: 4-5 university lecturers (computer science) 2005/2006: 2005/2006: Interdisciplinary team
– 33 computer scientists • 19 lecturers (11 doctors) • 13 PhD students (research grants)
– 17 linguists • 6 lecturers (4 doctors) • 11 PhD students (research grants)
– 2 research assistants assigned to projects
01/04/06
IXA Group. MilestonesIXA Group. Milestones1987 1990 1995 2000 2006
ProjectsProvince
Gov.Madrid
CicytBasque
Gov. Europa
(Meaning)Basque G.Industry
MTMadrid-Ind.
Companies Basque C. UZEI Plazagune
Eusenor ElhuyarASP Diana
Companiesabroad LexiquestMicrosoft
Eatoni Google Scansoft
Spin-offcompanies Eleka
Products Spelling checker Lemmatizer
EDBLLexical DB
ParserBasquelWordnet
ArgazkiPress
Irion
MT-system
01/04/06 14
OutlineOutline
Basque language Ixa GroupStrategy to develop HLTApplications Tools Linguistic resources
01/04/06 15
Underlying strategyUnderlying strategy
Need of standardization of resources to be useful in different – researches– tools – applications
Need of incremental design and development of language foundations, tools, and applications – in a parallel and coordinated way– in order to get the best benefit from them
01/04/06 16
Strategic priorities: from basic Strategic priorities: from basic research to application research to application developmentdevelopment
Research & developmentResearch & development
End-user applicationsLanguage tools
Basic & applied researchBasic & applied research
Linguistic foundationsLinguistic resources
01/04/06 17
Linguistic foundations & Linguistic foundations & resources, tools and applicationsresources, tools and applications
Linguistic foundations and resources: necessary infrastructure for the automatic processing of a language.
Tools: mainly intended to application developers.
Applications: commercial or non-commercial, for non-specialised end-users.
01/04/06 18
Phase I: laying foundationsPhase I: laying foundations
Phonetics Lexicon Morphology Syntax Semantics
Basic Lexical Database
MRD's Comp. descriptionof morphology
Raw corpus (written texts & speech recordings)
01/04/06 19
Phase II: first basic tools and Phase II: first basic tools and applicationsapplications
Phonetics Lexicon Morphology Syntax Semantics
Lemmatiser/TaggerMorphological analyserStatistical tools for the treatment of corpora
Comp. descriptionof morphology MRD's
Morphologically annotated corpusEnriched Lexical Database
Xuxen: spelling checker/corrector
01/04/06 20
Phase III: more advanced tools Phase III: more advanced tools and applicationsand applications
Phonetics Lexicon Morphology Syntax Semantics
Morphological analyserLemmatiser/Tagger
Statistical tools for the treatment of corpora
Morphologically and syntactically annotated corpusLexical Database
Environment for linguistic tools integration
Basic CALL
MRD's Comp. descriptionof morphology
Xuxen: spelling checker/corrector
Comp.grammar
Lexical-semantic KB
Surface syntax
analyser WSD
Web crawler Grammar checker
Electronicdictionaries
01/04/06 21
Phase IV: multilinguality and Phase IV: multilinguality and general applications general applications
Phonetics Lexicon Morphology Syntax Semantics
MRD's
Lexical Database
Comp. descriptionof morphology
Morphological analyserLemmatiser/Tagger
Statistical tools for the treatment of corporaComp.
grammar
Environment for linguistic tools integration
Electronicdictionaries Web crawler Grammar
checker
Information retrieval and extractionTranslation aids, dialog systems, ...
Syntaxanalyser WSD
Xuxen: spelling checker/corrector
Advanced CALL
Morphol., synt., and semantically annotated multilingual corpus
Multilinguallexical-
semantic KB
01/04/06 22
OutlineOutline
Basque language Ixa GroupStrategy to develop HLTApplicationsTools Linguistic resources
01/04/06 23
Commercial applicationsCommercial applications
Spelling checker/corrector Spelling checker/corrector 3 lemmatization based on-line bilingual /monolingual 3 lemmatization based on-line bilingual /monolingual
dictionariesdictionaries Lemmatization based on-line dictionary of synonymsLemmatization based on-line dictionary of synonyms Lemmatization based search machineLemmatization based search machine The Basque component of a generator of weather The Basque component of a generator of weather
reports reports Spanish-Basque transfer based MT systemSpanish-Basque transfer based MT system
01/04/06 24
Spelling checker/correctorSpelling checker/corrector
Late standardization (Unified Basque)Late standardization (Unified Basque)– Morphology and verbs , 1966Morphology and verbs , 1966– Lexical standardization process is still going onLexical standardization process is still going on
Adult speakers did not learn Basque at shoolAdult speakers did not learn Basque at shool
=> Many doubts in writing => Many doubts in writing 'tree': zuhaitz? zugaitz? zuhaitx? zuhaitsa? sugatza?'tree': zuhaitz? zugaitz? zuhaitx? zuhaitsa? sugatza?
=> Give up! Do it in Spanish or French! ?? => Give up! Do it in Spanish or French! ??
The spell-checker is an EFFECTIVE TOOL The spell-checker is an EFFECTIVE TOOL in the ongoing standardization of Basquein the ongoing standardization of Basque
Basque spelling checker/correctorMore than 20.000 downloads
Versions: Office, OOffice, PC, Mac, web service...
Not just a list of possible word-forms!
-Open Code-No lexical desanbiguation, but yes idioms!-No use of corpus
Spanish-Basque transfer MT
Spanish-Basque transfer MT
Future work:- Hybrid SMT, EBMT & RBMT- Lexical desambiguation- Verb subcategorization- English
Search engine(based on lemmas)• Not looking for “saguarekin” but “sagu” • Not relevant similar words removed Those beginning with “sagu” but with a different lemma i.e.: “saguzar”• Found word forms with other suffixes “saguen”, “saguaren”, “sagua”, “saguetan”
Lemmatization based Lemmatization based
on-line Basque-Spanish bilingual dictionaryon-line Basque-Spanish bilingual dictionary
Figure 5. On-line bilingual dictionary.
Thesaurus (lemmatizer inside)
Basque monolingual dictionary.
Advanced electronic version
Diccionario Básico Escolar Cubano
Integration of consults to different dictionaries
SYNSET
SYNSET
meaning numberSynset
BasqueWORDNET. Ontology of word synsets
Second language learning system
(learner and error corpora based)
01/04/06 38
OutlineOutline
Basque language Ixa GroupStrategy to develop HLTApplicationsTools Linguistic resources
<!ENTITY WDoc01 SYSTEM ‘testua.w.xml’ NDATA wDoc><!ENTITY MWDoc01 SYSTEM ‘testua.mwlnk.xml’ NDATA mwDoc><!ENTITY LemDoc01 SYSTEM ‘testua.lem.xml’ NDATA fsDoc><!-- ... --><body> <p id='xptr'> <xptr id='Xmw1' doc='MwDoc01' from='ID(mw1)'/> <xptr id='Xw6' doc='WDoc01' from='ID(w6)'/> <xptr id='XA-LOT-LOK-7' doc='LemDoc01' from='ID(A-LOT-LOK-7)'/> <xptr id='XL-LOT-LOK-3'doc='LemDoc01' from='ID(L-LOT-LOK-3)'/> <!-- ... --> </p> <p id='linkGrp'> <linkGrp type='w-lem' tagOrder='y'> <link targets='Xw6 XL-LOT-LOK-3'/> <!-- gainontzeko linkak --> </linkGrp> <linkGrp type='me-lem' tagOrder='y'> <link targets='Xmw1 XL-LOT-LOK-7'/> <!-- gainontzeko linkak --> </linkGrp> </p></body>
<!--XML Prolog --><TEI.2> <teiHeader> ... </teiHeader> <text id='TDoc0007' lang=''> <body> <p id='p1'>Hala ere, Matijose ere kalera dijoa.</p> </body> </text></TEI.2>
<!ENTITY WDoc02 SYSTEM ‘testua.w.xml’ NDATA wDoc><!-- ... --><p id='xptr'><xptr id=’Xw2’ doc=’WDoc02’ from=’id(w2)’/> <xptr id=’Xw1’ doc=’WDoc02’ from=’id(w1)’/></p><p id=’joinGrp’> <joinGrp><join id=’mw01’ targets=’Xw1 Xw2’/></joinGrp></p>
Jatorrizko testua (testua.xml)
<text id=’LemDoc002’><!-- ... --> <fs id=’L-LOT-LOK-3’ type=’Lemmatization’> <f name=’Form’><str>ere</str></f> <f name=’Lemma’><str>ere</str></f> <f name=’Morphological-Features’> <fs type=’Top-Features-List’> <f name=’POS’><sym value=’LOT’/></f> <f name=’SUBCAT’><sym value=’LOK’/></f> <f name=’SFL’ org=’list’><sym value=’@LOK’/></f> </fs> </f> </fs> <fs id=’L-LOT-LOK-7’ type=’Lemmatization’> <f name=’Form’><str>hala ere</str></f> <f name=’Lemma’><str>hala ere</str></f> <f name=’Morphological-Features’> <fs type=’Top-Features-List’> <f name=’POS’><sym value=’LOT’/></f> <f name=’SUBCAT’><sym value=’LOK’/></f> </fs> </f> </fs> <fs id=’L-IZE-IZB-3’ type=’Lemmatization’> <f name=’Form’><str>Marijose</str></f> <!-- ... --> </fs> </f> </fs></text>
Estekak(testua.lemlnk.xml)
Lematizazioak
<!ENTITY TDoc03 SYSTEM ‘testua.xml’ NDATA tDoc><!--... --><text id='WDoc0001'> <body> <p id='xptr'> <xptr id='Xw1' doc='TDoc03' from='id(p1) strLoc(1)' to='id(p1) strLoc(4)'/> <xptr id='Xw2' doc='TDoc03' from='id(p1) strLoc(6)' to='id(p1) strLoc(8)'/> <xptr id='Xw6' doc='TDoc06' from='id(p1) strLoc(21)' to='id(p1) strLoc(24)'/> </p> <p id='w'> <w id='w1' sameAs='Xw1' type='HAS_MAI'>Hala</w> <w id='w2' sameAs='Xw2'>ere</w> <w id='w6' sameAs='Xw6'>ere</w> <!-- ... -->
Testu tokenizatua (testua.w.xml)
HAULen egitura (testua.mwjoin.xml)
Methodology for stand-off corpus tagging
(TEI, feature structures and XML)
HDBLren pantailazoaHDBLren pantailazoahemenhemen
EULIA: tool for corpus tagging
CORPUSGILE: tool for consulting corpus
ERAUZTERM
Terminology extraction from corpus
01/04/06 43
EDBL lexical data-baseEDBL lexical data-base
Lexical basis of the automatic proccessing of Basque Lexical basis of the automatic proccessing of Basque 80.000 entries:
• Dictionnary entries• Verb-forms• Affixes
updated and consistent.built with ORACLE V7 and UNIX
CG
Mor
phos
ynta
xP
arsi
ngC
hunk
erD
epen
denc
ies
EUSTAGGER
Linguistic disambiguation
Statistical d
isambiguation
Surface syntax
EIHERA: entities
%
Postpositions
xfst
NP, PP, VP
Syntactic dependencies
Su
rfac
e sy
nta
xD
eep
sy
nta
xPlain text
Parsed text
CG
CG
CG
CG
Morhological analyzer
/<Noizean_behin>/<HAUL_EDBL>/
("noizean_behin" ADB ADOARR @ADLG)
/<,>/<PUNT_KOMA>/
/<Informatika>/<HAS_MAI>/
("informatika" IZE ARR @KM>)
/<Fakultatearen>/<HAS_MAI>/
("fakultate" IZE ARR DEK GEN NUMS MUGM @IZLG>
@<IZLG)
/<aurreko>/
("aurre" IZE ARR DEK NUMS MUGM DEK GEL @IZLG>
@<IZLG)
/<zuhaitzak>/
("zuhaitz" IZE ARR DEK ABS NUMP MUGM @OBJ @SUBJ
@PRED)
/<inausten>/
("inausi" ADI SIN AMM ADOIN ASP EZBU @-JADNAG)
/<dira>/
("izan" ADL A1 NR_HK @+JADLAG)
/<.>/<PUNT_PUNT>/
Format of the lemmatizer outputFormat of the lemmatizer output
Architecture of the lemmatizer/parser
01/04/06 48
ConclusionConclusion
A language that seeks to survive in the modern information society requires language technology products.
"Minority" languages have to do a great effort to face this challenge. – Need of high standardization – Reusing language foundations, tools, and applications – Incremental design and development of them
Technology is an effective tool Technology is an effective tool to promote use of Basqueto promote use of Basque
Strategies to develop HLT for minority languagesStrategies to develop HLT for minority languages
IXA Research Group on NLPUniversity of the Basque Country
IXA Research Group on NLPUniversity of the Basque Country
http://ixa.si.ehu.es