panola: parsing nordic languages eckhard bick

24
PaNoLa: Parsing Nordic Languages Eckhard Bick http:// beta.visl.sdu. dk

Upload: quentin-cheers

Post on 01-Apr-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PaNoLa: Parsing Nordic Languages Eckhard Bick

PaNoLa: Parsing Nordic Languages

Eckhard Bickhttp://beta.visl.sdu.dk

Page 2: PaNoLa: Parsing Nordic Languages Eckhard Bick

PaNoLa Goals

● 1. Integrate existing and stimulate new Constraint Grammar-research in Nordic countries

● 2. Internet based Grammar Teaching, applying theVISL model to different Nordic languages

● 3. Morphologically and syntactically annotated corpus data

Page 3: PaNoLa: Parsing Nordic Languages Eckhard Bick

Participants● University of Southern Denmark (Eckhard Bick, Anette Wulff)

Danish CG as well as CGs for 6 other languages

● Oslo University (Janne Bondi Johannessen, Kristin Hagen)Bokmål and Nynorsk CGs

● Helsinki University (Fred Karlsson):Finnish and Swedish CGs

● Göteborg University (Torbjörn Lager)µTBL-system (corpus trained automatic CG)

● Tartu University (Heli Uibo, Kaili Müürisep): Estonian CG

● Tromsø University (Trond Trosterud): Sami CG

● The Greenlandic Language Secretariat Oqaasileriffik (Per Langgård)

● Iceland University of Education (Jóhanna Karlsdottir)

● University of the Faroe Islands (Zakaris Hansen)

Page 4: PaNoLa: Parsing Nordic Languages Eckhard Bick

Project framework

● Funding: Nordic Council of Ministries● Funded project period:

PaNoLa: January 2002 – December 2003: da, no, sv, fiPaNoLa-addon: 2004: is, fo, smi, kl

PaNoLa-plus: 2005 (- 2006): is, fo, smi, kl planned: PaNoLa-neighbour: 2005/6 (- 2007): lit, lav, ru

● Historical basis and ongoing cooperation

PaNoLa PaNoLaaddon PaNoLa-plus

PaNoLa-neighbour

da, no, sv, fiis, fo, smi, kl

lit, lav, ru

Page 5: PaNoLa: Parsing Nordic Languages Eckhard Bick

Project framework

● Network aspect: 4 workshops in Denmark, Norway, Iceland and Sweden

Odense, 19.-21. May 2002 Ustaoset, 25.-27. October 2002 Reykjavik, 1.-2. June 2003 Göteborg, 24.-25. October 2003Odense, 23.-26. October 2004Fefor, 11.-13. Marts 2005(Tallin, 1.-3. April 2005)planned: Thorshavn, 16.-19. September 2005

● Administration, Web-server, Data-integration:VISL/ISK, University of Southern Denmark

● Satellite projects: e.g. Arboretum, GREI, Arborest

Page 6: PaNoLa: Parsing Nordic Languages Eckhard Bick

Constraint Grammar

● Rule and lexicon based robust parsing (Karlsson et. al. 1995), methodological paradigm

● Shared conceptual and notational conventions, allowing productive research transfer

● Language dependent differences: Lexicon, rules(Inter-scandinavian comparative payoff?)

● Compiler and rule type differences● Focus differences: tagging? Parsing? Semantics?

Teaching? Corpus annotation? QA?, NER?, ...

Page 7: PaNoLa: Parsing Nordic Languages Eckhard Bick

Rule formalism and architecture

cg1-compiler cg2-compiler

visl-cg-compiler

SweCG

FinCG

Oslo-Bergen tagger

DanGram, Samiother VISL languages

µ-TBL

Lingsoft-compatibleNeeds more rules than cg2

Sets as targetsBarrier-

conditions

“cg2-like” plus substitute operator

for correcting hybrid input

Automatic learning,

local context,rule ordering

PoS

Syntax

Case roles

Swedish orlanguage-indep.trained CG

☻cgx-compiler

EstCG

da

smino

estsv fi

Page 8: PaNoLa: Parsing Nordic Languages Eckhard Bick

The Lexical Base

TWOL Core lexicon +morphological analyser

SweCG

FinCG

Oslo-Bergen tagger

DanGram

Corpusdependent

Valency potential (especially for verbs)

Semantic setsNER

µ-TBL

Full semantic prototypelexicon

SamicCG

EstCG

Page 9: PaNoLa: Parsing Nordic Languages Eckhard Bick

Theoretical Framework (Syntax)

Cg2tree (MC)(visl-psg)

Traditional CG: Flat dependencyWord based form and function tags

Dependencyfilter (SH)

TIGER formatPENN format

Visl2penn(EB)

Visl2tiger(LN, EB, ..)

Treebank format

PSG-Grammar

DanishNorwegian

Editing tools

Search interfaces

☻☻☻

Korpus90/2000Oslo-Bergen Corpus

Arboretum

Redwood

Page 10: PaNoLa: Parsing Nordic Languages Eckhard Bick

Treebank data compatibility

CG CG-dep VISLVISL-dep

TIGER TIGER-dep MALT-depDTAG-dep

CGcg2depdepsplicator

cg2visl(visl-psg + grammar)

depsplicator

cg2visl | visl2tiger.pl

cg2visl | visl2tiger.pl | tiger2dep.pl

cg2dep | visldep2malt

depsplicator

CG-dep

visldep2malt

VISLtree2cg

visl2tiger.plvisl2tiger.pl | tiger2dep.pl

visl2tiger.pl | tiger2dep.pl | tigerdep2malt

VISL-dep

TIGER tiger2dep.pl

TIGER-dep

tigerdep2malt, (NTN tools)

(NTN tools)

MALT (NTN tools)

DTAG (NTN tools)

Page 11: PaNoLa: Parsing Nordic Languages Eckhard Bick

Accessibility

● Strong focus on making tools and corpora freely accessible on the internet

● Provide notational and complexity filters to bridge differences between different research and teaching traditions

● VISL's open source philosophy for reconciling academic and commercial use:Free compilers and corpora, but allowing for the protection (i.e. commercializability) of grammars, lexica and end-user applications

Page 12: PaNoLa: Parsing Nordic Languages Eckhard Bick

Related applicative CG-projects

● CG spell/grammar checking (No, Da)Lingsoft / Microsoft

● Named Entity Recognition (Da, No)Nomen Nescio (Nordic Network) 2001-2003

● Treebanks (Da Arboretum, Norwegian plans) Nordic Treebank Network 2003-2004

● Question Answering systems (Da)Aminova Dialogue Systems

● Teaching (e.g. VISL-GYM, VISL-HHX, GREI)

Page 13: PaNoLa: Parsing Nordic Languages Eckhard Bick

PaNoLa's other leg: CALLIntegrating and strengthening Nordic languages

in the VISL grammar teaching system

● A unified system of grammatical categories and structural analysis for 22 languages (Dienhart 2000 and Bick 2001)

● Color codes and symbolic notation● Systematic focus on form & function● Preexisting server and programming infrastructure● School and university teaching contacts at all levels● Internet based games and exercises● Graded complexity filters

Page 14: PaNoLa: Parsing Nordic Languages Eckhard Bick

notational harmonization vs. linguistic differences:The greenlandic example

QUE:parCJT:cl=S:pron Suumuna #'Hvilken/Hvad'=fA:icl==Od:g===D:n naasut #'planternes'===H:n qorsuttaat #'deres det grønne'==P:v-pcp1 kiilorpassuakkaarlugu

#gørende det i kilovis=A:g==H:n nunamut #'jorden'==D:n uumassuseqanngitsumut

#'på den livløse'=P:v siaruartilertaraa

#får det til at brede sigCJT:cl-=fA:cl-==S:n apullu #og sneenCO:conj _lu-CJT:cl=-fA:cl==P:v aanniariaraangat

#så ofte den begynder at smelte=P:v siaruaatipallatsittarlugu

#får det til at vælte frem?

KAL22a)Suumuna naasut qorsuttaat kiilorpassuakkaarlugu nunamut uumassuseqanngitsumut siaruartilertaraa apullu aanniariaraangat siaruaatipallatsittarlugu? (Hvad var det der gjorde, at kilo efter kilo af det grønne plantestof kunne vælte frem fra den livløse jord, lige så snart det blev varmt nok i vejret og de sidste rester af sne var væk?)

==H:n nunamut #på jorden===R:n('nuna') nuna-===D:in('mut',fleksiver) -mut

==D:n uumassuseqanngitsumut===R:v('uuma') uuma-===D:in('ssusiq')-ssuse-===D:iv('qar') -qa-===D:iv('ngngit')-nngit-===D:in('Tuq') -su-===D:in('mut',fleksiver) -mut

==P:v aanniariaraangat===R:v('aak') aan-===D:iv('niar') -nia-===D:iv('riar') -riar-===D:iv('gaangat',fleksiver) -aangat

=P:v siaruaatipallatsittarlugu==R:v('siaruar') siarua-==D:iv('ute') -ati-==D:iv('pallak') -pallat-==D:iv('tit') -sit-==D:iv('Tar') -tar-==D:iv('lugu',fleksiver) -lugu

Page 15: PaNoLa: Parsing Nordic Languages Eckhard Bick

Greenlandic word-internal tree structures

Page 16: PaNoLa: Parsing Nordic Languages Eckhard Bick

Teaching corpora

Se nte nce s W o rd s W o rd s p r.se n te nce

D a n ish 11 2 1 + 1 2 0 2 9 1 0 ,1

B o k m å l 7 6 6 5 6 2 9 7 ,3

N y n o rsk 7 6 6 5 8 8 8 7 ,7

Ic e la n d ic 2 1 2 1 3 9 4 6 ,6

F a ro e se 1 7 8 1 6 0 9 9 ,0

Sa m i 1 5 5 + 6 0 3 3 ,9

Sw e d ish 1 0 6 11 5 3 1 0 ,9

F in n ish 1 0 2 5 4 5 5 ,3

E sto n ia n 1 0 0 + 5 9 6 6 ,0

G re e n la n d ic 1 0 0 ? ? ?

● Pedagogically structured● XML-markup for teaching topic and didactical progression● Finnish and Swedish modelled on Danish and Norwegian examples files (comparative possibilities)● compatibility with and importability for research treebanks (e.g. Sofie)

Page 17: PaNoLa: Parsing Nordic Languages Eckhard Bick

Interactive teaching trees

Page 18: PaNoLa: Parsing Nordic Languages Eckhard Bick

Grammar games: Labyrinth

Page 19: PaNoLa: Parsing Nordic Languages Eckhard Bick

Grammar Games: Word Fall

Page 20: PaNoLa: Parsing Nordic Languages Eckhard Bick

Integrating the CG and CALL legs

● Nordic CG expertise is used to provide live analyses as input for the teaching modules, if necessary by CGI-communication between university servers, e.g. Oslo-SDU

● Descriptional harmonization issues (e.g. Word class)● Determine matching complexity (e.g. subclause analysis?)

Page 21: PaNoLa: Parsing Nordic Languages Eckhard Bick

CG leg evaluation

● CG-grammars improve incrementally, so evaluation is less definite than for probabilistic systems, and can change over time.

● Results depend on tag granularity and test genre

● Some numbers:-- DanGram: F-Score 98.65 for PoS, 94.9 for function (Bick 2003)-- DanGram NER: 5% typing errors, 2% chunking errors-- Bokmål CG: 97.2% lexical F-score (Hagen & Johannessen 2003)-- Nynorsk CG: 96.2% lexical F-score-- SWECG 1.0: recall 99.7% at a precision of 95% (pre-PaNoLa)-- µ-TBL CG for Swedish: 98.1% lexical accuracy when allowing for 1.04 tags pr. Word (Lager 1999)

Page 22: PaNoLa: Parsing Nordic Languages Eckhard Bick

Teaching leg evaluation● GREI evaluation: improvement of grammatical skills

after using VISL tools (104 children 7th and 8th grade)● Same level tests before & after using VISL/GREI, test &

control groups● Subjective results: All users thought VISL was more fun

(games more than trees), and that their grammatical skills had improved

● Objective results: Test group performed 14.5% better than control group (7th grade), resp. 7% (8th grade) and 12% at the secondary level.

● Differences were positive for both PoS and sentence analysis, but more marked for the latter

Page 23: PaNoLa: Parsing Nordic Languages Eckhard Bick

Teaching corpora differences across PaNoLa languages

● Preposition frequency: 11% (Bokmål), 11.4% (Danish), 13.4% (Nynorsk), 0.5% (Finnish)

● PoS: “klappe i”, “tage på”, “skrive noget om”are tagged as ADV in Danish, as PRP in Norwegian samples

● Danish infinitive markers ('at') tagged as CONJ in Norwegian● Subclass solutions: e.g. Da/Fi distinction between adjunct and

argument adverbials, not made by No/Se (fA/As/Ao vs. A)● Tradition interference: Swedish analysis had zero

constituents, because it was annotated according to the English VISL model

Page 24: PaNoLa: Parsing Nordic Languages Eckhard Bick

Outlook● Continued development of Nordic Constraint

Grammars and CG applications● Ongoing CALL service for schools● Presence of the CG paradigm in other Nordic networks● “Post-PaNoLa”: VISL adaptations for other minor

Nordic languages (Faeroese, Icelandic, Samic, Estonian ...)