fieldwork and grammaticography in a digital world · 2019-04-15 · corpus building/extension using...
TRANSCRIPT
FieldworkandGrammaticographyinaDigitalWorld
JoshuaWilburFreiburgResearchGroupinSaamiStudies•UniversitätFreiburg
DescriptiveGrammarsandTypology•UniversityofHelsinki28March2019
1
Overview• background• fieldwork• grammaticography
• otheradvances• outlook
2
FieldworkandGrammaticographyinaDigitalWorld
BACKGROUND(aka:contextualization)
3
PiteSaami• Uralic>Finno-Ugric>Saamic• spokenby~40individualsfromArjeplog/ÁrjepluovveinSwedishLapland• aka:Arjeplog-Saami,bidumsámegiella• nearlyallspeakersareatleast50• allspeakersarebilingual(PiteSaamiandSwedish/Arjeplogsmål)• noofficialorthography(yet...),butaworkingstandard• nomedia• Swedishdominateseverydaylife• hardlybeingpassedontoyoungergenerations
4
5
PiteSaamilargerlinguisticstudies:• Halász1893(inHungarian)• Lagercrantz1926(inGerman)• Ruong1943(inGerman)• Lehtiranta1992(inFinnish)• Wilbur2014(inEnglish)• Sjaggo2015(inSwedish)
6
PiteSaamilargerlinguisticstudies:• Halász1893(inHungarian)• Lagercrantz1926(inGerman)• Ruong1943(inGerman)• Lehtiranta1992(inFinnish)• Wilbur2014(inEnglish)• Sjaggo2015(inSwedish)
othermaterials:• extensivecollectionofheritagematerials(ISOF,Uppsala)• dictionary(PiteSaami->Swedish/English)
andproposedorthographicrules(2016)• onlinelexicaldatabase• onlineorthographicrules(includingspellchecker(inbeta))• smartphoneappintheworks
7
largerlinguisticstudies:• Halász1893(inHungarian)• Lagercrantz1926(inGerman)• Ruong1943(inGerman)• Lehtiranta1992(inFinnish)• Wilbur2014(inEnglish)• Sjaggo2015(inSwedish)
othermaterials:• Extensivecollectionofheritagematerials(ISOF,Uppsala)• dictionary(PiteSaami->Swedish/English)
andproposedorthographicrules(2016)• onlinelexicaldatabase• onlineorthographicrules(includingspellchecker(inbeta))• smartphoneappintheworksrecentlinguisticsprojects:• Documentation(2008-2015;materialsarchivedatELARandTLA)• Lexicography(2016)• Syntacticstructures(2016-present)
PiteSaami
8
largerlinguisticstudies:• Halász1893(inHungarian)• Lagercrantz1926(inGerman)• Ruong1943(inGerman)• Lehtiranta1992(inFinnish)• Wilbur2014(inEnglish)• Sjaggo2015(inSwedish)
othermaterials:• Extensivecollectionofheritagematerials(ISOF,Uppsala)• dictionary(PiteSaami->Swedish/English)
andproposedorthographicrules(2016)• onlinelexicaldatabase• onlineorthographicrules(includingspellchecker(inbeta))• smartphoneappintheworksrecentlinguisticsprojects:• Documentation(2008-2015;materialsarchivedatELARandTLA)• Lexicography(2016)• Syntacticstructures(2016-present)
PiteSaami
9
->eachfieldworksituationisunique!
PiteSaami
• significantaspectsofmineinclude:• anaccessiblemoderntechnologicalinfrastructureon-site• aprevioushistoryoflinguisticswork• extensivelanguagetechnologytoolsforclosely-relatedlanguages• messybutextantorthographic“tradition”whenIstarted
FIELDWORKinadigitalworld
10
toolsforfieldwork
• intheolddays:notebookandpencil• nowadays:– recordingequipment– laptop– digitalbackupcapacity(eveninthecloud)– transcriptionsoftware(ELAN)– mobilephones– socialmedia(e.g.:forstayingincontact,datasource)
– grammaticographysoftware(e.g.FLExforinterlinearization)
– languagetechnology… 11
• modern,affordabledigitalrecordingtechnologies(especiallyvideo)allowfieldworkerstocapturemuchmorethanjustlanguage,buttheentirehumanevent– morecompletedocumentation,potentiallyusefulbeyondlinguistics*
12
whynotuse:• bodycameras• drones• surround-soundmicrophones• 360°cameras• 3-Dcameras... *cf.Rießler&Wilbur2017
datacollectionandfieldwork
(re-)collectingolddata(heritageharvesting)
• OCR(opticalcharacterrecognition)
13
*cf.Partanen&Rießler2019
• embeddedtext(morethanjustscanning!)
• canbeexported(e.g.toELAN)
• canbepartofacorpus*
(re-)collectingolddata(heritageharvesting)
• HTR(handwrittentextrecognition)
14
• embeddedtext(morethanjustscanning!)
• canbeexported(e.g.toELAN)
• canbepartofacorpus*• muchmorecomplexthan
OCR,thusitcurrentlyrequiresmuchmoretrainingdatabeforeit’suseful
*cf.Transkribusproject(Kahle2017);alsoBloklandetal2019forabriefdiscussion
GRAMMATICOGRAPHYinadigitalworld
15
briefhistoryofgrammaticography
• 1/3oftheBoasiantrilogy…
• Payne1997,Mosel2006,Aikhenvald2015,etc.
• Nordhoff2008ElectronicReferenceGrammarsforTypology:ChallengesandSolutions
• Implementedgrammars(incorporationincorpusandcomputationallinguistics)
16
digitaltoolsforgrammaticography
17
• goodforconcatenativemorphology• play,play-s,play-ed,play-er,play-er-s
• notsogoodfornon-linearmorphology• sing,sing-s,sang,sung
Whatdoyoudowhennon-linearmorphologyisthedefaultinyourlanguage?
• Toolbox,FLEx
digitaltoolsforgrammaticography
18
SG PL
NOM juällge juolgeGEN juolge julgijACC juolgev julgijtILL juallgáj julgijda
INESS juolgen julgijnELAT juolgest julgijstCOM julgijna julgij
ABESS juolgedak juolgedagaESS juallgen
juällge‘foot/leg’
Whatdoyoudowhennon-linearmorphologyisthedefaultinyourlanguage?
• Toolbox,FLEx
digitaltoolsforgrammaticography
• Toolbox,FLEx• other,digitalapproaches...
19
SG PL
NOM juällge juolgeGEN juolge julgijACC juolgev julgijtILL juallgáj julgijda
INESS juolgen julgijnELAT juolgest julgijstCOM julgijna julgij
ABESS juolgedak juolgedagaESS juallgen
juällge‘foot/leg’
4stemallomorphs:juällg-juolg-juallg-julg-Whatdoyoudowhen
non-linearmorphologyisthedefaultinyourlanguage?
implementedgrammars
• aka“precise”grammars– self-validating
• computer-processable– butonlyborderlinehuman-readable(atleastfromatraditionalistperspective)
– computationallinguists,typicallyHPSG
• analyzelinguisticstructures• implementation-->parseandtagacorpus
20cf.newLanguageSciencePressseries“ImplementedGrammars”
Siegeletal.2016
21
• Giellateknoinfrastructure:– FST–FiniteStateTransducer1– CG–ConstraintGrammar2
• automaticannotationsinELAN…
1Beesley&Karttunen2003;2Didriksen2007–2018,Karlsson1990;Karlssonetal.1995
theResearchgroupforSaamilanguagetechnologyatUniversityTromsø
implementedgrammar(FST/CG)forPiteSaami
implementedgrammar(FST/CG)forPiteSaami
22
infrastructure:
FiniteStateTransducer(FST)→foranalyzingwordforms
ConstraintGrammar(CG)→forremovingambiguitiesinFSToutput formalism:
• lexc (lexicon,PoS,linearmorphology)
• twolc (non-linearmorphology)
• cg3 (syntax)
Usesorthographicstandard!
implementedgrammar(FST/CG)forPiteSaami
23
infrastructure:
FiniteStateTransducer(FST)→foranalyzingwordforms
formalism:
• lexc (lexicon,PoS,linearmorphology)
• twolc (non-linearmorphology)
Outputanalyses:
implementedgrammar(FST/CG)forPiteSaami
24
infrastructure:
FiniteStateTransducer(FST)→foranalyzingwordforms
input:wordform
output:wordformlemma+PoS+Morphology
juällge!juällge juällge+N+Sg+Nom!!julgijd!julgijd juällge+N+Pl+Acc!
implementedgrammar(FST/CG)forPiteSaami
25
infrastructure:
FiniteStateTransducer(FST)→foranalyzingwordforms
formalism:
• lexc (lexicon,PoS,linearmorphology)juällge juällge+N+Sg+Nom!!julgijd juällge+N+Pl+Acc!
implementedgrammar(FST/CG)forPiteSaami
26
infrastructure:
FiniteStateTransducer(FST)→foranalyzingwordforms
formalism:
• twolc (non-linearmorphology)juällge juällge+N+Sg+Nom!!julgijd juällge+N+Pl+Acc!
implementedgrammar(FST/CG)forPiteSaami
27
infrastructure:
FiniteStateTransducer(FST)→forgeneratingwordforms
(itworksinbothdirections)
implementedgrammar(FST/CG)forPiteSaami
28
infrastructure:
FiniteStateTransducer(FST)→foranalyzingwordforms
BUT:howtodealwithmorphologicallyambiguouswordforms?(disambiguation)
implementedgrammar(FST/CG)forPiteSaami
29
infrastructure:
FiniteStateTransducer(FST)→foranalyzingwordforms
ConstraintGrammar(CG)→forremovingambiguitiesinFSToutput formalism:
• lexc (lexicon,PoS,linearmorphology)
• twolc (non-linearmorphology)
• cg3 (syntax)
example:rulesdescribingdependencybetweenadpositionsandgenitivecase
implementedgrammar(FST/CG)forPiteSaami
30
infrastructure:
FiniteStateTransducer(FST)→foranalyzingwordforms
ConstraintGrammar(CG)→forremovingambiguitiesinFSToutput formalism:
• lexc (lexicon,PoS,linearmorphology)
• twolc (non-linearmorphology)
• cg3 (syntax)output(analyses)
nala gähttjat tjurvij daj
disambiguationexample
31
nala gähttjat tjurvij daj
onto look+INF antler+GEN+PLantler+COM+PL
DET+GEN+PLDET+COM+PLPRON+GEN+PLPRON+COM+PL
disambiguationexample
32
FSToutput:
daj tjurvij nala gähttjat
‘tolookatthoseantlers’ [pit100405b.011]
disambiguationexample
33
FSToutput:
daj tjurvij nala gähttjat
da-j tjurvi-j nala gähttja-t
DET-GEN.PL antler-GEN.PL onto look-INF
‘tolookatthoseantlers’ [pit100405b.011]
disambiguationexample
34
FSToutput: CGsyntacticdisambiguation:
• postpositionsgoverngenitiveNPsSELECT Gen IF (*1C Po BARRIER NoNP);
• pronounsarenotembeddedinanNPREMOVE Pron IF (*1C N BARRIER NPNH);
implementedgrammarspros:• entirelydigital(easycopying,versioning,etc.)• computer-processable• cananalyzeANDgenerate(usefulforpracticaltools,e.g.teaching
apps)• accuracycanbetestedonrealempiricaldata• prosecanbeincluded(as<!—comments-->)• furtheruseinother,digitalapplications...
35
cons:• requiressignificanttechnicalknowhowtolearnandtoimplement• notveryhuman-readable,especiallyfornon-specialists
– proseisonlyincludedas<!--comments-->– notidealforstandardaveragetypologists– notevenclosetoidealformostnon-linguists
36
• spell-checkers• grammar-checkers
• teachingmaterials(e.g.apps)
…
furtheruseinother,digitalapplications...
37
• spell-checkers• grammar-checkers
• teachingmaterials(e.g.apps)
…
• indocumentarylinguistics/endangeredlanguagedescriptions– automatictokenizationandannotationforcorpora
furtheruseinother,digitalapplications...
furtheruseinother,digitalapplications...
38
• tierstructureinELANcorpora(Freiburg-style)
includingannotationsfor:• Lemma• Partofspeech• Morphologicalcategories• Gloss
furtheruseinother,digitalapplications...
39
benefits:• savestime• avoidsinconsistencies• canbeupdatedautomatically
corpusbuilding/extensionusingascript1that:1. tokenizestheorthographicrepresentation
2. sendseachtokenthroughFST3. removesambiguitiesusingCG
4. addsanEnglishgloss
5. insertsthisoutputintoELAN
1cf.Bloklandetal.2015;Gerstenbergeretal.2016;Gerstenbergeretal.2017
• tierstructureinELANcorpora(Freiburg-style)
Moredetailsintalkat11:30inroom13byBlokland,PartanenandRießler
summaryofdigitalgrammaticography
40
requires:• timetolearntheformalismandsetuptheinfrastructure• understandingofgrammaticalstructures• string-basedrepresentationoflanguage
mainbenefits:• canbefreelyaccessibleonline• possibilitytopublish(hopefullygettingacademicrecognition,cf.LangSciPressseries)• exportdataforuseinothertoolsanddisciplines
• spell-checker• lexicographicmaterials(includingsmartphoneapps)• corpusbuilding• teachingmaterials• increasedstatusforthelanguage• moreaccessibletootherdisciplines,e.g.viatextsearch
maindrawbacks:• notterriblyhuman-accessible• nottaughttraditionallyinGeneralLinguisticsprograms
OTHERADVANCESindigitaltechnologies
41
newlanguagetechnologies• automaticsegmentation,e.g.:– Autosegmenteerija2.0
• Estonianautosegmentationforced-alignmenttestedonPiteSaamiwithsurprisinglyaccurateresults:
42
newlanguagetechnologies• speechrecognition,e.g.:– CommonVoice(moz://a)incommunitydevelopmentforanumberofsmallerlanguages(e.g.:Erzya,Komi-Zyrian,...)
43
newlanguagetechnologies• automaticimplementedgrammarproduction– LinGOGrammarMatrix
http://matrix.ling.washington.edu/customize/matrix.cgi
44
newlanguagetechnologies• automaticimplementedgrammarproduction– LinGOGrammarMatrix
http://matrix.ling.washington.edu/customize/matrix.cgi
45
newspeechtechnologies
• relevanttechnologiesbeingdevelopedcontinuously
• leadingtoasignificantincreaseinefficiencyforcorpusbuilding
46
->bettergrammaticaldescriptions
OUTLOOK
47
outlook
• digitaltoolscanprovidepowerfuladvantagesforbothfieldworkand(especially)grammaticographyanddocumentation
• but:theyrequireknowhowthatgoesbeyondatypicallinguist’straining
• I’mnotsayingthisisforeveryone,andrealisticallyonlypartswillberelevantforafew–thepointis:Digitaltechnologiesshouldbeconsidered,too!
48
References
49
Aikhenvald,AlexandraY.(2015).Theartofgrammar.Apracticalguide.Oxford:OxfordUniversityPress.Beesley,KennethR.&LauriKarttunen(2003).FiniteStateMorphology.Stanford:CenterfortheStudyofLanguageandInformation.Blokland,Rogier,CiprianGerstenberger,MarinaFedina,NikoPartanen,MichaelRießler,&JoshuaWilbur(2015).“Languagedocumentationmeetslanguage
technology”.In:FirstInternationalWorkshoponComputationalLinguisticsforUralicLanguages,16thJanuary,2015,Tromsø,Norway.Proceedingsoftheworkshop.Ed.byTommiA.Pirinen,FrancisM.Tyers,&TrondTrosterud.SeptentrioConferenceSeries2015:2.Tromsø:TheUniversityLibraryofTromsø,pp.8–18.
Blokland,Rogier,NikoPartanen,MichaelRießler,&JoshuaWilbur(2019).“Usingcomputationalapproachestointegrateendangeredlanguagelegacydataintodocumentationcorpora.Pastexperiencesandchallengesahead”.In:ProceedingsoftheWorkshoponComputationalMethodsforEndangeredLanguages.Vol.2.Honolulu:AssociationforComputationalLinguistics,pp.24–30.
Didriksen,Tino(2007–2018).Constraintgrammarmanual.3rdversionoftheCGformalismvariant.GrammarSoftApS.Gerstenberger,Ciprian,NikoPartanen,MichaelRießler,&JoshuaWilbur(2016).“UtilizinglanguagetechnologyinthedocumentationofendangeredUralic
languages”.In:NorthernEuropeanJournalofLanguageTechnology4,pp.29–47.Gerstenberger,Ciprian,NikoPartanen,MichaelRießler,&JoshuaWilbur(2017).“Instantannotations.ApplyingNLPmethodstotheannotationofspokenlanguage
documentationcorpora”.In:InternationalWorkshoponComputationalLinguisticsforUraliclanguages(IWCLUL2017).Ed.byTommiA.Pirinen,MichaelRießler,TrondTrosterud,&FrancisM.Tyers.St.Petersburg:AssociationforComputationalLinguistics,pp.25–36.
Halász,Ignácz(1893).Népköltésigyűjtemény.APiteLappmarkarjepluogiegyházkerületéből.Vol.5.Svéd-LappNyelv.Budapest:Magyartudományosakadémia.Kahle,Philip,SebastianColutto,GünterHackl,&GüngerMühlberger(2017).“Transkribus.AServicePlatformforTranscription,RecognitionandRetrievalof
HistoricalDocuments”.In:201714thIAPRInternationalConferenceonDocumentAnalysisandRecognition(ICDAR).Vol.04,pp.19–24.Karlsson,Fred(1990).“ConstraintGrammarasaframeworkforparsingunrestrictedtext”.In:Proceedingsofthe13thInternationalConferenceofComputational
Linguistics.Ed.byHansKarlgren.Vol.3.Helsinki,pp.168–173.Karlsson,Fred,AtroVoutilainen,JuhaHeikkila,&ArtoAnttila,eds.(1995).ConstraintGrammar.Alanguage-independentsystemforparsingunrestrictedtext.
NaturalLanguageProcessing4.Berlin:MoutondeGruyter.Lagercrantz,Eliel(1926).SprachlehredesWestlappischennachderMundartvonArjeplog.Suomalais-ugrilaisenSeuranToimituksia55.Helsinki:Suomalais-
UgrilainenSeura.Lehtiranta,Juhani(1992).Arjeploginsaamenäänne-jataivutusopinpääpiirteet.Suomalais-ugrilaisenSeurantoimituksia212.Helsinki:Suomalais-UgrilainenSeura.Mosel,Ulrike(2006).“Grammaticography.Theartandcraftofwritinggrammars”.In:Catchinglanguage.Thestandingchallengeofgrammarwriting.Ed.byFelix
Ameka,AlanDench,&NicholasEvans.Trendsinlinguistics:studiesandmonographs167.Berlin:MoutondeGruyter,pp.41–68.Nordhoff,Sebastian(2008).“ElectronicReferenceGrammarsforTypology:ChallengesandSolutions”.In:LanguageDocumentationandConservation2.2,pp.296–
324.Partanen,Niko&MichaelRießler(2019).“AnOCRsystemfortheUnifiedNorthernAlphabet”.In:InternationalWorkshoponComputationalLinguisticsforUralic
languages(IWCLUL2019).Tartu:AssociationforComputationalLinguistics,pp.77–89.Payne,ThomasE.(1997).Describingmorphosyntax.Aguideforfieldlinguists.Cambridge:CambridgeUniversityPress.Rießler,Michael&JoshuaWilbur(2017).“DocumentingendangeredoralhistoriesoftheArctic.Aproposedsymbiosisforlanguagedocumentationandoralhistory
research,illustratedbySaamiandKomiexamples”.In:Oralhistorymeetslinguistics.Ed.byErichKasten,KatjaRoller,&JoshuaWilbur.ExhibitionsandSymposia.Fürstenberg:KulturstiftungSibirien,pp.31–64.
Ruong,Israel(1943).LappischeVerbalableitungdargestelltaufGrundlagedesPitelappischen.Uppsala:AlmqvistochWiksell.Siegel,Melanie,EmilyM.Bender,&FrancisBond(2016).Jacy.AnImplementedGrammarofJapanese.CSLIStudiesinComputationalLinguistics.Stanford:CSLI
Publications.Sjaggo,Ann-Charlotte(2015).Pitesamiskgrammatik.enjämförandestudiemedlulesamiska.Senterforsamiskestudiersskriftserie20.Tromsø:Septentrio
AcademicPublishing.Wilbur,Joshua(2014).AgrammarofPiteSaami.StudiesinDiversityLinguistics5.Berlin:LanguageSciencePress.Wilbur,Joshua,ed.(2016).Pitesamiskordboksamtstavningsregler.Samica2.Freiburg:Albert-Ludwigs-UniversitätFreiburg.
Gijtovadnet!gijtov adnet
gijto-v adne-t
thank-ACC.SG have-PL.IMP
JoshuaWilburPiteSaamiSyntaxProject
FreiburgResearchGroupinSaamiStudiesjoshua.wilbur@skandinavistik.uni-freiburg.de
withspecialthankstoMichaelRießler,NikoPartanen,RogierBloklandandCiprianGerstenberger
forideas,collaborationandinspiration