language archiving at the mpi language archiving at the mpi peter wittenburg mpi for...

48
Language Archiving at the MPI Language Archiving at the MPI Peter Wittenburg MPI for Psycholinguistics DOBES Archive (DOkumentation BEdrohter Sprachen Documentation of Endangered Languages) (funded by VolkswagenFoundation) Nijmegen NL G Rhein

Upload: virginia-barnett

Post on 25-Dec-2015

228 views

Category:

Documents


0 download

TRANSCRIPT

Language Archiving at the MPI

Language Archiving at the MPI

Peter Wittenburg

MPI for Psycholinguistics

DOBES Archive (DOkumentation BEdrohter Sprachen

Documentation of Endangered Languages)

(funded by VolkswagenFoundation)Nijmegen

NL G

Rhein

Still a large variety of languages

• currently 6500 languages world-wide

• Distribution

• Africa 1995• S/SE Asia 1400• Neuguinea 1109 • Southamerica 419• North-Asia 380• Central-America 300• Pazific Area 250• Australia 250• North-America 209• Europe 209

Language Archiving at the MPI

Language endangerment

• 97 % of the people use 4% of the languages • 96% of the languages are being spoken by 3% of the people • approx 6000 of the languages are spoken by about 200 Mio

people• in average: 30.000 speaker per language

• for 50% less than 10.000, for 25% less than 1000

• for 50% the number of speakers is decreasing dramatically

• pessimistic view (according to Crystal):• 90 % of the languages will be extinct around 2100!!• i.e. every second week a language becomes extinct!!

Language Archiving at the MPI

what can we do?

Documentation + Revitalization• 2000 DOBES Programme of the VolkswagenFoundation

• many other initiatives and institutions – all to be complementary

• VolkswagenFoundation is devoted to primarily support research

• teams get funds for documentation (in general 3 years +)• had a very intensive pilot phase full of useful discussions • it was obvious that all teams felt the need to help the language

communities (including the archiving team)

Language Archiving at the MPI

How to do a language documentation?

• based on N. Himmelmann “Documentary and Descriptive Linguistics”

• Documentation: primary focus is on collection, transcription and translation of primary data (observations, elicitations, ...)

• Description: primary focus is on linguistic analysis and special phenomena

• the methods and the results are differentCollection Analysis

Result Corpus of utterances, notes on observations, comments of involved persons

Descriptive statements illustrated by a few examples

Procedures Observation, elicitation, recording, transcription, translation

Phonetic, phonological, morphosyntactic, semantic analysis

Methodological issues

Sampling, reliability, naturalness

Definition of terms and levels, justification, adequacy of analysis

Language Archiving at the MPI

How to do a language documentation?

• there is an overlap between the two poles: documentation and description no interlinear description without a morphological analysis

• Documentation has to• deliver a comprehensive representation of the “linguistic habitudes and

traditions” • document spoken language in its communicative and cultural

background• observed linguistic habitudes and meta knowledge• holistic view of language is important

• be interesting for other disciplines – in particular primary data• help the language community

• therefore a natural focus on audio&video recordings

Language Archiving at the MPI

DOBES language documentation

• language on its cultural background • “theory-neutral” representation • lots of multimedia (audion, video) recordings as basis• where possible base everything on primary data

• linguistic goals• annotations (orthographic transcription, translation, ...)• only for a small part a morphological/syntactical analysis• sketch grammar, limited topic-oriented lexicon

• also ethnologists, musikologogists, ethnobiologists involved • in total about 3 years

• idea: later generations should be able to reproduce the language• material could later be extended

Language Archiving at the MPI

Traditional annotation

Text Annotation

Modern annotation

Multimedia Annotation

DOBES Map

Aweti

KuikuroTrumai

Salar/Monguor

Teop

Wichita

Lacandon

Chipaya

ChacoLanguages

Waima’a

Svan/Udi/Tsova-Tush Tofa

Hocank

Marquesan

Chintang/Puma

!XooAkhoeHai//om

Mawe/Bakairi/ Katxuyana

Tsafiki

Iwaidja

Chontal

Bora/OcainaSaliba

Beaver

Sri LankaMalay

Sami Nenets

Jaminjung

Semang

Totoli

• 30 documentation teams (at MPI also 30 expeditions per year)• 1 Archiving Team

Archiv

Language Archiving at the MPI

Waima’a (East Timor)

Mauricio Belo, Caisido village John Bowden, Australian National University John Hajek, University of Melbourne Nikolaus Himmelmann, Ruhr-Universität Bochum

la enen iat before PTLOnce upon a time bu taha k’omu ruo bu wai-dura loo ligasaun iniHON mud ball and HON cricket make closeness RCPA ball of mud and a cricket were friends sire ruo laka khuu rahmhutu busa3p two go clean together gardenThe two of them went to clean the gardens together

  Labial(Post-) alveolar

Velar Glottal

Stops voiceless unaspirated (p) t k '

  voiceless aspirated (ph) th kh  

  voiceless ejective p' t' k'  

voiced b d g

Fricatives plain (f) s   h

  ???   s'    

Nasals plain m n    

  ??? mh nh    

  ??? m' n'    

Laterals plain   l    

  ???   lh    

  ???   l'    

Tap / trill     r    

      rr    

Glides plain w      

  ??? wh      

  ??? w'      

Trumai (Amazonas)

Stephen Levinson, MPI Nijmegen Raquel Guirardello-Damian, Museu Paraense Emílio Goeldi

• about 100 people• about 51 speaking Trumai Language Archiving at the MPI

Salar/Monguor (China)

Salar villages along the Yellow River

Salar children above Dashyinix village

Shaman in Huzhu Mongghul county

Drummers in the Nadun festivalMinhe county

Painting the faces of possessed Wutu, Niandehu township

Language Archiving at the MPI

Tofa, Tozhu, Tsengel Tuva, Tuha (Sibiria)

• David Harrison (Yale)• Brian Donahoe (Manchester)• Sven Grawunder (Halle)

• Language—its structure and sounds. • Oral folklore—texts, narratives and personal stories, belief systems, naming systems. • Music—singing and sound mimesis. • Traditional ecology— nomadsm, pastoralism, hunting and reindeer herding

Shaman Ceremony

Language Archiving at the MPI

Language documentation for whom?

• for interested researchers

• for students and schools

• for journalists

• for the interested public

• for the language communities

• for future generations

Language Archiving at the MPI

For language communities

• language maintenance or even revitalization

• maintainenance of the language, identity, self-conciousness • creation of school and other educational material• support local/regional centers (create and dl complete copies)• improve access to archives

• in communities big interest in recordings – in particular video

Language Archiving at the MPI

For future generations

• in a future world of mono cultures it will become important to know about earlier diversity

• as now it will be important to know the own roots

• it may be relevant to point to the different types of languages

• let’s be honest: we don’t know what future generations will do with the

material

Language Archiving at the MPI

Why archives?

• many reasons• Dietrich Schüller: 80% of our recordings about culture and languages are endangered! storage inadequate (Meda, Formats, PC, ...)

• selection of suitable technologies requires expert knowledge

• creation of redundat storage and migration is important requires discipine and has to be independent on persons

• migration to new technologies can be very expensive

• only centres can do this

• AND: requires explicitness – at the end a viewable corpus

• international trend: DOBES, AILLA, ELAR, PARADISEC, LACITO, ...

Language Archiving at the MPI

What is a “modern” digital archive?

• traditional archives• focus on preserving physical content• access not permitted

• digital archives • physical object is almost irrelevant (Tape, CDROM)• content has to be preserved • why this revolutionary change?

• copies can be made lossless (let’s be careful with compression) • copies can be created with low costs

• modern digital archive• long-term preservation fo the content (Migration, Distribution) • access to the content • enrichement without affecting the content • sensitive management of access

• DOBES has to be a living archive (interactive, expandible) Language Archiving at the MPI

Long-term preservation

• can we guarantee survival of bit-streams? NO• we can increase the chances of survival? YES• our storage media are not adequate

• how to do it • continuous migration (copies to new generation)• world-wide distribution (now within Germany/NL)• problem of interpretability not solvable • have to take care of ethical/legal aspects• crucial for survival are maintenance costs

• all MPI material is available in 7 copies at different locations

2000 years0 years 1000 years500 years250 years

clay tabletsvarious e-media

Language Archiving at the MPI

Pillars of Digital Archives I

• strict separation of physical and logical access layers

• physical domain is for System Managers and Archive Managersand changes

• logical domain (created by linguists) remains and is stable• metadata is the glue – have to be maintained

domain ofphysical resources

conceptualdomain ofresources

corpus manager

usercreator

system manager

Language Archiving at the MPI

Pillars of Digital Archives II

Archive Organization Layer of Language Layer of Sessions

Lexicon

Intro FilmsNotes

SongBook

VideoRecording

Sound Recording

Annotations

Language Archiving at the MPI

Pillars of Digital Archives III

• separation between object and instance

• need Unique Resource IDs • and robust “Resolving” mechanism

mapping

mapping

mapping

MPIRepository

GWDGRepository

URIDResolver

MPIPortal

Metadata XYZ

Portal

Metadata

Language Archiving at the MPI

Pillar of Digital Archives IV

• need Versioning

• nothing may be deleted, but annotations will be changed!• research world is dynamic – we want enrichment/extension

URID Resolver

userx=readusery=readetc

userx=writeusery=readetc

Language Archiving at the MPI

Principles V – Authentication&Authorization

• authentication and authorization has to be separated

• URIDs are central link to authorization information

• need to have space for policies, procedures, declarations etc• but administrative effort has to be minimized!!!

URID Resolver

userx=readusery=readetc

userx=writeusery=readetc

Language Archiving at the MPI

Principles VI – Formats

• only open, well-documented and widely used formats (encoding standards) should be used in the archive

• where possible generic schemas should be the basis

• in DOBES strong recommendations for a few archival formats• JPEG/TIFF/PNG, MPEG2, Linear PCM, UNICODE, XML• Plain Text, HTML, (PDF) possible

• at MPI less restrictive (therefore great danger with some types)• for presentation purposes also MPEG1/4, MP3, HTML• as import formats large variety (Shoebox, CHAT, WORD, ...)• conversion as much as possible towards generic files (LMF, EAF, ?)

• archived objects have to be stored in a neutral way and accessible as individual objects

• no encapsulation for primary objects

• nevertheless: MPI archive takes almost all data (even 16mm films)• but conversion can be very costly

Language Archiving at the MPI

MPI Archive – state

• more than 150.000 Objects (in online archive - ~1/3 of the data)

• in total more than 15 TB • per year about 4 TB in addition

• several sub-archives (EL, SL, ESF, CGN, ...)

• MPI archive ingest is open for other people !!!

• completely structured by open XML files based on IMDI schema

• a complete machinery available• are working on URIDs & Versioning at this moment

Language Archiving at the MPI

MPI Archive – Access

MetadataTools

Archive Utility Layer

Domain ofRegistered Primary and Secondary Resources

Domain ofDescriptive Metadata

Primary Resources:TextsImagesSoundMovies

User

DataIngestion&

Management

UserAuthentication

AccessRights

The Archive

Archive Access

AnnotationExploitation

LexiconExploitation

TextExploitation

Ontological Knowledge

MediaAnnotation

Archive Enrichment

LexicalEncoding

WebCommentary

MPI Archive – Metadata and Simple Access

• metadata is open!• what is minimal metadata? – ongoing discussion

• IMDI Editor• BatchModifier (to change lots of IMDI files)• IMDI XML Browser (operates in distributed XML

domain)• IMDI HTML Browsing (on the fly transformation of XML)• structured search in XML and HTML domain• unstructured search in XML and HTML domain• searchable via Google• geographic browsing via Google Earth (work in progress)• DC/OLAC bridge via OAI port (all IMDI stuff can be harvested)

• manuals and training courses

• direct access to simple objects via plug-ins• complete sub-tree download

Language Archiving at the MPI

Geographic Browsing

Geographic Browsing

Geographic Browsing

MPI Archive – Upload Access

• two options • manual integration exceptions are easy

too many teams (~60)• LAMUS controlled integration exceptions are difficult

users do it themselves (?)

• LAMUS features - web-based operation - request of a work space - specification of an accepted upload node (archive anchor) - extend and manipulate the corpus structure - upload metadata descriptions - upload any type of resources (configurable format control) - create a linked sub-archive in the workspace and integrate this

into the archive - checks to guarantee consistency and format compliance

Language Archiving at the MPI

MPI Archive – Utilization Access

tool is ANNEX

Language Archiving at the MPI

MPI Archive – Utilization Access

tool is LEXUS

Language Archiving at the MPI

MPI Archive – Utilization Access

Problem

• different structures and formats

• different terminologies

tools are ANNEX/LEXUS

Language Archiving at the MPI

MPI Archive – State of Access

Language Archiving at the MPI

• at this moment almost anything from DOBES is closed

• lots of requests by journalists

• first 15 teams have to finish these months

• working hard • changing a lot until last minute of course• expect some stuff to become open• but much to be handled on requests

End

Mark Abley (Canadian)

Each time we lose a languagethe ghosts who made use of itcast a new bell.The voices magnify. Soon,listen, they’ll outpeal the tongues of earth.

Thanks for your attention.

Language Archiving at the MPI

Lots of differences

Differences at all linguistic layers

• Phonemic• Prosody • Phonology• Morphology• Syntax• Semantics• Pragmatics

Reduced Languages

• Whistling of Gomera fishermen• Sign Language of Plains-Indians• “Computer” Languages• ...

Language Archiving at the MPI

Sound Systems

F1 F2 F3 F4 F5 F1

F2

F5

F4

F3F2

F1

Spectra and Formants Vocal – Distribution (28 languages)

Formants over time

• Rotoka (Papua-Neuguinea) • Vokals a/e/i/o/u • 6 Consonants p/t/k/v/r/g

• !Xoo (South-Africa)• 141 Sounds incl. click-sounds

Tone Systems

• modulation of segmental information by Prosody • stretches across phrases and sentences

• Tones: meaning of words• Swedish: 2 Tones (anden – ándén)• German: aufbäumen – aufBäumen

• Mandarin Chinese: 4 tones• Kantonese: 9 tones• Vietnamese: 8 tones• some so-aisan languages: up to 15 tones

dr ai st

i Zeug

i vermuten

i Stuhl/Sessel

i Bedeutung

Intonation

Mandarin Chin. 4

Language Archiving at the MPI

Morphosyntax

• Rules for the generation of words and grammatical structures

• strictly isolating languages: one morpheme – one word • Chinese is an isolating language

• another extreme are the polysynthetic languages

• example of the Yup’ik inuittuntussurqatarniksaitengqiggtuqtuntu ssur qatar ni ksaite ngqiggt uqRenntier jagen FUT sagen NEG wieder 3SG:IND er hatte noch nicht gesagt dass er Renntiere jagen wolle

basic principle: stem is inflected by many affixes

for us unusual: isolated core morphemes cannot be interpreted “ssur” uttered in isolation does not make sense

verb stem

Language Archiving at the MPI

Dialog style

• norms to express things/activities is different

• example from Kilivila (Trobriand Islands – Neuguinea)

Person: AmbeyaWhere do you go to?

 

Gunter: (wants to say: I will wash myself)Bala bakakayaI will go I will take a bath

 

Host: Bila bikakaya bike’ita bisisu bipaisewa3.Fut-gehen 3.Fut-baden 3.Fut-zurückkommen 3.Fut-sein 3.Fut-arbeitenHe will go – he will take a bath - he will come back – he will stay -

he will work.He will take a bath, come back again and work with us

Language Archiving at the MPI

Pronoms

• in Kilivila• the inclusive and exclusive Dual we two – myself and the others except you

• in Paamese (Vanuatu - Archipel)• in addition the Paukal “a few”

Language Archiving at the MPI

Spatial orientation

egocentric system

abovebehind

right

above

below

north

south

west

east

absolutesystem

• Herberger would use the egocentric system to describe the scene• Aborigines would chose the absolute system – for us hardly possible: “the ball lies east of the player”

Language Archiving at the MPI

Awareness

• since 1866 efforts to preserve diversity in nature

• 1991 problem in focus of American Linguistic Society • 1992 discussion at the Intern. Conference of Linguistics• 1992 AG for endangered languages in German linguistic society• 1993 UNESCO project to create the red list

• 2000 DOBES programme of the VolkswagenFoundation• within 2 decades broad awareness amongst linguists

• David Crystals amongst first semester students:• 75% don’t know anything about the problem• most don’t see a problem

• how does this come: attention for tigers etc but not for languages?

Language Archiving at the MPI

Factors are known

• external factors• military suppression• religious conversion• economic dominance • cultural dominance• educational suppression

• internal faktors• negative attitude towards own language• avoidance of discrimination • hope to earn (more) money • improvement of mobility• youngsters are trend followers• ...

Language Archiving at the MPI

MPI Archive – Content Overview

MDsession

s

video files

audio files

photos

othermedia

textual files

sub-types

MPI 18524 14085 5131 7774 1315 13979

365 EAF, 2377 CHAT, 5580 MediaTagger,

3568 PlainText/Shoebox, 1589 others

DOBES 1396 1043 1250 63 20 20546 EAF, 85 Shoebox,

72 others

Dutch Spoken Corpus

12767 12767 41832 to be converted to EAF

Dutch Bilingual

Database874 191 714 CHAT, EAF

ECHO Sign Language

168 296 181 EAF

ESF corpus 994 546 1775 in CHAT

Total 34723 15970 19339 7837 58686 136.555 objects