language technologies institute school of computer science carnegie mellon university nsf august 6,...
TRANSCRIPT
Language Technologies InstituteSchool of Computer ScienceCarnegie Mellon University
NICE: Native language Interpretation and Communication Environment
Jaime Carbonell, Lori Levin, Alon Lavie, Language Technologies Institute
Carnegie Mellon University{jgc, lsl, alavie}@cs.cmu.edu
Language Technologies InstituteSchool of Computer ScienceCarnegie Mellon University
Machine Translation of Indigenous Languages
• Policy makers have access to information about indigenous people.– Epidemics, crop failures, etc.
• Indigenous people can participate in – Health care– Education– Government– Internet
without giving up their languages.
Language Technologies InstituteSchool of Computer ScienceCarnegie Mellon University
History of NICE
• Arose from a series of joint workshops of NSF and OAS.
• Workshop recommendations: – Create multinational projects using information
technology to:• provide immediate benefits to governments and citizens
• develop critical infrastructure for communication and collaborative research
– training researchers and engineers
– advancing science and technology
Language Technologies InstituteSchool of Computer ScienceCarnegie Mellon University
Architecture Diagram
User
Learning Module
ElicitationProcess
Learning Process
TransferRules
Run-Time Module SLInput
SL Parser
TransferEngine
TLGenerator
EBMTEngine
UnifierModule
TLOutput
Language Technologies InstituteSchool of Computer ScienceCarnegie Mellon University
EBMT Example
English: I would like to meet her.Mapudungun: Ayükefun trawüael fey engu.
English: The tallest man is my father.Mapudungun: Chi doy fütra chi wentru fey ta inche ñi chaw.
English: I would like to meet the tallest man Mapudungun (new): Ayükefun trawüael Chi doy fütra chi wentru Mapudungun (correct): Ayüken ñi trawüael chi doy fütra wentruengu.
Language Technologies InstituteSchool of Computer ScienceCarnegie Mellon University
NICE PartnersLanguage Country Institutions
Mapudungun
(in place)
Chile Universidad de la Frontera, Institute for Indigenous Studies,
Ministry of Education
Iñupiaq
(advanced
discussion)
US (Alaska) Ilisagvik College, Barrow school district, Alaska Rural Systemic Initiative, Trans-Arctic and Antarctic Institute, Alaska Native Language Center
Siona
(discussion)
Colombia OAS-CICAD, Plante, Department of the Interior
Language Technologies InstituteSchool of Computer ScienceCarnegie Mellon University
Agreement Between LTI and Institute of Indigenous Studies (IEI),
Universidad De La Frontera, Chile
• Contributions of IEI– Native language knowledge and linguistic
expertise in Mapudungun– Experience in bicultural, bilingual education– Data collection: recording, transcribing,
translating– Orthographic normalization of Mapudungun
Language Technologies InstituteSchool of Computer ScienceCarnegie Mellon University
Agreement between LTI and Institute of Indigenous Studies (IEI), Universidad de la
Frontera, Chile
• Contributions of LTI– Develop MT technology for indigenous
languages– Training for data collection and transcription– Partial support for data collection effort
pending funding from Chilean Ministry of Education
– International coordination, technical and project management
Language Technologies InstituteSchool of Computer ScienceCarnegie Mellon University
LTI/IEI Agreement
• Continue collaboration on data collection and machine translation technology.
• Pursue focused areas of mutual interest, such as bilingual education.
• Seek additional funding sources in Chile and the US.
Language Technologies InstituteSchool of Computer ScienceCarnegie Mellon University
The IEI Team• Coordinator (leader of a bilingual and multicultural education project):
– Eliseo Canulef
• Distinguished native speaker:
– Rosendo Huisca
• Linguists (one native speaker, one near-native)
– Juan Hector Painequeo
– Hugo Carrasco
• Typists/Transcribers
• Recording assistants
• Translators
• Native speaker linguistic informants
Language Technologies InstituteSchool of Computer ScienceCarnegie Mellon University
MINEDUC/IEIAgreement Highlights:
Based on the LTI/IEI agreement, the Chilean Ministry of Education agreed to fund the data collection and processing team for the year 2001. This agreement will be renewed each year, as needed.
Language Technologies InstituteSchool of Computer ScienceCarnegie Mellon University
MINEDUC/IEI Agreement:Objectives
To evaluate the NICE/Mapudungun proposal for orthography and spelling
To collect an oral corpus that represent the four Mapudungun dialects spoken in Chile. The main domain is primary health, traditional and western.
Language Technologies InstituteSchool of Computer ScienceCarnegie Mellon University
MINEDUC/IEI Agreement:Deliverables
An oral corpus of 800 hours recorded, proportional to the demography of each current spoken dialect
120 hours transcribed and translated from Mapudungun to Spanish
A refined proposal for writing Mapudungun
Language Technologies InstituteSchool of Computer ScienceCarnegie Mellon University
Nice/Mapudungun:Database
• Writing conventions (Grafemario)• Glossary Mapudungun/Spanish• Bilingual newspaper, 4 issues• Ultimas Familias –memoirs• Memorias de Pascual Coña
– Publishable product with new Spanish translation
• 35 hours transcribed speech• 80 hours recorded speech`
Language Technologies InstituteSchool of Computer ScienceCarnegie Mellon University
NICE/Mapudungun:Other Products
• Standardization of orthography: Linguists at UFRO have evaluated the competing orthographies for Mapudungun and written a report detailing their recommendations for a standardized orthography for NICE.
• Training for spoken language collection: In January 2001 native speakers of Mapudungun were trained in the recording and transcription of spoken data.
Language Technologies InstituteSchool of Computer ScienceCarnegie Mellon University
Underfunded Activities• Data collection
– Colombia (unfunded)
– Chile (partially funded)
• Travel
– More contact between CMU and Chile (UFRO) and Colombia.
• Training
– Train Mapuche linguists in language technologies at CMU.
– Extend training to Colombia
• Refine MT system for Mapudungun and Siona
– Current funding covers research on the MT engine and data collection, but not detailed linguistic analysis