ewika: towards the digitalization of philippine languages
DESCRIPTION
eWika: Towards the Digitalization of Philippine Languages. Charibeth K. Cheng ([email protected]) DLSU, College of Computer Studies Natural Language Processing Research Lab. Isalin. Translate. MT Research in RP. started in 1993 at UP-Los Ba ň os Dr. Rachel Roxas and Allan Borra - PowerPoint PPT PresentationTRANSCRIPT
IsalinTranslate
eWika: Towards the Digitalization of Philippine Languages
Charibeth K. Cheng ([email protected])
DLSU, College of Computer Studies
Natural Language Processing Research Lab
MT Research in RP
• started in 1993 at UP-Los Baňos
• Dr. Rachel Roxas and Allan Borra– grammar-based
• in 2004 start at DLSU– hybrid approach
ENG-FIL MT System Project
• 3-year project
• started 2005
• funded by DOST-PCASTRD
• composition:– 6 faculty members of College of
Computer Studies– 15 computer science majors– assisted by the Filipino Dept and
Dept in English & Applied Linguistics of DLSU-M
Architectural Design of the Program
Language Resources: • Lexicon (electronic dictionary), • Morphological Analyzer & Generator• Part-of-Speech tagger• Grammar,• Corpus (Tagged)
MT: Example-based
MT: Rule-based
User Interface
Output Modeller
Source Text Target Text
Translator Engine
Rule-Based approach
Apply translation rules
The boy ate apples.
Kumain ng mga mansanas ang batang lalaki.
Where do we get the translation
rules?
Example-Based
• Learn the rules from examples
The boy ate apples.
Kumain ng mga mansanas ang batang lalaki.
A B C D
A BC D
Rule Learned:
A B C D C ng D A B
Results of the MT Engine
• Qualities of a Good Translation– Clarity – 3.3– Accuracy – 3.2– Naturalness - 2.8
• highest score of 5
• 100 respondents (5 linguists)
Challenge!
• Language resources– Quality of translation is dependent on it.– Built from almost non-existent digital forms– manual vs. automatic construction
Lexicon
• Diksyunaryo ng Wikang Filipino
• automatic construction (AeFLEX):– accuracy rate - 57%
• Currently contains about 30,000+ entries
• Challenge: Lexical resources – translation documents– part-of-speech tagger
Morphological Analyzer and Generator
• Dictionary is incomplete
• Create a software that:– analyzes – determines the root word– generates – generates the inflected word
Given: eating -> eat -> kain -> kumakain
• Challenge : Lexical resources– lexicon– part-of-speech tagger
Part-Of-Speech Tagger
• automatic association of parts-of-speech to words in a document– Can? – kaya vs. lata– Baba? – chin or go down
• Challenge : Lexical resource– corpora– lexicon– morphological analyzer– grammar
Corpora
• collection of translation-pair documents
• used by the lexicon extractor and part-of-speech tagger, example-based MT
• came from translation works of DLSU English majors, verified by linguists
• consists of 207,000 words
Bringing it home …
• 171 Philippine Languages (SIL)• No Philippine Corpora• Unfortunately, today, the Philippines has one of
the highest rates of dying languages (Solfed Foundation Inc)
• “Without our language, we have no culture, we have no identity, we are nothing.” (Thorrson)
eWika: Digitalization of Philippine Languages
• Build the Philippine Corpus
• Build software tools to study or use the corpus– Across Regions
– Across Forms and Genres
– Across Languages
Across Regions
• Web-based application: GLOBALIZATION– upload, download, tools
• Contributors (Main players)
• Verifiers
• Server: DLSU-M commits to host the server for the next three years.
• Terms of Use: Research purposes.
Across Languages
• 171 Philippine Languages (SIL List)
• start with 8 major languages– Tagalog, Cebuano, Ilocano, Hiligaynon, Bikol,
Waray, Kapangpangan, Boholano
• Filipino Sign Language
Across Forms and Genres
• In various forms:– Text– Speech– Video: Filipino sign language
• In various Genres: – Text – literary & creative, essays, news articles,
religious, etc– Speech – scripted, conversations, etc– Video – common signs, regional signs, signs for
specific purposes (legal, IT, etc.)