language and speech translation activities in thailand · language resources - basic travel...

23
1 ASEAN-NICT Round Table – Feb 2015 Chai Wutiwiwatchai Chai Wutiwiwatchai National Electronics and Computer Technology Center National Electronics and Computer Technology Center National Science and Technology Development Agency National Science and Technology Development Agency THAILAND THAILAND Language and Speech Translation Language and Speech Translation Activities in Thailand Activities in Thailand

Upload: others

Post on 15-Mar-2020

16 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Language and Speech Translation Activities in Thailand · Language Resources - Basic Travel Expression Corpus (BTEC) has been used to translate to member languages since A-STAR -

1ASEAN-NICT Round Table – Feb 2015

Chai WutiwiwatchaiChai WutiwiwatchaiNational Electronics and Computer Technology CenterNational Electronics and Computer Technology CenterNational Science and Technology Development AgencyNational Science and Technology Development Agency

THAILANDTHAILAND

Language and Speech TranslationLanguage and Speech TranslationActivities in ThailandActivities in Thailand

Page 2: Language and Speech Translation Activities in Thailand · Language Resources - Basic Travel Expression Corpus (BTEC) has been used to translate to member languages since A-STAR -

2ASEAN-NICT Round Table – Feb 2015

● U-STAR Speech Translation (since 2007)U-STAR Speech Translation (since 2007)- Brief history- Brief history- System architecture- System architecture- Current status- Current status- Future plan- Future plan

● ASEAN Machine Translation (since 2012)ASEAN Machine Translation (since 2012)- Project overview- Project overview- System architecture- System architecture- Current status- Current status- Future plan- Future plan

OutlineOutline

Page 3: Language and Speech Translation Activities in Thailand · Language Resources - Basic Travel Expression Corpus (BTEC) has been used to translate to member languages since A-STAR -

3ASEAN-NICT Round Table – Feb 2015

● U-STAR Speech Translation (since 2007)U-STAR Speech Translation (since 2007)- Brief history- Brief history- System architecture- System architecture- Current status- Current status- Future plan- Future plan

● ASEAN Machine Translation (since 2012)- Project overview- System architecture- Current status- Future plan

OutlineOutline

Page 4: Language and Speech Translation Activities in Thailand · Language Resources - Basic Travel Expression Corpus (BTEC) has been used to translate to member languages since A-STAR -

4ASEAN-NICT Round Table – Feb 2015

- - Collaboration of Collaboration of 30 Asian and European countries30 Asian and European countries- - Modality Conversion Marked-up Language (MCML)Modality Conversion Marked-up Language (MCML), , registered as an ITU-T recommendation standardregistered as an ITU-T recommendation standard

U-STAR HistoryU-STAR History11

Page 5: Language and Speech Translation Activities in Thailand · Language Resources - Basic Travel Expression Corpus (BTEC) has been used to translate to member languages since A-STAR -

5ASEAN-NICT Round Table – Feb 2015

● 2009 :2009 : A-STAR S2ST Live DemoA-STAR S2ST Live Demo- Network-based Multilingual S2ST- Network-based Multilingual S2ST- 8 Asian languages and English- 8 Asian languages and English- Peer-to-peer and Multi-party clients- Peer-to-peer and Multi-party clients- Portable devices (UMPC)- Portable devices (UMPC)

U-STAR HistoryU-STAR History11

Page 6: Language and Speech Translation Activities in Thailand · Language Resources - Basic Travel Expression Corpus (BTEC) has been used to translate to member languages since A-STAR -

6ASEAN-NICT Round Table – Feb 2015

● 2012 :2012 : U-STAR S2ST Public ServiceU-STAR S2ST Public Service- Network-based Multilingual S2ST in the travel - Network-based Multilingual S2ST in the travel and sport domain launched in Jun 2012and sport domain launched in Jun 2012- 23 Asian and European languages supported- 23 Asian and European languages supported- - VoiceTra4U-MVoiceTra4U-M, an iPhone App available freely, an iPhone App available freely on the AppStoreon the AppStore

U-STAR HistoryU-STAR History11

Page 7: Language and Speech Translation Activities in Thailand · Language Resources - Basic Travel Expression Corpus (BTEC) has been used to translate to member languages since A-STAR -

7ASEAN-NICT Round Table – Feb 2015

System ArchitectureSystem Architecture22

● ITU-T H.625 – Architecture for network-ITU-T H.625 – Architecture for network- based speech-to-speech translation based speech-to-speech translation servicesservices

● ITU-T F.745 – Functional requirements ITU-T F.745 – Functional requirements for network-based speech-to-speech for network-based speech-to-speech translation servicestranslation services

Page 8: Language and Speech Translation Activities in Thailand · Language Resources - Basic Travel Expression Corpus (BTEC) has been used to translate to member languages since A-STAR -

8ASEAN-NICT Round Table – Feb 2015

Page 9: Language and Speech Translation Activities in Thailand · Language Resources - Basic Travel Expression Corpus (BTEC) has been used to translate to member languages since A-STAR -

9ASEAN-NICT Round Table – Feb 2015

Current StatusCurrent Status● Language ResourcesLanguage Resources

- - Basic Travel Expression Corpus (BTEC)Basic Travel Expression Corpus (BTEC) has been has been used to translate to member languages since A-STARused to translate to member languages since A-STAR

- To extend the service for users during London Olypic - To extend the service for users during London Olypic Games, an Olympic expression corpus by Games, an Olympic expression corpus by HarbinHarbin Institute of Technology (HIT)Institute of Technology (HIT) has been acquired and has been acquired and distributed to translatedistributed to translate

- A - A Named Entity (NE) list Named Entity (NE) list of words related to Olympicof words related to Olympic expressions has also been collected from memberexpressions has also been collected from member countriescountries

Page 10: Language and Speech Translation Activities in Thailand · Language Resources - Basic Travel Expression Corpus (BTEC) has been used to translate to member languages since A-STAR -

10ASEAN-NICT Round Table – Feb 2015

Current StatusCurrent Status● Examples of EnginesExamples of Engines33

Language ASR MT TTSEnglish (En) HMnet (SSS) Concatenative

Hindi (Hi) SMT (Cleopatra) HMM

Indonesian (Id) HMnet (SSS) SMT (Moses) HMM

Japanese (Ja) HMnet (SSS) SMT (Cleopatra) Concatenative

Korean (Ko) FST RBMT (Parser) HMM

Malay (Ms) HMM RBMT (Piramid) HMM

Thai (Th) HMM SMT (Moses) HMM

Vietnamese (Vi) HMM SMT (Moses) HMM

Chinese (Zh) HMnet (SSS) SMT (Cleopatra) Concatenative

Page 11: Language and Speech Translation Activities in Thailand · Language Resources - Basic Travel Expression Corpus (BTEC) has been used to translate to member languages since A-STAR -

11ASEAN-NICT Round Table – Feb 2015

Current StatusCurrent Status

● No. of App DownloadsNo. of App Downloads15,645 downloads (til 2012)15,645 downloads (til 2012)5,179 Thai downloads (til 2012) - 25,179 Thai downloads (til 2012) - 2ndnd rank rank65,346 downloads (til 2014)65,346 downloads (til 2014)

● No. of Service TransactionsNo. of Service Transactions26,882 transactions (til 2012)26,882 transactions (til 2012)5,179 transactions for Thai (til 2012) - 25,179 transactions for Thai (til 2012) - 2ndnd rank rank514,552 transactions (til 2014)514,552 transactions (til 2014)45,056 translations for Thai (til 2014)45,056 translations for Thai (til 2014)

Page 12: Language and Speech Translation Activities in Thailand · Language Resources - Basic Travel Expression Corpus (BTEC) has been used to translate to member languages since A-STAR -

12ASEAN-NICT Round Table – Feb 2015

Future PlanFuture Plan● Named-Entities (NE)Named-Entities (NE) - NE words are often language specific- NE words are often language specific

- Using a descriptive or transliterated word for unknown- Using a descriptive or transliterated word for unknown

● Scalability and ExtensibilityScalability and Extensibility - Encouraging service maintenance from members- Encouraging service maintenance from members - Improving service performance by using real data- Improving service performance by using real data - Extending to new domains and languages- Extending to new domains and languages

● Service LatencyService Latency - The condition of network is the key- The condition of network is the key - Setting communication mirror servers- Setting communication mirror servers

Page 13: Language and Speech Translation Activities in Thailand · Language Resources - Basic Travel Expression Corpus (BTEC) has been used to translate to member languages since A-STAR -

13ASEAN-NICT Round Table – Feb 2015

● U-STAR Speech Translation (since 2007)- Brief history- System architecture- Current status- Future plan

● ASEAN Machine Translation (since 2012)ASEAN Machine Translation (since 2012)- Project overview- Project overview- System architecture- System architecture- Current status- Current status- Future plan- Future plan

OutlineOutline

Page 14: Language and Speech Translation Activities in Thailand · Language Resources - Basic Travel Expression Corpus (BTEC) has been used to translate to member languages since A-STAR -

14ASEAN-NICT Round Table – Feb 2015

ASEAN-MT Project (Since 2012)ASEAN-MT Project (Since 2012)

NECTEC

UCSYMOST

NIDA

IOIT

LINTON

BPPT

DLSU

UBD

I2R

● ASEAN languages translation is ASEAN languages translation is increasingly increasingly important to support important to support the coming AEC 2015the coming AEC 2015

• Endorsed by ASEAN SCMITEndorsed by ASEAN SCMIT Approved for ASF partial Approved for ASF partial support (2012-2014)support (2012-2014)

Page 15: Language and Speech Translation Activities in Thailand · Language Resources - Basic Travel Expression Corpus (BTEC) has been used to translate to member languages since A-STAR -

15ASEAN-NICT Round Table – Feb 2015

System ArchitectureSystem Architecture

Page 16: Language and Speech Translation Activities in Thailand · Language Resources - Basic Travel Expression Corpus (BTEC) has been used to translate to member languages since A-STAR -

16ASEAN-NICT Round Table – Feb 2015

Statistical MT ApproachStatistical MT Approach

Page 17: Language and Speech Translation Activities in Thailand · Language Resources - Basic Travel Expression Corpus (BTEC) has been used to translate to member languages since A-STAR -

17ASEAN-NICT Round Table – Feb 2015

Current StatusCurrent Status

● Kick-off & The 1Kick-off & The 1stst Working Committee Meeting Working Committee Meeting- July 2012- July 2012- Pattaya, Thailand- Pattaya, Thailand

● The 1The 1stst Technology Workshop Technology Workshop- January 2013- January 2013- Pathumthani, Thailand- Pathumthani, Thailand

● Progress Demonstration at ASEAN COST MeetingProgress Demonstration at ASEAN COST Meeting- May 2013- May 2013- Tagaytay City, Philippines- Tagaytay City, Philippines

● The 2The 2ndnd Working Committee Meeting Working Committee Meeting- December 2013- December 2013- Ayudhaya, Thailand- Ayudhaya, Thailand

Page 18: Language and Speech Translation Activities in Thailand · Language Resources - Basic Travel Expression Corpus (BTEC) has been used to translate to member languages since A-STAR -

18ASEAN-NICT Round Table – Feb 2015

Current StatusCurrent Status

Country Language Translation SMT NE Tag

Brunei Malay

Cambodia Cambodian 20,000 20,000 20,000

Indonesia Indonesian 20,000 20,000 20,000

Laos Lao 20,000 20,000 20,000

Malaysia Malay 20,000 20,000 20,000

Myanmar Myanmar 10,000 10,000 20,000

Philippines Filipino 20,000 20,000 20,000

Singapore Chinese 20,000 20,000 20,000

Thailand Thai 20,000 20,000 20,000

Vietnam Vietnamese 20,000 20,000 20,000

Parallel text corpusParallel text corpus: 20,000 sentences in travel domain: 20,000 sentences in travel domain

Page 19: Language and Speech Translation Activities in Thailand · Language Resources - Basic Travel Expression Corpus (BTEC) has been used to translate to member languages since A-STAR -

19ASEAN-NICT Round Table – Feb 2015

Current StatusCurrent Status

Size 20,000Domain Travel

People Greeting, Introduction, CommunicationSurvival Transportation, Accommodation, FinanceFood Food, Beverage, RestaurantFun Recreation, Traveling, Shopping, NightlifeResource Number, Time, CurrencySpecial Needs Emergency, Health

NE types 17 types

Page 20: Language and Speech Translation Activities in Thailand · Language Resources - Basic Travel Expression Corpus (BTEC) has been used to translate to member languages since A-STAR -

20ASEAN-NICT Round Table – Feb 2015

4 4 http://www.aseanmt.org/demo http://www.aseanmt.org/demo Demonstration in ASEAN COST Meeting, May 2014Demonstration in ASEAN COST Meeting, May 2014

Page 21: Language and Speech Translation Activities in Thailand · Language Resources - Basic Travel Expression Corpus (BTEC) has been used to translate to member languages since A-STAR -

21ASEAN-NICT Round Table – Feb 2015

Future PlanFuture Plan

● EvaluationEvaluation- Overall system evaluation- Overall system evaluation

● Post-Editing ModulePost-Editing Module- R&D on a post-editing module- R&D on a post-editing module

● 2015 Activities2015 Activities- The 2- The 2ndnd technology workshop technology workshop- The 3- The 3rdrd working committee meeting working committee meeting- Final demonstration at ASEAN COST meeting- Final demonstration at ASEAN COST meeting

Page 22: Language and Speech Translation Activities in Thailand · Language Resources - Basic Travel Expression Corpus (BTEC) has been used to translate to member languages since A-STAR -

22ASEAN-NICT Round Table – Feb 2015

● Future of S2STFuture of S2ST55

Page 23: Language and Speech Translation Activities in Thailand · Language Resources - Basic Travel Expression Corpus (BTEC) has been used to translate to member languages since A-STAR -

23ASEAN-NICT Round Table – Feb 2015

ReferencesReferences

5 5 S. Nakamura, 2009. Overcoming the language barrier withS. Nakamura, 2009. Overcoming the language barrier with speech translation technology. Science and Technology speech translation technology. Science and Technology Trends – Quarterly Review No. 31, Apr 2009, pp. 35-48.Trends – Quarterly Review No. 31, Apr 2009, pp. 35-48.

1 1 U-STAR consortium, U-STAR consortium, http://ustar-consortium.com/http://ustar-consortium.com/ 2 2 ITU-T standard, ITU-T standard, http://www.itu-t.int/http://www.itu-t.int/ 3 3 S. Sakti, M. Paul, A. Finch, S. Sakai, T. T. Vu, N. Kimura,S. Sakti, M. Paul, A. Finch, S. Sakai, T. T. Vu, N. Kimura, C. Hori, E. Sumita, S. Nakamura, J. Park, C. Wutiwiwatchai,C. Hori, E. Sumita, S. Nakamura, J. Park, C. Wutiwiwatchai, B. Xu, H. Riza, K. Arora, C. M. Luong, H. Li, 2011. TowardB. Xu, H. Riza, K. Arora, C. M. Luong, H. Li, 2011. Toward translating Asian spoken languages. Computer Speech andtranslating Asian spoken languages. Computer Speech and Language (2011), Language (2011), doi:10.1016/j.csl.2011.07.001.doi:10.1016/j.csl.2011.07.001.

4 4 ASEAN-MT project, ASEAN-MT project, http://www.aseanmt.org/http://www.aseanmt.org/