crl speech team © 2003 ibm corporation chinese romanization for chinese voice browsing ibm china...
TRANSCRIPT
CRL Speech Team
© 2003 IBM Corporation
Chinese Romanization for Chinese Voice Browsing
IBM China Research Lab
CRL Speech Team
© 2003 IBM Corporation
Index
• Motivations & Proposals
• IPA. VS. Chinese Romanization
• Chinese Romanization Standards• Implementations of Chinese Romanization in SSML
• Extensions for other languages
CRL Speech Team
© 2003 IBM Corporation
IBM Speech Synthesis System
• IBM speech synthesis system support about 20 languages.
• For Asian Language, we cover: – Mandarine, – Cantonese, – Korean, – Japanese, – Thai.
CRL Speech Team
© 2003 IBM Corporation
Pronunciations Annotations are important for Chinese
• A Chinese character represents a meaning more than a pronunciation.
• The homograph phenomenon is very common for Chinese characters.
• So it will be very helpful if the pronunciation can be given explicitly.
CRL Speech Team
© 2003 IBM Corporation
Proposals
• We propose to use Chinese Romanization to annotate Chinese pronunciation in “phoneme” element.
• We also propose SSML to use diverse predefined and widely used pronunciation annotation standards for different languages.
• Thus SSML can be more easily accepted and used around the world.
• Note: Chinese Romanization = Hanyu Pinyin in this PPT.
CRL Speech Team
© 2003 IBM Corporation
Comparison Rule: Goal of SSML
• The goal of SSML is to “provide a rich, XML-based markup language for assisting the generation of synthetic speech in Web and other applications”.
• To reach the goal, we need more and more users of SSML, such as ordinary Web applications developers, to learn and use the SSML easily.
• So, we need to define the SSML based on ordinary people’s knowledge and skill rather than professional linguistics’ knowledge. – Otherwise, it will be a long way for SSML be widely accepted
and used around the world.
CRL Speech Team
© 2003 IBM Corporation
IPA is not very fit for Chinese
• IPA tries to collect an exhaustive set of pronunciations for all kinds of languages.– It has become very complicated and difficult to input.
• A well educated Chinese adult can not annotate Chinese Pronunciation in IPA without special training. – IPA is not very popular in China.
• Special linguistic phenomena in Chinese, such as tone, retroflex, can not be conveniently described by IPA.
CRL Speech Team
© 2003 IBM Corporation
Chinese Romanization is fit for Chinese
• Chinese Romanization is specially designed only for Chinese instead of all languages. – Adding ‘r’ in the end to describe a “retroflex” syllable.– Adding ‘tone’ attribute to describe the tone.
• Chinese Romanization is widely used and learnt. – Chinese people learn Chinese Romanization in primary school. – Many foreigners begin to learn Chinese by Chinese Romanization. – Chinese Romanization is widely used to input Chinese Characters
on computer. • Chinese government has brought into effect a standard for
Chinese Romanization. – It is in effect for education, publishing, information processing and
other related industries in China.
CRL Speech Team
© 2003 IBM Corporation
Chinese Romanization Standard
• The writing rules of Chinese Romanization conform to P.R.C state standard “Basic rules for Hanyu Pinyin Orthography” [1] published by (CSBQTS) in 1996.
• This Orthography is based on “Hanyu Pinyin Schema” published in 1958.
• According to the naming method of alphabet, we propose to use “x-CSBQTS-96” to represent Chinese Romanization alphabet. However, we also propose to use “x-Pinyin-96”, which is easier to remember.
* CSBQTS: China State Bureau of Quality and Technical Supervision
CRL Speech Team
© 2003 IBM Corporation
Hanyu Pinyin Schema (published in 1958)
• Character Set.– 25 characters, all from ‘a’ to ‘z’ except ‘ü’. – (For easy to input on computer: ü is replaced by v.)
• Initial Set:– b, p m, f, d, t, n, l, g, k, h, j, q, x, zh, ch, sh, r, z, c, s
• Final Set:– i, u, ü, a , ia, ua, o, uo, e, ie, eü, ai, uai, ei, uei, – ao, iao, ou, iou, an, ian, uan, üan, en, in, uen, ün – ang, iang, uang, eng, ing, ueng, ong, iong,
• Tone Annotation:– mā , má, mǎ, mà, ma
• Separator: ' – pi’ao
CRL Speech Team
© 2003 IBM Corporation
Basic rules for Hanyu Pinyin Orthography (published in 1996)
1. Words are the basic units for spelling the Chinese Common Language. (Space is used to separate Word)
– rén (person/people), péngyou (friend[s]), túshūguǎn (library/libraries) – wǒrén hé nóngmín (Workers and Farmers)
2. Structures of two or three syllables that indicate a complete concept are linked:
– quánguó (the whole nation), duìbuqǐ (sorry),
3. Separate terms with more than 4 syllables if they can be separated into words, otherwise link all the syllables:
– wúfèng gāngbǐ (seamless pen), Hóngshízìhuì (Red Cross)
CRL Speech Team
© 2003 IBM Corporation
Basic rules for Hanyu Pinyin Orthography (published in 1996)
4. Reduplicated monosyllabic words are linked, but reduplicated disyllabic words are separated: – rénrén (everybody), chángshi chángshi (give it a try)
5. In certain situations, for the purpose of making it convenient to read and understand the words, a hyphen can be added: – huán-bǎo (environmental protection), shíqī-bā suì (17 or 18
years old)
CRL Speech Team
© 2003 IBM Corporation
Implementation 1
• <?xml version="1.0"?>• <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"• xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"• xsi:schemaLocation="http://www.w3.org/2001/10/synthesis• http://www.w3.org/TR/speech-synthesis/synthesis.xsd"• xml:lang="zh-CH">• <phoneme alphabet=" x-CSBQTS-96" ph="duìbuqǐ"> 对不起
</phoneme>• <!-- This is an example of Chinese Romanization
Standard Tone Annotation-->• </speak>•
CRL Speech Team
© 2003 IBM Corporation
Implementation 2
• <?xml version="1.0"?>• <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"• xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"• xsi:schemaLocation="http://www.w3.org/2001/10/synthesis• http://www.w3.org/TR/speech-synthesis/synthesis.xsd"• xml:lang="zh-CH">• <phoneme alphabet="x-CSBQTS-96" ph="dui4bu0qi3"> 对不
起 </phoneme>• <!-- This is an example of Chinese Romanization • using number to describe tone -->• </speak>
CRL Speech Team
© 2003 IBM Corporation
Comparison between Two implementations
Implementation 1:<phoneme alphabet=" x-CSBQTS-96" ph="duìbuqǐ"> 对不起 </phoneme>
Implementation 2:<phoneme alphabet="x-CSBQTS-96"ph="dui4bu0qi3"> 对不起
</phoneme>
Note: "x-CSBQTS-96" may be replaced by "x-Pinyin-96"
CRL Speech Team
© 2003 IBM Corporation
Extension for Cantonese
• The Linguistic society of Hong Kong has published a simple, easy-to-learn and easy-to-use “LSHK Cantonese Romanization Scheme” in 1993.
• This scheme is widely adopted in various areas: education, Cantonese information process and computer input method, etc.
• So we also propose to use “The LSHK Cantonese Romanization Scheme” to annotate Cantonese pronunciation.
CRL Speech Team
© 2003 IBM Corporation
Extension for more languages
• Though it is possible to form up a general standard to annotate all languages’ pronunciation, such a standard may become very complex to use.
• Another way is to use the predefined and widely accepted pronunciation annotation standards for different language.
• At least, these diverse standards should be an important complement to the general standard.
CRL Speech Team
© 2003 IBM Corporation
Korea Romanization
Korea Korea Romanization Meaning in English
밥 pap rice
불고기 pul go gi broiled beef
갈비찜 kal bi jjim beef rib stew
만두 man tu dumplings
홍차 hong ch\'a tea
콜라 k\'ol la cola
우유 u yu milk
It is used in our Korea Speech Synthesis System.
CRL Speech Team
© 2003 IBM Corporation
Japanese Romanization
• Japanese:– まだ覚えているでしょう 波音に包まれて
• Japanese Romanization:– mada oboeteiru deshou nami oto ni tsutsumarete
• English meaning:– Do you remember being surrounded by the sound of
tide?
CRL Speech Team
© 2003 IBM Corporation
Discussion of “Word”
• What is the definition of “Word” in Chinese?– Prosodic Word or Grammar Word
• 你来还是不来? nǐ lái háishi bù lái?
• Is “ 不来” a word?
• What is the difference between ‘Word’ & ‘break’?– The misunderstanding problem can be solved by adding ‘break’.
• Can Word information be handled by ‘Hanyu Pinyin Orthography’?
– In ‘Hanyu Pinyin Orthography’, space is used to separate words.