company confidential 1 © 2005 nokia ssml extensions aimed to improve asian language tts rendering...

1 © 2005 Nokia

Company Confidential

SSML Extensions Aimed To Improve Asian Language TTS Rendering

--W3C Workshop on Internationalizing the Speech Synthesis Markup Language, Beijing, China, November 2-3 2005

Jilei Tian*, Xia Wang+, Jani Nurminen*

Multimedia Technologies Laboratory *Nokia Research Center, Finland

+Nokia Research Center, China

2 © 2005 Nokia


1. SSML Overview And Status

2. Peculiarities In Asian Languages

3. Proposal To SSML Extension

4. Examples With Proposed SSML

Outline

3 © 2005 Nokia


• Speech Synthesis Markup Language (SSML) is a standard way of producing content to be spoken by a speech synthesis system

• Current SSML mainly designed/focused on English and European languages

• SSML should be improved and extended for rendering Asian languages including Chinese language, etc.

• Position on TTS development

• Core engine: formant based and concatenative acoustic unit based TTS systems;

• Structure: text processing, prosody processing and acoustic processing modules

• Languages: Mandarin Chinese, English, ……

• Interface between modules:

• Input to text processing module: SSML

• Output from acoustic module: waveform

• Rest of interface: revised ECESS XML

SSML Overview And Status

4 © 2005 Nokia


• Asian language• Tonal;• Syllabic;• No word marker/break;

• Features• Tone: supra-segmental features and controlled in the supra-segmental

layer, or within the basic unit, e.g. syllable or final level;• Tone Sandhi: tone might change inside a word due to the context,

differed from lexical tone. Word could be perceived as a totally differerent word if tone sandhi is not considered.

• Acoustic unit: syllables or sub-syllabic structures like initials and finals, rather than phonemes for western languages. More flexibility in SSML to better support different languages.

• Word segmentation: no explicit marker, essential part of the text processing.

• Multi-linguility in SSML: due to loan words, URLs, email, etc.• Written language vs. spoken language; dialect vs. accent

(personalization)

Peculiarities in Asian Languages

5 © 2005 Nokia


• SYLLABLE element <syllable>• Natural to use syllables or tonal syllables as the basic units of a TTS

system for the language, such as Mandarin Chinese;• Orthorgraphic plus romanized form or transliteration would be a better

representation for SSML. For example: HanYu pinyin in Mandarin Chinese; Jyutping for Cantonese; other transliteration system for Thai, Vietnamese, Hindi, Urdu, etc.

• WORD element <word>• Important in languages that don't have word boundaries (e.g. Thai,

Chinese, Vietnamese);• Crucial for tone sandhi since many tone changes happen within a word;

• WORD element to enhance the word segmentation and tone sandhi, in case that automatic word segmentation does not work or user forces the system to take a certain word segment. Its attribute is the segmented word.

• <pos> could be optional and used for determining the pronunciation, tones and stress level of given word.

• <break> can be used for defining the break strength at the boundary (character boundary, word boundary, prosodic word boundary, prosodic phrase boundary, sentence boundary, etc).

Proposal to SSML Extension (1)

6 © 2005 Nokia


• PROSODY element <prosody> update• Pitch contours play a very important role for tonal languages. Different tone leads

to completely different meanings.• Propose to enhance prosodic features, particularly on pitch. Prosody features are

given in (time, value) format. This approach gives the possiblity to cover any prosodic needs.

• As shown in example SSML for Mandarin Chinese, <syllable> used to define the given character. <frequency> and <energy> are introduced to describe prosodic features, pitch and volume, in the (time, value) format in order to have a better representation capability for prosodic features.

• Practical issues• Absolute vs. relative frequency in pitch;• define the tone contours out of the word level and to use those definitions in the

word level; no need to specify the frequencies for each word;• For parameter-based synthesis, tones are defined as another feature in the

suprasegmental level, like duration and stress;• Multilingual extensions to SSML

• Mixed multilingual texts (IBM中国研究中心 , Mp3 播放器 ); pronunciations for loan words; URLs, email address, etc.

• Abbreviations: asap, cu• Script language - Spoken language – spoken area (e.g. English – English – US, UK,

Australia, etc., or Chinese – Cantonese – HK; or Spanish – Spanish – US)

Proposal to SSML Extension (2)

7 © 2005 Nokia


<?xml version="1.0" encoding="UTF-8" ?> <speak version=“1.0” xml:lang="cn"> <s> <TOKEN token="下午我还有事 "> <WORD word="下午 "> <POS> <NOM /> </POS> <SYLLABLE syl="下 "> <PHONETIC>xia4</PHONETIC> <duration value="267" /> <frequency> <pair time="0" value="380" /> <pair time="80" value="363" /> <pair time="160" value="340" /> <pair time="240" value="301" /> </frequency> <energy> <pair time="267" value="74" /> </energy> <syllable> <break strength="none" time="0" /> </syllable> </SYLLABLE>

<SYLLABLE syl="午 "> <PHONETIC>wu32</PHONETIC> <duration value="181" /> <frequency> <pair time="0" value="290" /> <pair time="54" value="285" /> <pair time="108" value="285" /> <pair time="162" value="290" /> </frequency> <energy> <pair time="181" value="71" /> </energy> <syllable> <break strength="weak" time="0" /> </syllable> </SYLLABLE> </WORD>……………………………………………………………… </TOKEN> </s></speak>

Examples With Proposed SSML

company confidential 1 © 2005 nokia ssml extensions aimed to improve asian language tts rendering...

Documents