making full use of chinese speech corpora thomas fang zheng center of speech technology state key...

Making Full Use of Chinese Speech

CorporaThomas Fang Zheng

Center of Speech TechnologyState Key Laboratory of Intelligent Technology and Systems

Tsinghua Universityhttp://sp.cs.tsinghua.edu.cn/

Beijing d-Ear Technologies Co., Ltd.http://www.d-Ear.com

Oct. 2, 2003

O-COCOSDA, Oct. 1-3, 2003

Sentosa, Singapore

http://sp.cs.tsinghua.edu.cn/

http://www.d-ear.com/

Your Partner in the Century of Speech

2

Outline

Purpose of speech corpora

Factors to be considered in data creation

Data creation

Data transcription

Learning from corpora

Chinese Corpus Consortium (CCC)


3

Purpose of Speech Corpora

Item Description Percentage

1. Speech/ speaker recognition

system development, evaluation, sentence comprehension and summarization, speech recognition, speaker recognition

73%

2. Speech synthesis

system development, prosodic analysis 11%

3. Acoustic analysis

acoustic analysis, speech coding 9%

4. Sentence analysis

syntactic and semantic analysis 5%

5. Speech/ language education

speech and language education 2%


4

Outline



Data creation

Data transcription




5

Factors to be considered in data creation (1)

The language. Language: e.g., Chinese or English Dialectal background (e.g., for Chinese) :-

Putonghua or standard Chinese ( 普通话 ); Mandarin ( 官话， Northern China); Wu ( 吴语， Southern Jiangsu, Zhejiang, and Shanghai); Yue ( 粤语， Guangdong, Hong Kong, Nanning Guangxi); Min ( 闽南话， Fujian, Shantou Guangdong, Haikou Hainan, Taipei

Taiwan); Hakka ( 客家话， Meixian Guangdong, Hsin-Chu Taiwan); Xiang ( 湘， Hunan); Gan ( 赣， Jiangxi); Hui ( 徽， Anhui); and Jin ( 晋， Shanxi).

Special for Chinese :- Simplified Chinese Traditional Chinese


6


7

太湖片

台州片

瓯江片

？

处衢片

苏沪嘉小片江淮官话

徽语

宣州片杭州小片

林绍小片


8


Speaking style :- Read: for ASR in earlier research, or for TTS Spontaneous/conversational: for ASR nowadays

Recording channel Depending on goal of task or application, or the

application environment Close-talk microphones: for personal computers (PCs) Telephone, and/or cellular phone: for telephony applications Specific channel: for embedded applications (PDA, digital

recorder, ...), or broadcast news, TV news. Normally mono channel instead of stereo channel. However, microphone array may be used for some

research purpose


9


Sampling rate :- 8 kHz: for the telephone/mobile-phone channel where the

bandwidth is about 3.4 kHz 16 kHz: for the close-talk microphone PC channel though the

bandwidth is higher than 8 kHz. Sampling precision :-

16 bits, normally. 8-bit A-law or Miu-law (13-bit wide after decompression).

Signal-to-Noise Ratio (SNR) level: Was/is often collected in a good environment (clean speech

database). For noise-related research, noisy data obtained via :-

Noises (NOISEX 92) mixed with clean speech; Collected in real-world noisy environments.


10


Number of speakers and Speaker balance: The more, the better: with a good speaker diversity,

according to :- Gender; Age; Education; Birthplace (or dialectal background); Occupation; and so on.

Corpus size: Measured by either the number of speakers or the

length of valid speech in hour, or both.


11

Outline



Data creation

Data transcription




12

Data Creation

Purposes for ASR corpus :- Acoustic training - Training Set Speech recognition evaluation (testing) - Testing Set

Categories of ASR corpus :- Read speech; Spontaneous/conversational speech.

Design of a speech database before creation :- Aspects as mentioned above:

language, speaking style, recording channel, sampling rate and precision, and corpus size: according to the application/task background;

SNR levels, number of speakers and the speaker balance: for diversity consideration.

Speaking content balance - for content diversity consideration, to provide a good training set,

For read speech, the balance could be on a basis of– phone, di-phone, tri-phone, and so on;– IF, di-IF, tri-IF, syllable, di-syllable, tri-syllable for Chinese

For spontaneous speech, topics design.


13

Data Creation – Read Speech (1)

Though spontaneous ASR is becoming one of the research focuses, the read speech database collection is still necessary.

A high quality read speech corpus is helpful to train a good initial acoustic model, and then

In spontaneous ASR, pronunciation modelling techniques as well as pronunciation lexicons are adopted to get a practically good acoustic model finally.


14


Goal of read speech corpus design is often to balance the speech recognition units (modelling units), so as to cover as many co-articulations as possible using a set of sentences as small as possible.

Such a minimal sentence set can be used for not only the training of acoustic models but also the speaker adaptation.


15


Sentence design example. Goal is to choose 6,000 sentences (about 0.75%)

from 800,000 sentences taken from the People's Daily with a balanced di-IF distribution.

Several criteria: Natural selection - randomly. Almost natural di-IF

distribution. Restraining high-frequency di-IFs (RHF). To restrain

those high-frequency di-IFs from occurring more frequently so that each di-IF occurs almost equally to well train the acoustic model.

Encouraging low-frequency di-IFs (ELF). As an alternative, to encourage those low-frequency di-IFs as frequently as possible. # of occurring times of any di-IF should be greater than a doable pre-defined threshold.


16


0

100

200

300

400

500

600

Average Natual RHF ELF

0

500

1000

1500

2000

2500

3000

3500

4000

4500

Average Natural RHF ELF

(a) (b)

In this example, all seen di-IFs are sorted in an ascending order of their occurring counts in the mother set. The occurring count as a function of the di-IF order is shown in the above Figure: (a), the first 90% di-IFs, and (b), the next 10% di-IFs.


17


RHF: helpful to make distribution as close to the average curve as

possible It cannot guarantee a minimal occurring count for those less

frequently seen di-IFs. ELF:

Guarantee a minimal occurring count so that data is enough to train the models for those less frequently seen units

Will lead to relatively big occurring counts for those most frequently seen di-IFs though the distribution is close to natural.

It is expected that a good result can be achieved by combining both the RHF and the ELF strategies.


18


Quality control is very important, for read speech database, there could be three steps: Self-check Cross-check Final-review


19Data Creation – Spontaneous/Conversational Speech (1)

In spontaneous (casual) speech or conversational (dialogue) speech

More spontaneous speech phenomena (phonetically and acoustically)

Richer spoken language phenomena (linguistically)

Man-Man dialogue different from Man-Machine dialogue



Three kinds of scenarios should be considered Spontaneous (Casual):

Giving speech or talking about something Man-Machine conversation:

Leaving a voice message: leaving a voice message is not a true conversation mode, but it is important in telephony speech.

Man-Machine dialogue: given a simulated service provider, the speaker can inquire about whatever he/she wants in the following domains:

– Hotel/ticket reservations– Telephone number inquiries– Stock prices inquiries– Weather inquiries– City route map inquiries– Telephone-based shopping/ordering– ...



Three kinds of scenarios should be considered Spontaneous (Casual): Man-Machine conversation: Man-Man conversation:

Inquiring: given a service provider, the speaker can inquire about whatever he/she wants [recording "by stealth", legal issue very important]

Chatting: chatting could be happening between two recruited speakers given a topic (example topics in the following table)



Topic Sub-topics Topic Sub-Topics

Sports

generalbasketballfootballthe game of gotable tennistennisOlympic games...

Lifestyle

familywork and studyeducationtraffictelecommunicationgoing abroaddrinking and smokinghabits and customslanguage and dialectsfriendship and loveshopping for and buying houses...

Politics&

Economy

generalinternational relationsterrorismeconomic globalization...

Entertainment

generalfilm reviewmovie and teleplaystarsChinese Spring Festival get-togetherscomic dialoguetravel and tourism...

Technologies

computernetworkingcompanycloning...


23

Data Creation - Recording

Example for collecting chatting speech data

PBX - Private Branch (telephone) eXchangePSTN - Public Switched Telephone Network

PSTN

PBX

Server for multi-channel Recording

+ ... Server for multi-channel Recording

+


24

Outline



Data creation

Data transcription




25

Data Transcription – Symbol (1)

Two major groups of information in need of transcription phonetic information

base form - giving canonical pronunciation; surface form - giving the observed (actual) pronunciation.

linguistic information Symbol Set

International Phonetic Alphabet (IPA) - can be used to accurately represent the actual pronunciation of the utterance, but not machine readable;

Speech Assessment Methods Phonetic Alphabet (SAMPA) – machine readable;

For Chinese, machine-readable IPA symbols adapted for Chinese languages from SAMPA, and named as SAMPA-C.

In SAMPA-C, there are :- 23 phonologic consonants, 9 phonologic vowels, and 10 kinds of sound change marks.


26Data Transcription - Symbol (2) - Chinese C/V in SAMPA-C

Consonants Vowels

Pinyin [IPA] SAMPA-C Pinyin [IPA] SAMPA-C Pinyin [IPA] SAMPA-C

b [p] p z [ts] ts a [ ] A

p [pH] p_h c [tsH] ts_h o [o] o

m [m] m s [s] s e [ ] 7

f [f] f zh [t©] ts` i [i] i

d [t] t ch [t©H] ts_h` u [u] u

t [tH] t_h sh [©] s` ü [y] y

n [n] n r [¸] z` ii [¡] i\

(a)n [n] _n j [t»] ts\ iii [Ÿ] i`

l [l] l q [t»H] ts\_h er [Ä] @`

g [k] k x [»]

k [kH] k_h ? ?

h [x] X

ng [Ð] N

§

°


27Data Transcription - Symbol (3) - Phonological Vowels in SAMPA-C

PhonemeAllophone

(IPA)SAMPA-C

PhonemeAllophone

(IPA)SAMPA-CPinyin IPA Pinyin IPA

a A a a i I I i

A a_" I I

A J j

Q E ï Ÿ i`

P { ¡ i\

o o O o u u U U

U U u u

U u W w

È @ Ʋ v\

e 7 ü Y Y y

e e á H

E E_r er Ä Ä @`

È @

° °

§ ï


28

Data Transcription - Symbol (4) - Sound Changes in SAMPA-C

Classifications SAMPA-CExamples

IPA SAMPA-C Explanation

鼻化 Nasalized ~ a‹ A~ 'a' is nasalized

央化 Centralized _" Eƒ e_" 'e' is centeralized

清化 Voiceless _u n% n_u 'n' is voiceless

浊化 Voiced _v dŒ d_v 'd' is voiced

圆唇化 More Rounded _O f_O 'f' is more rounded

成音节 Syllabic = M= 'M' is syllabic

喉化 Pharyngealized _?\ t/ A_?\ 'A' is pharyngealized

送气 Aspirated/Breathy _h a! a_h 'a' is more breathy

增音 Inserted (+) (+N) 'N' is inserted

减音 Deleted (-) (i-) 'i' is deleted.

1m

wf


29

Data Transcription – Transcription Layer (1)

Should cover :- the base form and the surface form, as well as some linguistic and non-linguistic information.


30


Transcription Layers include :- Chinese character layer: sentences in Chinese

character (without word boundary information), or in Chinese word (with word boundary information). The Chinese character could be either traditional or simplified.

Non-Chinese words should also be labeled, for example, they can be enclosed in {}.

Additionally, paralinguistic and non-linguistic phenomena could also be transcribed (See Table 7 for more details).

Here is an example.– {hi} 你怎么也 LA< 也来了 LA> (w/o word boundary info)or– {hi} 你 / 怎么 / 也 / LA< 也 / 来 / 了 LA> (w word boundary info)


31


Transcription Layers include :- Chinese character layer Canonical Chinese Pinyin layer: pronunciation is

given in canonical Chinese Pinyin (or Chinese syllable). Alternatively in Chinese IF. The Pinyin (or Final) here can be either toned or toneless. Similarly, non-Chinese words, paralinguistic phenomena and

non-linguistic phenomena could also be transcribed. The example is as follows:

– {hi} ni3 zen3 me0 ye2 LA< ye2 lai2 la0 LA> (toned Pinyins)or– {hi} n i3 z en3 m e0 ie2 LA< ie2 l ai2 l a0 LA> (toned IFs)


32


Transcription Layers include :- Chinese character layer Canonical Chinese Pinyin layer Surface form Chinese IF layer: pronunciation in Chinese IF

roughly. To be more accurate, a super set of the canonical Chinese IF set should be defined, for example the generalized IF (GIF) set. For Chinese, the tones are often changed in spoken language,

known as tone-sandhi. There are some rules for tone-sandhi.– For example, 33 => 23

Depending on tone of following character, tone for character " 一 /yi1/", " 七 /qi1/", " 八 /ba1/", or " 不 /bu4/" will also change, which is different from the ordinary tone change (quasi-canonical tone).– For example " 一个 /yi2 ge4/".

To reflect such tone-sandhi rules, following the Pinyin of such a character is a two-digit tone string where the first digit is the original canonical tone while the second the tone-sandhi.– For example, " 一个 /yi12 ge4/".


33


Transcription Layers include :- Chinese character layer Canonical Chinese Pinyin layer Surface form Chinese IF layer Surface form SAMPA-C layer: Alternatively, the

SAMPA-C symbols can be used to provide more accurate surface form transcription (the observed tone is mostly attached to the SAMPA-C sequence of each Final).


34


An example on how to transcribe in the Semi-Syllable layer for Wu-Dialect Chinese corpus

The Surface Form Semi-Syllable Tier. The initials and finals are in either the Putonghua IF set or in the Wu Dialect IF set. The observed tone is attached to the Final. Special symbols are:

Prefix “+”: an insertion of the following IF; Prefix “-”: an deletion of the following IF; Prefix “#”: indicating a change of the following IF caused by Wu dialect; Prefix “*”: indicating a change of the following IF caused by

mispronunciation (mistake); Postfix “_v”: indicating a voiced consonant due to the co-articulation,

which is different from Wu Dialect specific voiced consonants. Retroflexed “_r”: indicating the tongue with retroflexed gesture,

which is transcribed as “hua_r1”. Paralinguistic phenomena and non-linguistic phenomena. Tone-sandhi.


35


Transcription Layers include :- Chinese character layer Canonical Chinese Pinyin layer Surface form Chinese IF layer Surface form SAMPA-C layer Miscellaneous layer :-

Non-speech information is provided in this layer. Those spontaneous phonetic phenomena and spoken

language phenomena (paralinguistic or non-linguistic) are also given in this layer.


36


1

Para-linguistic

Lengthening ( 拉长 ) LE< LE>

2 Breathing ( 吸气、喘气 ) BR< BR>

3 Laughing ( 笑 ) LA< LA>

4 Crying ( 哭 ) CR< CR>

5 Coughing ( 咳嗽 ) CO< CO>

6 Disfluency ( 不联贯 ) DS< DS>

7 Error ( 口误 ) ER< ER>

8 Silence (long) ( 长时间静音 ) SI< SI>

9 Murmur/Uncertain segment ( 不清发音 ) UC< UC>

10 Modal/Exclamation ( 语其次 / 感叹词 ) MO< MO>

11 Smack ( 咂嘴音 ) SM< SM>

12 Non-Chinese ( 非汉语 ) NC< NC>

13 Sniffle ( 吸鼻子 ) SN< SN>

14 Yawn ( 打哈欠 ) YA< YA>

15 Overlap ( 重叠发音 ) OV< OV>

16 Interjection ( 插话 ) IN< IN>

17 Deglutition ( 吞咽音 ) DE< DE>

18 Hawk ( 清嗓子 ) HA< HA>

19 Sneezing ( 打喷嚏 ) SE< SE>

20 Filled Pause ( 填充停顿 ) FP< FP>

21 Trill ( 颤音 ) TR< TR>

22 Whisper ( 耳语 ) WH< WH>

23

Non-linguistics Noise ( 噪音 ) NS< NS>

24 Steady Noise ( 平稳噪音 ) TN< TN>

25 Beep ( 电话忙音 ) BP< BP>

XX<: start point

XX>: end point


37

Data Transcription – Transcription Tools (1)

Tool NameItem Praat SFS X-Waves+

URL http://www.fon.hum.uva.nl/praat/

http://www.phon.ucl.ac.uk/resource/sfs/

Platform PC, Mac, WS PC PC, WS

OS Win, MacOS, Unix, Linux Win, Unix, Linux Unix, Linux (X-win)

Price Free Free

Supported file formats

WAV, RIFF, AU, Binary raw, ASCII, etc

WAV, AU, AIFF, ILS, HTK, etc

AU, Binary Raw, ASCII, etc

Analysis Pitch track, Epoch, Pitch modification, Spectrum, Formant, LPC, PSOLA LPC, Wavelet, Cepstrum, Excitation, Cochlea-gram, Vocal track analysis, ...

Re-sampling and speed/pitch changing, Fundamental frequency estimation (from SP or from LX), Spectrographic analysis, Formant frequency estimation & formant synthesis, ...

Pitch track, Epoch, Pitch modification, Spectrum, Format, LPC, etc

Labelling layers Multiple Single Multiple

http://www.fon.hum.uva.nl/praat/








38

Data Transcription – Transcription Tools (2)


39

Outline



Data creation

Data transcription




40

Learning from Corpora

Databases exist not only for acoustic model training and system assessment.

Knowledge can be learned from the speech corpus, especially for the spontaneous speech corpus. What can be learned from the corpus includes the

phonetic knowledge that can be used to improve the acoustic model, and

The linguistic knowledge that can be used for the natural language processing and understanding.


41

Learning from Corpora - Knowledge for PM (1)

In spontaneous speech, two kinds of differences between canonical IFs and their surface forms if the deletion and insertion are not considered. Sound change from one IF to a SAMPA-C sequence close to its

canonical IF (forming a part of the GIF set); Phone change directly from one IF to another quite different IF

or GIF. Definition of the GIF set: find all possible SAMPA-C

sequences of all IFs in the database. Example:

By searching in the CASS corpus, we initially obtain a GIF set containing over 140 possible SAMPA-C sequences (pronunciations) of IFs;

Some of them are least frequently observed sound variability forms therefore they are merged into the most similar canonical IFs;

Finally we have 86 GIFs.


42


IF (Pinyin)

SAMPA-C Comments

z /ts/ Canonical

z /ts_v/ Voiced

z /ts`/ Changed to 'zh'

z /ts`_v/ Changed to voiced 'zh'

e /7/ Canonical

e /7`/ Retroflexed, or changed to 'er'

e /@/ Changed to /@/ (a GIF)


43


Probabilistic GIFs output probability given an IF, i.e., P(GIF | IF)

GIF N-Grams, including unigram P(GIF), bigram P(GIF2 | GIF1) and trigram P(GIF3|GIF1,GIF2),

For spontaneous ASR, a multi-pronunciation lexicon is necessary to reflect the pronunciation variation. Can be learned form the database: P(GIF1, GIF2 | Syllable).

Such a lexicon is called a probabilistic multi-entry syllable-to-GIF lexicon.


44


Syllable (Pinyin)

Initial (SAMPA-C)

Final (SAMPA-C)

Output Probability

chang ts`_h AN 0.7850

chang ts`_h_v AN 0.1215

chang ts`_v AN 0.0280

chang <del> AN 0.0187

chang z` AN 0.0187

chang <del> iAN 0.0093

chang ts_h AN 0.0093

chang ts`_h UN 0.0093


45

Probabilistic Pronunciation Modeling Theory

Recognizer goal K*=argmaxK P(K|A) = argmaxK P(A|K) P(K)

Applying independent assumption P(A|K) = n P(an|kn)

Pronunciation modeling part – via introducing surface s P(a|k) = s P(a|k,s) P(s|k)

Symbols a: Acoustic signal, k: IF, s: GIF A, K, S: corresponding string

AM

LM

Refined AMOutput Prob.



47

Learning from Corpora - Knowledge for NLP (1)

Spoken Language Phenomena :- The courtesy items / sentences inessential for semantic analysis.

C: 喂，你好，请问是中关村航空客运代理处么 ? Hi, hello, could you tell me ...?

Simple repetitions because of the pondering or thinking when speaking.

C: 我问一下那个四月三十呃四月三十号北京到 ... ... 30th April ... 30th April ...

Semantic repetitions for emphasis.C: 请问那个周四就是四月三十号北京到 ... ... Thursday ... 30th April ...

Speech corrections or repairs.C: 呃，那个什么星期三，呃，星期四的去南京的机票还有吗？ ... Wednesday ... Thursday ...


48


Spoken Language Phenomena (2) :- Ellipsis in the context.

C: 我问一下那个四月三十呃四月三十号到福州的机票最后一班还有吗 ? (Asking if there are tickets available for the last flight from the departure city to the arrival city, on the departure date.)

O: 只有一班有。 ("Only one flight with ticket available")C: 那个那五月一号的下午三点有么？ (How about the flight at a departure

time on another departure date?) Constituents appearing in any order (as long as sufficient

information is given) C: ... 五点二十五国航飞深圳的 ... (Time, airline code, location and some

other items can appear in any order) Constituents appearing in reverse order

C: ... 的机票多少钱得 ? How much cost (Normal order: “ 得” “多少钱” )


49


Spoken Language Phenomena (3) :- Parol (verbal idioms) or unnecessary terms.

C: 那，那个八点二十那个是去什么机场的呀？ (“ 那” /” 那个” is somewhat similar to “uhm”)

And long sentences with all required information. Or additional explanation following the previous sentence.

C: 哎，您好，这样那个我订一张 (one) 那个明天 (tomorrow) 下午(afternoon) 五点 (5 o’clock) 四十五 (45) 去北京 (from Beijing) 到上海 (to Shanghai) 的那个机票 (ticket) 的。

O: ...C: 您给我看一下有没有 (is there) 北京到湛江 (from Beijing to Zhanjiang)

的 (ticket) ？二十九 (29th) 、三十号 (30th)?


50


Based on the above information, 4 types of parsing rules are proposed for robust spoken language understanding.


51


Robust rule type 0: some crucial information are not allowed to be

mixed/inserted by other terms, e.g. personal ID no. ( 身份证号码 )

up-tying ( 苛刻型 ) rules

sub_id_card_head * ato_0to9_yao ato_0to9_yao ... ato_0to9_yao

(15 identical terms)

id_card_no sub_id_card_head

id_card_no * sub_id_card_head ato_0to9_yao ato_0to9_yao ato_0to9_yao


52


Robust rule type 1: contrarily, some utterances are allowed to be

inserted with recognized garbage/fillers, or some other meaningless parts, e.g. “ 星期啊三嗯星期四”

by-passing ( 跳跃型 ) rules

sub_week_day ato_week ato_1to6

sub_week_day_list sub_week_day

sub_week_day_list sub_week_day sub_week_day_list


53


Robust rule type 2: some information, such as time, city names, plane

types etc., can appear in any order up-messing ( 无序型 ) rules

timeloc_info_cond @ info_date_time_cond info_fromto

plane_info @ mat_airline_code mat_aircraft_type

flight_info_cond @ timeloc_info_cond plane_info


54


Robust rule type 3: some phrases/constituents, such as “是…吗” , may

contain other constituents in between, e.g. “是到北京吗” , “ 是两张吗”

over-crossing ( 交叉型 ) rules

mark_q_is tag_is_or_not

mark_q_is tag_is tag_question_mark

mark_q_is tag_is_q

confirm_request # mark_q_is confirm_c


55Learning from Corpora - Knowledge for dialectal ASR (1)

Dialect-related Chinese Syllable Mapping (CSM).Word-independent CSM: e.g. in Southern

Chinese, Initial mappings include zhz, chc, shs, nl, and so on, and Final mappings include engen, ingin, and so on;

Word-dependent CSM: e.g. in dialectal Chuan Chinese, the pinyin 'guo2' is changed into 'gui0' in word ' 中国 (China)' but only the tone is changed in word ' 过去 (past)'.



The CSM could be N→1, 1→N, or crossed.

kei kuo kui...

Standard Chinese Syllabe Set

Chuan Dialect

ke

上[课] [克]服

kuo kui

[扩]大 [魁]梧



The CSM is not exact. For any mapping AB, it is mostly that the resulted pronunciation is not B exactly, but something quite similar to B, more similar to B than to any other syllable.

A

B

B1

B3

B4

B2

Bi is a variation of B, such as :-

nasalization, centralization, voiced,voiceless, rounding, syllabic, pharyngrealization, aspiration



Lexical: a dialect-related lexicon containing two parts :- A common part shared by standard Chinese and most

dialectal Chinese languages (over 50k words), and A dialect-related part (several hundreds).

And in this lexicon: each word has one pinyin string for standard Chinese

pronunciation and a kind of representation for dialectal Chinese pronunciation, and

each of those dialect-related words is corresponding to a word in the common part with the same meaning



For each word in the lexicon Word-dependent CSM syllable

– Add a parallel pinyin string, with word-independent CSM pinyin unchanged

• 中国： Putonghua: zhong guo• added: zhong gui

For each pronunciation representation of a word in the lexicon

Word-independent CSM syllable– Add a parallel initial/final arc

• 中国： { (zh | z) (ong) (g) (uo) } | { (zh | z) (ong) (g) (ui) }

• 精神： (j) (ing | in) (sh | s) (en | eng)



zh

ong

… x

in

…

中心忠心

…

zh

ong

… x

in

…

中心忠心

… z

ing {I*(zh)} {I*(z)}

{*F(in)} {*F(ing)}



Dialectal Chinese SpeechRecognition Framework

Standard ChineseSpeech Recognizer

+

Dialectal ChineseSpeech Recognizer

Dialectal Chinese RelatedKnowledge & Resources



Dialectal Chinese SpeechRecognition Framework

AM AdapterAcoustic Regulator LM Adapter

Pronunciation Modeling(PM) Techniques:

Accents & Spontaneous Speech

Toned-Syllable Mappings:Word-Independent/-Dependent

Dialect-RelatedLexical Entry Replacement Rules

Language Post-ProcessingAlgorithms

Standard ChineseSpeech Recognizer


63

Outline



Data creation

Data transcription




64

Chinese Corpus Consortium (CCC) (1)

Speech and text databases are very important to the research and development in the areas of automatic speech recognition, text-to-speech, natural language processing and etc.

And there have been several consortiums or associations devoting themselves in collecting and distributing such data resources, such as the Linguistic Data Consortium (LDC 1992), the European Language Resources Association (ELRA 1995), and The Gengo Shigen Kyouyuukikou Language Resource

Consortium (GSK 1999). We also need such a consortium or association for the

Chinese language.


65


After communications and discussions for a long time, now CCC is coming out soon.

CCC will provide corpora for, but not limited to :- Chinese ASR, TTS, NLP, perception analysis, phonetics analysis, linguistic analysis, and so on.

Corpora could be :- speech and text, read and spontaneous, wideband and narrowband, Putonghua, dialectal Chinese and Chinese dialects, clean and with noise, and of any other kinds which are helpful to the foresaid purposes.


66


CCC in more international: Mainland China, Hong Kong, Singapore, Japan, USA, [Taiwan], [Europe].

CCC is open to the speech and language research and development society.

More details can be found in the CCC web site (http://www.d-Ear.com/CCC/).

http://www.d-ear.com/CCC/

http://www.d-ear.com/CCC/

Thomas Fang Zheng

[email protected]

[email protected]

Q/A

mailto:[email protected]

mailto:[email protected]

making full use of chinese speech corpora thomas fang zheng center of speech technology state key...

Documents