specification_of_text_and_speech_corpus_for_indonesian_lvcsr

8/14/2019 Specification_of_Text_and_Speech_Corpus_for_Indonesian_LVCSR

1/2

Text and Speech Corpus for Indonesian LVCSR

1. Background

In the period of August 2005 April 2006, a joint research team: TEL!" #$% &enter

'TEL!"#isTI( Indonesia as the leader, Tel)om *chool of Engineering '*TT Tel)om(

Indonesia, and Ad+anced Telecommunication #esearch 'AT#( apan, conducted a research on

de+elopment of Te-t and *peech corpus for Indonesian Large .oca/ular &ontinuous *peech

#ecognition 'L.&*#(1 The project is funded / the 2005 round of AT 3#% rogram for

E-change of I&T #esearchers and Engineers1 The results of the project are: Te-t source of 500

ne4s domain sentences, Le-icon dictionar of 15 4ords, T4o sentence sets consist of 2,500

sentences of application domain and tri7phone /alanced ,896 sentences of ne4s domain, and 9

spo)en sentences 'utterances( for clean and telephon1 The o4ner of the all results is TEL!"

#$% &enter 'TEL!"#isTI( Indonesia as the are the project leader1

2. Text Corpus

The te-t corpus co+ers t4o domains: 8( application domain'fi+e e-isting, running applications

in TEL!"#isTI: %irector *er+ices for 3earing and *pea)ing impaired telecommunication

ser+ice, Tele7home securit, ;illing information *er+ices, #eser+ation ser+ices, and *tatus

trac)ing feature of e7place names( 4here the le-icon 4as

de+eloped / an Indonesian language e-pert1 Tri7phone /alanced sentence set ',896 sentences( is

e-tracted from the 500 ne4s domain te-t source1 This set co+ers ?,669 distinct tri7phones1

After4ard, the sentence sets from /oth domains are com/ined and distri/uted into 800 sentence

lists1 The 2,500 application domain sentences are di+ided into 800 sets 4here each set consists of

800 sentences 4ith o+erlap ratio of 5@, /ut the ,896 ne4s domain sentences are di+ided into

800 sets 4here each set consists of 880 sentences 4ith o+erlap ratio of 0@1 Thus, each sentence

list contains 280 sentences '800 application domain sentences and 880 ne4s domain sentences(1

These sentence lists 4ill /e read / 00 spea)ers1

3. Speech Corpus

3.1Speakers

There are 400spea)ers distri/uted / gender '201 malesand 199 emales(, age '20@ for 8972ears old, 0@ for 275 ears old, 0@ for 6750 ears, and 80@ for 58760 ears old(, and four

8/14/2019 Specification_of_Text_and_Speech_Corpus_for_Indonesian_LVCSR

2/2

major 4estern Indonesia accents '8615@ ;ata), 2915@ a)arta, 28@ a+anese, and 15@

*undanese(1

3.2 Soundproo room

The specifications of the soundproof room and recording euipment are as follo4:81 *oundproof parameter:

a( *ound insulation le+el: 0 d;

/( ;ac)ground noise le+el: 22 d;

c( #e+er/eration time: 0185 second

21 *oundproof design

a( Length: 290 cm

/( Bidth: 220 cm

c( 3eight: 20 cm

d( Thic)ness: 2615 215 cm

1 #ecording euipment

The recording euipment 4as configured as such to ena/le the recording of clean speech

'microphone source( and telephone speech1 ; follo4ing strict AT# reuirement, it is

e-pected that noise, mainl generated / electricit, 4ill /e ma-imall reduced so that the

recording result 4ill /e lo4 noise1

3.3 Speech Corpus Si!e

Each spea)er uttered 280 sentences1 Thus, there are 00 - 280 C "4#000 utterances1 ; using

monochannel, 4" k$!freuenc sampling, 1% &itsuantiDation le+el, and '()file format, the

corpus siDe is around 2% *iga B+tes4ith duration of around ,- hours1

specification_of_text_and_speech_corpus_for_indonesian_lvcsr

Documents