language identification through gmms and language modelling

8/12/2019 Language Identification through GMMs and language modelling

1/30

utomatic anguagedentification Telephone SpeechMarcA. ZissmanII Lincoln Laboratory has investigated the development of a system that canautomatically identify the language of a speech utterance. To perform the taskof automatic language identification, we have experimented with fourapproaches: Gaussian mixture model classification; single-language phonerecognition followed by language modeling (PRLM); parallel PRLM, whichuses multiple single-language phone recognizers, each trained in a differentlanguage; and language-dependent parallel phone recognition. These fourapproaches, which span a wide range of training requirements and levels ofrecognition complexity, were evaluatedwith the Oregon Graduate InstituteMulti-Language Telephone Speech Corpus. Our results show that the threesystems with phone recognizers achieved higher performance than the simplerGaussian mixture classifier. The top-performing systemwas parallel PRLM,which performed two-language, closed-set, forced-choice classification with a2 error rate for 45-sec utterances and a 5 error rate for lO-sec utterances.For eleven-language classification, parallel PRLM exhibited an 11 error ratefor 45-sec utterances and a 21 error rate for 10-sec utterances.

EECH IS ONE OF THE MOST natural and efficient means for communicating informationamong a group of people. Because speech communication is ubiquitous, researchers have made significant efforts to create methods for automaticallyextracting the fundamental information that a speechutterance conveys. Figure 1 illustrates how a set offour prototypical speech-information extractionmodules might be configured. Thesemodules includespeech recognition (or word transcription), topicidentification, language identification, and speakerrecognition. What is usually termed automaticspeech recognition, i.e., computer analysis of thespeech waveform for the purpose of producing a written transcription, has received significant attentionover the past three decades. s previously reported inthis journal, the Speech Systems Technology group atLincoln Laboratory has developed techniques for ob-

taining such word-by-word transcriptions of speechutterances [1]. Often, however, wewould like to consider the word-by-word transcription as merely anintermediate representation in a more elaborate computer-based speech-understanding system a systemthat understands the semantic meaning of what isbeing spoken. While wide-domain, general-purposespeech-understanding systems do not yet exist, wehave begun to address simpler, more narrowly definedspeech-understanding problems. For example, wehave reported on research and development of sys-tems that can automatically determine the topic of aspeech conversation [2].In addition to the word transcription and the se

mantic meaning of a speech utterance, there are otherpieces of information present in an utterance that wemight like to extract automatically. For example, wehave built systems that can recognize and verify the

VOLUME 8 NUM ER 2 995 THE LINCOLN L OR TORY JOURN L 115


2/30

ZISSMANutomatic anguage dentification Telephone peech

1 Language name English )

1 Word transcription Meet me at noon )

Topic of conversation Appointmentscheduling )

Topicidentification

Speechrecognition

LanguageidentificationTelephone channel

Speakerrecognition 1 Speaker identity James Wilson )

IGUR A set of four speech information extraction modules applied to a speech utterance spoken over a telephonechannel. Speech utterances convey many different types of information that can be detected automatically includingthe words that were spoken the topic of the utterance the language that was spoken and the identity of the speaker.This article describes the module for automatic language identification.identity of the speaker from his or her voice. Such asystem is discussed in a companion article by DouglasA Reynolds in this issue [3]. A telated task of identifying the language of a speech utterance has been under investigation for the past four years. It is automatic language identification ID that is the primaryfocus of this paper.

Language-ID applications fall into two main categories: preprocessing for machine systems and preprocessing for human listeners. In an example of preprocessing for machine systems suggested by T.].Hazen andvw Zue, the hotel lobby or internationalairport of the future might employ a multi lingualvoice-controlled travel-information retrieval system[4], as shown in Figure If no mode of input otherthan speech is used, then the system must be capableof determining the language of the speech commandseither while it is recognizing the commands or beforeit has recognized the commands. Determining thelanguage during recognition would require manyspeech recognizers one for each language running inparallel. Because tens or even hundreds of input lan-

guages would need to be supported, the cost of the re-quired real-time hardware might prove prohibitive.Alternatively; a language-ID system could be run inadvance of the speech recognizer. In this case, the language-ID system would quickly list the most likelylanguages of the speech commands, after which thefew most appropriate language-dependent speechrecognition models could be loaded and run on theavailable hardware. A final language-ID determination would be made only after speech recognition wascomplete.

Figure 3 illustrates an example of the second category of language-ID applications-preprocessingfor human listeners. In this case, language ID is usedto route an incoming telephone call to a humanswitchboard operator fluent in the corresponding language. Such scenarios are already occurring today: forexample, T T offers the Language Line interpreterservice to, among others, police departments handling emergency calls. When a caller to LanguageLine does not speak English, a human operator mustatt empt to route the call to the appropriate inter-

116 THE L INCOLN L OR TORY JOURN L VOLUME 8 NUM ER 2 995


3/30

ZISSMANutomatic anguagedentification o Te ephone Speech

Speaker in aninternationalsetting

Speech Language-IDsystem

Acoustic andlanguageTop three language model libraryhypotheses Afrikaans1 German Albanian2 Dutch t ~ 1 Arabic3 English Azerbaijani

Dutch English German

Speech-recognitionsystem loaded withDutch modelsSpeech-recognitionsystem loaded withEnglish models

Speech-recog nitionsystem loaded withGerman models Top threelanguagemodels

rI Dutch I: transcription :L J

Englishtranscription

Germantranscription

IGUR 2 A language-identification I D system as a front end to a set of real-t ime speech recognizers. The languageID system outputs i tsthree best guesses of the language of the spoken message in this case, German, Dutch, and English . Real-time speech recognizers are loaded with models for these three languages and make the f inal language-IDdecision in this case, Dutch after decoding the speech utterance.

English-speaking operatorr - - - - - - - -, IGerman-speaking caller

-- , Language-ID-based __ ~ :router , II IIL 1

German-speaking operator

Spanish-speaking operator

IGUR 3 A language-ID system as a front end to a multili ngual group of directory-assistance or emergency operators.The language-ID system routes an incoming cal l to a switchboard operator fluent in the corresponding language.

VOLUME B NUMBER 2.1995 THE LINCOLN L BOR TORY JOURN L


4/30

ZISSMANutomatic anguage dentification Telephone Speech

preter. Much of the process is trial and error (for ex-ample, recordings of greetings in various languagescan be used) and can require connections to severalhuman interpreters before the appropriate person isfound. s recently reported by Y.K. Muthusamy [5],when callers to Language Line do not speak English,the delay in finding a suitable interpreter can be onthe order of minutes, which could prove devastatingin an emergency. Thus a language-IO system that canquickly determine the most likely languages of theincoming speech might be used to reduce the timerequired to find an appropriate interpreter by one ortwo orders of magnitude.

Although research and development of automaticlanguage-IO systems have been in progress for thepast twenty years, publications have been sparse. Thenext section, entitled Historical Background, begins with a brief discussion of previous work. Thisdiscussion does not provide a quantitative report onthe performance of each of these systems because,until recently, a standard multi-language evaluationcorpus that could allow a fair comparison among thesystems did not exist.The section entitled LanguageID Cues describes several cues that humans and machines use for identifying languages. Knowledge ofthese cues, which are key elements that distinguishone language from another, is useful in developingspecific automatic algorithms. Next, the section entitled Algorithms describes each of the four language-IO approaches that are the main focus of thiswork: Gaussian mixture model (GMM) classification[6-8], single-language phone recognition followed bylanguage modeling (PRLM) [9-11], parallel PRLM[9], and language-dependent parallel phone recognition (PPR) [12-13]. A phone is the realization of anacoustic-phonetic unit or segment; a phoneme is anunderlying mental representation of a phonologicalunit in a language [14]. See the glossary on the following page.) Because the four approaches have different levels of computational complexity and training-data requirements, our goal was to study theperformance of the systems while considering the easewith which they could be trained and run.

The section entitled Speech Corpus reviews theorganization of the Oregon Graduate Institute MultiLanguage Telephone Speech (OGI-TS) Corpus [15],

THE LINCOLN L BOR TORY JOURNAl VOLUME B NUMBER 2 995

which has become a standard for evaluating language10 systems. We used the OGI-TS corpus to evaluateour four processing systems. At the start of our work,the corpus comprised speech from approximatelyninety speakers in each of ten languages. Since then,the numbers of speakers and languages have bothgrown. The section entitled Experiments and Results reports the language-IO performance of thefour systems that we tested on the OGI-TS corpus,and the section entitled Additional Experimentspresents results of subsequentwork that sought to improve the best system of the four approaches. Finally,the section entitled Summary reviews the results ofour work, discusses the implications of this work, andsuggests future research directions.Historical BackgroundResearch in automatic language ID from speech has ahistory extending back at least twenty years. Until recently, a comparison of the performance of these systems was difficult because few of the algorithms hadbeen evaluated on common corpora. Thus what follows is a brief description of some representative systems without much indication of their quantitativeperformance. (For further detail, see Muthusamy s recent review oflanguage-IO systems [5].)

Most language-ID systems operate in two phases:training and recognition. During the training phase,the typical system is presented with examples ofspeech from a variety of languages. Some systems require only the digitized speech utterances and the corresponding true identities of the languages being spoken. More complicated language-IO systems mayrequire either (1) a phonetic transcription (sequenceof symbols representing the sounds spoken), or 2 anorthographic transcription (the text of the words spoken) along with a phonemic transcription dictionary(mapping ofwords to prototypical pronunciation) foreach training utterance. Figure 4 shows an example ofa transcribed utterance from the English speech corpus developed by Texas Instruments, Inc. and MITTIMIT [16]. Producing these transcriptions anddictionaries is an expensive and time-consuming process that usually requires a skilled linguist fluent inthe language of interest.

For each language, the training speech is analyzed


5/30

ZISSMANAutomatic anguage Identification ofTelephone peech

GLOSSARY

cepstrum the inverse Fourier transform of the logmagnitude spectrum. Feature vect rs of cepstralcoefficients are used for many speech processingapplications.decode convert speech into a phonetic, phonemic, or orthographic transcriptionGMM Gaussian mixture model: a parameterizedprobability density function comprising multipleunderlying weighted Gaussian densities in whichthe weights sum to one.HMM hidden Markov model: the most commonapproach for modeling speech production for thepurposes of speech recognitionlanguage the a utom atic identifica tion by amachine) of the language of a speech utterancemel scale a scale that has linear frequency spacingbelow 1000 Hz and logarithmic spacing above1000 Hz. T his scaling is motiva ted by the construction and function of the periphery of the human auditory system.n-gram a sequence of n symbolsorthographic transcription a word-by-word transcript of speechphone a realization of acoustic-phonetic units orsegments. A phone is the actual sound producedwhen a speaker means to produce a phoneme. Forexample, the phones that comprise the word cel-ebrate might be Is eh 1 ax l b r ey qphoneme an underlying mental representation ofa phonological unit in a language. For example,

the phonemes that comprise the word celebrate areIs eh 1 ix b r ey t ote: There areabout forty phonemes in the English language.)phonemic transcription or expansion a mappingof a word to its prototypical pronunciation or pronunciations, s w ou ld be fo und i n a dictionary.Unfortunately, because of factors such s thespeaker s accent, educational level, and age, s wells the context of the communication, words arenor always pronounced according to their dictionary expansIOns.phone tic tra ns cription the sequence of phonesspoken in a speech utterancephonotactics the language-specific rules that govern which phonemes may follow other phonemesspectrum the representation of a signal in the frequency domain. In speech processing, spectra areoften calculated with a short-time Fourier transform every 10 msec. A 20-msec window is used sothat the speech signal is relatively stationarywithin the 20-msec window.tokenize convert an input waveform a sequence of phone symbolsViterbi decoding a procedure for finding the hidden sequence of phones and states within aphone) most likely to have produced the observedsequence of feature vectors

the orth American telephone standard forrepresenting a speech waveform with 8-bit codesat an kHzrate. The coding is logarithmic ratherthan linear to minimize the effect of the quantizatlon error.

VOLUME NUM ER 2 995 THE L INCOLN L OR TORY JOURN L 119


6/30


1 on.n

success

, 4n

n n

, >n

brother s

0.53463 l: 0.24250 R: 0.77713 F: 1.8

celebrate

E 80000 6000e- 4000Q)0- 2000 /)

T h o f): 0.2425050

Orthographic I Help Itranscription:

2000

n n

Ti t : O OOOOOso D: 0.00000 l: 1.93644 R: 1.93 44 F:-- - - - - -t O J ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~)>ro

Phonetic transcriptionfo r celebrate : s EH I l I AX IscLisl R I EY I QPhonemic (dictionary)transcription for celebrate : S EH l IX S R EY T

IGUR 4 An example of a transcribed utterance from the English speech corpus developed by Texas Instruments, Inc.and MIT (TIMIT). The example speech utterance is the sentence Help celebrate your brother s success. Included arethe orthographic transcription of the sentence and phonetic and phonemic transcriptions of the word celebrate. Thephonetic codes are self-explanatory except for AX, which is a schwa; SCl, which is the closure silence preceding the Ssound; and 0 which is a glottal stop.an d one or more models are produced. These models,which are intended to represent some set oflanguagedependent fundamental characteristics of the trainingspeech, can then be used in the recognition phase ofthe language-ID process. During recognition, a newutterance is compared to each of the language-dependent models. Th e likelihood that the new utterance

was spoken in the same language as the speech used totrain each model is computed, and the maximumlikelihood model is found. Th e language of thespeech that was used to train the maximum-likelihood model is then hypothesized as the language ofthe utterance.

Th e earliest automatic language-ID systems used

12 0 THE L IN CO LN L BO R T OR Y J OU RN L V OL UM E B NUMBER 2 1995


7/30


the following procedure: examine training speech either manually or automatically); extract and store aset of prototypic al spectra each c om pute d fromab ou t 10 msec of the training speech) for each language; analyze and compare test speech to the sets ofprototypical spectra; and classifY the test speech onthe basis of the results of the comparison. For ex-ample, in a system proposed by R.G. Leonard a ndG.R. Doddington [17-20], spectral feature vectorsextracted from training messages were scanned by theresearchers for regions of stability and for regions ofvery rapid change. Such regions, t ho ug ht to be indicative of a specific language, were used as exemplarsfor template-matching the test data. After this initialwork, researchers tended to focus on automatic spectral feature extraction, unsupervised training, andmaximum-likelihood recognition. D. Cimarusti [21],J.T. Foil [22], F J Goodman [23], and M. Sugiyama[24J all built systems that automatically extracted ex-emplar spectra from the training speech and classifiedtest speech on the basis of its similarity to the trainingexemplars. In a related approa ch, L Riek [6], SNakagawa [7], and M.A. Zissman [8J appliedGaussian mixture model classifiers to language identification. Described more fully in the section entitled Algorithms, Gaussian mixture model classificationis in some sense) a generalizationof exemplar extraction and matching.

I n an effort to move beyond low-level spectralanalysis, Muthusamy [25J built a neural-networkbased multi-language segmentation system that wascapable of partitioning a speech signal into sequencesof seven broad p hon et ic categories. For each u tterance, the category sequences were then convertedto 194 features that could be used for languageidentification.

Whereas the language-ID systems described aboveperform primarily static classification, other systemshave used hidde n M arkov models HMM [26J tomodel sequential characteristics of speech production. HMM-based language ID was first proposed byA.S. House and E P Neuburg [27], who created a discrete-observation ergodic HMM that took sequencesof speech symbols as input and produced a sourcelanguage hypothesis as output. The system derivedtraining and test symbol sequences from published

phonetic transcriptions of text. M. Savic [28J, Riek[6], Nakagawa [7], and Zissman [8J all appliedHMMs to feature vectors derived automatically fromthe speech signal. In these systems, HMM trainingwas performed on unlabeled training speech, i.e.,speech that was not labeled either orthographically orphonetically. Riek a nd Ziss man found tha t HMMsystems trained in this unsupervised manner did notperform as well as some of the static classifiers theyhad been testing. Nakagawa, however, eventuallyobtained better performance from his HMM approachtha n from his static a pproaches [29J In rela ted research, K P Li and T.J. Edwards [30J segmented incoming speech into six broad acoustic-phoneticclasses. Finite-state models were then used to modelclass-transition probabilities as a function of language. Li has also developed a new language-ID system based on the examination and coding of spectralsyllabic features [31

Recently, language-ID researchers have proposedsystems tha t are tra ined with m ulti-language, phonetically labeled corpora. L P Lamel and J.-L.Gauvain have found that likelihood scores from language-dependent phone recognizers are capable ofdiscriminating between speech read from English andFrench texts [32], as did Muthusamy on English-versus-Japanese spontaneous telephone speech [13JThis type of system is covered in the section entitled Algorithms. In other work, O. Andersen [33J andK.M. Berkling [34J have explored the possibility offinding a nd using only those phone s which bes t discriminate between language pairs. Although initiallysuch systems were constrained to operate only whenphonetically transcribed training speech was available, R.C.P. Tucker [IIJ and Lamel [35J have utilizedsingle-language phone recognizers to label multilingual training-speech corpora, which have then beenused to train language-dependent phone recognizersfor language ID. In other research, S Kadambe [36Jhas studied the effect of applying a lexical access mod-ule after p ho ne recognition to sp ot in some sense)words in the phone sequences.

A related appr oach has been to use a single-language phone recognizer as a front e nd to a system tha tuses phonotactic scores to perform language ID.Phonotactics are the language-dependent set of con-

VOLUME B NUMBER 2 1995 THE LINCOLN L BOR TORY JOURN L


8/30

ZISSMAN utomaticLanguage Identification ofTelephone Speech

straints specifying which phonemes are allowed tofollow other phonemes. For example, the Germanword spiel which is pronounced ISH P IY andmight be spelled shpeel in English, begins with a consonant cluster ISH P that is rare in English. (Note:The consonant cluster ISH P can occur in Englishonly if one word ends in SH I and the next beginswith pI or in a compound word like flashpoint.This approach is reminiscent of the work of R.J.D Amore [37-38J, J.c. Schmitt [39] and M.Damashek [40], who used n-gram (i.e., a sequence of symbols) analysis of text documents to perform language and topic identification and clustering. T.A.Albina [41J extended the same technique to clusterspeech utterances according to topic. By tokenizingthe speech message, i.e., by converting the inputwaveform to a sequence of phone symbols, we can usethe statistics of the resulting symbol sequences to perform either language or topic identification. Hazen[9], Zissman [10], Tucker [11], and M.A. Lund [42Jhave all developed such language-ID systems by usingsingle-language front-end phone recognizers.Zissman [lOJ and Yan [43J have extended thiswork to multiple single-language front ends for whichthere need not be a front end in each language to beidentified. Meanwhile, Hazen [4J has pursued asingle multi-language front-end phone recognizer.Examples of some of these types of systems are ex-plored more fully below.

Prosodic features such as duration, pitch, and stresshave also been used to distinguish one language fromanother automatically. For example, S. Hutchins [44Jsuccessfully applied prosodic features in two-languageapplications, e.g., distinguishing between Englishversus Spanish, or English versus Japanese.

Finally, within the past year, efforts at a number ofsites have focused on the use of continuous-speechrecognition systems for language ID [45J. In thesesystems, one speech recognizer per language is createdduring training. Each of these recognizers is then runin parallel during testing. The recognizer yielding theoutput with the highest likelihood is selected as thewinner, and the language used to train that recognizeris the hypothesized language of the utterance. Suchsystems promise high-quality language-ID results because they use higher-level knowledge (words and


word sequences) rather than lower-level knowledge(phones and phone sequences) to make the languageID decision. Furthermore, a transcription of the utterance can be output as a by-product oflanguage IDOn the other hand, continuous-speech-recognitionsystems would require many hours of labeled trainingdata in each language to be identified, and they arealso the most computationally complex of the algorithms proposed.Language I u sHumans and machines can use a variety of cues todistinguish one language from another. The reader isreferred to the linguistics literature [46 47, and 14Jfor in-depth discussions of how specific languages differ from one another, and to Muthusamy [48], whohas measured howwell humans can perform languageID. We know that the following characteristics differfrom language to language: Phonology. Phone/phoneme sets are different

from one language to another even thoughmany languages share a common subset ofphones/phonemes. Phone/phoneme frequencies can also differ; i.e., a phone can occur intwo languages, but it might be more frequent inone language than the other. In addition,phonotactics, i.e., the rules governing the se-quences of allowable phones/phonemes, can bedifferent, as can be the prosodics.

Morphology. Word roots and lexicons are usuallydifferent in different languages. Each languagehas its own vocabulary, and its own manner offorming words.

Syntax. The sentence patterns are different fromone language to another. Even when two languages share a word, e.g., the word bin in English ( container ) and German a form of theverb to be ), the sets ofwords that precede andfollow the word will be different.

Prosody. Duration characteristics, pitch contours, and stress patterns are different from onelanguage to another.At present all automatic language-ID systems of

which the author is aware take advantage of one ormore of these sets of language characteristics in discriminating one language from another.


9/30

ZISSMAN utomaticLanguage Identification ofTeLephone Speech

lgorithmsThe algorithms described in the section entitled Historical Background have various levels of computational complexity and different requirements for thetraining data. Our primary goal in this work was toevaluate a few of these techniques in a consistentmanner to compare their language-ID capabilities.We tested four language-ID algorithms: 1 Gaussianmixture modeling, 2 single-language phone recognition followed by language modeling, 3 parallelphone recognition followed by language modeling,and (4) parallel phone recognition. Each of these systems is described in detail in this section.The descriptions are preceded by a discussion of the conversion ofspeech to feature vectors, which is a process commonto all four algorithms.

Digitized telephone speechSpeech-activity detection

rDiscrete Fourier transform

Preemphasis

Inverse cosine transform

Relative spectral methodology

Log

CepstralfeaturevectorsDelta-cepstralfeaturevectors

Time differencing

Mel-scale weighting IIL_________ _

FIGUR 5 The acoustic preprocessing used to converttelephone speech into feature vectors. Dig it ized telephone speech is passed through a mel-scale filter bankindicated by the dashed lines above , from whichcepstral and delta-cepstral feature vectors are created.A speech activity detector automatically removes silencefr om the speech, and relative spectral methodologyhelps remove telephone-channel effects.

Converting Telephone Speech into Feature vectorsBefore either training or recognition can be performed on a speech utterance, the speech waveformsmust be converted from their digital waveform representations (usually 16-bit l inear or 8-bit .u-Iawencodings) to one or more streams of feature vectors.Figute 5 shows a diagram of the mel-scale filter bankthat performs this conversion. By using a 20-msecwindow, the acoustic preprocessing produces onemel-scale cepstral observation vector every 10 msec.This type of front-end processing was studied by S.B.Davis and P Mermelstein [49]; the version used atLincoln Laboratory for speech recognition, speakerID and language ID was implemented by D.B. Paul[1]. For language ID, only the lowest thirteen coefficients of the mel-cepstrum are calculated co throughcl2 thereby retaining information that relates to theshape of the speaker s vocal tract while largely ignoring the excitation signal. The lowest cepstral coefficient co is ignored because it contains only overallenergy-level information. The next twelve coefficientsc through cl2 form the cepstral feature vector. Because the mel-scale cepstrum is a relatively orthogonalfeature set in that its coefficients tend no t to be linearly related, it has been used widely for many typesof digital speech processing.

Difference cepstra are also computed and modeledas feature vectors in order to determine cepstral tran-

VOLUME 8 NUMBER 2 1995 THE LINCOLN L BOR TORY JOURN L 23


10/30

ZISSMANutomatic Language dentification of TeLephone Speech

smon information. A vector of cepstral differencesthrough ~ c 1 2 ) called the delta-cepstral vector, is

computed for every frame. The vector elements are

Note that is typically included as part of the deltacepstral vector, thus making thirteen coefficients altogether. For historical reasons relating to the use of tiedmixture GMM systems [50], we process this vector asa separate indep en dent stream of observations,though it could be appended to the cepstral vector toform a twenty-five-dimensional composite vector.

When training or test speech messages compriseactive speech segments separated by long regions ofsilence, we train o r te st o nl y o n t he active speech re-gions because the non-speech regions typically contain no language-specific information. The speechactivity detector we use was developed by D.A.Reynolds for preprocessing speech in his speaker-IDsystem [51]. To separate speech from silence, the detector relies on a time-varying estimate of the instantaneous signal-to-noise ratio.

Because the cepstral feature vectors can be influenced by the frequency response of the communications channel and because of the possibility that eachindividual message might be collected over a channelthat is different from all other channels, we apply therelative spectral RASTA) methodology to removeslowly varying linear channel effects from the raw feature vectors [52]. In the process, each feature vector sindividual elements, considered to be separatestreams of data, are passed through identical filters toremove near-DC components along with somehigher-frequency components. For each vector indexi the RASTA-filtered coefficient is related to theoriginal coefficient ci as

where denotes the convolution operation, and t isthe time index measured in frames. We use theRASTA infinite impulse response filter:

-1 4H ) _ 2 + z - z - zz - 0.1 X 4 1Z - - 0.98z-

The effect ofRASTA on language-ID performance is


roughly comparable to removing the long-termcepstral mean, but with the computational advantageof requiring only a single pass over the input data.

lgorithm 1: Gaussian Mixture ModelA GMM language-ID system served as the simplestalgorithm for this study. As shown below, GMM language ID is motivated by the observation that different languages have different sounds and sound frequencies. GMM-based classification has been appliedto language ID at several sites [6-8].

Under the GMM assumption, each feature vectorV t at frame t ime t is ass umed t o be d ra wn ran do ml yaccording to a probability density that is a weightedsum of unimodal multivariate Gaussian densities:

Np vtl A = L w k bk v t

k= l

where is the set of model parameters {wk J Lk Sk}k is the mixture index k N , Wk are the mixtureweights constrained such that

and bk are the multivariate Gaussian densities definedby the means Lk and the covariance matrices SkFor each language l two GMMs are created: one

for t he set of cepstral feature vectors {x t } and one forthe set of delta-cepstral feature vectors {yJ TheGMMs are created as follows: Two i nd ep en de nt streams of feature vectors are

extracted from t rai ning speech spoken i n thelanguage l: centisecond mel-scale cepstra c1through cl ) and delta-cepstra through

~ c l 2 ) as described previously. A modified version of t he Linde, Buza, and

Gray a lg orit hm [53] is used to cluster eachstream of feature vectors, producing forty cluster centers for each stream i.e., N = 40).

Multiple iterations of the estimate-maximize algorithm are run by using the cluster centers asinitial estimates for the means J Lk For eachstream, this process produces a more likely set ofJ Lk Sk and Wk [54-55].


11/30

ZISSMAN utomaticLanguage Identification o/TeLephone Speech

During recognition, an unknown speech utteranceis classified by first converting the digitized waveformto feature vectors and then by calculating the log likelihood that the language l model produced the un-known speech utterance. The log likelihood is defined s

L{ Yt } IAt T=I [logp x t IAt + 10gp Yt IAte }t= l

suits in the highest likelihood is identified, and thelanguage of that model is selected s the language ofthe message.

PRLM was motivated by a desire to use speech sequence information in the language-ID process. Weview the approach s a compromise between 1) mod-eling the sequence information with HMMs trainedfrom unlabeled speech such systems often perform

Input speechwhere Al and are the cepstral GMM and deltacepsrral GMM respectively, for language l and sthe tot al t ime of the utterance. Implicit in this equation are the assumptions that the observations {x t } arestatistically independent of each other, the observations {Yt} are statistically independent of each other,and the two streams are jointly statistically independent of each other. The maximum-likelihood classi-.fier hypothesizes l s the language of the unknownutterance, where

Conversion tofeature vectors

Single-languagephone recognition

Hypothesized language

Sequenceof phones

Language-Klog likelihood

Maximum

] [ ] [N].

Language Language-Blog l ikel ihood log l ikel ihood

ILanguage Language-B Language-K

n gr m n gr m n gr mmodel model modelI I I

IGUR Phone recognition followed by language mod-el ing PRLM . A s ingle- language phone recognizer isused to lokenize the input speech, i.e., to convert the in-put waveform into a sequence of phone symbols . Thephone sequences are then analyzed by the n gr m analyzer and a language is hypothesized on the basis ofmaximum likelihood.

lgorithm2: Phone Recognition Followed byLanguage ModelingThe s econ d l angu age-I D app ro ach we tested comprises a single-language phone recognizer followed bylanguage modeling with an n-gram analyzer, s shownin Figure 6. In the PRLM system, training messagesin each language l are tokenized by a single-languagephone recognizer, the resulting symbol sequence associated with each of the training messages is analyzed,and an n-gram probability-disrribution languagemodel is estimated for each language l During recognition, a test message is tokenized and the likelihoodthat its symbol sequence was produced in each of thelanguages is calculated. The n-gram mode l t ha t re-

The GMM system is simple to train because it requires neither an orthographic transcription nor pho-netic labeling of the training speech. The GMMmaximum-likelihood recognition is also simple: a Cimplementation of a two-language classifier can berun easily in real time on a Sun SPARCstation 10.

= arg m x L {x Yt } IAt Atc).

VOLUME 8 NUMBER 2,1995 THE LINCOLN L BOR TORY JOURN L 125


12/30

ZISSMANAutomatic Language Identification afTelephone Speech

no better than static classification [7-8]), and 2 employing language-dependent parallel phone recognizers trained from labeled speech (such systems,which are the subject of the subsection entitled Algorithm 4: Parallel Phone Recognition, can be difficultto implement because labeled speech in every language of interest is often not available).

Though PRLM systems can employ a single-language phone recognizer trained from speech in anylanguage, we focused initially on English front endsbecause labeled English speech corpora were the mostreadily available. (Ultimately, we tested single-language front ends in six different languages.) Thephone recognizer, implemented with the HiddenMarkov Model Toolkit [56], is a network of contextindependent phones monophones) in which eachphone model contains three emitting states. In eachof the three states, the model emits an observationvector, and the output vector probability densities aremodeled as GMMs with six underlying unimodalGaussian densities per state per stream. The observation streams are the same cepstral and delta-cepstralvectors that are used in the GMM system. Phone recognition is performed via a Viterbi search using afully connected null-grammar network of monophones in which the search can exit from onemonophone and enter any other monophone. Phonerecognition, which dominates PRLM processingtime, takes about 1.5 times real time on a SunSPARCstation 10; i.e., a ten-second utterance takesabout fifteen seconds to process.

Using the English-phone recognizer as a front end,we can train a language model for each language Ibyrunning training speech for the language 1 in to thephone recognizer and computing a model for the statistics of the phone and phone sequences that are output by the recognizer. In the analysis, we count theoccutrences of n-grams: subsequences of n symbols(phones, in this case). Language ID is performed byaccumulating a set of n-gram histograms, one per language, under the assumption that different languageswill have different n-gram histograms. We then useinterpolated n-gram language models [57] to approximate the n-gram distribution as the weightedsum of the probabilities of the n-gram, the n - 1)gram, the n - 2)-gram, the n - 3)-gram, and so on.

126 THE L IN CO LN L B OR T OR Y J OU RN L V OL UM E B N U MB ER 2 995

n example for a bigram model n = 2 isP wtl Wt_I Wt-2 Wt-3 ...)

= a2 P wt lw t-l + a1P wt ) + aoPo,where W t _ 1 and W t are consecutive symbols observedin the phone stream; the Pvalues are ratios of countsobserved in the training data, e.g.,

P W w = C wt_I wt )t t- I C Wt-Iwhere C wt_1, wt ) is the number of times symbolW t _ 1 is followed by W t and C w t _ 1 is the number ofoccurrences of symbol wt _ 1; Po is the reciprocal of thenumber of symbol types; and the a values can eitherbe estimated iteratively with the estimate-maximizealgorithm so as to minimize perplexity, or they can beset manually.

During recognition, the test utterances are firstpassed through the f ront-end phone recognizer,which produces a phone sequence

The log likelihood L that A . ~ g the interpolatedbigram language model for language I produced thephone sequence Wis

TL WI A . ~ g = L .log p wtl Wt_I A ~ g

tFor language identification, we use the maxlmumlikelihood classifier decision rule, which hypothesizesthat the language of the unknown utterance is givenby , I b1 = arg max L W Xl).

LOn the basis of early experiments, we set n = 2,

=0.399, a l =0.6, and ao= 0.001 for all PRLM experiments because we found that peak performancewas obtained for values of a l and in the range from0.3 to 0.7. Thus far we have found little advantage tousing values of n greater than two, and this observation is consistent with other sites [4]. Though no tyet effective for PRLM-based language ID trigrams


13/30

ZISSMANutomatic anguage dentification Te ephone Speech

Input speech


[E] [ r] .English phones

...[e] [j ] [a]. ..Japanese phones


...[1] [r] [e] ...Spanish phones

Average thecorrespondinglog likelihoods

IGUR 7 Diagram of parallel PRLM. In this example, three single-language phone recognition front ends in English,Japanese, and Spanish are used in parallel to tokenize the in put speech. The phone sequences output by the front endsare analyzed and a language Farsi, French, or Tamil is hypothesized. Note thatthe front-end recognizers do not need tobe t rained in any of the languages to be recognized.

have been used successfully in other types of language-ID systems [36,29]. Our settings for values ofa and n are surely related to the amount of trainingspeech available; for example, we might weight thehigher-order a values more heavily as the amount oftraining data increases.lgorithm 3: arallelPRLMAlthough PRLM is an effective means of identifYingthe language of speech messages as discussed in thesection entitled Experiments and Results , we knowthat the sounds in the languages to be identified donot always occur in the one language that is used totrain the front-end phone recognizer. Thus we look

for a way to incorporate phones from more than onelanguage into a PRLM-like system. For example,Hazen has proposed to train a front-end recognizeron speech from more than one language [4]. Alternatively, our approach is simply to run multiple PRLMsystems in parallel with the single-language front-endrecognizers each trained in a different language. Thisapproach requires that labeled training speech beavailable in more than one language, a lthough thetraining speech does not need to be available for all,or even any of the languages to be recognized. Figure7 shows an example of such a parallel PRLM system[10]. In this example, we have access to labeled speechcorpora in English, Japanese, and Spanish, but the

VOLUME B NUMBER 2 1995 THE L INCOLN L BO R TO RY JO URN L 7


14/30

ZISSMAN utomaticanguage Identification olTelephone Speech

Process and pick maximum

Language-Klog likelihood

Input speech

Language-Blog likelihoodLanguagelog likelihood


t tLanguage Language-B Language-Kphone phone phonerecognizer recognizer recognizerand and andn-gram n-gram n-grammodel model modelI I I


ai j = slog p jl i

tested such a parallel phone recognition PPR system, as shown in Figure 8 Previously, Lamel [12] andMuthusamy [13] had proposed similar systems.

The language-dependent phone recognizers in thePPR language-ID system have the same configurationas the single-language phone recognizer used inPRLM, with a few exceptions. First, the languagemodel is an integral part of the recognizer in the PPRsystem, whereas it is a postprocessor in the PRLMsystem. In the PPR system during recognition, theinterphone transition probability between two phonemodels and is

FIGUR 8 Diagram of parallel phone recognition PPR .Several single-language phone recognition front endsare used in parallel. The likelihoods of the Viterbi pathsthrough each system are compared , from which a lan-guage is hypothesized.

task at hand is to perform language classification ofmessages in Farsi, French, and Tamil.

To perform the classification, we first train threeseparate PRLM systems: one with an English frontend, another with a Japanese front end, and one witha Spanish front end. This parallel PRLM systemwould have a total of nine n-gram language models-one for each language ro be identified Farsi, French,and Tamil per each front end English, Japanese,Spanish . During recognition, a test message is processed by all three PRLM systems, and their outputsare averaged in the log domain multiplied in the linear domain, as if each PRLM system were operatingindependently to calculate the overall language loglikelihood scores. Note that this approach extendseasily to any number of front ends. The only limitation is the number of languages for which labeledtraining speech is available. The phone-recognizer parameters e.g., the number of states and the numberof Gaussian densities used in parallel PRLM areidentical to those used in PRLM. Parallel PRLM processing time is approximately 1.5 times real time on aSun SPARCstation 10 per front-end phone recognizer; thus a system with phone recognizers in threelanguages e.g., English, Japanese, and Spanishwould take approximately 4.5 times real time. lgorithm4: arallel hone RecognitionThe PRLM and parallel PRLM systems perform phonetic tokenization followed by phonotactic analysis.Though this approach is reasonable when labeledtraining speech is not available in each language to beidentified, the availability of such labeled trainingspeech broadens the scope of possible language-IDstrategies; for example, it becomes easy to train anduse integrated acousticlphonotactic models. If we allow the phone recognizer to use the language-specificphonotactic constraints during the Viterbi-decodingprocess rather than applying those constraints afterphone recognition is complete as is done in PRLMand parallel PRLM , the most likely phone sequenceidentified during recognition will be optimal with re-spect to some combination of both the acoustics andphonotactics. The joint acoustic-phonotactic likelihood of that phone sequence would seem to be wellsuited for language ID. Thus we have developed and



15/30

ZISSMANAutomatic anguageIdentification Telephone Speech

where s is the grammar scale factor, and the valuesare bigram probabilities derived from the training labels. On the basis of preliminary testing, s = 3 wasused in these experiments because performance w sseen to have a broad peak near this value. Another difference between PRLM and PPR phone recognizers isthat, although both can use context-dependentphone models, our PRLM phone recognizers use onlymonophones, while our PPR phone recognizers usethe monophones of each language plus the one hundred most commonly occurring right i.e., succeeding) context-dependent phones. This strategy wasmotivated by initial experiments showing that context-dependent phones improved PPR language-IDperformance but had no effect on PRLM languageID performance.

PPR language ID is performed by Viterbi-decoding the test utterance once for each language-dependent phone recognizer. Each phone recognizer findsthe most likely path of the test utterance through therecognizer and calculates the log likelihood scorenormalized by length) for that best path. Duringsome of the initial experiments, we found that the loglikelihood scores were biased; i.e., the scores from oneof the language recognizers were higher on averagethan the scores from another language recognizer. Wespeculate that this effect might be a result of our usingthe Viterbi best path) log likelihood rather than thefull log likelihood across ll possible paths. Alternatively, the bias might have been caused by a languagespecific mismatch between the speakers or text usedfor the training and testing. Finally, these biasesmight represent different degrees of mismatch between the HM M assumptions and various naturallanguages. In any c se to hypothesize the most likelylanguage in our PPR system, w e use a modified maximum-likelihood criterion in which a recognizer-dependent bias is subtracted from each log likelihoodscore prior to applying the maximum-likelihood decision rule. Thus, instead of finding

we find

where L pt IAt is the log likelihood of the Viterbipath through the language Lphone recognizer, and is the recognizer-dependent bias, which is set to theaverage of the normalized log likelihoods for all mess ges processed by the language recognizer. The PPRrecognizer for each language runs at about two timesreal time on a Sun SPARCstation 10.

Note that PPR systems require labeled speech forevery language to be recognized. Therefore, implementing a PPR system can be more difficult thanimplementing any of the other systems discussed earlier, although Tucker [11] and Lamel [35] havebootstrapped PPR systems by using labeled trainingspeech in only one language.peech orpusThe Oregon Graduate Institute Multi-Language Telephone Speech OGI-TS) Corpus [15] was used toevaluate the performance of each of the four language-ID approaches outlined earlier. The OGI TScorpus contains messages spoken by different speakers over different telephone channels. Each message,spoken by a unique speaker, comprises responses toten prompts, four of which elicit fixed text e.g.,Please recite the seven days of the week and Pleasesay the numbers zero through ten ) and six of whichelicit free text e.g., Describe the room from whichyou are calling and Speak about any topic of yourchoice ). The ten responses contained in each mess ge together comprise about two minutes of speech.

Table 1 contains a listing of the number of mess ges per language in each of the four segments of thecorpus: initial training, development test, extendedtraining, and final test. Our GMM PRLM, parallelPRLM, and PPR comparisons were run with the initial-training segment for training and the development-test segment for testing. Because the Hindimessages were not yet available when we performedour preliminary tests, only ten languages were used.Test utterances of forty-five seconds and ten secondswere extracted from the development-test segmentaccording to April 1993 specifications of the NationalInstitute ofStandards and Technology NIST) [58].

Forty-five-secondutterance testing Language ID wasperformed on a set of forty-five-second utterancesspoken by the development-test speakers. The utter-

VOLUME B NUMBER 2 995 THE LIN OLN L BOR TORY JOURN L 129


16/30

ZISSMANAutomatic anguage Identification Tetephone peech

Table 1 The Oregon Graduate Institute Multi-Language Telephone Speech OGI-TS Corpus:Nurnber of Messages fo r th e Different Languages *

Language nitialTraining evelopment st xtended Training Final stMale Female Male Female Male Female ale FemaleEnglish 33 17 14 6 72 3 16 4Farsi 39 1 15 4 8 18 2French 4 1 15 5 2 12 8German 25 25 9 1 5 15 5Hindi 47 3 13 4 25 14 6Korean 32 17 18 2 3 2 15 5Japanese 3 2 15 5 0 11 8Mandarin 34 15 14 6 8 8 1 1Spanish 34 16 16 4 14 5 11 8Tamil 43 7 17 3 2 2 19Vietnamese 31 19 16 4 6 13 7

* The number of messages is equal to the number of different speakers used both male and female . Each message,comprising roughly two minutes of speech, is a set of responses to ten prompts.

ances were the first forty-five seconds of the responsesto the prompt speak about any topic of your choice.OG I refers to these utterances as stories before thetone, and they are denoted as story bt A tone signaled the speaker when forty-five seconds of speechhad been collected, indicating that fifteen seconds remained in the one-minute response.)

Ten second utterance testing Language ID was performed on a set of ten-second cuts from the same utterances used in the forty-five-second testing.

During the course of this work, OGI providedphonetic labels for six of the languages: English, Japanese, and Spanish labels were provided first, followedby German, Hindi , and Mandarin. For all six languages, labels were provided only for the story bt utterances. We compared the GMM PRLM, parallelPRLM, and PPR systems by using only the English,Japanese, and Spanish messages. Additional experiments comparing only the GMM PRLM, and parallel PRLM systems used messages in all ten languages.

Though the same OGI TS messages were used totrain each of the four systems, the systems used the

130 THE L IN CO LN L BO R T OR Y J OU RN L V OL UM E B NUMBER 2 1995

training data in different ways. The GMMs weretrained on the responses to the six free-text prompts.The PRLM back-end language models and the phonerecognizers for the parallel PRLM and PPR systemswere trained on the story bt utterances.

For the PRLM system, three different Englishfront ends were trained: A phone recognizer was trained on the phoneti

cally labeled messages of the OGI TS Englishinitial-training segment. Models for forty-eightmonophones were trained.

A second phone recognizer was trained on theentire training set except for the shibbolethsentences) of the telephone-speech corpus developed jointly by NYNEX, Texas Instruments,and MIT referred to as the NTIMIT corpus)[59]. The data comprised 3.1 hr of read, labeledtelephone speech recorded with a single handset. Models for forty-eight monophones weretrained.

A third phone recognizer was trained onCREDITCARD excerpts from the Switchboard


17/30

ZISSMANutomatic anguage dentification olTelephone peech

Table Results Percent Error Comparing four Systems fo r Two-Language, forced-ChoiceIdentif ication of Utterances of forty-f ive-Second and Ten-Second Duration

System English Japanese English Spanish Japanese Spanish45 sec sec 45 sec sec 45 sec secGMM 17 16 17 16 35 36PRLM2 6 12 3 15 12 22Parallel PRLM 9 1 3 12 6 1PPR 6 8 3 8 15 13

Standard deviation3

Over all three language pairs2 Using the Switchboard corpus3 With the assumption of a binomial distribution

Average45 sec sec23 237 166 8 9

4 2

corpus [60]. The data comprised 3.8 hr of spontaneous labeled telephone speech recorded withmany different handsets. Models for forty-twomonophones were trained.For the parallel PRLM system we conducted fur

ther tests after the initial comparisons were completedan d as the O I TS extended-training segment andHindi messages became available. The system s singlelanguage front ends were eventually trained in six languages: English German Hin di Japanese Mandarin a nd Spanish. Language-model trai ning wasperformed on the union of the initial-training development-test and extended-training segments. Testutterances were selected according to the March 1994NIST specification [58] with both forty-five-secondutterances and ten-second utterances extracted fromthe final-test set.Experiments and ResultsWe compared the four algorithms by performingtwo-alternative and three-alternative forced-choiceclassification experiments with the English Japaneseand Spanish O I TS messages. This first set of experiments used the initial-training data for trainingand the development-test data for testing as definedin Table 1 For the two-alternative testing one modelwas trained on English speech and another on Japanese speech. Test messages sp ok en in E ng li sh and

Japanese were then presented to the system for classification. Si mil ar e xp eriment s were run for E ng li shversus Spanish and Japanese versus Spanish. For thethree-alternative testing models were trained in allthree languages and test messages in all three languages were presented to the system for forced-choiceclassification. Tables 2 and 3 show the results for all ofthese experiments. In the tables the averages were

Table 3 Results Percent Error Comparing four Systems fo r Three-Language,

forced-Choice Identificat ion of Utterancesof for ty-f ive-Second and Ten-Second Duration

System English Japanese Spanish45 sec sec

GMM 35 36PRLM1 1 27Parallel PRLM 8 15PPR 14 15

Standard deviation2 6 3 Using the Switchboardcorpus2 With the assumption of a binomial distribution

VOLUME NUM ER 2 995 THE LINCOLN L OR TORY JOURN L 131


18/30

ZISSMANAutomatic anguage Identification Telephone peech

Table Full Ten Language Results Percent rror

System Ten anguage English nother anguage anguage airs345 sec sec 45 sec sec 45 sec sec

GMM 47 5 19 16 2PRLM NTIMIT 33 53 12 18 1 16PRLM Switchboard 28 46 5 12 8 14PRLM OGI-English 28 46 7 13 8 14Parallel PRLM 21 37 8 12 6 1

Standard deviation4 3 2 21 Ten-alternative, forced-choice classification2 Average of the nine two-alternative, forced-choice experiments with English and one other language from the nine

other languages3 Average of the forty-five two-alternative, forced-choice experiments with each pair of languages from the ten

different languages4 With the assumption of a binomial distribution

computed with equal weighting per language pair,and the standard deviations were computed with theassumption of a binomial distribution. Generally, theresults show that parallel PRLM and PPR performabout equally. This result is not surprising, becausethe major difference between the two systems forthese three languages is the manner in which the language model is applied. For the forty-five-second utterances, Switchboard-based PRLM performs aboutas well as parallel PRLM and PPR, though i t performs worse than parallel PRLM and PPR for theshorter, ten-second utterances.

Using all ten languages of the OGI TS corpus, weran additional experiments to compare PRLM, parallel PRLM, and GMM PPR could not be run in thismode because phonetic labels did not exist for all ofthe languages. The first two columns ofTable 4 showten-language, forced-choice results. The next two columns show two-language, forced-choice average results for English versus each of the other nine languages are presented. The final two columns showtwo-language, forced-choice results averaged over allof the forty-five language pairs. Approximate standard deviations are contained in the bottom rowTable 4 shows that parallel PRLM generally performs

132 THE LINCOLN L B OR TO RY J OU RN L V OL UM E B NUMBER 2 995

best. Also note that PRLM with a Switchboard frontend performs about equally to PRLM with an OGI-TS English front end, but PRLM with an NTIMITfront end performs rather poorly, perhaps in part because of the lack of handset variability.

Table 5 shows the results of evaluating the parallelPRLM system according to the March 1994 NISTguidelines. With the addition of Hindi, the first twocolumns refer to eleven-alternative, forced-choiceclassification, the next two columns refer to an average of the ten two-alternative, forced-choice experiments with English and one other language, and thelast two columns refer to an average of the fifty-fivetwo-alternative, forced-choice experiments using eachpair of languages. Six front-end phone recognizersEnglish, German, Hindi, Japanese, Mandarin, andSpanish were used for this experiment. This secondset and all subsequent sets of experiments used theinitial-training, development-test, and extendedtraining data for training, and the final-test data fortesting, as defined in Table 1 Table 5 shows the results for our first pass through the final-test evaluation data; i.e., for these results there was no possibilityof tuning the system to specific speakers or messages.For discussion of a live demonstration system that


19/30

ZISSMANutomatic anguage dentification Te ephone peech

Table 5 Parallel PR M Results Percent ErrorUsing March 994N ST Guidelines

Eleven Language45 sec sec English nother Language45 sec sec Language Pairs sec sec2 30 4 6 5 8

implements the parallel PRLM algorithm, see thesidebar entitled A Language Identification Demonstration System.

We performed further analysis of our March 1994NIST results to determine the effect of reducing thenumber of front-end parallel PRLM phone recognizers. Figure 9 shows the results of the eleven-language classification task: part a shows that reducingthe number of channels generally increases the errorrate more quickly for the ten-second utterances thanfor the forty-five-second utterances; part bshows thatusing only one channel, no matter which one it isgreatly increases the error rate; and part c shows thatomitting anyone of the six channels has only a smallimpact.

Finally, we measured the within-language performance of three of the PPR front-end recognizers; i.e.,we tested the English recognizer with English, theJapanese recognizer with Japanese, and the Spanishrecognizer with Spanish. Table 6 shows the withinlanguage phone-recognition performance of thesethree PPR recognizers. The results are presented interms of the error rate, calculated by summing threetypes of errors: substitution a phone being misidentified), deletion a phone not being recognized),and insertion a nonexistent phone being recognized), and dividing the sum by the true number ofphones. Note that for each language the number ofequivalence classes i.e., classes of similar phones thatare, for the intent of scoring, considered equivalent) is

Q cC tJl ..c..c CIS Q tJltJl C 1:l CIS 1:l C C a. c CISc Q CIS CIS a.w < I J en(c)

45 sec ~ e ..-1 sec

_ ~ ~

1 sec

Q cC tJl ; : ..c..c CIS Q CIS tJltJl CE 1:l CIS 1:l C iii c a. c CISc CIS CIS a.w < I J en(b)

45 sec .

(a)

456Number of channels(averaging over allpossible combinations

6

5e 4iiicQ

3Qa2

1

IGUR 9 Performance of the parallel PRLM system using fewer than six front ends: (a) the average effect of reducingthe number of channels, (b) the effect of using on ly one channel , and (c) the effect of omi tt ing anyone o f the sixchannels.

VOLUME 8 NUMBER 2 1995 THE L INCOL N L BO R TO RY JO URN L 133


20/30


A LA GU GE IDE TIFIC TIODEMO STR TIO SYSTEM

A E R RE L TIME live demon-stration version of the parallelPRLM system has been imple-mented at Lincoln Laboratory. Aguest dials an ordinary telephoneto connect through the publicswitched telephone network tothe analog to digital converter ofa Sun SPARCstation. The guestthen speaks a short unerance. Theresulting speech signal is con-verted to a sampled data streamand sent in parallel to several otherSPARCstations each of whichperforms phone recognition in asingle language followed by lan-guage modeling. The results are

sent back to thecontrolling work-station for final analysis and dis-play A computer based speechsynthesizer informs the guest ofboth the maximum likelihoodlanguage hypothesis and the sec-ond best guess. The guest canthen review the phone sequencescreated by each of the phonerecognizers and listen to the corre-sponding speech to gain benerinsight into the operation of thelanguage ID system.The monitorimage shown in FigureA is typicalof the display seen by the guest.

In this example a guestspoke amessage in German. Below the

waveform display are three paral-lel time aligned phone transcrip-tions: the bottom transcription isthe output of an English phonerecognizer the middle is from aJapanese phone recognizer andthe top is from a Spanish phonerecognizer. The bar graph showsthe final language ID likelihoodscores. The hypothesis in this casewas correct; German was the lan-guage of the message with En-glish coming in second.

The graphical user interfacewas implemented by using theWAVES package from EntropicResearch Laboratory.

1 14f 11 iJ iift5 Wi AI ; _ ~ I

ll ~ t lt 3ow

8JlO

twdi -. . . . ...... nn... h;;;;eo; rreM

x

00 I I I00 I00 II I

= ;;;:;;> : I ,= i=-- J I I f0I

_1IlOhx s3

z; I III d

FIGURE A. Monitor display showing typical results of the language identif ication system. A messagespoken in German is properly recognized by the system as German with English as the second mostlikely hypothesis. Phone transcriptions for English Japanese and Spanish are also shown.

134 THE LINCOLN L O R T ORY JOURN L VOL UME NUM ER 2.1995


21/30

Z IS SM AN utomatic anguageIdentification Telephone Speech

Table Phone Recognition Results for Within Language Performance ofndividual PPR Front nd Recognizers

Phone Tokens Insertions Substi tuti ons Deletions Co rr ec e Monophones Phone Classes Error RateRecognizer number) number) number) number) number) number) number) percent)nglish 8269 966 2715 1120 4434 52 39 58.1

Japanese 7949 864 945 1730 5274 27 25 44.5Spanish 7509 733 1631 1021 4857 38 34 45.1

um er of actual tokens in the test set2 um er of phones identified correctly

smaller t ha n t he n umbe r of monophones. The tensecond utterances from the development-test set wereused to evaluate the phone-recognition performance.For these evaluations the phone networks includedall context-independent and right-context-dependentphones observed in the tr aining data. The resultss ho wn i n Tables 2 a nd 5 i ndi cate t ha t t he i nd ivi du alphone recognizers do not need to obtain a low phonerecognition error rate in order for the overall PRLMparallel PRLM and PPR systems to achieve good language-I performance.

We believe that although not verified experimentally our PRLM phone recognizers which do notemploy any context-dependent phones have evenhigher error rates than our PPR phone recognizers.Given that belief we found it interesting that the output from the PRLM phone recognizers could be usedto perform language I effectively. Several preliminary studies indicate that mutual information as opposed to phone accuracy might be a better measureof front-end utility. As suggested by H Gish [61]mutual information of the front end measures boththe resolution of the phone recognizer and its consistency. Consistency rather than accuracy is w ha t thelanguage models require; after all if phone a is alwaysrecognized by a two-phone front end as phone andphone b is always recognized as phone a the accuracyof t he f ront e nd m ig ht be zero but the ability of thelanguage model to perform language I will be j us tas high as if the front end has made no mistakes. Thatbigram performance is b et te r t ha n u ni gram p erformance might be due to the fact that we can recognize

bigrams consistently even though we can rarely recognize them accurately.dditional xperimentsBecause of the encouraging preliminary results forour parallel PRLM approach we focused our attention on ways of boosting the system s language-ID capabilities. In this section we report on efforts to usegender-dependent phonotactic weighting and duration tagging to improve the language-ID performance of parallel PRLM.Gender Dependent ChannelsThe use of gender-dependent acoustic models is awell-known technique for improving speech-recognition performance [62-64]. We were motivated to usegender-dependent front ends and back ends f or tworeasons: 1 gender-dependent phone recognizersshould produce a more reliable tokenization of the inpu t speech relative to their gender-independent counterparts; therefore n-gram analysis should provemore effective; and 2 the acoustic likelihoods thatare output by gender-dependent phone recognizersc ou ld be used t o w ei gh t t he p ho no ta ct ic scores t ha tare output by the interpolated language models. Thisweighting procedure would represent our first use ofacoustic likelihoods in a PRLM-type system.

The general idea of employing gender-dependentchannels for language I is to make a preliminary determination regarding the gender of the speaker of amessage and then use the confidence of that determination to weight the phonotactic evidence from gen-

VOLUME B NUMBER 2 995 THE LINCOLN L BOR TORY JOURN L 135


22/30

ZISSMAN utomaticanguage Identification ofTelephone peech

Interpolated language models

Hypothesizedlanguage

p

,...-.j Model fo r Farsi 10English

,. male phone Model fo r French 10.-X,ecognition Model fo r Tamil .. p xIA m Wm10Model fo r Farsinglish gender- X, Combine independent Model for French and pic k f----phone maximumrecognition Model for Tamil L Wgi10Model fo r FarsiEnglish X, female phone Model for French X,ecognition Model for Tamil L.. xlA Wf

Input speech

IGUR 10. Example of gender-dependent processing fo r a c hannel w it h an English front end. The acoustic likelihoodsp x IAm for male speakers and p x for female speakers where x i s t he unk nown m es sage) ar e used t o c om pute WmW and Wgj , which are the respective weights for the male, female, and gender-independent channels.

where \v n Wi and ~ i are the weights for the male,

where p x IAm) is the likelihood of the best phonestate sequence given the male HMMs, Am andp x IAf ) is the likelihood of the best phone-state sequence given the female HMMs, Af . Observing empirically that the cutoff between male and femalemessages is not absolutely distinct and does not always occur exactly at Pr male Ix = 0.5, we usePr male Ix to calculate three weights:

der-dependent channels. Figure 10 shows a diagramof the system for English front ends.

During training, three phone recognizers perfront-end language are trained: one from malespeech, one from female speech, and one from combined male and female speech. Next, for each language I to be identified, three interpolated n-gramlanguage models are trained, one for each of the frontends. The language models associated with the malephone recognizer are trained only on male messages,the female language models only on female messages,and the combined models on both male and femalemessages.

During recognition, an unknown message x is processed by all three front ends. The acoustic likelihoodscores emanating from the male and female frontends are used to compute the a posteriori probabilitythat the message is male:

Pr male Ix = p x IAm ,p xIA m + p xIA f )

_ {pr malelx - KWm - 1 Ko

{K - Pr male IxW = Ko

{1 WW. = mgl 1 Wf

if Pr malel x Kotherwiseif Pr male Ix < Kotherwiseif Pr male Ix Kif Pr male Ix < K

136 THE LINCOLN L OR TORY JOURNAl VOLUME 8 NUM ER 2 1995


23/30

ZISSMANAutomatic Language Identification ofTelephone Speech

1.0 . . - - - - - - - - - . ~ - - - - - - - - 7 I

FIGURE The three weight fun tions Wm WI and Wg;determined by the gender dependent processing systemshown in Figure 1 The value of each w ight is a fun -t ion of Pr male Ix

Perftrmance Results ofEnhanced SystemThe us of gender-dependent phonotactic weightingand duration tagging has resulted in a modest improvement in language-ID performance, s shown inTable 7. The table compares the performance of fivparallel PRLM systems using the 1994 NIST guidelines: Baseline: our first-pass six-channel system from

the March 1994 evaluation. ew baseline: a newer version of the baseline

systemwith better silence detection and a betterset of language-model interpolation weightsa 2 = 0.599, a =0.4, and ao= 0.001 . Gender: a 16-channel system having three frontends one male, one female, and one gender independent for English, German, Japanese,Mandarin, and Spanish, and one front end forHindi. There was insufficient female speech totrain gender-dependent front ends for Hindi.The results shown in Table 7 represent an attempt to use the gender-dependent acousticlikelihoods that are output by the front-endphone recognizer to improve the phonotacticscores output by the n-gram language models.

Duration: a system that uses a simple techniquefor modeling duration with -5 and -L tags.

Gender nd duration: a system that combinesthe above gender and duration enhancements.Tables 8 and 9 show confusion matrices for the

forty-five-second and ten-second utterances, respectively, for a parallel PRLM system with gender-dependent and duration processing. Each row showsthe number of utterances truly spoken in some lan-

tions for each phone emitted from each recognizer iscompiled and the average duration determined. AnL suffix is appended to ll phones having durationlonger than the average duration for that phone, andan 5suffix is appended to all phones having durationshoner than the average duration for that phone. Thismodified sequence of phone symbols is then used inplace of the original sequence to train the interpolated language models. During recognition, we usthe duration thresholds determined during trainingto apply the same procedure to the output symbolsfrom the phone recognizer.

1.0Pr malelx

0.0 1< >


24/30


Interpolated languagemodels based onn-gram histograms

Input speech

-.-Conversion tofeature vectors English-phonerecognizer r------------:I II Duration It----7----1f-- lquantizer II II IL J Model fo rFarsiModel fo rFrenchModel fo rTamil Hypothesizedlanguage

Tolanguagemodels

hom ponerecognizer Duration-tagged symbolPhoneme Duration threshold 1sec)AA 0.10 Duration HH-S AW-L AE 0.10 quantizerAW 0.13HH 0.08IX 0.05

F

-------------------------------------------------------lSymbol name Duration HH, 0.03 AW 0.18

IGUR 12. One approach to durat ion tagging in which an -L suff ix is appended to all phones having duration longerthan the average duration for that phone, and an -5 suff ix is appended to all phones having duration shorter than theaverage duration for that phone. This modif ied sequence of phone symbols is then used in place of the original sequence to train the interpolated language models.

guage; each column shows the number of utterancesclassified by the system s spoken in some language.Thus entries along the main diagonal indicate utterances that were identified correctly; off-diagonal entries indicate errors. From studying the confusionmatrices, w s that errors are not necessarily correlated with the linguistic closeness of the language pairinvolved. For example, there are many more Spanish/Hindi confusions than Spanish/French confusions.This result may be due to the small size of the test corpus, which limits our confidence in these statistics.

The confusion matrices in Tables 8 and 9 also provide evidence that the parallel PRLM system hastrouble with non-native speakers of a language. ForSpanish, an expert Spanish dialectologist listened toeach message and classified the dialect of the speaker[66]. Although most of the Spanish speakers were natives ofSpain or Latin America, some were born and/or raised in other countries e.g., the United Statesand France . Of the thirteen Spanish speakers identified correctly in Table 8, ten were native speakers andthree were not. Of the four Spanish speakers identi-

138 THE LINCOLN L OR TORY JOURN L VOLUME 8 NUM ER 2 1995


25/30

ZISSMAN utomaticanguage Identification o Telephone peech

Table Parallel PRLM Results Percent rror with Several Enhancements

System leven Language English nother Language Language Pairs45 sec sec 45 sec sec 45 sec sec

Baseline 2 30 4 6 5 8New baseline 14 26 5 3 4 7Gender 13 23 2 4 3 6Duration 14 23 2 5 3 6Gender and duration 11 21 2 4 2 5

Standard deviation 3 2 2


26/30


Table 9. Confusion Matrix or Ten Second Utterances or Parallel PRLM SystemUsing oth Gender Dependent and Duration Processing

English Farsi French German in i Japan Korean Mand Span Tamil VietEnglish 61 2 2 2 2 Farsi 2 47 3 2 2 2 French 5 41 9 2 3 German 3 3 2 53 Hindi 3 2 2 51 5 Japanese 2 2 49 2 5 Korean 2 34 6Mandarin 4 3 2 4Spanish 2 6 2 4 4Tamil 43 Vietnamese 2 2 3 3 34

rable to our original approach, but allows for a morereliable separation between male and female speakersand obviates the need for computing the K factor. Inother research, the use of even more fine-grain duration tags has been studied by us and by LockheedSanders. Both groups have found that the quantizingof durat ion into more than two values -S and -Ldoes not improve language-ID performance.umm ryThis article has reviewed the research and development of language-ID systems at Lincoln Laboratory.We began by comparing the performance of four approaches to automatic language ID of telephonespeech messages: Gaussian mixture model GMM)classification, single-language phone recognition followed by language modeling PRLM), parallelPRLM, and parallel phone recognition PPR). TheGMM system, which requires no phonetically labeledtraining speech and runs faster than real time on aconventional UNIX workstation, performed themost poorly. PRLM, which requires phonetically labeled training speech in only one language and runs abit more slowly than real time on a conventional


workstation, performs respectably as long as thefront-end phone recognizer is trained on speech collected over a variety of handsets and channels. Evenbetter results were obtained when multiple front-endphone recognizers were used with either the parallelPRLM or PPR systems. Because the phonetic or orthographic labeling of foreign-language speech is expensive, the high performance obtained with the parallel PRLM system-which can use, but does notrequire, labeled speech for each language to be recognized-is encouraging.

With respect to a parallel PRLM system, we haveshown that the use of gender-dependent front ends inparallel with gender-independent front ends can improve performance. We have also used phone-duration tagging to improve performance. For forty-fivesecond telephone-speech messages, our very bestsystem yields a error rate in performing elevenlanguage closed-set tasks and a 2 error rate in performing two-language closed-set tasks.

s automatic speech-recognition systems becomeavailable for more and more languages, it is reasonable to believe that the availability of standardizedmulti-language speech corpora will increase. These


27/30

ZISSMAN utomatic anguageIdentification o Telephone Speech

large new corpora should allow us to train and testsystems that model language dependencies more c-curately than is possible with just language-dependent phone recognizers employing bigram grammars.Language-10 systems that use language-dependentwor d spotters [36] and continuous-speech recognizers [45] are evolving. These systems are movingbeyond the use of phonology for language 10 incorporating both morphologic and syntactic information. It will be interesting compare the performance and computational complexity of these newersystems to the systems we studied.or urther ReadingMost sites developing language-ID systems reporttheir work at the International Conference on Acoustics, Speech, and Signal Processing ICASSP), the International Conference on Spoken Language Processing ICSLP), and the Eurospeech conferences, s wells in the IEEE Transactions on Speech and udio Pro-cessing Also, the National Institute of Standards andTechnology NIST) has sponsored a series of evaluations for various sites performing language-ID re-search. Specifications and results may be obtainedfrom Dr. Alvin F Martin of the NIST Computer ys-tem Laboratory, Spoken Language TechnologyGroup in Gaithersburg, Maryland.

cknowledgmentsThe author is grateful to Ron Cole and his team atthe Cen ter for Spoken Language U nd er stan ding atthe Oregon Graduate Institute for making availablethe OGI TS Corpus. Bill Mistretta and Dave Morganof Lockheed-Sanders suggested the approach of duration tagging described in the article. Kenney Ng ofBolt Beranek N ew man kindly provided highquality transcriptions of the Switchboard CREDIT-C AR D corpus. T in a Chou and Linda Sue Sohnhelped prepare the OGI data at Lincoln Laboratoryfor PPR processing. Doug Reynolds, Doug Paul,Rich Lippmann, Beth Carlson, Terry Gleason, JackLynch, Jerry O Leary, Charlie Rader, Elliot Singer,and Cliff Weinstein of the Speech Systems Technology group at Lincoln Laboratory offered significanttechnical advice. Lippmann, O Leary, Reynolds, andWeinstein made numerous suggestions for improvingthis article.

VOLUME NUMBER 2 1995 THE LINCOLN L BOR TORY JOURNAl 141


28/30

Z SSM NAutomatic Langttage Identification ofTelephone Speech

R F R N S

1. D.B. Paul, Speech Recognition UsingHidden MarkovModels, Linc. Lab ] 3, 41 (l990).2. R.e. Rose, Techniques fot Infotmation Retrieval fromSpeech Messages, Linc. Lab.] 4, 45 (l991).3. D.A Reynolds, Auromatic Speaket Recognition using Gaussian-Mixrure Speaket Models, Linc. Lab.], in this issue.4. T.J. Hazen, private communication.5. Y.K Mumusamy, E. Barnard, and R.A. Cole, ReviewingAu

romatic Language Idenrification, IEEE Signal Process. Mag.II 33 (Ocr. 1994).6. L. Riek, W. Mistrerra, and D. Morgan, Expetimenrs in Language Idenrification, Technical Report SPCOT-91-002(Lockheed-Sandets, Nashua, NH Dec. 1991).7. S. Nakagawa, Y Ueda, and T. Seino, Speaker-Independenr,Text-Independenr Language Identification by HMM, ICSLP92 Proc. 2, Banff,Alberta Canada 12-16Oct. 1992, p. 1011.8. M.A Zissman, Auromatic Language Idenrification UsingGaussian Mixture and Hidden Markov Models, ICASSP 93Proc. 2, Minneapolis 27-30Apr. 1993, p. 399.9. T.J. Hazen andVW. Zue, AUtomatic Language IdenrificationUsing a Segmenr-Based Approach, Proc. Eurospeech 93 2,Berlin 21-23 Sept. 1993, p. 1303.10. M.A Zissman and E. Singer, ''Auromatic Language Idenrification ofTelephone Speech Messages Using Phoneme Recognition and n-Gram Modeling, ICASSP '94Proc. 1, AdelaideAllStralia 19-22 Apr. 1994, p. 305.11. R.e.F. Tucker, M.J. Carey, and E.S. Parris, Auromatic language Idenrification Using Sub-Word Models, ICASSP 94Proc. 1, Adelaide Australia 19-22 Apr. 1994, p. 301.12. L.F. Lamel and J.-L. Gauvain, Identifying Non-LinguisticSpeech Fearures, Proc. Eurospeech 93 1, Berlin 21-23 Sept.1993, p. 23.13. Y Mumusamy, K Bedding,T. Arai, R. Cole, and E. Barnard,A Comparison of Approaches ro Auromatic LanguageIdenrification Using Telephone Speech, Proc. Eurospeech 932 Berlin 21-23 Sept. 1993, p. 1307.14. V Fromkin and R. Rodman, An Introduction to Langttage(Harcourt Brace Jovanovich, Orlando, FL, 1993).15. Y.K Mumusamy, R.A. Cole, and BT. Oshika, The OGIMulti-LanguageTelephone Speech Corpus, ICSLP 92 Proc.2, Banff,Alberta Canada 12-16 Oct. 1992, p. 895.16. L.F. Lamel, R.H. Kassel, and S. Seneff, Speech Database Developmenr: Design and Analysis of me Acoustic-PhoneticCorpus, Proc. DARPA Workshop on Speech Recognition PaloAlto CA 19-20 Feb. 1986, p. 100.

17. R.G. Leonard, Language Recognition Test and Evaluation,Technical Report RADC-TR-80-83 (RADC/Texas Instrumenrs, Dallas, Mar. 1980).18. R.G. Leonard and G.R. Doddingron, Auromatic LanguageIdenrification, Technical Report RADC-TR-74-200/TI347650 (RADC/Texas Insrrumenrs, Dallas, Aug. 1974).19. R.G. Leonard and G.R. Doddingron, ''Auromatic Classification of Languages, Technical Report RADC-TR-75-264(RADC/Texas Instrumenrs, Dallas, Ocr. 1975).20. R.G. Leonard and G.R. Doddingron, ''Auromatic LanguageDiscrimination, Technical Report RADC-TR-78-5 (RADCITexas Instrumenrs, Dallas, Jan. 1978).

4 THE L INCOLN L OR TORY JOURN L VOLUME NUM ER 2 995

21. D. Cimarusti and R.B. Ives, Developmenr of an AuromaticIdenrification System ofSpoken Languages: Phase I, ICASSP82 Proc. 3, Paris 3-5 May 1982, p. 1661.22. JT . Foil, Language Idenrification Using Noisy Speech,ICASSP '86Proc. 2, Tokyo 7 10 Apr. 1986, p. 861.23. F.J. Goodman, AF. Marrin, and R.E. Wohlford, ImprovedAuromatic Language Identification in Noisy Speech, ICASSP89 Proc. 1, Glasgow Scotland 23-26May 1989, p. 528.24. M. Sugiyama, ' 'Auromatic Language Recognition UsingAcoustic Features, ICASSP 91 Proc. 2, Toronto 14-17May1991, p. 813.25. Y.K. Mumusamy and R.A Cole, Auromatic Segmenrationand Idenrification of Ten Languages Using TelephoneSpeech, ICSLP 92 Proc. 2, Banff, Alberta, Canada 12 16

Oct. 1992, p. 1007.26. L.R. Rabiner, A Turorial on Hidden Markov Models and Selected Applications in Speech Recognition, Proc. IEEE 77,257 (1989).27. AS. House and E.P. Neuburg, TowardAuromaticIdenrification of me Language of an Urrerance. 1. PreliminaryMemodological Considerations ] AcollSt. Soc. Am. 62, 708 (1977).28. M. Savic, E. Acosta, and S.K. Gupta, ''AnAuromaticLanguageIdenrification System, ICASSP 91 Proc. 2, Toronto 14 17May 1991, p. 817.29. S. Nakagawa, T. Seino, and Y Ueda, Spoken Language Identification by Ergodic HMMs and Its State Sequences, ELec-tronics and Communications in japan, Part 3, 77, 70 (Feb.1994).30. KP. Li and T.J. Edwards, Statistical Models for Auromat.icLanguage Idenrification, ICASSP 80 Proc. 3, Denver 9 11Apr. 1980, p. 884.31. K-P. Li, AUtomatic Language Idenrification Using SyllabicSpectral Fearures, ICASSP '94Proc. 1,Adelaide Australia 19 -22 Apr. 1994, p. 297.32. L.F. Lamel and J.-L. Gauvain, Cross-Lingual Experimenrswim Phone Recognition, ICASSP 93 Proc. 2, Minneapolis27-30Apr. 1994, p. 507.33. O. Andersen, Dalsgaard, and W Barry, On me Use ofDataDriven Clustering Technique for Idenrification of Poly- andMono-Phonemes for FourEuropean Languages, ICASSP 94Proc. I, 121 Adelaide Australia 19-22Apr. 1994, p. 121.34. KM. Berkling, T. Arai, and E. Barnard, ''Analysis ofPhonemeBased Features for Language Idenrification, ICASSP '94Proc.1 Adelaide Australia, 19-22Apr. 1994, p. 289.35. L.F. Lamel and J.L. Gauvain, Language Idenrification UsingPhone-Based Acoustic Likelihoods, ICASSP '94 Proc. 1Adelaide Australia 19-22Apr. 1994, p. 293.36. S. Kadambe and J.L. Hieronymus, Language Idenrificationwim Phonological and Lexical Models, ICASSP 95 Proc. 5Detroit 9-12 May 1995, p. 3507.37. R.]. D'Amore and e.P. Mah, One-Time Complete IndexingofText: Theory and Practice, Proc. Eighth Int. ACM Can onRes. and Dev. in Information Retrieval p. 155 (1985).38. R.E. Kimbrell, Searching forText? Send an n-Gram Byte 13297 (May 1988).39. J.e. Schmitt, Trigram-BasedMemod ofLanguageIdenrification, U.S. Patent 5,062,143 (Ocr. 1991).40. M. Damashek, Gauging Similarity via n-grams: LanguageIndependenr Text Sorting, Categorization, and Retrieval ofText, Science 267, 843 (10 Feb. 1995).41. T.A.Albina, E.G. Bernstein, D.M. Goblirsch, and D.E. Lake,A System for Cluster ing Spoken Documenrs, Proc.Eurospeech 93 2, Berlin 21-23 Sept. 1993, p. 1371.


29/30

ZISSMANAutomatic Language Identification ofTelephone Speech

42. M.A. Lund and H. Gish, Two Novel Language Model Estimation Techniques for Statistical Language Identification,Proc. Eurospeech 952 Madrid Spain 18 21 Sept. 1995p. 1363.43. Y Yan and E. Barnard, An Approach ro Automatic LanguageIdentification Based on Language-Dependent Phone Recognition, ICASSP 95Proc. 5, Detroit 9-12May 1995, p. 3511.44. S. Hutchins, private communication.45. S. Mendoza, private communication.46. B. Comrie, The World s Major Languages (Oxford UniversityPtess, New York, 1990).

47. D. Ctystal, The Cambridge Encyclopedia of Language (Cambridge University Press, Cambridge, England, 1987).48. YK. Murhusamy, N. Jain, and R.A. Cole, Perceptual Benchmarks for Automatic Language Identification, ICASSP 94Proc. 1 333 (Apr. 1994).49. S.B. Davis and P Mermelstein, Comparison of ParametricRepresentations for MonosyllabicWord Recognition in Continuously Spoken Sentences, IEEE Trans. Acoust. Speech Sig-nal Process. ASSP-28, 357 (1980).50. X.D. Huang and M.A. Jack, Semi-Continuous HiddenMarkovModels forSpeech Signals, Computer Speech andLanguage 3 239 (1989).51. D.A. Reynolds, R.e. Rose, and M.J.T. Smith, PC-BasedTMS320C30 Implementation of the Gaussian MixtureModel Text-Independent Speaker Recognition System,I

language identification through gmms and language modelling

Documents