indian language speech processing · types (e.g., decoding ... malayalam very rich in morphology...
TRANSCRIPT
Dr. G. Bharadwaja Kumar
Indian Language
Speech
Processing
Speech recognition
Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, into the corresponding
orthographic representation.
Parameters Range
Speaking Mode Isolated words to Continuous Speech
Speaking style Read Speech to Spontaneous Speech
Enrollment Speaker dependent to Speaker independent
Vocabulary Small (<20 words) to large (> 20000 words)
Language model Finite State to Context Sensitive
Perplexity Small (<10) to large (>100)
Signal to Noise Ratio (SNR)
High (>30dB) to low (<10dB)
Transducer noice cancelling microphone to Telephone
Speech recognition systems can be characterized by many parameters
Speaking Style
Read speech
– Planned or read speech may not contain disfluencies
– News
Spontaneous speech
– extemporaneously generated speech
– Disfluencies (hesitations and fillers)
– ‘less-well-formed’ (or un-grammatical).
Vocabulary size
As the vocabulary increases, the number
of input-template comparisons which
must be made before a best match
can be determined also increases.
Enrolment
Some systems require speaker enrollment -- a user must provide samples of his or her speech before using them -- whereas other systems are said to be speaker-independent, in that no enrollment is necessary.
External parameters
In addition, there are some external parameters that can affect speech recognition system performance, including the characteristics of the environmental noise and the type and the placement of the microphone.
The modifications to pronunciation once isolated words are embedded in continuous speech include
Assimilation
Elision
Vowel reduction
Strong and weak forms
Liaison
Contractions
Juncture
Ref: http://cristiancuesta512.blogspot.in/
Source Channel Model
If A represents the acoustic feature sequence extracted from a speech sample, the speech recognition system should yield the optimal word sequence which matches Abest .
W= argmax P(W|A)
W
Using Baye’s rule, we can rewrite as
Here, P(A|W) is the likelihood of feature sequence A given the acoustic model of word sequence W.
P(A|W)P(W)P(W|A)=
P(A)
Pronunciation Lexicon
provides pronunciations of words, so decoder knows which HMMs to use for a certain word.
also provides a list of words to limit the language model complexity and the decoder’s search space.
As a result, an ASR system can only recognize a limited number of words presented in the dictionary, which is normally known as the closed-vocabulary speech recognition.
Out-of-vocabulary (OOV)
words that are unknown and appear in test data for which the phonetic sequence is unknown.
They cannot be recognized and also effect the recognition accuracy of their surrounding in-vocabulary (IV) words.
Four challenges with OOV:
Detecting the presence of the word
Determining its location within the utterance
Recognizing the underlying phonetic sequence
Identifying the spelling of the word
Acoustic Modeling
Sampling: measuring amplitude of signal at time t
16,000 Hz (samples/sec) Microphone (“Wideband”):
8,000 Hz (samples/sec) Telephone
Why?– Need at least 2 samples per cycle
– max measurable frequency is half sampling rate
– Human speech < 10,000 Hz, so need max 20K
– Telephone filtered at 4K, so 8K is enough
Why Frequency Domain
the frequency of a sound is one of its most important physical properties.
Which can be easily observed by converting time to frequency domain using FFT
Cepstral coefficients are typically used in speech recognition to characterize spectral envelopes, capturing primarily the formants of speech.
Mobile Recorded Speech
Mel-Frequency Cepstral Coefficient (MFCC)
Most widely used spectral representation in ASR
Why is MFCC so popular?
Efficient to compute
Incorporates a perceptual Mel frequency scale
Separates the source and filter
IDFT(DCT) decorrelates the features– Improves diagonal assumption in HMM
modeling
Acoustic Modeling
- Mporas, Iosif, et al. "Comparison of speech features on the speech recognition task." Journal of Computer Science 3.8 (2007): 608-616.
HMM/GMM Models
Approaches to Speaker Adaptation
Language models
help any speech recognizer to figure out how likely a word sequence is independent of the acoustics.
play a paramount role in guiding and constraining among large number of alternative word hypotheses in continuous speech recognition.
continuous speech recognition suffers from difficulties such as variation due to sentence structure (prosodies), interaction between adjacent words (crossword co-articulation), and no clear acoustic markers to delineate word boundaries.
play a vital role in resolving acoustic confusions that arise due to co-articulation, assimilation and homophones while decoding.
The perplexity can be roughly interpreted as the average branching factor of the testing data to the language model.
lower perplexity correlates to better recognition performance due to less branches the speech recognizer needs to consider during decoding
N-Gram Language Models
The intuition of the N-gram model is that instead of computing the probability of a word given its entire history, we can approximate the history by just the last few words
Smoothing Techniques
Add-1 smoothing or Good-Turing:
OK for text categorization,
for language modeling the most commonly used method:
Extended Interpolated Kneser-Ney
For very large N-grams like the Web:
backoff
Domain Sensitivity
Language models are extremely sensitive to changes in the style, topic or genre of the text on which they are trained.
A language model trained on Dow-Jones newswire text will see its perplexity doubled when applied to the very similar Associated Press newswire text from the same time period
Rosenfeld, Ronald. "Two decades of statistical language modeling: Where do we go from here?." Proceedings of the IEEE 88.8 (2000): 1270-1278.
Given a background model PB (w|h)
and a topic-based model PT (w|h) it
is possible to obtain a final model PI
(w|h) , to be used in the second
decoding pass, as follows:
Complexity
ASR systems often have complexity that is linear in the number of tokens and polynomial in the number of types (e.g., decoding using a trigram language model with size-Nvocabulary has, in the worst case, a complexity of at least O(N3)).
-- Lin, Hui, and Jeff Bilmes. "Optimal selection of limited vocabulary speech corpora." Twelfth Annual Conference of the International Speech Communication Association. 2011.
Notable speech recognition software engines
Ref- https://en.wikipedia.org/wiki/List_of_speech_recognition_software
System Name Open Source Acoustic Modeling
CMU Yes GMM/HMM
HTK No GMM/HMM
RWTH Yes LSTM
Kaldi Yes Deep Neural Network
Julius Yes GMM/HMM
Challenges in
Indian Language Speech Processing
Dravidian Languages
Major: Telugu, Tamil, Kannada, Malayalam
Very rich in morphology and complex Sandhi rules
Relatively free word order languages
Pronunciation Lexicon
Most of the Indian languages are phonetic in nature i.e. there exists a one-to-one correspondence between the orthography and pronunciation in these languages.
For Telugu, simple rule based G2P is enough.
Tamil does not distinguish between voiced and voiceless stops and lacks symbols for voiced and aspirated stops.
Morph Based Language Models
In two Finnish recognition tasks, relative error rate reductions between 12% and 31% are obtained.
word fragments obtained using grammatical rules do not outperform the fragments discovered from text.
Hirsimäki, Teemu, et al. "Unlimited vocabulary speech recognition with morph language models applied to Finnish." Computer Speech & Language 20.4 (2006): 515-541.
Phoneme list
Tamil Phonetic Mapping
Grapheme to Phoneme Mapping Softwares
Sequitur G2P
https://www-i6.informatik.rwth-aachen.de/web/Software/g2p.html
Sequence-to-Sequence G2P toolkit
https://github.com/cmusphinx/g2p-seq2seq
Phonetisaurus G2P
https://github.com/AdolfVonKleist/Phonetisaurus
Morphology
Application of extensive Sandhi changes sometimes results in telescoping of several words into long strings.
English Sentence: Do you say that there is no hot water?’
Telugu Sentence: vEdinILLu levu aNtavu A?
After Sandhi: vENNILLEvaNtAvA (one word)
– Reference: P. Bhaskara Rao, “Telugu” , Concise Encyclopedia of Languages of the world, Elsevier, pp 1055-1060.
Type-Token Analysis
G. Bharadwaja Kumar et. Al. “Statistical Analyses of Telugu Text Corpora”, IJDL, Vol. 36, No. 2 (2007)
BNC Corpus for English (100 Million Word Corpus)
UoH Corpus for Telugu ( 40 Million Word Corpus)
One of the main problems with LVCSR systems is that the words spoken may not always exist within the system’s vocabulary.
These are called out of vocabulary words (OOV’s).
Predominant problem for very rich & complex Morphological languages such as Dravidian Languages
Thank You
Presentation is available at
http://bharadwajakumar.wordpress.com