8- speech recognition
DESCRIPTION
8- Speech Recognition. Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types. 7- Speech Recognition (Cont’d). HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm - PowerPoint PPT PresentationTRANSCRIPT
8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types
1
7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training In HMM
2
Recognition Tasks Isolated Word Recognition (IWR) Connected Word (CW) , And Continuous
Speech Recognition (CSR) Speaker Dependent, Multiple Speaker, And
Speaker Independent Vocabulary Size
Small <20Medium >100 , <1000Large >1000, <10000Very Large >10000
3
Speech Recognition Concepts
4
NLP SpeechProcessing
Text Speech
NLPSpeech Processing
SpeechUnderstanding
Speech Synthesis
TextPhone Sequence
Speech Recognition
Speech recognition is inverse of Speech Synthesis
Speech Recognition Approaches
Bottom-Up Approach
Top-Down Approach
Blackboard Approach
5
Bottom-Up Approach
6
Signal Processing
Feature Extraction
Segmentation
Signal Processing
Feature Extraction
Segmentation
Segmentation
Sound Classification Rules
Phonotactic Rules
Lexical Access
Language Model
Voiced/Unvoiced/Silence
Kno
wle
dge
Sou
rces
Recognized Utterance
Top-Down Approach
7
UnitMatching
System
FeatureAnalysis
LexicalHypothesis
SyntacticHypothesis
SemanticHypothesis
UtteranceVerifier/Matcher
Inventory of speech
recognition units
Word Dictionary Grammar
TaskModel
Recognized Utterance
Blackboard Approach
8
EnvironmentalProcesses
Acoustic Processes Lexical
Processes
SyntacticProcesses
SemanticProcesses
Blackboard
Recognition Theories Articulatory Based Recognition
Use from Articulatory system for recognitionThis theory is the most successful until now
Auditory Based RecognitionUse from Auditory system for recognition
Hybrid Based RecognitionIs a hybrid from the above theories
Motor TheoryModel the intended gesture of speaker
9
Recognition Problem
We have the sequence of acoustic symbols and we want to find the words that expressed by speaker
Solution : Finding the most probable of word sequence by having Acoustic symbols
10
Recognition Problem
A : Acoustic Symbols W : Word Sequence
we should find so that
11
W)|(max)|ˆ( AWPAWP
W
Bayse Rule
),()()|( yxPyPyxP
12
)()()|()|(
yPxPxyPyxP
)()()|()|(
APWPWAPAWP
Bayse Rule (Cont’d)
13
)()()|(max
APWPWAP
W
)|(max)|ˆ( AWPAWPW
)()|(max
)|(maxˆ
WPWAPArg
AWPArgW
W
W
Simple Language Model
14
nwwwww 321
),...,,,(),...,,|(
).....,,|(),|()|()(
)|()(
121
121
1234
123121
1211
WWWWPWWWWP
WWWWPWWWPWWPWP
wwwwPwP
nnn
nnn
iii
n
i
Computing this probability is very difficult and we need a very big database. So we use from Trigram and Bigram models.
Simple Language Model (Cont’d)
15
)|()( 211 iii
n
iwwwPwP
)|()( 11 ii
n
iwwPwP
Trigram :
Bigram :
)()(1 i
n
iwPwP
Monogram :
Simple Language Model (Cont’d)
16
)|( 123 wwwP
Computing Method :Number of happening W3 after W1W2
Total number of happening W1W2
AdHoc Method :)()|()|()|( 332321231123 wfwwfwwwfwwwP
Error Production Factor
Prosody (Recognition should be Prosody Independent)
Noise (Noise should be prevented)
Spontaneous Speech
17
P(A|W) Computing Approaches
Dynamic Time Warping (DTW)
Hidden Markov Model (HMM)
Artificial Neural Network (ANN)
Hybrid Systems
18
Dynamic Time Warping
Dynamic Time Warping
Dynamic Time Warping
Dynamic Time Warping
Dynamic Time Warping Search Limitation :
- First & End Interval- Global Limitation- Local Limitation
Dynamic Time Warping Global Limitation :
Dynamic Time Warping Local Limitation :
Artificial Neural Network
26
...
1x
0x
1w 0w
1Nw1Nx
y)(
1
0
i
N
ii xwy
Simple Computation Element of a Neural Network
Artificial Neural Network (Cont’d)
Neural Network TypesPerceptronTime DelayTime Delay Neural Network Computational
Element (TDNN)
27
Artificial Neural Network (Cont’d)
28
. . .
. . .
0x
0y 1My
1Nx
Single Layer Perceptron
Artificial Neural Network (Cont’d)
29
. . .
. . .
Three Layer Perceptron
. . .
. . .
2.5.4.2 Neural Network Topologies
30
TDNN
31
2.5.4.6 Neural Network Structures for Speech Recognition
32
2.5.4.6 Neural Network Structures for Speech Recognition
33
Hybrid Methods
Hybrid Neural Network and Matched Filter For Recognition
34
PATTERN
CLASSIFIER
Speech Acoustic Features Delays
Output Units
Neural Network Properties
The system is simple, But too much iteration is needed for training
Doesn’t determine a specific structure Regardless of simplicity, the results are
good Training size is large, so training should
be offline Accuracy is relatively good
35
Pre-processing Different preprocessing techniques are
employed as the front end for speech recognition systems
The choice of preprocessing method is based on the task, the noise level, the modeling tool, etc.
36
38
39
41
42
MFCCروش
روش MFCCبر نحوه ادراک گوش انسان از اصوات ي مبتن باشد.يم
روش MFCCبهتر ي نويزيطهايژگيها در محير وي نسبت به سا کند.يعمل م
MFCCه شده ي گفتار ارايي شناساي اساسا جهت کاربردها دارد.يز راندمان مناسبينده ني گويياست اما در شناسا
دار گوش انسان ي واحد شنMelباشد که به کمک رابطه ي م د:ي آير بدست ميز
43
MFCCمراحل روش
گنال از حوزه زمان به حوزه ي: نگاشت س1 مرحله زمان کوتاه.FFTفرکانس به کمک
44
گنال گفتاريس : Z(n)تابع پنجره مانند پنجره :
)W(nهمينگWF= e-j2π/F
m : 0,…,F – 1;يم گفتاريطول فر : .F
MFCCمراحل روش
لتر.ي هر کانال بانک فيافتن انرژي: 2مرحله
Mمعيار مل ي فيلتر مبتني تعداد بانکها pبر باشد.يم
بانک فيلتر يلترهاي تابع فاست.
45
0,1,..., 1k M ( )kW j
توزيع فيلتر مبتنی بر معيار مل
46
MFCCمراحل روش
ل ي طيف و اعمال تبدي: فشرده ساز4 مرحلهDCT MFCCب يجهت حصول به ضرا
47
در رابطه باالL،...،0=nتبه ضراpب ي مرMFCC باشد.يم
روش مل-کپستروم
48
Mel-scaling بندی فریم
IDCT
|FFT|2
Low-order coefficientsDifferentiator
Cepstra
Delta & Delta Delta Cepstra
زمانی سیگنال
Logarithm
ضرایب مل MFCC)کپستروم
)
49
ویژگی های مل (MFCC)کپستروم
نگاشت انرژی های بانک فیلترملدرجهتی که واریانس آنها ماکسیمم
(DCT )با استفاده ازباشد استقالل ویژگی های گفتار به صورت
(DCT غیرکامل نسبت به یکدیگر)تاثیرپاسخ مناسب در محیطهای تمیزکاهش کارایی آن در محیطهای نویزی
50
Time-Frequency analysis
Short-term Fourier Transform Standard way of frequency analysis: decompose the
incoming signal into the constituent frequency components.
W(n): windowing function N: frame length p: step size
51
Critical band integration Related to masking phenomenon: the
threshold of a sinusoid is elevated when its frequency is close to the center frequency of a narrow-band noise
Frequency components within a critical band are not resolved. Auditory system interprets the signals within a critical band as a whole
52
Bark scale
53
Feature orthogonalization
Spectral values in adjacent frequency channels are highly correlated
The correlation results in a Gaussian model with lots of parameters: have to estimate all the elements of the covariance matrix
Decorrelation is useful to improve the parameter estimation.
54
otherwisevalidiswwif
wwP
wwwPwwwwP
wwwwP
wwwPwwPwPwwwPWP
wwwW
jkkj
jNjjjQ
Q
Q
01
)|(
),|()|(
|(
)|()|()()()(
,
11121
).121
21312121
21
Language Models for LVCSR
Word Pair Model: Specify which word pairs are valid
Statistical Language Modeling
)()(
)(),(
),(),,(
),|(ˆ
,),,(
),,,(),,|(ˆ
),,,,|()(
13
1
212
21
3211213
11
1111
1211
i
Nii
NiiiNiii
Niii
Q
iiN
wFwF
pwF
wwFp
wwFwwwF
pwwwP
wwFwwwF
wwwP
wwwwPWP
),,,(log1lim
)(log)(
)()()(),,,(
),,,(log),,,(1lim
21
2121
2121
Vw
QQQ
wwwPQ
H
wPwPH
wPwPwPwwwP
wwwPwwwPQ
H
Perplexity of the Language ModelEntropy of the Source:
First order entropy of the source:
If the source is ergodic, meaning its statistical properties can be completely characterized in a sufficiently long sequence that the Source puts out,
H
Qp
Ni
Q
iiiip
Q
wwwPB
wwwPQ
H
wwwwPQ
H
wwwPQ
H
p /121
21
11
21
21
),,,(ˆ2
),,,(ˆlog1
),,,|(log1
),,,(log1
We often compute H based on a finite but sufficiently large Q:
H is the degree of difficulty that the recognizer encounters, on average,When it is to determine a word from the same source.
Using language model, if the N-gram language model PN(W) is used,An estimate of H is:
In general:
Perplexity is defined as:
Overall recognition system based on subword units