8- speech recognition

8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types

1

7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training In HMM

2

Recognition Tasks Isolated Word Recognition (IWR) Connected Word (CW) , And Continuous

Speech Recognition (CSR) Speaker Dependent, Multiple Speaker, And

Speaker Independent Vocabulary Size

Small <20Medium >100 , <1000Large >1000, <10000Very Large >10000

3

Speech Recognition Concepts

4

NLP SpeechProcessing

Text Speech

NLPSpeech Processing

SpeechUnderstanding

Speech Synthesis

TextPhone Sequence

Speech Recognition

Speech recognition is inverse of Speech Synthesis

Speech Recognition Approaches

Bottom-Up Approach

Top-Down Approach

Blackboard Approach

5

Bottom-Up Approach

6

Signal Processing

Feature Extraction

Segmentation

Signal Processing

Feature Extraction

Segmentation

Segmentation

Sound Classification Rules

Phonotactic Rules

Lexical Access

Language Model

Voiced/Unvoiced/Silence

Kno

wle

dge

Sou

rces

Recognized Utterance

Top-Down Approach

7

UnitMatching

System

FeatureAnalysis

LexicalHypothesis

SyntacticHypothesis

SemanticHypothesis

UtteranceVerifier/Matcher

Inventory of speech

recognition units

Word Dictionary Grammar

TaskModel

Recognized Utterance

Blackboard Approach

8

EnvironmentalProcesses

Acoustic Processes Lexical

Processes

SyntacticProcesses

SemanticProcesses

Blackboard

Recognition Theories Articulatory Based Recognition

Use from Articulatory system for recognitionThis theory is the most successful until now

Auditory Based RecognitionUse from Auditory system for recognition

Hybrid Based RecognitionIs a hybrid from the above theories

Motor TheoryModel the intended gesture of speaker

9

Recognition Problem

We have the sequence of acoustic symbols and we want to find the words that expressed by speaker

Solution : Finding the most probable of word sequence by having Acoustic symbols

10

Recognition Problem

A : Acoustic Symbols W : Word Sequence

we should find so that

11

W)|(max)|ˆ( AWPAWP

W

Simple Language Model

14

nwwwww 321

),...,,,(),...,,|(

).....,,|(),|()|()(

)|()(

121

121

1234

123121

1211

WWWWPWWWWP

WWWWPWWWPWWPWP

wwwwPwP

nnn

nnn

iii

n

i

Computing this probability is very difficult and we need a very big database. So we use from Trigram and Bigram models.

Simple Language Model (Cont’d)

15

)|()( 211 iii

n

iwwwPwP

)|()( 11 ii

n

iwwPwP

Trigram :

Bigram :

)()(1 i

n

iwPwP

Monogram :

Simple Language Model (Cont’d)

16

)|( 123 wwwP

Computing Method :Number of happening W3 after W1W2

Total number of happening W1W2

AdHoc Method :)()|()|()|( 332321231123 wfwwfwwwfwwwP

Error Production Factor

Prosody (Recognition should be Prosody Independent)

Noise (Noise should be prevented)

Spontaneous Speech

17

P(A|W) Computing Approaches

Dynamic Time Warping (DTW)

Hidden Markov Model (HMM)

Artificial Neural Network (ANN)

Hybrid Systems

18

Dynamic Time Warping

Dynamic Time Warping Search Limitation :

- First & End Interval- Global Limitation- Local Limitation

Dynamic Time Warping Global Limitation :

Dynamic Time Warping Local Limitation :

Artificial Neural Network

26

...

1x

0x

1w 0w

1Nw1Nx

y)(

1

0

i

N

ii xwy

Simple Computation Element of a Neural Network

Artificial Neural Network (Cont’d)

Neural Network TypesPerceptronTime DelayTime Delay Neural Network Computational

Element (TDNN)

27


28

. . .

. . .

0x

0y 1My

1Nx

Single Layer Perceptron


29

. . .

. . .

Three Layer Perceptron

. . .

. . .

2.5.4.2 Neural Network Topologies

30

TDNN

31

2.5.4.6 Neural Network Structures for Speech Recognition

32

2.5.4.6 Neural Network Structures for Speech Recognition

33

Hybrid Methods

Hybrid Neural Network and Matched Filter For Recognition

34

PATTERN

CLASSIFIER

Speech Acoustic Features Delays

Output Units

Neural Network Properties

The system is simple, But too much iteration is needed for training

Doesn’t determine a specific structure Regardless of simplicity, the results are

good Training size is large, so training should

be offline Accuracy is relatively good

35

Pre-processing Different preprocessing techniques are

employed as the front end for speech recognition systems

The choice of preprocessing method is based on the task, the noise level, the modeling tool, etc.

36

MFCCروش

روش MFCCبر نحوه ادراک گوش انسان از اصوات ي مبتن باشد.يم

روش MFCCبهتر ي نويزيطهايژگيها در محير وي نسبت به سا کند.يعمل م

MFCCه شده ي گفتار ارايي شناساي اساسا جهت کاربردها دارد.يز راندمان مناسبينده ني گويياست اما در شناسا

دار گوش انسان ي واحد شنMelباشد که به کمک رابطه ي م د:ي آير بدست ميز

43

MFCCمراحل روش

گنال از حوزه زمان به حوزه ي: نگاشت س1 مرحله زمان کوتاه.FFTفرکانس به کمک

44

گنال گفتاريس : Z(n)تابع پنجره مانند پنجره :

)W(nهمينگWF= e-j2π/F

m : 0,…,F – 1;يم گفتاريطول فر : .F


لتر.ي هر کانال بانک فيافتن انرژي: 2مرحله

Mمعيار مل ي فيلتر مبتني تعداد بانکها pبر باشد.يم

بانک فيلتر يلترهاي تابع فاست.

45

0,1,..., 1k M ( )kW j

توزيع فيلتر مبتنی بر معيار مل

46


ل ي طيف و اعمال تبدي: فشرده ساز4 مرحلهDCT MFCCب يجهت حصول به ضرا

47

در رابطه باالL،...،0=nتبه ضراpب ي مرMFCC باشد.يم

روش مل-کپستروم

48

Mel-scaling بندی فریم

IDCT

|FFT|2

Low-order coefficientsDifferentiator

Cepstra

Delta & Delta Delta Cepstra

زمانی سیگنال

Logarithm

ضرایب مل MFCC)کپستروم

)

49

ویژگی های مل (MFCC)کپستروم

نگاشت انرژی های بانک فیلترملدرجهتی که واریانس آنها ماکسیمم

(DCT )با استفاده ازباشد استقالل ویژگی های گفتار به صورت

(DCT غیرکامل نسبت به یکدیگر)تاثیرپاسخ مناسب در محیطهای تمیزکاهش کارایی آن در محیطهای نویزی

50

Time-Frequency analysis

Short-term Fourier Transform Standard way of frequency analysis: decompose the

incoming signal into the constituent frequency components.

W(n): windowing function N: frame length p: step size

51

Critical band integration Related to masking phenomenon: the

threshold of a sinusoid is elevated when its frequency is close to the center frequency of a narrow-band noise

Frequency components within a critical band are not resolved. Auditory system interprets the signals within a critical band as a whole

52

Bark scale

53

Feature orthogonalization

Spectral values in adjacent frequency channels are highly correlated

The correlation results in a Gaussian model with lots of parameters: have to estimate all the elements of the covariance matrix

Decorrelation is useful to improve the parameter estimation.

54

otherwisevalidiswwif

wwP

wwwPwwwwP

wwwwP

wwwPwwPwPwwwPWP

wwwW

jkkj

jNjjjQ

QQ

Q

Q

01

)|(

),|()|(

|(

)|()|()()()(

,

11121

).121

21312121

21

Language Models for LVCSR

Word Pair Model: Specify which word pairs are valid

Statistical Language Modeling

)()(

)(),(

),(),,(

),|(ˆ

,),,(

),,,(),,|(ˆ

),,,,|()(

13

1

212

21

3211213

11

1111

1211

i

Nii

NiiiNiii

Niii

Q

iiN

wFwF

pwF

wwFp

wwFwwwF

pwwwP

wwFwwwF

wwwP

wwwwPWP

),,,(log1lim

)(log)(

)()()(),,,(

),,,(log),,,(1lim

21

2121

2121

QQ

Vw

QQ

QQQ

wwwPQ

H

wPwPH

wPwPwPwwwP

wwwPwwwPQ

H

Perplexity of the Language ModelEntropy of the Source:

First order entropy of the source:

If the source is ergodic, meaning its statistical properties can be completely characterized in a sufficiently long sequence that the Source puts out,

QQ

H

Qp

Ni

Q

iiiip

Q

wwwPB

wwwPQ

H

wwwwPQ

H

wwwPQ

H

p /121

21

11

21

21

),,,(ˆ2

),,,(ˆlog1

),,,|(log1

),,,(log1

We often compute H based on a finite but sufficiently large Q:

H is the degree of difficulty that the recognizer encounters, on average,When it is to determine a word from the same source.

Using language model, if the N-gram language model PN(W) is used,An estimate of H is:

In general:

Perplexity is defined as:

Overall recognition system based on subword units

8- speech recognition

Documents

recognition problemwe

artificial neural network26

simple language model14

probable of word sequence

word cw

simple computation element

sequence of acoustic

multiple speaker