[email protected] topics recognition results on aurora noisy speech databaserecognition results on...

[email protected]

Topics

• Recognition results on Aurora noisy speech Recognition results on Aurora noisy speech databasedatabase

• Proposal of robust formant estimation from Proposal of robust formant estimation from MFCCsMFCCs

• Availability of real in-car speech databasesAvailability of real in-car speech databases• Contact from Pi ResearchContact from Pi Research

[email protected]

[email protected]

Robust Formant Prediction from MFCCs

• One of the aims of this integrated project is to use One of the aims of this integrated project is to use the speech recogniser to provide clean speech the speech recogniser to provide clean speech information for the speech enhancement information for the speech enhancement componentcomponent

• Proposal is to use the speech recogniser to provide Proposal is to use the speech recogniser to provide robust formant information from noisy speech robust formant information from noisy speech

• Review previous work on predicting pitch from Review previous work on predicting pitch from MFCC vectorsMFCC vectors

• Extension to proposed prediction of formantsExtension to proposed prediction of formants

[email protected]

Pitch Prediction from MFCCs

• In speech recognition most common feature In speech recognition most common feature extracted is the mel-frequency cepstral extracted is the mel-frequency cepstral coefficient (MFCC)coefficient (MFCC)

• This is designed for class discrimination and This is designed for class discrimination and contains spectral envelope informationcontains spectral envelope information

• Excitation information (pitch) is lost through Excitation information (pitch) is lost through smoothing processessmoothing processes

• Project at UEA aimed at reconstructing Project at UEA aimed at reconstructing speech from MFCC vectors - therefore speech from MFCC vectors - therefore needed additional pitch estimate or needed additional pitch estimate or prediction of pitchprediction of pitch

[email protected]

MFCC Extraction

• Mel Frequency Cepstral Mel Frequency Cepstral Coefficients (MFCC)Coefficients (MFCC)• designed for speech designed for speech

recognizerrecognizer• simulate human perceptual simulate human perceptual

abilityability• currently give best currently give best

recognition performancerecognition performance• extract information of vocal extract information of vocal

tracttract• ignore most of speaker ignore most of speaker

information, such as pitchinformation, such as pitch

speech

Framing,Pre-emphasis and windowing

FFT and Magnitude Spectrum

DCT

Mel Filterbank

Log( )

Truncation

13-D MFCCs

[email protected]

Pitch Prediction from MFCC vectors

• There is clearly no global correlation between pitch There is clearly no global correlation between pitch frequency and spectral envelope (or MFCC vector)frequency and spectral envelope (or MFCC vector)

• There does exist a class-dependent correlation - There does exist a class-dependent correlation - the classes being different speech soundsthe classes being different speech sounds

• If this class-based correlation can be modelled If this class-based correlation can be modelled then prediction of pitch from spectral envelope, or then prediction of pitch from spectral envelope, or MFCC, should be possibleMFCC, should be possible

• Investigate two methods for modelling this Investigate two methods for modelling this correlationcorrelation

• GMMGMM• HMMHMM

[email protected]

Class-based GMM Pitch PredictionTraining phaseTraining phase• Introduce augmented feature vectorIntroduce augmented feature vector

yy = [ = [xx, f], f]• Model joint distribution by clustersing to Model joint distribution by clustersing to

form a GMM - tested from 64 to 128 form a GMM - tested from 64 to 128 clusters clusters

Pitch PredictionPitch Prediction• During prediction stage only have MFCC During prediction stage only have MFCC

component component xx• Pitch is predicted using MAP algorithm from Pitch is predicted using MAP algorithm from

the means and covariance of the clustersthe means and covariance of the clusters

• Does not fully exploit the class-based Does not fully exploit the class-based correlation between the MFCC vector and correlation between the MFCC vector and pitchpitch

Txki

xxk

fxk

fk

K

kiki xxhf

1

1

ˆ

x

f

[email protected]

HMM Pitch Prediction

Training phaseTraining phase• Model joint distribution of pitch and Model joint distribution of pitch and

MFCC using a series of HMMsMFCC using a series of HMMs

Pitch PredictionPitch Prediction• Perform standard Viterbi decoding Perform standard Viterbi decoding

of MFCC stream in the HMMof MFCC stream in the HMM• Use model and state Use model and state

sequence information to locate sequence information to locate mapping for each MFCC vector mapping for each MFCC vector and then use MAP to predict pitch and then use MAP to predict pitch

• GMM does not model the temporal correlation of pitch GMM does not model the temporal correlation of pitch • GMM clusters are trained unsupervised - may be better to GMM clusters are trained unsupervised - may be better to

used supervised trainingused supervised training x

1 2

f

Tximiqki

xximiqk

fximiqk

fimiqk

K

kiimiqki xxhf ,,

1,,,,,,

1,,

ˆ

[email protected]

Pitch Prediction Results• Aurora database - 200 utterances for training (50 Aurora database - 200 utterances for training (50

speakers), 90 utterances for testing (23 speakers)speakers), 90 utterances for testing (23 speakers)• 42,902 frames in total 42,902 frames in total

[email protected]

Reconstructed Speech

original

MFCC+HMM-based pitch

MFCC+ reference

pitch

[email protected]

Extension to Formant Prediction

• Prediction of formants may also be possible from Prediction of formants may also be possible from MFCC vectors using similar strategy of modelling MFCC vectors using similar strategy of modelling joint distributionjoint distribution

yy = [ = [xx, f1, f2, f3, f4, …], f1, f2, f3, f4, …]• Potentially stronger correlation between formant Potentially stronger correlation between formant

and MFCCs than pitch and MFCCsand MFCCs than pitch and MFCCs• Use Brunel format estimator to provide frequency, Use Brunel format estimator to provide frequency,

bandwidth, amplitude of formantsbandwidth, amplitude of formants

[email protected]

Why Predict Formants?

• Formant estimation from noisy speech is a difficult Formant estimation from noisy speech is a difficult task and prone to errorstask and prone to errors

• Predicting them from MFCCs may be more robustPredicting them from MFCCs may be more robust• Before prediction can apply noise compensation Before prediction can apply noise compensation

methods to MFCCs (spectral subtraction/Wiener)methods to MFCCs (spectral subtraction/Wiener)• Alternatively model the joint distribution of noisy Alternatively model the joint distribution of noisy

MFCCs and formantsMFCCs and formants• In effect utilise the correlation information available In effect utilise the correlation information available

inside the speech models themselvesinside the speech models themselves• Formant predictions provide clean speech Formant predictions provide clean speech

information necessary for speech enhancement information necessary for speech enhancement component of projectcomponent of project

[email protected]

Noisy Speech Databases

• Two more noisy speech databases availableTwo more noisy speech databases available• SpeechDat-Car - DanishSpeechDat-Car - Danish• SpeechDat-Car - SpanishSpeechDat-Car - Spanish

• Connected digit strings recorded in a moving car Connected digit strings recorded in a moving car under different driving conditions.under different driving conditions.

• Both hands-free and close-talking microphoneBoth hands-free and close-talking microphone

• Available through SIG in COST278 - will request Available through SIG in COST278 - will request availability to other partnersavailability to other partners

[email protected]

Pi Research

• Pi Research in Cambridge specialise in data Pi Research in Cambridge specialise in data communication in Formula 1 racingcommunication in Formula 1 racing

• Made an approach regarding possibility of reducing Made an approach regarding possibility of reducing noise on driver-to-pit crew communicationnoise on driver-to-pit crew communication

• Example - down to SNRs of -30dB Example - down to SNRs of -30dB

[email protected]

Pi Research

[email protected]

End

[email protected] topics recognition results on aurora noisy speech databaserecognition results on...

Documents

speech recogniser

pitch speech framing

reconstructing speech

clean speech information

speech recognizer

pitch prediction

pitch frequency

prediction of pitch