auditory scene analysis and automatic speech recognition in adverse conditions
DESCRIPTION
Auditory Scene Analysis and Automatic Speech Recognition in Adverse Conditions. Phil Green Speech and Hearing Research Group, Department of Computer Science, University of Sheffield With thanks to Martin Cooke, Guy Brown, Jon Barker. Overview. Visual and Auditory Scene Analysis - PowerPoint PPT PresentationTRANSCRIPT
HCSNet December 2005
Auditory Scene Analysis and Automatic Speech Recognition in Adverse Conditions
Phil GreenSpeech and Hearing Research Group, Department of Computer Science, University of Sheffield
With thanks toMartin Cooke, Guy Brown, Jon Barker..
HCSNet December 2005
Overview
• Visual and Auditory Scene Analysis
• ‘Glimpsing’ in Speech Perception
• Missing Data ASR
• Finding the glimpses
• Current Sheffield Work• Dealing with Reverberation
• Identifying Musical Instruments
• Multisource Decoding
• Speech Separation Challenge
HCSNet December 2005
Visual Scenes and Auditory Scenes
• Objects are opaque• Each spatial pixel images a single object• Object recognition has to cope with occlusion
• Sound is additive• Each time/frequency pixel receives contributions from many sound sources• Sound source recognition apparently requires reconstruction..
HCSNet December 2005
‘Glimpsing’ in auditory scenes: the dominance effect (Cooke)
Although audio signals add ‘additively’, the occlusion metaphor is a good approximation due to loglike compression in the auditory system
Consequently, most regions in a mixture are dominated by one or other source, leaving very few ambiguous regions, even for a pair of speech signals mixed at 0 dB.
HCSNet December 2005
Can listeners handle glimpses?
HCSNet December 2005
The robustness problem in Automatic Speech Recognition
• Current ASR devices cannot tolerate additive noise, particularly if it’s unpredictable
• Listener’s noise-tolerance is 1 or 2 orders of magnitude better in equivalent conditions (Lippmann 97)
• Can glimpsing be used as the basis for robust ASR?
Requirements:• Adapt statistical ASR to
incomplete data case• Identify the glimpses
Clean speech
+noise
Missing dataMask (oracle)
HCSNet December 2005
Classification with Missing Data
A common problem: visual occlusion, sensor failure, transmission losses..Need to evaluate the likelihood that observation vector x was generated by class C , f(x|C) Assume x has been partitioned into reliable and unreliable parts, (xr,xu)
Two approaches:
Imputation: estimate xu , then proceed as normal
Marginalisation: integrate over possible range of xu
Marginalisation is preferable if there is no need to reconstruct x
HCSNet December 2005
The Missing Data Likelihood Computation
In ASR by Continuous Density HMMS,• State distributions are Gaussian Mixtures with diagonal covariance• The marginal is just the reduced dimensionality distribution• The integral can be approximated by ERFS• This is computed independently for each mixture in the state distribution Cooke et al 2001
HCSNet December 2005
Counter-evidence from bounds
frequency
energy
Observed spectrum x
Mean spectrum for class Creliable unreliable
Class C matches the reliable evidence well but there is insufficient energy in the unreliable components
HCSNet December 2005
Finding the glimpses
Auditory scene analysis identifies spectral regions dominated by a single source
• Harmonicity• Common amplitude modulation• Sound source location
Local SNR estimates can be used to compensate for predictable noise sources.
Cooke 91
HCSNet December 2005
Harmonicity Masks
• Only meaningful in voiced segments• Can be combined with SNR masks
HCSNet December 2005
Aurora Results (Sept 2001)
Average gain over clean baseline under all conditions: 65%
Barker et al2001
HCSNet December 2005
Missing data masks from spatial location
• Cues for spatial location are used to separate a target source from masking sources• Interaural Time Difference from corss-correlation between left and right binaural signals• Interaural Level Difference from ratio of energy in left and right ears • Soft masks• Task:
• Target source: male speaker straight ahead• One or two masking sources (also male speakers) at other positions• Added reverberation
Sue Harding, Guy Brown
HCSNet December 2005
Time (frames)
Fre
qu
en
cy c
ha
nn
el Localisation mask, ITD only
20 40 60 80 100 120
10
20
30
40
50
60
Time (frames)
Fre
qu
en
cy c
ha
nn
el
Localisation mask, ILD/ITD
20 40 60 80 100 120
10
20
30
40
50
60
Fre
qu
en
cy c
ha
nn
el Localisation mask, ILD only
Time (frames)20 40 60 80 100 120
10
20
30
40
50
60
OracleITD only, ILD only, combined ITD and
ILD.
Best performance is with combined ITD and ILD:
Missing data masks from spatial location (2)
% A
ccur
acy
5 7.5 10 15 20 30 4030
40
50
60
70
80
90
100
% A
ccur
acy
Azimuth of masker (degrees)
HCSNet December 2005
MD for reverberant conditions (1)• Palomäki, Brown and Barker have applied MD
to the problem of room reverberation:• Use spectral normalization to deal with distortion
caused by early reflections;• Treat late reverberation as additive noise, and
apply standard MD techniques.
• Select features which are uncontaminated by reverberation and contain strong speech energy.
Approach based on modulation filtering:
• Each rate map channel passed through modulation filter
• Identify periods with enough energy in the filtered output• Use these to define mask on original rate map
HCSNet December 2005
MD for reverberant conditions (2)• Recognition of connected
digits (Aurora 2)• Reverberated using
recorded room impulse responses
• Performance comparable with Brian Kingsbury’s hybrid HMM-MLP recognizer
K. J. Palomäki, G. J. Brown and J. Barker (2004) Speech Communication 43 (1-2), pp. 123-142
0
20
40
60
80
100
T60/source-receiver distance
Rec
ogni
tion
accu
racy
(%
)
HMM-MLP Baseline
MD A priori Mask
MD Reverb Mask
HCSNet December 2005
MD for music analysis (1)
• Eggink and Brown have used MD techniques to identify concurrent musical instrument sounds
• Part of a system for transcribing chamber music• Identify the F0 of the target note, and only keep
its harmonics in the MD mask• Uses a GMM classifier for each instrument,
trained on isolated tones and short phrases• Tested on tones, phrases and commercial CD
HCSNet December 2005
MD for music analysis (2)
• Example: duet for flute and clarinet
• All instrument tones correctly identified in this example
J. Eggink and G. J. Brown (2003) Proc. ICASSP, Hong Kong, IV, pp. 553-556
J. Eggink and G. J. Brown (2004) Proc. ICASSP, Montreal, V, pp. 217-220
0 20 40 60 80 100 120100
200
300
400
500
600
700
Time (frames)
Fund
amen
tal F
requ
ency
(H
z)
Flute
Clarinet
HCSNet December 2005
Multisource DecodingUse primitive ASA and local SNR to identify time-frequency regions
(fragments) dominated by a single source… i.e. possible segregations S
… but NOT to decide what the best segregation is
Based on missing data techniques – regions hypothesised as non-speech are missing
Decoding algorithm finds best subset of fragments to match speech source
Instead, jointly optimise over the word sequence W and S
Barker, Cooke & Ellis 2003
HCSNet December 2005
Multisource decoding algorithm
Work forward in time, maintaining a set of alternative decodings – Viterbi searches based on a choice of speech fragments.
When new fragment arrives, split decodings - speech or non-speech?
When fragment ends, merge decoders which differ in its interpretation.
HCSNet December 2005
Multisource Decoding on Aurora 2
Multisource decoding with a competing speaker
Andre Coy and Jon Barker
• Utterances of male and female speakers mixed at 0 db• Voiced regions: Soft Harmonicity masks from autocorrelation peaks • Voiceless regions: fragments from ‘image processing’• Gender-dependent HMMs. • Separate decoding for male & female • 73.7% accuracy on a connected digit task
HCSNet December 2005
Informing Multisource Decoding – Work in
progressNing Ma, Andre Coy, Phil Green• HMM Duration constraints• Links between fragments – pitch continuity• ‘Speechiness’
HCSNet December 2005
Speech separation challenge
Organisers: Martin Cooke (University of Sheffield, UK) , Te-Won Lee (UCSD, USA)
• see http://www.dcs.shef.ac.uk/~martin• Global comparison of techniques for separating and
recognising speech • Special session of Interspeech 2006 in Pittsburgh (USA)
from 17-21 September, 2006. • Task- recognise speech from a target talker in the
presence of either stationary noise or other speech. • Training and test data supplied. • One signal per mixture (i.e. the task is "single
microphone"). • Speech material- simple sentences from the ‘Grid Task’,
e.g. “place white at L 3 now"