mr background noise and miss speech perception in: by elvira perez and georg meyer

38
Mr Background Noise and Miss Speech Perception in: by Elvira Perez and Georg Meyer

Upload: angelica-robbins

Post on 31-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Mr Background Noise and

Miss Speech Perception in:

by Elvira Perez and Georg Meyer

Structure

1. Introduction

2. Background

3. Experiments

4. Conclusions

5. Future research

1. Introduction:

Ears received mixtures of sounds.

We can tolerate surprisingly high levels of noise and still orientate

our attention to whatever we want to attend.

But... how the auditory system can do this so accurately?

• Auditory scene analysis (Bregman, 1990) is a theoretical framework that aims to explain auditory perceptual organisation.

• Basics:– Environment contains multiples objects

• Decomposition into its constituent elements.• Grouping.

• It proposes two grouping mechanisms:– 1. ‘Bottom-up’: Primitive cues (F0, intensity, location)

Grouping mechanism based on Gestalt principles.– 2. ‘Top-down’: Schema-based (speech pattern

matching)

1. Introduction:

• Primitive process (Gelstalt. Koffka, 1935):– Similarity: Sounds will be grouped into a single perceptual stream

if they are similar in pitch, timbre, loudness or location.

– Good continuation: Smooth changes in frequency, intensity, location or spectrum will be perceived as changes in a single source, whereas abrupt changes indicate change in source.

– Common fate: Two components in a sound sharing the same kinds of changes at the same time (e.g.: onset-offset) will perceived and grouped as part of the same single source.

– Disjoint locations: A given element in a sound can only form part of one stream at a time ( duplex perception).

– Closure: When parts of a sound are masked or occluded, that sound will be perceived as continuous, if there is no direct sensory evidence to indicate that it has been interrupted

1. Introduction:

1. Introduction

• Criticisms: Too simplistic. Whatever cannot be explained through the primitive processes, it is explained by the schema-based processes.

• Primitive processes only work in the lab.• Sine-wave replicas of utterances (Remez et al.,

1992)– Phonetic principles of organization find a single speech

stream, whereas auditory principles find several simultaneous whistles.

– Grouping by phonetic rather than by simple auditory coherence.

• Previous studies (Meyer and Barry, 1999: Harding and Meyer, 2003) have shown that perceptual streams may be formed also from complex features such as formants in speech, and not just from low-level sounds like tones.

2. Background:

Perception of synthetic /m/ changes to /n/ when preceding by a vowel with a high second formant and no transition between the vowel and consonant.

Hear as /m/ (Grouped F2 matches that of an /m/)

Heard as /n/ (Grouped F2 matches that of an /n/)

vowel m

vowel m

3. Experiments (baseline):

• The purpose of these studies is to explore how noise (a chirp) affects speech perception.

• The stimulus used is a vowel-nasal syllable which is perceived as /en/ if presented in isolation but as /em/ if it is presented with a frequency modulated sine wave in the position where the second formant transition would be expected.

• In the three experiments participants categorised the synthetic syllable heard as /em/ or /en/.

• Direction, duration, and position of the chirp were the values manipulated.

The perception of a nasal /n/ change to /m/ when adding a chirp between the vowel and nasal F2

For

man

t fre

qu. (

Hz) 2700

2000

800375

vowel nasal

100 200ms

3. Experiments

• Chirp down means that the frequency of its sine wave changes from 2kHz to 1kHz.

• Chirp up means that its frequency changes from 1kHz to 2kHz

3. Experiments

• Subjects: 7 male and 6 female.

• Their task was always the same: After each signal presentation subjects were asked to judge whether the syllable sound more like and /em/ or like and /en/ by pressing the appropriate button on a computer graphics tablet.

• There was no transition between the vowel and the nasal. A chirp between the vowel and the nasal F2 was added to the signal.

3. Experiments

Experiment 1 Baseline/Direction

chirp up chirp down

vowel nasal

100 200ms ctrl chirp up chirp dn

0.0

0.2

0.4

0.6

0.8

1.0

p(m

)

Condition

In 80% of the trials the participants heard the difference between up and down chirp.

Experiment 2 Duration

vowel nasal

100 200ms

0 5 10 15 20 25 30

0.0

0.2

0.4

0.6

0.8

1.0 down chirp

up chirp

p(m

)

Chirp duration (ms)

100 200ms

vowel nasal

Experiment 3 Position

-100 -50 -25 -10 0 10 25 50 100 ctrl

0.0

0.2

0.4

0.6

0.8

1.0

p(m

)

Chirp position relative to midpoint (ms)

5. Conclusions:

Chirps of 4 ms and 10 ms duration, independent of their direction are apparently integrated into the speech signal and change the percept from /en/ to /em/.

For longer duration chirps the direction of it matters but for chirp durations up to 30 ms the majority of stimuli are still heard as /em/.

Subjects very clearly hear two objects, so that some scene analysis is taking place since the chirp is not integrated completely into the speech.

Duplex perception with one ear.

It seems that listeners can also discriminate the direction motion of the chirp when they focus their attention in the chirp and a more high level of auditory processing takes places (80% accuracy).

The data suggests that a low spectro-temporal analysis of the transitions is carried out because:

(1) the spectral structure of the chirp sound is very different from the harmonically structured speech sound, and

(2) the direction of the chirp motion has little perceptual significance in the percept.

These results complement the auditory scene analysis theoretical framework.

Mr. Background Noise

• Do human listeners actively generate representation of background noise to improve speech recognition?

• Hypothesis: Recognition performance should be highest if the spectral and temporal structure of interfering noise is regular so that a good noise model can be generated random noise

• Noise prediction.

Experiment 4 & 5

• Same stimuli than before (chirp down)

• The amplitude of the chirp vary (5 conditions)

• Background noise (down chirps):– Quantity: Lots vs Few– Quality: Random vs Regular

• Categorization task 2FC.

• Threshold shifts.

en en en en en en

en en en en en en

Regular condition

Irregular condition

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0 T

hre

sh

old

Pro

ba

blility

of e

ari

ng

/m

/

Amplitude Chirp

Each point in the scatter is the mean threshold over all subjectsfor a give session. The solid lines show the Boltzmann fit (Eq.(1)for each individual subject in the fifth different conditions. All the fits have the same upper and lower asymptotes.

2/)(21

01A

e

AAy dxxx

control fewrand fewreg lotsrand lotsreg-0.4

-0.2

0.0

0.2

0.4

0.6

0.8

1.0

Th

resh

old

/m/

Condition

lots vs. few (t = -3.34, df = 38, p = 0.001). control vs. lots (t = -3.34, df = 38, p = 0.001).

No effect between random and regular.

Exp. 4 rand/reg

• Two aspects change from exp. 4 to 5:– Amplitude scale of the chirps.– The conditions lots now includes 100/20’’ and

before 170/20’’.

Control fewrand fewreg lotsrand lotsreg

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Th

resh

old

/m/

Condition

lots vs few (t = 2.27, df = 38, p = 0.05). control vs. lots (t = 3.12, df = 38, p < 0.05).

No effect between random and regular.

Exp.5 rand/reg

Conclusions

Only the amount of background noise seems to affect the performance of the recognition task.

The sequential grouping of the background noise seems an irrelevant cue to improve auditory stream segregation and therefore, speech perception.

Counterintuitive phenomenon.

Attention must be focused on an object (background noise) for a change in that object to detected (Rensink, et al, 1997)

Change deafness: May occur as a function of (not) attending to the relevant stimulus dimension (Vitevitch, 2003).

SPEECH

Lexical (meaning)

Inlexical (age, gender, emotions of the talker)

Acoustical (loudness, pitch, timbre)

• Irrelevant sound effect (ISE) (Colle & Welsh, 1976) disrupts in serial recall.

• The level of meaning (reverse vs forward speech), predictability of the sequence (random vs regular), and similarity (semantic or phisical) of the IS to the target material, seems to have little impact in the focal task. (Jones et al., 1990).

• Changing state: The degree of variability or phisical change within an auditory stream is the primary determinant of the degree of distrupion in the focal task.

Freq

uenc

y

Time

One stream

Freq

uenc

yTime

Two streams

Changing single stream Two unchanging streams

Experiment 6

• Shannon et al. (1999) database of CVC syllables.

• Constant white noise + high frequency burst + low frequency burst.

• 3 Conditions: Control, background noise 1 stream, background noise 2 streams.

• Does auditory stream segregation require attention?

Results

Is there a confound effect?

Tukey test p < 0.001

More quantity of noise

control 1stream 2streams0

10

20

30

40

50

60

70

80

90

100

110T

hre

sh

old

X Axis Title

Procedure

White noise: 68 dB

Burst high 71 dB

Burst low 67 dB

Speech signal 59-64 dB : 61.5dB

Conclusions

• Changing-state stream is more disruptive than two steady-state streams.

• Working memory or STM and the phonological loop (Badeley, 1999)

• Auditory streaming can occur in the absence of focal attention.

6. Future research:

• Streaming and focal attention

• Streaming and location (diotic vs dichotic presentation)

Thank you

References:

Bregman, A.S. 1990. Auditory Scene Analysis: The perceptual organisation of sound, MIT Press, Cambridge MA.

Harding and Meyer, G. (2003) Changes in the perception of synthetic nasal consonants as a result of vowel formant manipulations. Speech Communication 39, 173-189.

Meyer, G. and Barry, W. (1999) Continuity based grouping affects the perception of vowel-nasal syllables. In: Proc. 14th International Congress of Phonetic Science. University of California.

Table 1: Klatt synthesizer parameters: Fx: formant frequency (Hz), B: bandwidth (Hz), A: Amplitude (dB). The nasals have nasal formants at 250Hz and nasal zeros at 750 Hz(/m/) and 1000 (/n/).

603002700606010006060250/m/

603002700606020006060250/n/

50752700607520006075375Vowel

ABF3ABF2ABF1

Spectrogram of the stimulus used in the three experiments, with vowel formant at 375, 2000, and 2700 Hz corresponding to an /e/, and with nasal prototype formants at 250, 2000 and 2700 Hz corresponding to an /n/. There was no transitions between the vowel and the nasal