speech enhancement - speech processing€¦ · input speech signal, noise and noisy speech (snr...

Speech EnhancementSpeech Processing

Tom Bäckström

Aalto University

October 2015

Introduction

I Many modern speech processing applications are mobile.I We use mobile phones at home, at the office, on the bus, in

cars, on the street, in nightclubs, on the toilet, while joggingetc.

I Such environments often feature background noises:I Computer hum, television noise, passing by cars, in-car noise,

music in the “background”, wind sounds, clothing scratchingthe microphone, etc.

I These noises are independent/uncorrelated with the speechsound.

I In addition we often find distortions, where the speech sound ismodified by the environment:

I Room acoustics often cause reverberation, which changes thespeech signal itself. The modification is thusdependent/correlated with the desired speech signal.

I The microphone and loudspeaker are often located near eachother, whereby there will be feedback known as echo.

Introduction

I Our objective is to improve the communication experience inadverse (noisy&distorted) environments.

I We want to attenuate or remove background noises.I We want to reduce reverberation (effect room acoustics).I Usually we do not want to change the “original” noise-free

signal in any way.I However, in some (rare) application we do want to improve

the original (non-distorted) speech signal.I Attenuating effect of pathologies in the speech production

system. For example, cancer-operations in the throat-area cancause damage to the speech production organs.

I Improving intelligibility of speech for the hearing-impaired(hearing aids).

I We must always specify clearly what our goal actually is. Thegoal depends on the application.

IntroductionSome approaches

I People based approaches (low-tech):I TALK LOUDER! In-to-na-te clear-ly. Go to a silent room.

Wait until the bus has passed. (Intensity, quality, spatial andtemporal modifications)

I Hardware solutions:I Physical protection from distortions (wind-screens, noise

protectors etc.), better/more hardware such as severalmicrophones/loudspeakers, more powerful loudspeakers etc.

I Software solutions (the interesting stuff):I Noise attenuation algorithms, dereverberation algorithms,

source separation, intelligibility enhancement, echo cancellationetc.

I Some of the most important methods:I Spectral subtraction, Wiener filtering, Kalman filters,

beamforming

Introduction

In the following we will discuss typical enhancement methods:I Noise attenuation with spectral subtractionI Multi-microphone methods for noise estimation and

beamformingI Dereverberation

Noise attenuationAssumption of independence (uncorrelated sources)

I Speech and background noises can usually be treated asindependent sources.

I We assume that there is no acoustic coupling of the sources.I Acoustic coupling would imply that two sources influence each

other. For example, hitting a drum near a piano can make thepiano strings vibrate, whereby the two sources influence eachother.

I Typical background noises (computer hum, street noise,television noises etc.) do not significantly influence a speechsound whereby assuming independence is relatively safe.

I Room acoustics (reverberation) is coupled with the speechsignal and will be treated separately.

Noise attenuationAdditive noise model

I Independent (uncorrelated) noise sources V (z) can bemodelled as additive noise onto a speech signal S(z) such thatthe observation is

X (z) = S(z) + V (z).

I Our task is to estimate the speech signal S(z) when X (z) isknown.

I The sources are, according to our assumption, uncorrelated,which means that the expected correlation is zero.

E [S(z)V (z)] = 0.

I To be able to estimate S(z) we then quite clearly need toknow something more about V (z).

Noise attenuationNoise estimation and modelling

I To be able to remove noise, we first need to estimate noisecharacteristics or statistics.

I We need to find sections of the signal which have noisy only(non-speech).

I One approach would be to use voice activity detection to findnon-speech segments.

I Assuming we have a good VAD this can be effective.I Works if we assume that noise is stationary, such that the noise

on non-speech parts is similar to the noise in speech-parts.I VADs are accurate only at low noise levels.

I Alternatively, we can look at characteristics of speech to makeeducated guesses:

I Low-energy parts of a signal are most likely non-speech(energy of noise is smaller than energy of noise+speech).

I Stationary parts of a signal are most likely non-speech (speechis non-stationary).

Noise attenuationNoise estimation and modelling using a VAD

VAD

Estimatenoise

Attenuatenoise

Estimatenoise

Inputsignal

Outputsignal


I Most enhancement methods operate in the spectral domain(of the STFT type).

I We need to specify a model for noise in the spectral domain.I Spectrum is complex-valued (phase and magnitude).I We can model magnitude as average energy,

|V (z)|2 :=1N

∑k

|Vk(z)|2,

where k goes over non-speech frames identified by VAD.I Alternatively with a minimum-statistics approach, we can

define|V (z)|2 := min

k|Vk(z)|2,

over recent frames, whereby we do not need a VAD.I It is difficult to say much about the phase, whereby we usually

assume it is random (no modelling).


Am

plit

ude

Input speech signal, noise and noisy speech (SNR 0dB)

Speech

Noise

Noisy speech

Time (s)

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

non-speech

speech

VAD decision

Frequency (Hz)

0 1000 2000 3000 4000 5000 6000 7000 8000

Magnitude (

dB

)

-30

-20

-10

0

10

20Spectra

Clean speech

Noisy speech

True noise

Estimated noise


I Noise estimation is an art form!I Estimation of stationary noises is still somehow reasonable, but

time-varying noise is very difficult.I Most estimation methods are heuristic;

I It is very hard to design a method which works for all noisetypes.

I It is hard to even know what all types of noises there are.I “Easier” to model speech and say that everything which is

not-speech is noise, but since speech is a very rich and variedsignal, so speech modelling is also not an easy task.

I Most noise attenuation algorithms just assume that we alreadyhave a noise estimate and leave the implementation of noiseestimation for the engineers...

Noise attenuationSpectral subtraction

Our signal model is X (z) = S(z) + V (z).I Given a noise energy estimate |V (z)|2 and an observation

X (z) our task is to estimate the speech signal S(z).I Note that we do not have the phase of the noise, that is, we

have |V (z)|2 but not V (z) and therefore we cannot justcalculate S(z) = X (z)− V (z).

I Heuristic approach;I We can subtract energies |S(z)|2 = |X (z)|2 − |V (z)|2,

whereby |S(z)| =√|X (z)|2 − |V (z)|2.

I We don’t know the phase, so lets just keep the noisy phase∠S(z) = ∠X (z).

I The estimate speech signal is then

S(z) = ∠X (z)

√|X (z)|2 − |V (z)|2 = X (z)

√|X (z)|2 − |V (z)|2

|X (z)|2.

I This method is known as spectral subtraction.


Frequency (Hz)

0 1000 2000 3000 4000 5000 6000 7000 8000

Magnitude (

dB

)

-60

-40

-20

0

20Noisy input signal

S

S+V

V

Frequency (Hz)

0 1000 2000 3000 4000 5000 6000 7000 8000

Magnitude (

dB

)

-60

-40

-20

0

20Spectral subtraction

S+V

S'

S


I Observe that sometimes the speech energy estimate|S(z)|2 = |X (z)|2 − |V (z)|2 can be negative.

I V (z) is a random variable and |V (z)|2 is its variance.⇒ 50% of the time |V (z)|2 < |V (z)|2.

I We can therefore limit |S(z)|2 to non-negative values

|S(z)|2 =

{|X (z)|2 − |V (z)|2 when |X (z)|2 > |V (z)|2

0 otherwise.

I That is, whenever we have overestimated |V (z)|2, then wesubtract a bit less and if we underestimated |V (z)|2 then wesubtract more.

I This is a biased estimate which more often removes too much.


s

vx=s+v


s

vx=s+v

s'?


s

vx=s+v

s'?

s'

v'


s

vx=s+v

s'?

s'

v'

e=s-s'

Noise attenuationWiener filtering

Our signal model is X (z) = S(z) + V (z).I A very similar rule, known as a Wiener filter can be derived by

minimizing the squared error between the original speechsignal S(z) and an estimate S(z):

minS(z)

E

[∣∣∣S(z)− S(z)∣∣∣2]

I If we define a multiplicative model S(z) = A(z)X (z) we have

E

[∣∣∣S(z)− S(z)∣∣∣2] = E

[|S(z)− A(z)X (z)|2

]= E

[|S(z)− A(z)(S(z)− V (z))|2

]


I The minimum is found by setting the derivative to zero.

0 =∂

∂A(z)E[|S(z)− A(z) (S(z)− V (z))|2

]= E [(V ∗(z)− S∗(z)) (S(z)− A(z) (S(z)− V (z)))]

= E[V ∗(z)S(z)− V ∗(z)A(z)S(z) + A(z)|V (z)|2

− |S(z)|2 + A(z)|S(z)|2 − S∗(z)A(z)V (z)]

= E[A(z)|V (z)|2 − |S(z)|2 + A(z)|S(z)|2

]I It follows that the Wiener filter is defined as

A(z) =E[|S(z)|2

]E [|V (z)|2 + |S(z)|2]

=|X (z)|2 − |V (z)|2

|X (z)|2

and the estimate of clean speech is obtained by

S(z) = A(z)X (z) = X (z)

[|X (z)|2 − |V (z)|2

|X (z)|2

].


Frequency (Hz)

0 1000 2000 3000 4000 5000 6000 7000 8000

Magnitude (

dB

)

-60

-40

-20

0

20Noisy input signal

S

S+V

V

Frequency (Hz)

0 1000 2000 3000 4000 5000 6000 7000 8000

Magnitude (

dB

)

-60

-40

-20

0

20Wiener filtering

S+V

S'

S

Noise attenuationI The similarity between the two spectral subtraction methods

between is obvious.

S(z) = X (z)

√|X (z)|2 − |V (z)|2

|X (z)|2linear energy filtering

S(z) = X (z)

[|X (z)|2 − |V (z)|2

|X (z)|2

]Wiener filtering.

I Further methods can be easily constructed, for example

S(z) = X (z)|X (z)| − |V (z)||X (z)|

linear magnitude.

I When we have a VAD, we can decide to reduce more noisefrom non-speech frames.

I To make sure the speech content is not distorted, we canreduce less noise than given by the above rule.


SNR of noisy signal (dB)

0 5 10 15 20 25 30

Attenuation (

dB

)

-30

-25

-20

-15

-10

-5

0Attenuation factors

Linear energy

Wiener

Linear magnitude

Noise attenuationTime-domain formulation (advanced topic)

The Wiener-filtering rule can be readily defined also in the timedomain.

I Suppose xn, sn and vn are the observed noisy speech, the cleanspeech and noise signals respectively.

I We define a filter an such that our estimate of the cleanspeech signal is sn = an ∗ xn.

I The optimal filter vector a = [a0 . . . aN−1]T is (with similarderivation as above)

a = R−1x rs

and the estimated output is

sn = aT x = rTs R−Tx x .

where Rx is the autocorrelation matrix of xn and rs is a vectorof the autocorrelation of sn.

Performance measures

We have thus obtained a number of different noise attenuationmethods. How do we know which one to choose? Which method isthe best one?

I Listen to the output!I Listen to it in the same environment and with the same

hardware as the intended application.I Listen to the whole range of speakers (female, male, children,

different ethnic groups) and languages as the desiredapplication.

I Listen with the whole range of background noises in theintended application.

I More about testing methodology later.I Use objective measures:

I Signal to noise ratio.I Speech distortion index.I Noise reduction factor.

Performance measuresSignal to noise ratio

I The signal to noise ratio is a generic measure for signaldistortion and noise. Before noise attenuation we have

SNR =|S(z)|2

|V (z)|2.

I To measure the SNR after noise attenuation, we need todetermine the amount of noise after processing. If theestimated signal is S(z) then the error is S(z)− S(z), wherebythe output SNR is

SNR =|S(z)|2

|S(z)− S(z)|2.

I The SNR does not however differentiate between differenttypes of effects.

I Some methods can effectively remove noise, but also distortthe speech signal, whereas

I other methods can guarantee minimal distortion of the speechsignal, but at the cost of lower noise attenuation.

Performance measuresSpeech distortion index

I To quantify how much a method distorts the desired speechsignal we can measure how much filtering modifies a cleansignal.

I Filtering a clean signal means that we multiply the clean signalS(z) by the filter A(z), whereby the amount of modification isA(z)S(z)− S(z).

I In addition we should normalize with the magnitude of S(z) toobtain the speech distortion index

SDI =|A(z)S(z)− S(z)|2

|S(z)|2.

I Clearly we have SDI = 0 if the original signal is preserved andwith increasing amount of distortion the SDI grows.

Performance measuresNoise reduction factor

I In a similar manner we can quantify how much noise V (z) isattenuated by a filter A(z) using the noise reduction factor

NRF =|V (z)|2

|A(z)V (z)|2.

I Here NRF = 1 indicates that there is no noise attenuation andthe higher the value that NRF gets the better.

I In many cases it is more important to preserve the originalspeech signal (keep SDI low) than to maximize NRF , becausehuman listeners find distortions annoying.

I For a speech recognition application it might though be moreimportant to obtain the best SNR even at the cost of higherSDI , because (or if) all noise and distortions reduce theaccuracy of the speech recognizer equally.

Performance measuresResult integration

I Above, we have defined performance measures for individualfrequencies, such as, SNR = |S(z)|2

|S(z)−S(z)|2, but in practice we

want often to obtain the average over all frequencies

SNR =

∫ π0 |S(e iθ)|2dθ∫ π

0 |S(e iθ)− S(e iθ)|2dθ

which in practice corresponds the signal energy divided byerror energy.

I This measure gives the instantaneous SNR , that is, the SNRwithin one frame.

I To obtain the SNR of a longer segment of a signal, we wouldusually take the mean of all SNRs (the frame-wise meanSNR).

I It is completely possible to calculate the SNR of completesignal in one super-long frame.

I However, the frame-wise SNR is usually closer to how humanperception would evaluate quality.

Performance measures

Am

plit

ude

Noisy input signal

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

SN

R (

dB

)

-50

0

50Signal to noise ratio

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

SD

I (d

B)

-40

-20

0Speech distortion index

Time (ms)

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

NR

F (

dB

)

0

10

20Noise reduction factor

Linear energy

Wiener

Linear magnitude

Performance measuresPerceptual distortion measures

I In applications where we want to optimize perceptual quality,we can include models for human perception in distortionsmeasures.

I Suppose W (z) is a filter which approximates the accuracy ofhuman hearing (perception).

I Then we can apply perceptual weighting to the signal errorS(z)− S(z) by multiplication with W (z), to obtain aperceptual SNR criteria

pSNR =|W (z)S(z)|2

|W (z)(S(z)− S(z))|2.

I The filter can be specified for example byI a fixed filter approximating perception accuracy (Bark-, ERB-

or Mel-scales), orI a time-variant filter which includes also masking effects.

Beamforming

I Performance of a speech communication system in noise canobviously be improved if we access to multiple microphones.

I More data means better quality.I Uncorrelated noise sources (sensor noise) can be attenuated

more easily.I More importantly, spatial differences between microphones can

be used to distinguish between sources.I Beamforming refers to methods which use spatial information

to extract a specific source from a sound scene.I If microphones are at different distances from a source, then

the sound will arrive at different times to to microphones.I Delaying microphone signals appropriately will make the

desired source have the same phase in all channels, while othersources are (hopefully) out of phase.

I We can add the delayed sensors whereby in-phase componentsadd up and out-of-phase components attenuate each other.

Beamforming

x0

x1

x2

x3

x4

Microphone array

Beamforming

x0

x1

x2

x3

x4

Microphone array

Plane-wave s

Beamforming

x0

x1

x2

x3

x4

Microphone array

Plane-wave s

∆T1

∆T2

∆T3

∆T4

Time delays

BeamformingDelay-and-sum

I Let xn,k be the nth sample from the kth microphone and

xn,k = sn−Tk+ vn−Lk

where Tk and Lk are the delays of the speech and noisesources at the kth microphone.

I We can delay the kth microphone signal by ∆Tk such thatadding the microphone signals yields

sn =1K

K∑k=1

xn+∆Tk=

1K

K∑k=1

sn + vn−Lk+∆Tk

= sn +1K

K∑k=1

vn−Lk+∆Tk.

BeamformingDelay-and-sum

I The output SNR of the delay-and-sum beamformer with twosource (K = 2) is

SNRout =E[|sn|2

]E[∣∣sn + 1

2(vn−D1 + vn−D2)∣∣2]

=E[|sn|2

]E[|sn|2

]+ 1

2E[|vn|2

]+ 1

2E [vn−D1vn−D2 ]

≥E[|sn|2

]E[|sn|2

]+ E

[|vn|2

] = SNRin.

I Since E [vn−D1vn−D2 ] ≤ E[|vn|2

]a delay-and-sum

beamformer will always improve speech quality, if the delaysare estimated correctly.

BeamformingMVDR and other beamformers

I In general, beamformers can be formulated as spatial filteringand usually we operate in the frequency domain. We candefine a beamformer as

S(z) =∑k

αkXk(z)

where Xk(z) is the kth microphone signal and αk is aweighting coefficient.

I Delay and sum corresponds to phase-rotations αk = e iθk .I A common optimization criteria for finding αk is

Minimum-Variance Distortionless Response (MVDR), whichconsists of two parts:

I Distortionless Response = A signal from the desired directionshall pass through the filter unmodified (constrain to SDI = 0).

I Minimum-Variance = Energy (variance) arriving from anyother direction is minimized (minimize noise = maximizeNRF ).

Dereverberation

I An interesting consequence of beamforming is that it does notonly filter independent sources, it also removes sounds fromthe desired source which are reflected and arrive from otherdirections.

I For example, when speaking in a office room, then tables,walls, ceilings and floors will reflect the sound in differentdirections and some of these reflections will be picked up bythe microphones.

I The beamformer thus removes a part of the influence of theroom reverberation and it is thus performing dereverberation.

I Beamforming is though only one dereverberation approach andnext we will discuss another important approach, namely,deconvolution.

DereverberationSignal model

Reverberation means that a signal is reflected from objects suchthat the reflections arrive at the microphone with different delays.

I Let sn be the direct speech signal which reaches themicrophone through the direct path without reflections.

I If we have K reflections each arriving with a delay of ∆Tk ,then these components can be written as sn−∆Tk

.I The observed signal xn is then a sum of the direct and

reflected components plus uncorrelated noise vn

xn = sn +K∑

k=1

hksn−∆Tk+ vk =

K∑k=0

hksn−∆Tk+ vk .

I By assuming that K is smaller than some number N we canwrite the observed signal as a convolution

xn = sn ∗ hn + vn.

where hn is known as the room-impulse response.

DereverberationProblem formulation

I The dereverberation problem is to find an estimate sn of thespeech signal sn given an observation xn = sn ∗ hn + vn suchthat some distance measure d(sn, sn) is minimized.

I As usual, we can minimize for example the minimum meansquare error d(sn, sn) = ‖sn − sn‖2 or a perceptually weighteddistance d(sn, sn) = ‖wn ∗ (sn − sn)‖2 where wn is a perceptualweighting filter.

I Convolution is thus an integral part of our signal model,whereby the inverse convolution (=deconvolution) shouldobviously be part of the solution.

I By finding a filter an such that filtering the room impulse filterhn with an yields the Dirac-delta, hn ∗ an ≈ δn, we can removereverberation

sn = an ∗ xn = an ∗ hn ∗ sn + an ∗ vn = sn + an ∗ vn.

Dereverberation – Some challenges

I Determining inverse filters is not trivial.I With a given room impulse response H(z), the “best” inverse

filter is A(z) = H−1(z), but if H(z) contains zeros, H(zk) = 0then A(z) will go to infinity at those points A(zk)→∞. Thatis, A(z) is impossible to realize!

I If we find an approximation A(z) ≈ H−1(z), then we still havelarge values for A(zk). That means, that even smallmeasurement errors and round-off errors will be multiplied bybig numbers. ⇒ Noise is amplified.

I Blind estimation of H(z) is non-trivial.I We do not have simple methods for estimating room impulse

responses from real-life signals. We can measure them fromsynthetic inputs (hand-claps in a room already give reasonableestimates), but we cannot expect that a normal dialog wouldbegin with hand-claps.

I Measurements are also corrupted by measurement noise vn,whereby room impulse measurements are distorted.

Dereverberation – Some challenges

I Even when we achieve perfect dereverberation of the desiredsignal, the independent noise part vn remains.

I Room impulse responses can be very long in time, more than300ms is not unusual, yet analysis frames are in the range20ms ... 50ms. ⇒ We need methods which take into accountinter-frame (between-frame) dependencies.

Quality Enhancement Discussion

I We have seen that we have a wide range of efficient methodsfor (single channel) noise attenuation as well as multi-channelmethods (beamforming).

I These are standard methods used in frequently in mobiledevices, even if users are often not aware of them.

I These methods work! They improve quality.I Some fundamental questions however remain.

I Our signal model for background noise is additiveX (z) = S(z) + V (z). The inverse of addition is subtraction,yet spectral subtraction methods are expressed multiplicationsS(z) = A(z)X (z)!How can we expect to invert an addition by multiplication?⇒ Inherent bias!

I In basic beamforming methods, we try to extract a singledirection. However, reflections do contain information aboutthe desired source. Removing reflections does removes parts ofthe desired signal. For best performance we should use allavailable information!


I The most important discrepancy or deficiency is however signalphase.

I Most methods simply use the noisy phase and modify only themagnitude.

I If noise magnitude is small, then the error we make this way issmall. For very low-SNR scenarios, we however end up inproblems.

⇒ Phase is currently a hot research topic in speech and audio.


I Real-life environments often feature background noises andother distortions ⇒ We need speech enhancement algorithmsfor attenuating the effects of such distortions.

I Noise-attenuation methods attempt to remove backgroundnoises.

I In spectral subtraction we multiply (sic!) the signalcomponents with a gain factor such that the contribution ofnoise is reduced.

I In beamforming, we use time-differences between microphonesto obtain a better estimate of the desired signal when thesource location is known.

I We obtain both noise attenuation and dereverberation effects.I Dereverberation methods attempt to reverse the effect of

room-acoustics on the desired speech signal.I We end up with a deconvolution problem, which is challenging.

Intelligibility and signal improvementI Usually, the objectivity of speech enhancement is signal

restoration.I We assume that the signal has a distortion which we try to

cancel.I Sometimes, however, the output sounds better than the

original! We can also try to develop algorithms which improvethe signal.

I For hearing-aids, we often want to improve intelligibility.I If you listen to a mobile devices in an environment with loud

noise, you might want to increase the loudness of the speechsound (with respect to the noise).

I In forensics/surveillance applications you might know who isspeaking but you cannot make out what the person is saying.

I For the elderly and people with hearing disorders, you mightwant to improve intelligibility of TV-programs.

I The main difficulty of such methods is that we lack properperceptual criteria for optimization and measurement methodsfor intelligibility.

Conclusion

I Speech enhancement is an exciting field where a lot oftechniques are already avaialable and in widespread use, butplenty of challenges still remain.

I We can restore distorted signals to obtain an estimate of the“original” signal (standard techniques exists).

I We can potentially improve signals to obtainbetter-than-original signals (research is ongoing).

I Since speech is a so important part of our daily lives,enhancement techniques can have a large impact.

I Potential to improve quality of life.I Potential to destroy privacy.

I We need an ethics discussion in the speech processingcommunity (it has already started).

speech enhancement - speech processing€¦ · input speech signal, noise and noisy speech (snr...

Documents