speech recognition

20
The mel scale, named by Stevens , Volkman and Newman in 1937 [1] is a perceptual scale of pitches judged by listeners to be equal in distance from one another. The reference point between this scale and normal frequency measurement is defined by assigning a perceptual pitch of 1000 mels to a 1000 Hz tone, 40 dB above the listener's threshold. Above about 500 Hz, larger and larger intervals are judged by listeners to produce equal pitch increments. As a result, four octaves on the hertz scale above 500 Hz are judged to comprise about two octaves on the mel scale. The name mel comes from the word melody to indicate that the scale is based on pitch comparisons. A popular formula to convert hertz into mel is: [2] There is no single mel-scale formula. [3] The popular formula from O'Shaugnessy's book can be expressed with different log bases:

Upload: aswaniaka

Post on 01-Nov-2014

60 views

Category:

Documents


1 download

DESCRIPTION

mel frequency, mel coefficients,

TRANSCRIPT

Page 1: speech recognition

The mel scale, named by Stevens, Volkman and Newman in 1937[1] is a perceptual scale of pitches judged by listeners to be equal in distance from one another. The reference point between this scale and normal frequency measurement is defined by assigning a perceptual pitch of 1000 mels to a 1000 Hz tone, 40 dB above the listener's threshold. Above about 500 Hz, larger and larger intervals are judged by listeners to produce equal pitch increments. As a result, four octaves on the hertz scale above 500 Hz are judged to comprise about two octaves on the mel scale. The name mel comes from the word melody to indicate that the scale is based on pitch comparisons.

A popular formula to convert hertz into mel is:[2]

There is no single mel-scale formula.[3] The popular formula from O'Shaugnessy's book can be expressed with different log bases:

The corresponding inverse expressions are:

Page 2: speech recognition

There were published curves and tables on psychophysical pitch scales since Steinberg's 1937[4] curves based on just-noticeable differences of pitch. More curves soon followed in Fletcher and Munson's 1937[5] and Fletcher's 1938[6] and Steven's 1937[1] and Stevens and Volkmann's 1940[7] papers using a variety of experimental methods and analysis approaches.

In 1949 Koenig published an approximation based on separate linear and logarithmic segments, with a break at 1000 Hz.[8]

Gunnar Fant proposed the current popular linear/log formula in 1949, but with the 1000 Hz corner frequency.[9]

An alternate expression of the formula, not depending on choice of log base, is noted in Fant (1968):[10][11]

In 1976, Makhoul and Cosell published the now-popular version with the 700 Hz corner frequency.[12] As Ganchev et al. have observed, "The formulae [with 700], when compared to [Fant's with 1000], provide a closer approximation of the Mel scale for frequencies below 1000 Hz, at the price of higher inaccuracy for frequencies higher than 1000 Hz."[13] Above 7 kHz, however, the situation is reversed, and the 700 Hz version again fits better.

Data by which some of these formulas are motivated are tabulated in Beranek (1949), as measured from the curves of Stevens and Volkman:[14]

Page 3: speech recognition

A cepstrum /ˈkɛpstrəm/ is the result of taking the Fourier transform (FT) of the logarithm of the estimated spectrum of a signal. There is a complex cepstrum, a real cepstrum, a power cepstrum, and phase cepstrum. The power cepstrum in particular finds applications in the analysis of human speech.

The name "cepstrum" was derived by reversing the first four letters of "spectrum". Operations on cepstra are labelled quefrency analysis, liftering, or cepstral analysis.

The power cepstrum was defined in a 1963 paper by Bogert et al.[1] The power cepstrum of a signal is defined as the squared magnitude of the Fourier transform of the logarithm of the squared magnitude of the Fourier transform of a signal.[2]

power cepstrum of signal

A short-time cepstrum analysis was proposed by Schroeder and Noll for application to pitch determination of human speech.[3][4][5]

The complex cepstrum was defined by Oppenheim in his development of homomorphic system theory.[6] The complex cepstrum of a signal is defined as the Fourier transform of the logarithm (with unwrapped phase) of the Fourier transform of the signal. This is sometimes called the spectrum of a spectrum.

complex cepstrum of signal = FT(log(FT(the signal))+j2πm) (where m is the integer required to properly unwrap the angle or imaginary part of the complex log function)

The real cepstrum uses the logarithm function defined for real values. The real cepstrum is related to the power via the relationship (4 * real cepstrum)^2 = power cepstrum, and is related to the complex cepstrum as real cepstrum = 0.5*(complex cepstrum + time reversal of complex cepstrum).

Steps in forming cepstrum from time history

The complex cepstrum uses the complex logarithm function defined for complex values. The phase cepstrum is related to the complex cepstrum as phase spectrum = (complex cepstrum - time reversal of complex cepstrum).^2

The complex cepstrum holds information about magnitude and phase of the initial spectrum, allowing the reconstruction of the signal. The real cepstrum uses only the information of the magnitude of the spectrum.

Many texts define the process as FT → abs() → log → IFT, i.e., that the cepstrum is the "inverse Fourier transform of the log-magnitude Fourier spectrum".[7][8]

Page 4: speech recognition

The kepstrum, which stands for "Kolmogorov equation power series time response", is similar to the cepstrum and has the same relation to it as statistical average has to expected value, i.e. cepstrum is the empirically measured quantity while kepstrum is the theoretical quantity.[9][10]

Applications

The cepstrum can be seen as information about rate of change in the different spectrum bands. It was originally invented for characterizing the seismic echoes resulting from earthquakes and bomb explosions. It has also been used to determine the fundamental frequency of human speech and to analyze radar signal returns. Cepstrum pitch determination is particularly effective because the effects of the vocal excitation (pitch) and vocal tract (formants) are additive in the logarithm of the power spectrum and thus clearly separate.[5]

The autocepstrum is defined as the cepstrum of the autocorrelation. The autocepstrum is more accurate than the cepstrum in the analysis of data with echoes.

The cepstrum is a representation used in homomorphic signal processing, to convert signals (such as a source and filter) combined by convolution into sums of their cepstra, for linear separation. In particular, the power cepstrum is often used as a feature vector for representing the human voice and musical signals. For these applications, the spectrum is usually first transformed using the mel scale. The result is called the mel-frequency cepstrum or MFC (its

Page 5: speech recognition

coefficients are called mel-frequency cepstral coefficients, or MFCCs). It is used for voice identification, pitch detection and much more. The cepstrum is useful in these applications because the low-frequency periodic excitation from the vocal cords and the formant filtering of the vocal tract, which convolve in the time domain and multiply in the frequency domain, are additive and in different regions in the quefrency domain.

Cepstral concepts

The independent variable of a cepstral graph is called the quefrency. The quefrency is a measure of time, though not in the sense of a signal in the time domain. For example, if the sampling rate of an audio signal is 44100 Hz and there is a large peak in the cepstrum whose quefrency is 100 samples, the peak indicates the presence of a pitch that is 44100/100 = 441 Hz. This peak occurs in the cepstrum because the harmonics in the spectrum are periodic, and the period corresponds to the pitch. Note that a pure sine wave should not be used to test the cepstrum for its pitch determination from quefrency as a pure sine wave does not contain any harmonics. Rather, a test signal containing harmonics should be used (such as the sum of at least two sines where the second sine is some harmonic (multiple) of the first sine).

Liftering

Playing further on the anagram theme, a filter that operates on a cepstrum might be called a lifter. A low pass lifter is similar to a low pass filter in the frequency domain. It can be implemented by multiplying by a window in the quefrency domain and when converted back to the frequency domain, resulting in a smoother signal.

Convolution

A very important property of the cepstral domain is that the convolution of two signals can be expressed as the addition of their complex cepstra:

Page 6: speech recognition

In sound processing, the mel-frequency cepstrum (MFC) is a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency.

Mel-frequency cepstral coefficients (MFCCs) are coefficients that collectively make up an MFC. They are derived from a type of cepstral representation of the audio clip (a nonlinear "spectrum-of-a-spectrum"). The difference between the cepstrum and the mel-frequency cepstrum is that in the MFC, the frequency bands are equally spaced on the mel scale, which approximates the human auditory system's response more closely than the linearly-spaced frequency bands used in the normal cepstrum. This frequency warping can allow for better representation of sound, for example, in audio compression.

MFCCs are commonly derived as follows:[1] [2]

1. Take the Fourier transform of (a windowed excerpt of) a signal. 2. Map the powers of the spectrum obtained above onto the mel scale, using triangular

overlapping windows. 3. Take the logs of the powers at each of the mel frequencies. 4. Take the discrete cosine transform of the list of mel log powers, as if it were a signal. 5. The MFCCs are the amplitudes of the resulting spectrum.

There can be variations on this process, for example, differences in the shape or spacing of the windows used to map the scale.[3] The European Telecommunications Standards Institute in the early 2000s defined a standardised MFCC algorithm to be used in mobile phones.[4]

Applications

MFCCs are commonly used as features in speech recognition systems, such as the systems which can automatically recognize numbers spoken into a telephone. They are also common in speaker recognition, which is the task of recognizing people from their voices.[5]

MFCCs are also increasingly finding uses in music information retrieval applications such as genre classification, audio similarity measures, etc.[6]

Noise sensitivity

MFCC values are not very robust in the presence of additive noise, and so it is common to normalise their values in speech recognition systems to lessen the influence of noise. Some researchers propose modifications to the basic MFCC algorithm to improve robustness - e.g. by raising the log-mel-amplitudes to a suitable power (around 2 or 3) before taking the DCT, which reduces the influence of low-energy components.[7]

Page 7: speech recognition

Hamming window

if x is an n-element input signal and t is the time scale

w = 0.54 - 0.46*cos(2*pi*t/(n-1));

wx = x.*w;

or

w = hamming(L)

w = hamming(L,'sflag')

In signal processing, a window function (also known as an apodization function or tapering function[1]) is a mathematical function that is zero-valued outside of some chosen interval. For instance, a function that is constant inside the interval and zero elsewhere is called a rectangular window, which describes the shape of its graphical representation. When another function or waveform/data-sequence is multiplied by a window function, the product is also zero-valued outside the interval: all that is left is the part where they overlap; the "view through the window". Applications of window functions include spectral analysis, filter design, and beamforming. In typical applications, the window functions used are non-negative smooth "bell-shaped" curves,[2] though rectangle, triangle, and other functions can be used.

A more general definition of window functions does not require them to be identically zero outside an interval, as long as the product of the window multiplied by its argument is square integrable, that is, that the function goes sufficiently rapidly toward zero.[3]

Applications

Applications of window functions include spectral analysis and the design of finite impulse response filters.

[edit] Spectral analysis

The Fourier transform of the function cos ωt is zero, except at frequency ±ω. However, many other functions and waveforms do not have convenient closed form transforms. Alternatively, one might be interested in their spectral content only during a certain time period.

In either case, the Fourier transform (or something similar) can be applied on one or more finite intervals of the waveform. In general, the transform is applied to the product of the waveform

Page 8: speech recognition

and a window function. Any window (including rectangular) affects the spectral estimate computed by this method.

Figure 1: Zoomed view of spectral leakage

[edit] Windowing

Windowing of a simple waveform, like cos ωt causes its Fourier transform to develop non-zero values (commonly called spectral leakage) at frequencies other than ω. The leakage tends to be worst (highest) near ω and least at frequencies farthest from ω.

If the waveform under analysis comprises two sinusoids of different frequencies, leakage can interfere with the ability to distinguish them spectrally. If their frequencies are dissimilar and one component is weaker, then leakage from the larger component can obscure the weaker one’s presence. But if the frequencies are similar, leakage can render them unresolvable even when the sinusoids are of equal strength.

The rectangular window has excellent resolution characteristics for sinusoids of comparable strength, but it is a poor choice for sinusoids of disparate amplitudes. This characteristic is sometimes described as low-dynamic-range.

At the other extreme of dynamic range are the windows with the poorest resolution. These high-dynamic-range low-resolution windows are also poorest in terms of sensitivity; this is, if the input waveform contains random noise close to the frequency of a sinusoid, the response to noise, compared to the sinusoid, will be higher than with a higher-resolution window. In other words, the ability to find weak sinusoids amidst the noise is diminished by a high-dynamic-range window. High-dynamic-range windows are probably most often justified in wideband applications, where the spectrum being analyzed is expected to contain many different components of various amplitudes.

Page 9: speech recognition

In between the extremes are moderate windows, such as Hamming and Hann. They are commonly used in narrowband applications, such as the spectrum of a telephone channel. In summary, spectral analysis involves a tradeoff between resolving comparable strength components with similar frequencies and resolving disparate strength components with dissimilar frequencies. That tradeoff occurs when the window function is chosen.

Comparison of two window functions in terms of their effects on equal-strength sinusoids with additive noise. The sinusoid at bin −20 suffers no scalloping and the one at bin +20.5 exhibits worst-case scalloping. The rectangular window produces the most scalloping but also narrower peaks and lower noise-floor. Thus a third sinusoid with amplitude −16 dB would be detectable in the rectangularly windowed spectrum, but not in the lower image.

[edit] Discrete-time signals

When the input waveform is time-sampled, instead of continuous, the analysis is usually done by applying a window function and then a discrete Fourier transform (DFT). But the DFT provides only a coarse sampling of the actual DTFT spectrum. Figure 1 shows a portion of the DTFT for a rectangularly windowed sinusoid. The actual frequency of the sinusoid is indicated as "0" on the horizontal axis. Everything else is leakage, exaggerated by the use of a logarithmic presentation. The unit of frequency is "DFT bins"; that is, the integer values on the frequency axis correspond to the frequencies sampled by the DFT. So the figure depicts a case where the actual frequency of the sinusoid happens to coincide with a DFT sample,[note 1] and the maximum value of the spectrum is accurately measured by that sample. When it misses the maximum value by some amount [up to 1/2 bin], the measurement error is referred to as scalloping loss (inspired by the shape of the peak). But the most interesting thing about this case is that all the other samples coincide with nulls in the true spectrum. (The nulls are actually zero-crossings, which cannot be shown on a logarithmic scale such as this.) So in this case, the DFT creates the illusion of no leakage. Despite the unlikely conditions of this example, it is a common misconception that visible leakage is some sort of artifact of the DFT. But since any window function causes leakage, its apparent absence (in this contrived example) is actually the DFT artifact.

Page 10: speech recognition

[edit] Noise bandwidth

The concepts of resolution and dynamic range tend to be somewhat subjective, depending on what the user is actually trying to do. But they also tend to be highly correlated with the total leakage, which is quantifiable. It is usually expressed as an equivalent bandwidth, B. Think of it as redistributing the DTFT into a rectangular shape with height equal to the spectral maximum and width B.[note 2][4] The more leakage, the greater the bandwidth. It is sometimes called noise equivalent bandwidth or equivalent noise bandwidth, because it is proportional to the average power that will be registered by each DFT bin when the input signal contains a random noise component (or is just random noise). A graph of the power spectrum, averaged over time, typically reveals a flat noise floor, caused by this effect. The height of the noise floor is proportional to B. So two different window functions can produce different noise floors.

[edit] Processing gain

In signal processing, operations are chosen to improve some aspect of quality of a signal by exploiting the differences between the signal and the corrupting influences. When the signal is a sinusoid corrupted by additive random noise, spectral analysis distributes the signal and noise components differently, often making it easier to detect the signal's presence or measure certain characteristics, such as amplitude and frequency. Effectively, the signal to noise ratio (SNR) is improved by distributing the noise uniformly, while concentrating most of the sinusoid's energy around one frequency. Processing gain is a term often used to describe an SNR improvement. The processing gain of spectral analysis depends on the window function, both its noise bandwidth (B) and its potential scalloping loss. These effects partially offset, because windows with the least scalloping naturally have the most leakage.

For example, the worst possible scalloping loss from a Blackman–Harris window (below) is 0.83 dB, compared to 1.42 dB for a Hann window. But the noise bandwidth is larger by a factor of 2.01/1.5, which can be expressed in decibels as:   . Therefore, even at maximum scalloping, the net processing gain of a Hann window exceeds that of a Blackman–Harris window by: 1.27 + 0.83 − 1.42 = 0.68 dB. And when we happen to incur no scalloping (due to a fortuitous signal frequency), the Hann window is 1.27 dB more sensitive than Blackman–Harris. In general (as mentioned earlier), this is a deterrent to using high-dynamic-range windows in low-dynamic-range applications.

[edit] Filter designMain article: Filter design

Windows are sometimes used in the design of digital filters, in particular to convert an "ideal" impulse response of infinite duration, such as a sinc function, to a finite impulse response (FIR) filter design. That is called the window method.[5][6]

Page 11: speech recognition

[edit] Window examples

Terminology:

represents the width, in samples, of a discrete-time, symmetrical window function . When N is an odd number, the non-flat windows have a singular maximum point. When N is even, they have a double maximum.

o A common desire is for an asymmetrical window called DFT-even[7] or periodic, which has a single maximum but an even number of samples (required by the FFT algorithm). Such a window would be generated by the Matlab function hann(512,'periodic'), for instance. Here, that window would be generated by N=513 and discarding the 513th element of the sequence.

is an integer, with values 0 ≤ n ≤ N-1. Thus, these are lagged versions of functions denoted

whose maximum occurs at n=0. Each figure label includes the corresponding noise equivalent bandwidth metric (B)[note 2], in units

of DFT bins. As a guideline, windows are divided into two groups on the basis of B. One group comprises , and the other group comprises . The Gauss, Kaiser, and Poisson windows are parametric families that span both groups, though only one or two examples of each are shown.

Hamming window

Hamming window; B=1.37

The "raised cosine" with these particular coefficients was proposed by Richard W. Hamming. The window is optimized to minimize the maximum (nearest) side lobe, giving it a height of about one-fifth that of the Hann window, a raised cosine with simpler coefficients.[13][14]

[note 3]

with

Page 12: speech recognition

instead of both constants being equal to 1/2 in the Hann window. The constants are

approximations of values and , which cancel the first sidelobe of the Hann window by

placing a zero at frequency .[7] Approximation of the constants to two decimal places substantially lowers the level of sidelobes,[7] to a nearly equiripple condition.[14]

unlagged version:

1. matlab code for fft

Examples

A common use of Fourier transforms is to find the frequency components of a signal buried in a noisy time domain signal. Consider data sampled at 1000 Hz. Form a signal containing a 50 Hz sinusoid of amplitude 0.7 and 120 Hz sinusoid of amplitude 1 and corrupt it with some zero-mean random noise:

Fs = 1000; % Sampling frequencyT = 1/Fs; % Sample timeL = 1000; % Length of signalt = (0:L-1)*T; % Time vector% Sum of a 50 Hz sinusoid and a 120 Hz sinusoidx = 0.7*sin(2*pi*50*t) + sin(2*pi*120*t); y = x + 2*randn(size(t)); % Sinusoids plus noiseplot(Fs*t(1:50),y(1:50))title('Signal Corrupted with Zero-Mean Random Noise')xlabel('time (milliseconds)')

NFFT = 2^nextpow2(L); % Next power of 2 from length of yY = fft(y,NFFT)/L;f = Fs/2*linspace(0,1,NFFT/2);

% Plot single-sided amplitude spectrum.plot(f,2*abs(Y(1:NFFT/2))) title('Single-Sided Amplitude Spectrum of y(t)')xlabel('Frequency (Hz)')ylabel('|Y(f)|')

Page 13: speech recognition

EXAMPLE Simple demo of the MFCC function usage.

%

% This script is a step by step walk-through of computation of the

% mel frequency cepstral coefficients (MFCCs) from a speech signal

% using the MFCC routine.

%

% See also MFCC, COMPARE.

% Author: Kamil Wojcicki, September 2011

% Clean-up MATLAB's environment

clear all; close all; clc;

% Define variables

Tw = 25; % analysis frame duration (ms)

Ts = 10; % analysis frame shift (ms)

alpha = 0.97; % preemphasis coefficient

M = 20; % number of filterbank channels

C = 12; % number of cepstral coefficients

L = 22; % cepstral sine lifter parameter

LF = 300; % lower frequency limit (Hz)

HF = 3700; % upper frequency limit (Hz)

wav_file = 'sp10.wav'; % input audio filename

% Read speech samples, sampling rate and precision from file

[ speech, fs, nbits ] = wavread( wav_file );

Page 14: speech recognition

% Feature extraction (feature vectors as columns)

[ MFCCs, FBEs, frames ] = ...

mfcc( speech, fs, Tw, Ts, alpha, @hamming, [LF HF], M, C+1, L );

% Generate data needed for plotting

[ Nw, NF ] = size( frames ); % frame length and number of frames

time_frames = [0:NF-1]*Ts*0.001+0.5*Nw/fs; % time vector (s) for frames

time = [ 0:length(speech)-1 ]/fs; % time vector (s) for signal samples

logFBEs = 20*log10( FBEs ); % compute log FBEs for plotting

logFBEs_floor = max(logFBEs(:))-50; % get logFBE floor 50 dB below max

logFBEs( logFBEs<logFBEs_floor ) = logFBEs_floor; % limit logFBE dynamic range

% Generate plots

figure('Position', [30 30 800 600], 'PaperPositionMode', 'auto', ...

'color', 'w', 'PaperOrientation', 'landscape', 'Visible', 'on' );

subplot( 311 );

plot( time, speech, 'k' );

xlim( [ min(time_frames) max(time_frames) ] );

xlabel( 'Time (s)' );

ylabel( 'Amplitude' );

title( 'Speech waveform');

subplot( 312 );

imagesc( time_frames, [1:M], logFBEs );

axis( 'xy' );

xlim( [ min(time_frames) max(time_frames) ] );

xlabel( 'Time (s)' );

ylabel( 'Channel index' );

title( 'Log (mel) filterbank energies');

Page 15: speech recognition

subplot( 313 );

imagesc( time_frames, [1:C], MFCCs(2:end,:) ); % HTK's TARGETKIND: MFCC

%imagesc( time_frames, [1:C+1], MFCCs ); % HTK's TARGETKIND: MFCC_0

axis( 'xy' );

xlim( [ min(time_frames) max(time_frames) ] );

xlabel( 'Time (s)' );

ylabel( 'Cepstrum index' );

title( 'Mel frequency cepstrum' );

% Set color map to grayscale

colormap( 1-colormap('gray') );

% Print figure to pdf and png files

print('-dpdf', sprintf('%s.pdf', mfilename));

print('-dpng', sprintf('%s.png', mfilename));

% EOF