final project on homomorphicdeconvolution

Rahul Jaiswal (101630556)

Final Project on Homomorphic-deconvolution

(Use of cepstral analysis to deconvolve pitch information from vocal

tract information in speech production)

-by Rahul Jaiswal


Contents:

Topic Page No:

Aim 3 Motivation 3 Theory 4 MATLAB Program 8

Plots & Results 10 Applications 65


There are a lot of algorithms which leads to detection of pitch of speech signals. The method I have used is Cepstral Method which is more reliable than some other outdated, extensive methods having multiple methods.

The main aim of this project is to understand motivation behind Cepstral Analysis of speech, understand basic Cepstral Analysis approach which is used to perform vocal tract and source information separation, understand liftering concept and develop a pitch determination method.

Homomorphic-deconvolution is an algorithm designed to estimate the pitch or fundamental frequency and vocal tract information of a digital recording of speech or a musical note or tone. Cepstral analysis is the method used to achieve this. Any signal coming out from a system is due to the input excitation and also the response of the system. From the signal processing point of view, the output of a system can be treated as the convolution of the input excitation with the system response. At times, we need each of the components separately for study and processing. The process of separating the two components is termed as de-convolution.

Human speech is very closely related to the above concept. When a speech signal is produced, it passes two steps. excitation ("source") and signal shaping ("filter"). The objective of cepstral analysis is to separate the speech into its source and filter components without any prior knowledge about source and filter so that the information can be used in various speech processing applications.

There are two types of sounds Voiced and Unvoiced. We will mainly concentrate in the Voiced segment and they mainly include the Vowels. So these voiced sounds are produced by exciting the time varying system characteristics with periodic impulse sequence and unvoiced sounds are produced by exciting the time varying system with a random noise sequence. The resulting speech can be considered as the convolution of respective excitation sequence and vocal tract filter characteristics. Let x(n) is the excitation sequence and h(n) is the vocal tract filter sequence, then the speech sequence y(n) can be expressed as follows:

y(n)=x(n)* h(n) (1)

Or,

Y(w) =X(w). H(w) (2)

As we can see this is very similar to the basic convolution in time domain that we have been doing, but in this case we dont have any kindof knowledge about either X(w) or H(w). This is


the reason why we use Cepstral analysis. As Cepstral converts convolution in time domain to addition in quefrency domain. So, we get a linear combination of x(n) & h(n) in quefrency domain.

Basic Principles of Cepstral analysis:

The main idea of Homomorphic-deconvolution is to convert the product X(w).H(w) [Y(w)] into the sum by applying a logarithmic function. The complex cepstrum is defined as the inverse Fourier transformation of the log-normalized Fourier transform of the input signal, which is reverted to the time or the quefrency domain. So the return signal as be written in the transform domain as

Y(w)= X(w) +H(w) (3)

So now in the quefrency domain, the vocal tract components are represented by the slowly varying components concentrated near the lower quefrency region and excitation components (pitch) are represented by the fast varying components at the higher quefrency region.

The figure below gives a pictorial view of the steps to convert a speech signal to cepstral domain representation.

y(n) y_w(n) Y_W(w) Log|Y_W(w)| c(n)

In the above figure y(n) is the speech signal, y_w(n) is the windowed frame. We obtain the windowed frame y_w(n) by multiplying the speech signal s(n) with the hamming window h(n). Now we perform N-pts. DFT of the windowed frame to obtain Y_W(w). Log|Y_W(w)| is the log magnitude spectrum, obtained by taking log of the |Y_W(w)|. Then we perform IDFT of Log|Y_W(w)| to obtain the cepstrum of the speech signal s(n). The cepstrum thus obtained contains vocal tract information linearly combined with the pitch information, which can be separated using liftering technique, which will be discussed next.

Windowing DFT Log|y(w)| IDFT


Basic Principles of Liftering Technique:

Lifter is a filter which operates in quefrency domain. A low time lifter is similar to a low pass filter in the frequency domain. It can be implemented by multiplying by a window in the quefrency domain and when converted back to the frequency domain, resulting in a smoother signal. We will use low time liftering to obtain smoother formant functions, and then estimate formant frequencies from that function. High time lifter is used to obtain pitch information.

Lets discuss both of these in detail:

Pitch Estimation (High Time Liftering):

The pitch information typically appears as periodic peaks occurring after around 12-20 samples in the cepstrum spectrum. Making use of this assumption we use the high time liftering window to separate out the pitch component from the cepstrum signal. The High Time Liftering used here is just the opposite of the Low Time Liftering used later.

We represent the High time Lifter as H[n]. So,

H[n]=0 for nL0 & n


Figure shows the High Time Liftering Window.

We multiply this High time Liftering signal H[n] with the cepstrum c(n), and then look for the highest peak in the graph. The sample contains many periodic peaks but the sample at which the highest peak occurs, gives the pitch of the speech signal s(n). We can calculate the pitch from this sample value, as we already know the sampling frequency.

Formant estimation (Low Time Liftering):

Low time liftering is applied on the cepstrum of the speech signal, to obtain the formant estimation. Formants are defined as "the spectral peaks on the sound spectrum of the voice". It is often measured as an amplitude peak in the frequency spectrum of the sound, using a spectrogram.

The low-time liftering window that has been used in this project for extracting vocal tract characteristics are:

L[n]=1 for nL0& n


L0 is typically in between 12-20 samples.

The characteristics of the window are drawn below in time domain:

We have taken L0=15 samples. So L=1 for first 15 samples and remains at 0 for the rest 5000 samples.

Cl[n]=c(n).L[n]

Vocal Tract characteristics can be obtained by multiplying the cepstrum signal c(n) with this window L[n], followed by taking the DFT. Taking DFT on the low time liftered signal gives the Log-Magnitude of the Vocal-Tract Spectrum. Here H(w) is the response of the vocal tract in frequency domain.

So,

log|H(w)|= DFT[Cl[n]]

All the details about the vocal tract can be figured out from this vocal tract spectrum, like the Formant Frequency. The spectrum has peaks at regular intervals, as already mentioned in the first paragraph of this topic. Each of these local peaks represents different formant frequencies. And, these frequencies are different for different vowels.


Main Function:

% ************************************************************************* % This function seperates the formant frequency (vocal tract information) * % from the pitch information in the voiced human speech. * % parameters: * % SYNOPSIS: * % ------------------------------------------------- * % final_op = speech_synth(fs,p,N,rN,vowel,method) * % ------------------------------------------------- * % *

* % *************************************************************************

clear all; clc; close all; %% Uploading the File Here. [FileName,PathName] = uigetfile('*.wav','Select the WAV file to process'); ln=input('Please inter the liftering window length. The typical value is

between 10-40'); FilePath=strcat(PathName, FileName); [y,fs]=audioread(FilePath); [PATHSTR,NAME,EXT] = fileparts(FilePath); %%Checking the validity of the file. if ((strcmp(EXT,'.wav'))||(strcmp(EXT,'.mp3'))||(strcmp(EXT,'.wav')))

%%Dual Channel to single channel audio conversion. y=y(:,1);

%%Taking a small frame of audio data. y=y(2000:5000); y=double(y); N=length(y); t=(0:N-1)/fs; % time in seconds.

figure; subplot(2,2,1); plot(y); % Speech Signal. legend('Speech Signal'); xlabel('Time (s)'); ylabel('Amplitude'); y_th=y./(1.1*abs(max(y))); y_th=y_th(1:N); subplot(2,2,2); plot(t,y_th); %Normalized & Framed Signal. legend('Normalized Signal'); xlabel('Time (s)'); ylabel('Amplitude');

w=hamming(N); %Hamming Window.


y_w=y_th.*w; subplot(2,2,3); plot(t,y_w); legend('Windowed Signal'); xlabel('Time (s)'); ylabel('Amplitude'); %axis([0,0.3,-2,2]); y_fft=fft(y_w,N); k=0:N-1; subplot(2,2,4); plot(k, y_fft); legend('DFT'); xlabel('Frequency (Hz)'); ylabel('Amplitude'); %axis([0,10,-3,3]); c = ifft(log(abs(y_fft))); figure , plot(c); axis([0,N/2,-1,1]); % Since the cepstrum is always symmetric so we

have taken only half of its values. legend('Cepstrum'); xlabel('Quefrency'); ylabel('Amplitude'); y_c=c(1:length(c)/2);

[y_formant,y_Pitch,p_frequency,Magnitude_F,formant_ceps,formant]=liftering(y_

c,fs,N,ln); % Function for performing liftering t1=1:N/2; figure, plot(t1,y_formant); axis([0,N/2,-1,1]); legend('Low Time Liftered Cepstrum'); xlabel('Quefrency '); ylabel('Amplitude'); % Multiplying the cepstrum with the High Time

Lifter to obtain the Pitch estimation figure, plot(t1,y_Pitch); axis([0,N/2,-1,1]); legend('High Time Liftered Cepstrum'); xlabel('Quefrency'); ylabel('Amplitude');

%%Formant Estimation Algorithm

figure, plot(formant_ceps); hold on; plot(formant,Magnitude_F, 'ko'); hold off; legend('Formant Spectrum'); xlabel('Quefrency'); ylabel('Amplitude in LOG'); else error('The file is invalid. Please upload only wav, mp3 or au file

formats'); end


Liftering Function:

function [y_formant,y_Pitch, p_frequency,

Magnitude_F,formant_ceps,formant]=liftering(c,fs,N,ln) %% low quefrency lifter of length=ln quefrencies for obtaining vocal track

estimation. t=1; L=zeros(1,length(c)); L=L'; L(1:ln)=1; y_formant=real(c.*L); % Multiplying the cepstrum with the

Low Time Lifter to obtain the vocal tract estimation.

%%High Time Lifter H=zeros(1,length(c)); H=H'; H(ln:N/2)=1; y_Pitch=real(c.*H); % Multiplying the cepstrum with the

High Time Lifter to obtain the pitch estimation.

[y_Pitchvalue, y_Pitchlocation]= max(y_Pitch); % Calculating the maximum

value in the y_Pitch Matrix to obtain the value of pitch. p_period=y_Pitchlocation; p_frequency =(1/p_period)*fs;

%%Formant Estimation yy_formant=y_formant(1:ln); formant_ceps=fft(y_formant,10000); formant_ceps=formant_ceps(1:5000); formant_ceps=real(formant_ceps);

for i=2:length(formant_ceps)-1 if(formant_ceps(i-1)


Vowel: A_Front

Symbol: a

Figure1: Original Signal, Normalized Signal, Windowed Singal and the DFT transform of the signal.

Figure 2: Cepstrum (c(n)) of the original signal

Figure 3: Low Time Liftered Version of the Cepstrum, used for estimating formant freq.

Figure4: High Time Liftered version of the Cepstrum, used for Pitch estimation

Figure 5: Formant Spectra, containing peaks corresponding to Formant frequencies F1, F2, F3,

From figure 2 and figure 5 we can determine the pitch and formant freq. and for this case

Pitch= 135.5932 Hz , F1= 850Hz, F2=1600Hz

Figure 1


Figure 2

Figure 3


Figure 4

Figure 5


Vowel: ae Symbol: Figure1: Original Signal, Normalized Signal, Windowed Singal and the DFT transform of the signal.






Pitch= 129.0323Hz , F1= 500Hz, F2=1400Hz

Figure 1


Figure 2

Figure 3


Figure 4

Figure 5


Vowel: BackwardSchwa

Symbol:







Pitch= 140.3509Hz , F1= 250Hz, F2=1400Hz

Figure 1


Figure 2

Figure 3


Figure 4

Figure 5


Vowel: BackwardsEpsilon

Symbol:







Pitch= 139.1304Hz , F1= 350Hz, F2=1250Hz

Figure 1


Figure 2

Figure 3


Figure4

Figure 5


Vowel: Barred I

Symbol:







Pitch= 140.3509Hz , F1= 1500Hz

Figure 1


Figure 2

Figure 3


Figure 4

Figure 5


Vowel: BarredO

Symbol:







Pitch= 146.789Hz , F1= 350Hz, F2=1200Hz

Figure 1


Figure 2

Figure 3


Figure 4

Figure 5


Vowels: BarredU

Symbol:







Pitch= 148.1481Hz , F1= 1350Hz

Figure 1


Dsfdfd


Vowels: CapitalOE

Symbol:







Pitch= 139.1304Hz , F1= 370Hz, F2=1900Hz

Figure 1


Rahul


Vowels: Capital U

Figure1: Original Signal, Normalized Signal, Windowed Signal and the DFT transform of the signal.




Figure 5: Formant Spectra, containing peaks corresponding to Formant frequencies F1, F2, F3


Pitch= 148.1481Hz , F1= 400Hz, F2=1400Hz

Figure 1


Figure 2

Figure 3


Figure 4

Figure 5


Vowels: Capital Y

Pitch= 145.4545Hz , F1= 1350Hz

Figure 1


Figure 5


Vowel: Caret

Symbol:

Pitch= 139.1304Hz , F1= 500Hz, F2=1500Hz

Figure 1


Vowel: e

Symbol: e

Pitch= 141.5929Hz , F1= 390Hz, F2=2300Hz

Figure 1


Vowels: i

Symbol: i

Pitch= 146.789Hz , F1= 240Hz, F2=2200Hz

Figure 1


Vowels: o

Symbol : o

Pitch= 142.8571Hz , F1= 240Hz, F2=2200Hz

Figure 1


Vowels: u

Symbol: u

Pitch= 149.5327Hz , F1= 250Hz, F2=595Hz

Figure 1


Vowels: U (Girls Voice)

The sound clip was of a girls voice, so we expect higher pitch than that with a male voice tested earlier.

Pitch= 237.0968Hz , F1= 300Hz, F2=600Hz

Figure 1


Vowels: U (My Voice)

Pitch= 146.5116Hz , F1= 250Hz, F2=600Hz

Figure 1


Vowel: Y

Symbol: y

Pitch= 155.3398Hz , F1= 235Hz, F2=2100Hz

Figure 1


Figure 2

Figure 3


Figure 4

Figure 5


There are many applications to this speech deconvolution. It is being widely used in speech recognition

as different persons have different pitch, and we can easily recognize vowels by formant frequencies, as

shows above.

Pitch estimation is used to find out anger and neutral emotions, lie detection etc. generally pitch has

higher frequency for anger emotions.

The concept is also used in automatic music transcription.

References:

http://www.linguistics.ucla.edu/people/hayes/103/Charts/VChart/ for Vowels Frequencies

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.17.2453&rep=rep1&type=pdf for

Homomorphic-Deconvolution

https://www.wikipedia.org/ General definations

http://www.freesound.org/ Vowel .wav files

final project on homomorphicdeconvolution

Documents

cepstral analysis of

cepstral method

speech sequence yn

objective of cepstral

speech production

speech signals

human speech

resulting speech