final project on homomorphicdeconvolution

65
Rahul Jaiswal (101630556) Final Project on Homomorphic-deconvolution (Use of cepstral analysis to deconvolve pitch information from vocal tract information in speech production) -by Rahul Jaiswal

Upload: rahul-jash

Post on 15-Nov-2015

12 views

Category:

Documents


1 download

TRANSCRIPT

  • Rahul Jaiswal (101630556)

    Final Project on Homomorphic-deconvolution

    (Use of cepstral analysis to deconvolve pitch information from vocal

    tract information in speech production)

    -by Rahul Jaiswal

  • Rahul Jaiswal (101630556)

    Contents:

    Topic Page No:

    Aim 3 Motivation 3 Theory 4 MATLAB Program 8

    Plots & Results 10 Applications 65

  • Rahul Jaiswal (101630556)

    There are a lot of algorithms which leads to detection of pitch of speech signals. The method I have used is Cepstral Method which is more reliable than some other outdated, extensive methods having multiple methods.

    The main aim of this project is to understand motivation behind Cepstral Analysis of speech, understand basic Cepstral Analysis approach which is used to perform vocal tract and source information separation, understand liftering concept and develop a pitch determination method.

    Homomorphic-deconvolution is an algorithm designed to estimate the pitch or fundamental frequency and vocal tract information of a digital recording of speech or a musical note or tone. Cepstral analysis is the method used to achieve this. Any signal coming out from a system is due to the input excitation and also the response of the system. From the signal processing point of view, the output of a system can be treated as the convolution of the input excitation with the system response. At times, we need each of the components separately for study and processing. The process of separating the two components is termed as de-convolution.

    Human speech is very closely related to the above concept. When a speech signal is produced, it passes two steps. excitation ("source") and signal shaping ("filter"). The objective of cepstral analysis is to separate the speech into its source and filter components without any prior knowledge about source and filter so that the information can be used in various speech processing applications.

    There are two types of sounds Voiced and Unvoiced. We will mainly concentrate in the Voiced segment and they mainly include the Vowels. So these voiced sounds are produced by exciting the time varying system characteristics with periodic impulse sequence and unvoiced sounds are produced by exciting the time varying system with a random noise sequence. The resulting speech can be considered as the convolution of respective excitation sequence and vocal tract filter characteristics. Let x(n) is the excitation sequence and h(n) is the vocal tract filter sequence, then the speech sequence y(n) can be expressed as follows:

    y(n)=x(n)* h(n) (1)

    Or,

    Y(w) =X(w). H(w) (2)

    As we can see this is very similar to the basic convolution in time domain that we have been doing, but in this case we dont have any kindof knowledge about either X(w) or H(w). This is

  • Rahul Jaiswal (101630556)

    the reason why we use Cepstral analysis. As Cepstral converts convolution in time domain to addition in quefrency domain. So, we get a linear combination of x(n) & h(n) in quefrency domain.

    Basic Principles of Cepstral analysis:

    The main idea of Homomorphic-deconvolution is to convert the product X(w).H(w) [Y(w)] into the sum by applying a logarithmic function. The complex cepstrum is defined as the inverse Fourier transformation of the log-normalized Fourier transform of the input signal, which is reverted to the time or the quefrency domain. So the return signal as be written in the transform domain as

    Y(w)= X(w) +H(w) (3)

    So now in the quefrency domain, the vocal tract components are represented by the slowly varying components concentrated near the lower quefrency region and excitation components (pitch) are represented by the fast varying components at the higher quefrency region.

    The figure below gives a pictorial view of the steps to convert a speech signal to cepstral domain representation.

    y(n) y_w(n) Y_W(w) Log|Y_W(w)| c(n)

    In the above figure y(n) is the speech signal, y_w(n) is the windowed frame. We obtain the windowed frame y_w(n) by multiplying the speech signal s(n) with the hamming window h(n). Now we perform N-pts. DFT of the windowed frame to obtain Y_W(w). Log|Y_W(w)| is the log magnitude spectrum, obtained by taking log of the |Y_W(w)|. Then we perform IDFT of Log|Y_W(w)| to obtain the cepstrum of the speech signal s(n). The cepstrum thus obtained contains vocal tract information linearly combined with the pitch information, which can be separated using liftering technique, which will be discussed next.

    Windowing DFT Log|y(w)| IDFT

  • Rahul Jaiswal (101630556)

    Basic Principles of Liftering Technique:

    Lifter is a filter which operates in quefrency domain. A low time lifter is similar to a low pass filter in the frequency domain. It can be implemented by multiplying by a window in the quefrency domain and when converted back to the frequency domain, resulting in a smoother signal. We will use low time liftering to obtain smoother formant functions, and then estimate formant frequencies from that function. High time lifter is used to obtain pitch information.

    Lets discuss both of these in detail:

    Pitch Estimation (High Time Liftering):

    The pitch information typically appears as periodic peaks occurring after around 12-20 samples in the cepstrum spectrum. Making use of this assumption we use the high time liftering window to separate out the pitch component from the cepstrum signal. The High Time Liftering used here is just the opposite of the Low Time Liftering used later.

    We represent the High time Lifter as H[n]. So,

    H[n]=0 for nL0 & n

  • Rahul Jaiswal (101630556)

    Figure shows the High Time Liftering Window.

    We multiply this High time Liftering signal H[n] with the cepstrum c(n), and then look for the highest peak in the graph. The sample contains many periodic peaks but the sample at which the highest peak occurs, gives the pitch of the speech signal s(n). We can calculate the pitch from this sample value, as we already know the sampling frequency.

    Formant estimation (Low Time Liftering):

    Low time liftering is applied on the cepstrum of the speech signal, to obtain the formant estimation. Formants are defined as "the spectral peaks on the sound spectrum of the voice". It is often measured as an amplitude peak in the frequency spectrum of the sound, using a spectrogram.

    The low-time liftering window that has been used in this project for extracting vocal tract characteristics are:

    L[n]=1 for nL0& n

  • Rahul Jaiswal (101630556)

    L0 is typically in between 12-20 samples.

    The characteristics of the window are drawn below in time domain:

    We have taken L0=15 samples. So L=1 for first 15 samples and remains at 0 for the rest 5000 samples.

    Cl[n]=c(n).L[n]

    Vocal Tract characteristics can be obtained by multiplying the cepstrum signal c(n) with this window L[n], followed by taking the DFT. Taking DFT on the low time liftered signal gives the Log-Magnitude of the Vocal-Tract Spectrum. Here H(w) is the response of the vocal tract in frequency domain.

    So,

    log|H(w)|= DFT[Cl[n]]

    All the details about the vocal tract can be figured out from this vocal tract spectrum, like the Formant Frequency. The spectrum has peaks at regular intervals, as already mentioned in the first paragraph of this topic. Each of these local peaks represents different formant frequencies. And, these frequencies are different for different vowels.

  • Rahul Jaiswal (101630556)

    Main Function:

    % ************************************************************************* % This function seperates the formant frequency (vocal tract information) * % from the pitch information in the voiced human speech. * % parameters: * % SYNOPSIS: * % ------------------------------------------------- * % final_op = speech_synth(fs,p,N,rN,vowel,method) * % ------------------------------------------------- * % *

    * % *************************************************************************

    clear all; clc; close all; %% Uploading the File Here. [FileName,PathName] = uigetfile('*.wav','Select the WAV file to process'); ln=input('Please inter the liftering window length. The typical value is

    between 10-40'); FilePath=strcat(PathName, FileName); [y,fs]=audioread(FilePath); [PATHSTR,NAME,EXT] = fileparts(FilePath); %%Checking the validity of the file. if ((strcmp(EXT,'.wav'))||(strcmp(EXT,'.mp3'))||(strcmp(EXT,'.wav')))

    %%Dual Channel to single channel audio conversion. y=y(:,1);

    %%Taking a small frame of audio data. y=y(2000:5000); y=double(y); N=length(y); t=(0:N-1)/fs; % time in seconds.

    figure; subplot(2,2,1); plot(y); % Speech Signal. legend('Speech Signal'); xlabel('Time (s)'); ylabel('Amplitude'); y_th=y./(1.1*abs(max(y))); y_th=y_th(1:N); subplot(2,2,2); plot(t,y_th); %Normalized & Framed Signal. legend('Normalized Signal'); xlabel('Time (s)'); ylabel('Amplitude');

    w=hamming(N); %Hamming Window.

  • Rahul Jaiswal (101630556)

    y_w=y_th.*w; subplot(2,2,3); plot(t,y_w); legend('Windowed Signal'); xlabel('Time (s)'); ylabel('Amplitude'); %axis([0,0.3,-2,2]); y_fft=fft(y_w,N); k=0:N-1; subplot(2,2,4); plot(k, y_fft); legend('DFT'); xlabel('Frequency (Hz)'); ylabel('Amplitude'); %axis([0,10,-3,3]); c = ifft(log(abs(y_fft))); figure , plot(c); axis([0,N/2,-1,1]); % Since the cepstrum is always symmetric so we

    have taken only half of its values. legend('Cepstrum'); xlabel('Quefrency'); ylabel('Amplitude'); y_c=c(1:length(c)/2);

    [y_formant,y_Pitch,p_frequency,Magnitude_F,formant_ceps,formant]=liftering(y_

    c,fs,N,ln); % Function for performing liftering t1=1:N/2; figure, plot(t1,y_formant); axis([0,N/2,-1,1]); legend('Low Time Liftered Cepstrum'); xlabel('Quefrency '); ylabel('Amplitude'); % Multiplying the cepstrum with the High Time

    Lifter to obtain the Pitch estimation figure, plot(t1,y_Pitch); axis([0,N/2,-1,1]); legend('High Time Liftered Cepstrum'); xlabel('Quefrency'); ylabel('Amplitude');

    %%Formant Estimation Algorithm

    figure, plot(formant_ceps); hold on; plot(formant,Magnitude_F, 'ko'); hold off; legend('Formant Spectrum'); xlabel('Quefrency'); ylabel('Amplitude in LOG'); else error('The file is invalid. Please upload only wav, mp3 or au file

    formats'); end

  • Rahul Jaiswal (101630556)

    Liftering Function:

    function [y_formant,y_Pitch, p_frequency,

    Magnitude_F,formant_ceps,formant]=liftering(c,fs,N,ln) %% low quefrency lifter of length=ln quefrencies for obtaining vocal track

    estimation. t=1; L=zeros(1,length(c)); L=L'; L(1:ln)=1; y_formant=real(c.*L); % Multiplying the cepstrum with the

    Low Time Lifter to obtain the vocal tract estimation.

    %%High Time Lifter H=zeros(1,length(c)); H=H'; H(ln:N/2)=1; y_Pitch=real(c.*H); % Multiplying the cepstrum with the

    High Time Lifter to obtain the pitch estimation.

    [y_Pitchvalue, y_Pitchlocation]= max(y_Pitch); % Calculating the maximum

    value in the y_Pitch Matrix to obtain the value of pitch. p_period=y_Pitchlocation; p_frequency =(1/p_period)*fs;

    %%Formant Estimation yy_formant=y_formant(1:ln); formant_ceps=fft(y_formant,10000); formant_ceps=formant_ceps(1:5000); formant_ceps=real(formant_ceps);

    for i=2:length(formant_ceps)-1 if(formant_ceps(i-1)

  • Rahul Jaiswal (101630556)

    Vowel: A_Front

    Symbol: a

    Figure1: Original Signal, Normalized Signal, Windowed Singal and the DFT transform of the signal.

    Figure 2: Cepstrum (c(n)) of the original signal

    Figure 3: Low Time Liftered Version of the Cepstrum, used for estimating formant freq.

    Figure4: High Time Liftered version of the Cepstrum, used for Pitch estimation

    Figure 5: Formant Spectra, containing peaks corresponding to Formant frequencies F1, F2, F3,

    From figure 2 and figure 5 we can determine the pitch and formant freq. and for this case

    Pitch= 135.5932 Hz , F1= 850Hz, F2=1600Hz

    Figure 1

  • Rahul Jaiswal (101630556)

    Figure 2

    Figure 3

  • Rahul Jaiswal (101630556)

    Figure 4

    Figure 5

  • Rahul Jaiswal (101630556)

    Vowel: ae Symbol: Figure1: Original Signal, Normalized Signal, Windowed Singal and the DFT transform of the signal.

    Figure 2: Cepstrum (c(n)) of the original signal

    Figure 3: Low Time Liftered Version of the Cepstrum, used for estimating formant freq.

    Figure4: High Time Liftered version of the Cepstrum, used for Pitch estimation

    Figure 5: Formant Spectra, containing peaks corresponding to Formant frequencies F1, F2, F3,

    From figure 2 and figure 5 we can determine the pitch and formant freq. and for this case

    Pitch= 129.0323Hz , F1= 500Hz, F2=1400Hz

    Figure 1

  • Rahul Jaiswal (101630556)

    Figure 2

    Figure 3

  • Rahul Jaiswal (101630556)

    Figure 4

    Figure 5

  • Rahul Jaiswal (101630556)

    Vowel: BackwardSchwa

    Symbol:

    Figure1: Original Signal, Normalized Signal, Windowed Singal and the DFT transform of the signal.

    Figure 2: Cepstrum (c(n)) of the original signal

    Figure 3: Low Time Liftered Version of the Cepstrum, used for estimating formant freq.

    Figure4: High Time Liftered version of the Cepstrum, used for Pitch estimation

    Figure 5: Formant Spectra, containing peaks corresponding to Formant frequencies F1, F2, F3,

    From figure 2 and figure 5 we can determine the pitch and formant freq. and for this case

    Pitch= 140.3509Hz , F1= 250Hz, F2=1400Hz

    Figure 1

  • Rahul Jaiswal (101630556)

    Figure 2

    Figure 3

  • Rahul Jaiswal (101630556)

    Figure 4

    Figure 5

  • Rahul Jaiswal (101630556)

    Vowel: BackwardsEpsilon

    Symbol:

    Figure1: Original Signal, Normalized Signal, Windowed Singal and the DFT transform of the signal.

    Figure 2: Cepstrum (c(n)) of the original signal

    Figure 3: Low Time Liftered Version of the Cepstrum, used for estimating formant freq.

    Figure4: High Time Liftered version of the Cepstrum, used for Pitch estimation

    Figure 5: Formant Spectra, containing peaks corresponding to Formant frequencies F1, F2, F3,

    From figure 2 and figure 5 we can determine the pitch and formant freq. and for this case

    Pitch= 139.1304Hz , F1= 350Hz, F2=1250Hz

    Figure 1

  • Rahul Jaiswal (101630556)

    Figure 2

    Figure 3

  • Rahul Jaiswal (101630556)

    Figure4

    Figure 5

  • Rahul Jaiswal (101630556)

    Vowel: Barred I

    Symbol:

    Figure1: Original Signal, Normalized Signal, Windowed Singal and the DFT transform of the signal.

    Figure 2: Cepstrum (c(n)) of the original signal

    Figure 3: Low Time Liftered Version of the Cepstrum, used for estimating formant freq.

    Figure4: High Time Liftered version of the Cepstrum, used for Pitch estimation

    Figure 5: Formant Spectra, containing peaks corresponding to Formant frequencies F1, F2, F3,

    From figure 2 and figure 5 we can determine the pitch and formant freq. and for this case

    Pitch= 140.3509Hz , F1= 1500Hz

    Figure 1

  • Rahul Jaiswal (101630556)

    Figure 2

    Figure 3

  • Rahul Jaiswal (101630556)

    Figure 4

    Figure 5

  • Rahul Jaiswal (101630556)

    Vowel: BarredO

    Symbol:

    Figure1: Original Signal, Normalized Signal, Windowed Singal and the DFT transform of the signal.

    Figure 2: Cepstrum (c(n)) of the original signal

    Figure 3: Low Time Liftered Version of the Cepstrum, used for estimating formant freq.

    Figure4: High Time Liftered version of the Cepstrum, used for Pitch estimation

    Figure 5: Formant Spectra, containing peaks corresponding to Formant frequencies F1, F2, F3,

    From figure 2 and figure 5 we can determine the pitch and formant freq. and for this case

    Pitch= 146.789Hz , F1= 350Hz, F2=1200Hz

    Figure 1

  • Rahul Jaiswal (101630556)

    Figure 2

    Figure 3

  • Rahul Jaiswal (101630556)

    Figure 4

    Figure 5

  • Rahul Jaiswal (101630556)

    Vowels: BarredU

    Symbol:

    Figure1: Original Signal, Normalized Signal, Windowed Singal and the DFT transform of the signal.

    Figure 2: Cepstrum (c(n)) of the original signal

    Figure 3: Low Time Liftered Version of the Cepstrum, used for estimating formant freq.

    Figure4: High Time Liftered version of the Cepstrum, used for Pitch estimation

    Figure 5: Formant Spectra, containing peaks corresponding to Formant frequencies F1, F2, F3,

    From figure 2 and figure 5 we can determine the pitch and formant freq. and for this case

    Pitch= 148.1481Hz , F1= 1350Hz

    Figure 1

  • Rahul Jaiswal (101630556)

    Dsfdfd

  • Rahul Jaiswal (101630556)

  • Rahul Jaiswal (101630556)

    Vowels: CapitalOE

    Symbol:

    Figure1: Original Signal, Normalized Signal, Windowed Singal and the DFT transform of the signal.

    Figure 2: Cepstrum (c(n)) of the original signal

    Figure 3: Low Time Liftered Version of the Cepstrum, used for estimating formant freq.

    Figure4: High Time Liftered version of the Cepstrum, used for Pitch estimation

    Figure 5: Formant Spectra, containing peaks corresponding to Formant frequencies F1, F2, F3,

    From figure 2 and figure 5 we can determine the pitch and formant freq. and for this case

    Pitch= 139.1304Hz , F1= 370Hz, F2=1900Hz

    Figure 1

  • Rahul Jaiswal (101630556)

    Rahul

  • Rahul Jaiswal (101630556)

  • Rahul Jaiswal (101630556)

    Vowels: Capital U

    Figure1: Original Signal, Normalized Signal, Windowed Signal and the DFT transform of the signal.

    Figure 2: Cepstrum (c(n)) of the original signal

    Figure 3: Low Time Liftered Version of the Cepstrum, used for estimating formant freq.

    Figure4: High Time Liftered version of the Cepstrum, used for Pitch estimation

    Figure 5: Formant Spectra, containing peaks corresponding to Formant frequencies F1, F2, F3

    From figure 2 and figure 5 we can determine the pitch and formant freq. and for this case

    Pitch= 148.1481Hz , F1= 400Hz, F2=1400Hz

    Figure 1

  • Rahul Jaiswal (101630556)

    Figure 2

    Figure 3

  • Rahul Jaiswal (101630556)

    Figure 4

    Figure 5

  • Rahul Jaiswal (101630556)

    Vowels: Capital Y

    Pitch= 145.4545Hz , F1= 1350Hz

    Figure 1

  • Rahul Jaiswal (101630556)

  • Rahul Jaiswal (101630556)

    Figure 5

  • Rahul Jaiswal (101630556)

    Vowel: Caret

    Symbol:

    Pitch= 139.1304Hz , F1= 500Hz, F2=1500Hz

    Figure 1

  • Rahul Jaiswal (101630556)

  • Rahul Jaiswal (101630556)

  • Rahul Jaiswal (101630556)

    Vowel: e

    Symbol: e

    Pitch= 141.5929Hz , F1= 390Hz, F2=2300Hz

    Figure 1

  • Rahul Jaiswal (101630556)

  • Rahul Jaiswal (101630556)

  • Rahul Jaiswal (101630556)

    Vowels: i

    Symbol: i

    Pitch= 146.789Hz , F1= 240Hz, F2=2200Hz

    Figure 1

  • Rahul Jaiswal (101630556)

  • Rahul Jaiswal (101630556)

  • Rahul Jaiswal (101630556)

    Vowels: o

    Symbol : o

    Pitch= 142.8571Hz , F1= 240Hz, F2=2200Hz

    Figure 1

  • Rahul Jaiswal (101630556)

  • Rahul Jaiswal (101630556)

  • Rahul Jaiswal (101630556)

    Vowels: u

    Symbol: u

    Pitch= 149.5327Hz , F1= 250Hz, F2=595Hz

    Figure 1

  • Rahul Jaiswal (101630556)

  • Rahul Jaiswal (101630556)

  • Rahul Jaiswal (101630556)

    Vowels: U (Girls Voice)

    The sound clip was of a girls voice, so we expect higher pitch than that with a male voice tested earlier.

    Pitch= 237.0968Hz , F1= 300Hz, F2=600Hz

    Figure 1

  • Rahul Jaiswal (101630556)

  • Rahul Jaiswal (101630556)

  • Rahul Jaiswal (101630556)

    Vowels: U (My Voice)

    Pitch= 146.5116Hz , F1= 250Hz, F2=600Hz

    Figure 1

  • Rahul Jaiswal (101630556)

  • Rahul Jaiswal (101630556)

  • Rahul Jaiswal (101630556)

    Vowel: Y

    Symbol: y

    Pitch= 155.3398Hz , F1= 235Hz, F2=2100Hz

    Figure 1

  • Rahul Jaiswal (101630556)

    Figure 2

    Figure 3

  • Rahul Jaiswal (101630556)

    Figure 4

    Figure 5

  • Rahul Jaiswal (101630556)

    There are many applications to this speech deconvolution. It is being widely used in speech recognition

    as different persons have different pitch, and we can easily recognize vowels by formant frequencies, as

    shows above.

    Pitch estimation is used to find out anger and neutral emotions, lie detection etc. generally pitch has

    higher frequency for anger emotions.

    The concept is also used in automatic music transcription.

    References:

    http://www.linguistics.ucla.edu/people/hayes/103/Charts/VChart/ for Vowels Frequencies

    http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.17.2453&rep=rep1&type=pdf for

    Homomorphic-Deconvolution

    https://www.wikipedia.org/ General definations

    http://www.freesound.org/ Vowel .wav files