workshop on image and signal processing - indian … proc-2.pdfdirector, aimscs, hyderabad and prof....

Proceedings of

Workshop on

Image and Signal Processing

December 28-29, 2007

Indian Institute of Technology Guwahati

Guwahati - 781039, India.

Organized by

Co-Sponsored by

Department of Biotechnology

Government of India

Published by The Department of Electronics and Communication Engineering, Indian Institute of Technology

Guwahati-781039, India.

Printed in India by The Centre of Mass Media Communication, IIT Guwahati.

Copyright © 2007 by The Department of ECE, IIT Guwahati.

Author Disclaimer

While the author and the publisher believe that the information and the guidance given in this work are correct, all parties must rely upon

their own skill and judgment when making use of it. Neither the author nor the publisher assume any liability to any one for any loss or

damage caused by any use or omission in the work, whether such error or omission is the result of the negligence or any other cause. Any and

all such liability is disclaimed.

Copyright and copying

All rights are reserved. No part of this publication may be produced, stored in a retrieval system or transmitted in any form or by any means

electronic, mechanical, photocopying, recording or otherwise without the prior written permission of the publisher.

iii

TTaabbllee ooff CCoonntteennttss

((WWoorrkksshhoopp oonn IImmaaggee aanndd SSiiggnnaall PPrroocceessssiinngg))

((WWIISSPP--22000077))

Preface ……………………………………………................................................................................ vii

Workshop Organization ……………………….................................................................................. viii

List of Reviewers …………………………………………………………………………………....... ix

Biographies of Invited Speakers …….................................................................................................. xi

IImmaaggee && VViiddeeoo PPrroocceessssiinngg

Effect of Orthogonalization on FLD based Algorithms .……………………………………………...

Noushath S, Ashok Rao, Hemantha Kumar G

2

Match Score Fusion for Person Verification System using Face and Speech ………………………...

Raghavendra R, Ashok Rao, Hemantha Kumar G

6

Removal of Degradations in Video Sequences using Decision based Adaptive Spatio Temporal

Median Algorithm …………………………………………………………………………………......

S. Manikandan, D. Ebenezer

9

ROI Based Approach for the User Assisted Separation of Reflections from a Single Image …….......

M. A. Ansari, R. S. Anand

13

Compression of Multi-Lead ECG Signals and Retinal Images using 2-D Wavelet Transform and

SPIHT Coding Scheme for Mobile Telemedicine ……………………………………………………..

L. N. Sarma, S. R. Nirmala, M. Sabarimalai Manikandan, S. Dandapat

18

Specialized Text Binarization Technique for Camera-Based Document Images ……………………...

T Kasar, J Kumar, A. G. Ramakrishnan

28

A Multi-lingual Character Recognition System based on Subspace Methods and Neural Classifiers

Manjunath Aradhya V. N, Ashok Rao, Hemantha Kumar G

32

Two-Level Search Algorithm for Motion Estimation …………………………………………………

Deepak. J. Jayaswal, Mukesh A. Zaveri

36

iv

SSppeeeecchh && AAuuddiioo PPrroocceessssiinngg Four-way Classification of Place of Articulation of Marathi Unvoiced Stops from Burst Spectra ……

Veena Karjigi, Preeti Rao

42

Explicit Segmentation of Speech Signals using Bach Filter-Banks ……………………………………

Ranjani H. G, Ananthakrishnan G, A. G. Ramakrishnan

47

Temporal and Spectral Processing for Enhancement of Noisy Speech ………………………………...

P. Krishnamoorthy, S. R. M. Prasanna

51

Partial Encryption of GSM Coded Speech using Speech-Specific Knowledge …………………………

V. Anil Kumar, A. Mitra, S. R. M. Prasanna

57

Exploring Suprasegmental Features using LP Residual for Audio Clip Classification ………………..

Anvita Bajpai, B. Yegnanarayana

63

Spectral Estimation using CLP Model and its Applications in Speech Processing ……………………

Gupteswar Sahu, S. Dandapat

67

SSppeeeecchh SSyynntthheessiiss

Text to Speech Synthesis System for Mobile Applications ……………………………………………

K. Partha Sarathy, A. G. Ramakrishnan

74

Sinusoidal Analysis and Music Inspired Filter Bank for Training Free Speech Segmentation for TTS

Ranjani H. G, Ananthakrishnan G, A. G. Ramakrishnan

78

Novel Scheme to Handle the Missing Phonetic Contexts in Speech Synthesis, based on Human

Perception and Speech …………………………………………………………………………………..

Laxmi Narayana M, A. G. Ramakrishnan

82

Detection of Positions of Packet Loss in Reconstructed VOIP Speech ………………………………….

Anupam Mandal, C. Chandra Sekhar, K. R. Prasanna Kumar, G. Athithan

86

Defining Syllables and their Stress Labels in Tamil TTS Corpus ……………………………………….

Laxmi Narayana M, A. G. Ramakrishnan

92

Grapheme to Phoneme Conversion for Tamil Speech Synthesis ………………………………………...

A. G. Ramakrishnan, Laxmi Narayana M

96

v

SSppeeeecchh && SSppeeaakkeerr RReeccooggnniittiioonn

Designing Quadratic Spline Wavelet for Subband Based Speaker Classification …………....................

Hemant A. Patil, T. K. Basu

102

Speaker Recognition in a Multispeaker Environment …………………………………………………

R. Kumara Swamy, B. Yegnanarayana

106

Speaker Recognition in Limited Data Condition ………………………………………………………

H. S. Jayanna, S. R. M. Prasanna

110

Emotion Recognition using Multilevel Prosodic Information ………………………….........................

K. Sreenivasa Rao, S. R. M. Prasanna, T. Vidya Sagar

115

Wavelet Based Speaker Recognition under Stressed Condition ………………………………………

J. Rout, S. Dandapat

121

VVLLSSII

A 2.4 GHz CMOS Differential LNA with Multilevel Pyramidically Wound Symmetric Inductor in

0.18 μm Technology …………………………………………………………………………………

Genemala Haobijam, Deepak Balemarthy, Roy Paily

128

A Design Overview of Low Error Rate 800MHz 6-bit CMOS FLASH ADCs ……………………

Niket Agrawal, Roy Paily

134

Acquisition, Storage and Analysis of Real Signals using FPGA ……………………………………

P. Chopra, A. Kapoor, R. Paily

140

Author Index …………………………………………………………………………………………..

145

vii

PPrreeffaaccee

The recent decades have witnessed tremendous progress in the field of Signal Processing that has

touched almost all walks of our daily life. On international level, there exist a number of platforms

allowing the researchers and the technologists in this area to exchange their ideas regularly. In India, we

often felt the need for more accessible avenues for facilitating dialogue among researchers. Motivated

by this in 2006, Prof. B. Yegnarayana initiated a workshop christened WISP at Indian Institute of

Technology Madras with an aim to provide an accessible platform for Signal Processing researchers

within the country. It is heartening that we got the opportunity to hold WISP-2007 at Indian Institute of

Technology Guwahati (IITG). We highly appreciate the C. R. Rao Advanced Institute of Mathematics,

Statistics and Computer Science (AIMSCS), Hyderabad for joining hands in organizing this workshop

in the North East region.

The technical program of WISP-2006 involved invited talks from eminent speakers, and was highly

accepted by both the academia and the industry. The technical program of WISP-2007 also includes the

call for papers in addition to the invited talks from eminent speakers. The call for the paper found a

good response and we have received 43 research papers out of which 28 papers have been selected for

the poster presentation based on the recommendations of the reviewers from IISc and IITs. We are very

thankful to all reviewers for their valuable time and effort in ensuring the quality.

We are fortunate and thankful to Prof. G. V. Anand, IISc Bangalore; Dr. P. S. Goel, MoES, Govt. of

India; Prof. H. Hermansky, IDIAP, Switzerland; Prof. V. Kannan, University of Hyderabad; Dr. V.

Thyagarajan, Motorola India; Prof. Vijaykumar B., CMU, USA and Prof. B. Yegnarayana, IIIT

Hyderabad who have kindly consented to deliver the invited talks at the workshop. The invited talks

cover a wide range of topics from Signal Processing theory and applications. This workshop also

features a special session on Music Therapy by Ms. Sunita Bhuyan, a well-known violinist of the

Hindustani style.

We are grateful to the Ministry of Human Resources (MHRD), Govt. of India; the Department of

Science and Technology (DST), Govt. of India; the Department of Biotechnology (DBT), Govt. of

India; Council of Scientific and Industrial Research (CSIR), India; The North Eastern Council (NEC)

Shillong; Indian Space Research Organisation (ISRO); Hewlett-Packard (HP) India; and Motorola India

for providing the generous partial financial support for conducting this workshop.

We thank the members of the advisory committee and the organizing committee for their support and

co-operation. We express our sincere gratitude to Prof. Gautam Barua, Director, IITG, Prof. S. B. Rao,

Director, AIMSCS, Hyderabad and Prof. P. K. Bora, Head, Dept. of ECE, IITG for their constant

support and valuable guidance. We thank Prof. K. Ramachandran, Head, Centre for Mass Media

Communication (MMC), IITG for allowing the in-house printing of this proceedings and the staff of

MMC in particular Mr. Bipin Kataki, Mr. Amitabh Bordoloi and Mr. Bishnu Tamuli for their excellent

work in printing. Our special thanks to Mr. L. N. Sharma, Mr. H.S. Jayanna, Mr. P. Krishnamoorthy and

Mr. M. S. Manikandan for their assistance in bringing out this proceedings.

We welcome all the delegates to the exciting technical program of WISP-2007 and trust that the

workshop serves the primary objective of sparking creative dialogue among the delegates. Wish you all

a pleasant stay in picturesque campus of IIT Guwahati.

Guwahati S. Dandapat

December 13, 2007 S. R. M. Prasanna

P. R. Sahu

Rohit Sinha

viii

WWoorrkksshhoopp OOrrggaanniizzaattiioonn

0BChair

S. Dandapat, IIT Guwahati

1BConvenors

S. R. M. Prasanna, IIT Guwahati

Rohit Sinha, IIT Guwahati

2BAdvisory Committee

G. Barua, IIT Guwahati

S. B. Rao, ISI Kolkata

K. C. Bhattacharya, NESAC Shillong

P. K. Bora, IIT Guwahati

A. Chaudhury, Gauhati University

M. K. Chaudhuri, Tezpur University

R. Chellappa, Univ. of Maryland, USA

A. Mahanta, IIT Guwahati

J. Medhi, Gauhati University

P. J. Narayanan, IIIT Hyderabad

V. K. Saraswat, DRDO Hyderabad

V. Thyagarajan, Motorola India

3BOrganizing Committee

R. Bhattacharjee, IIT Guwahati

A. K. Gogoi, IIT Guwahati

C. Mahanta, IIT Guwahati

S. Majhi, IIT Guwahati

A. K. Mishra, IIT Guwahati

A. Mitra, IIT Guwahati

H. B. Nemade, IIT Guwahati

R. Paily, IIT Guwahati

A. Rajesh, IIT Guwahati

J. S. Sahambi, IIT Guwahati

P. R. Sahu, IIT Guwahati

K. R. Singh, IIT Guwahati

ix

4BLLiisstt ooff RReevviieewweerrss

Arun Kumar, IIT Delhi

T. K. Basu, IIT Kharagpur

P. K. Bora, IIT Guwahati

C. Chandra Sekhar, IIT Madras

C. Mahanta, IIT Guwahati

S. Majhi, IIT Guwahati

A. K. Mishra, IIT Guwahati

Preeti S. Rao, IIT Bombay

A. G. Ramakrishnan, IISc Bangalore

G. Saha, IIT Kharagpur

J. S. Sahambi, IIT Guwahati

P. R. Sahu, IIT Guwahati

R. Sinha, IIT Guwahati

K. Sreenivasa Rao, IIT Kharagpur

S. Umesh, IIT Kanpur

Vinod Kumar, IIT Roorkee

P. Viswanath, IIT Guwahati

BBiiooggrraapphhiieess

ooff

IInnvviitteedd SSppeeaakkeerrss

xi

G. V. Anand

Professor, Dept. of Electrical Communication Engineering,

Indian Institute of Science, Bangalore - 560 012, India.

Email: H [email protected]

Homepage: HUhttp://ece.iisc.ernet.in/~anandgv/ U

Prof. G. V. Anand obtained his Ph.D. degree in Electrical Communication

Engineering from IISc in 1971, M.Sc. (Physics) and B.Sc. from Osmania University

in 1964 and 1962, respectively. Since 1969, he held various positions in the Dept. of

ECE, IISc Bangalore and from 1994, he is professor in the same department. He was

Commonwealth Academic Staff Fellow of the Department of Electronics & Electrical

Engineering, University College London during 1978-1979. He also held Visiting

Scientist position at Naval Physical & Oceanographic Laboratory, Kochi during 1996-

1997. He is a Fellow of Indian Academy of Sciences, Fellow of Indian National

Academy of Engineering, Fellow of Institute of Electronics and Telecommunication

Engineering and Fellow of Acoustical Society of India. His research interests include

ocean acoustics, signal processing and fractal image compression. He has several

publications to his credit in these areas.

Title of the talk:

Non-Gaussian Signal Processing

mailto:[email protected]

http://ece.iisc.ernet.in/~anandgv/

xiii

P. S. Goel

5BSecretary, Ministry of Earth Sciences, Government of India,

6BMahasagar Bhavan, Block-12, CGO Complex, Lodhi Road, New Delhi – 110003

Email: [email protected]

Homepage: HUhttp://www.incois.gov.in/Incois/homepages/chairman.htm U

Born on April 20, 1947 in Moradabad (U.P.), India, Dr. P. S. Goel had his education

at University of Jodhpur [B.E. Hons. (Electrical Engineering)], Indian Institute of

Science [M.E. (Applied Electronics & Servomechanism], and Bangalore University

[Ph.D.]. Before taking over as Secretary, Department of Ocean Development (now

Ministry of Earth Sciences) in 2005, Dr. Goel has been the Director, ISRO Satellite

Centre. Dr. Goel has specialized in Applied Electronics & Servomechanism, and

their application to address problems of satellites. He has published over 100 research

papers in international journals and conferences.

Title of the talk:

India Inc. - Success and Failures in Engineering


http://www.incois.gov.in/Incois/homepages/chairman.htm

xv

H. Hermansky

Professor and Senior Researcher, IDIAP Research Institute,

Centre du Parc, Av. des Prés-Beudin 20, Case Postale 592, CH-1920 Martigny, Switzerland.


Homepage: HUhttp://www.bme.ogi.edu/~hynek/ U

Prof. Hynek Hermansky works at the IDIAP Research Institute, Martigny,

Switzerland. He has been working in speech processing for over 30 years, previously

as a Research Fellow at the University of Tokyo, a Research Engineer at Panasonic

Technologies, Santa Barbara, California, a Senior Member of the research staff at US

WEST Advanced Technologies, and Professor and Director of the Center for

Information Processing, OHSU, Portland, Oregon. He is a Fellow of the IEEE, a

Member of the Board of the International Speech Communication Association, and a

Member of the Editorial Boards of Speech Communication and of Phonetica. He

holds 5 US patents and authored or coauthored over 130 papers in reviewed journals

and conference proceedings. He holds a Dr.-Eng. degree from the University of

Tokyo, and Dipl.-Ing. degree from Brno University of Technology, Czech Republic.

His main research interests are in acoustic processing for speech and speaker

recognition.

Title of the talk:

Modulation Spectrum in Speech Processing: Humans and Machines


http://www.bme.ogi.edu/~hynek/

xvii

V. Kannan

Professor, Department of Mathematics and Statistics,

University of Hyderabad, Hyderabad - 500 046, India.

Email: [email protected] U

Homepage: HUhttp://202.41.85.103/faculty/kannan/index.html U

Prof. V. Kannan obtained his Ph.D. from Madurai University in 1972. He is currently

professor in the department of mathematics and statistics, University of Hyderabad.

He is also Pro Vice-Chancellor of University of Hyderabad. He is a Fellow of Indian

Academy of Sciences, Fellow of Indian National Science Academy and Fellow of

A.P. Academy of Sciences. Recently, Indian Science Congress Association has

conferred on him prestigious Srinivasa Ramanujan award. His research interests

include topological dimensions and chaos theory. He has several publications to his

credit in these areas.

Title of the talk:

Theory of Chaos: An Introduction


http://202.41.85.103/faculty/kannan/index.html

xix

B. Vijayakumar

Professor, Department of Electrical & Computer Engineering,

Carnegie Mellon University, Pittsburgh, PA 15213, USA.


UHHomepage: HUhttp://www.ece.cmu.edu/directory/details/92 U

Prof. B. Vijayakumar received his B.Tech and M.Tech degrees in Electrical

Engineering from Indian Institute of Technology Kanpur and his Ph.D in Electrical

Engineering from Carnegie Mellon University, Pittsburgh, USA. Since 1982, he has

been a faculty member in the Department of Electrical and Computer Engineering

(ECE) at Carnegie Mellon where is now a professor. Prof. Kumar served as Associate

Head of the Department of ECE from 1994-1996 and as the acting Head of the

Department during 2004-2005. Prof. Kumar’s research interests include Automatic

Target Recognition Algorithms, Biometric Recognition Methods, Coding, and Signal

Processing for Data Storage Systems. He is currently leading coding and signal

processing research efforts in the Data Storage Systems Center (DSSC) and the

Biometrics research efforts in CyLab at Carnegie Mellon. He has several publications

to his credit in these areas.

Title of the talk:

Iris Verification: Recent Advances




http://www.ece.cmu.edu/directory/details/92

xxi

B. Yegnanarayana

Professor and Microsoft Chair, International Institute of Information Technology,

Gachibowli, Hyderabad - 500032, India


UHHomepage: HUhttp://www.iiit.ac.in/faculty/ynarayan.php U

Prof. B. Yegnanarayana was born in India in 1944. He received the B.E., M.E. and

PhD. degrees in Electrical Communication Engineering from Indian Institute of

Science, Bangalore, India in 1964, 1966 and 1974, respectively. He was a Lecturer

from 1966 to 1974 and an Assistant Professor from 1974 to 1978, in the Department

of Electrical Communication Engineering, Indian Institute of Science. From 1978 to

1980, he was a Visiting Associate Professor of Computer Science at Carnegie Mellon

University, Pittsburgh, USA. Since 1980 to 2006, he was professor in the Department

of Computer Science and Engineering, Indian Institute of Technology Madras. He is

currently professor and Microsoft chair at the International Institute of Information

Technology Hyderabad. His research interests include signal processing, speech,

vision, neural networks and man-machine interfaces. He has published several papers

in reviewed journals in these areas.

Title of the talk:

New Approaches to Speech Signal Processing




http://www.iiit.ac.in/faculty/ynarayan.php

xxiii

SUNITA BHUYAN

601, Sunshine, Raheja Vihar, Chandivli Farm Road,

Powai, Mumbai-400072.


Sunita Bhuyan is a violinist of the Hindustani style and has carved a niche for herself

as an upcoming musician of India. Recepient of the Indira Gandhi Priyadarshini

award for Music, Sunita has undergone her initial training from her mother Minoti

Khaund, eminent violinist and disciple of Pt. V.G. Jog. The mother daughter duo of

Minoti and Sunita have performed duets all over the country and abroad, regaling the

audience and the press alike with their Jugalbandi. Sunita has attained a master’s

degree in Hindustani Music from Prayag Sangeet Samitte, with a distinction and has

studied advanced music under the apprenticeship of violin maestro Padmabibhushan

Late Pt V.G. Jog. She has successfully imbibed Pt. Jog’s Tantarakri style along with

her inherent melody. Sunita has performed in several prestigious platforms in India

and abroad like India International Center, Delhi, National Gallery of Modern Art

Mumbai, and Rabindra Natya Mandir Mumbai, India Habitat Centre, Shankardev

Kalakshetra Assam. Along with pure classical music, Sunita has been performing a

variety of light music melodies, experimenting with jazz, folk and popular music to

reach out to a diverse spectrum of audience. Sunita also conducts workshops and

demonstrations on music on topics like music appreciation, music and meditation,

music therapy which have attracted a very interesting audience in the west.

Special Session Titled:

Music Therapy


IImmaaggee && VViiddeeoo PPrroocceessssiinngg

EFFECT OF ORTHOGONALIZATION ON FLD BASED ALGORITHMS

ABSTRACT

Beginning with the standard Fisherface algorithm[7], many

flavors of Fisher Linear Discriminant(FLD) algorithms have

been proposed in the literature for face recognition. Their per-

formance is widely admired, especially under varied facial

expressions and lighting configuration. However, they tend

to perform poorly when more or all the Fisher discriminant

vectors are used for feature extraction. This is due to the pos-

sibility of linear dependence among the Fisher discriminant

vector. Hence, in this work, we have suggested a remedy to

resolve this problem by cascading orthogonalization process

in regular FLD algorithms. We have conducted experiments

on ORL database under both clean and noise conditions and

the result affirms our proposed claims.

Index Terms— Face recognition; FLD algorithm; Linear

dependency; Orthogonalization; Grahm-Schmidt decomposi-

tion.

1. INTRODUCTION

Face recognition research has tremendously grown predom-

inantly due to the rising concern for security and verifica-

tion tasks. Though research has been conducted for several

decades now, this problem of face recognition is still largely

unresolved due to the inherent technical difficulties involved.

Many algorithms have been proposed in the literature, amongst

which the subspace based FLD method[7] (also known as

LDA) is very popular due to its ability to maximize the be-

tween class scatter while minimizing the within-class scatter.

In this paper, we have identified a demerit of this FLD based

algorithms and emphasized the need for a preprocessing step

to overcome this drawback.

Rest of the paper is organized as follows: In section 2,

we enumerate the drawbacks of conventional FLD algorithms

and suggest a suitable preprocessing step. Experimental re-

sults and comparative study are given in section 3. The con-

clusions are drawn in section 4.

2. NEED FOR ORTHOGONALIZATION

Many extended algorithms have been proposed for face recog-

nition based on FLD algorithm. Though these algorithms are

efficient, we have identified following major drawbacks in

them:

1. They perform better only when less number of projec-

tion vectors are used for feature extraction.

2. However, selecting optimal number of projection vec-

tors to yield better recognition accuracy is a highly sub-

jective task.

3. In addition, when sought to extract features by using

more or all the projection vectors, their classification

performance deteriorates significantly.

This is due to the redundancy that gets introduced among the

Fisher discriminant vectors when more number of projection

vectors are used. The Foley-Sammon Discriminant(FSD)[1]

transform was one remedy to this which exhibits higher clas-

sification performance in face recognition than Fisher linear

discriminant due to its elimination of dependency among the

discriminant vectors. However, its theory is complex and

computationally intensive. Therefore, it is highly desirable

to incorporate a strategy for these algorithms which makes

them to perform as good or equally better compared to when

optimal number of projection vectors are used. Hence in this

paper, we suggest an equivalent way of removing the depen-

dency among the Fisher discriminant vectors through a sim-

ple, yet effective process known as orthogonalization. For

this purpose, we have used Gram-Schmidt decomposition pro-

cess which is described in the following section.

Gram-Schmidt Orthonormal Decomposition

Suppose w1, w2, . . . , wd are Fisher’s discriminant vectors and

v1, v2, . . . , vd be the corresponding orthogonalized projection

vectors. Let v1 be w1 and assume that k vectors v1, v2, . . . , vk,

(1 ≤ k ≤ d− 1) have already been calculated. The (k + 1)th

Noushath.S,1 Ashok Rao,2 Hemantha Kumar.G1

1Dept of Studies in Computer Science, University of Mysore, Mysore-570 006, [email protected]

2Dept of Computer Science, J.S.S. College of Arts, Commerce and Science, Mysore-570 025, [email protected]

WISP-2007, IIT Guwahati, India, 28-29 Dec. 2007_________________________________________________________________________________________________________________2

orthogonalized Fisher discriminant vector vk+1 is calculated

as follows:

vk+1 = wk+1 −k∑

i=1

vTi wk+1

vTi vi

vi (1)

By computing the dot product between any two vectors we

can easily realize the orthogonality between them. Note that

orthonormal vectors v1, v2, · · · , vd are used for feature ex-

traction instead of original discriminant vectors w1, w2, · · · , wd.

This step should be used as a preprocessing step before ex-

tracting the features based on FLD methods.

3. EXPERIMENTAL RESULTS

We have taken the samples of ORL1 database to ascertain the

behavior of FLD based algorithms in their conventional and

orthogonalized forms. The experiments are conducted under

both clean and noise conditions. The number of projection

vectors(d) is controlled by following equation:

∑di=1 λi∑Ni=1 λi

≥ θ (2)

Where N is the total number of eigen values and θ is set to

some preset threshold. The algorithms are realized in a P4

3GHz PC (with 1GB RAM) through Matlab 7 programming

platform.

3.1. Performance under clean data

First an experiment was conducted by varying the number of

training samples. Best recognition accuracy is obtained by

varying the number of projection vectors from 1 to 45 insteps

of 1. The values in parenthesis (Refer table 1) corresponds to

the number of projection vectors for which best recognition

accuracy was attained.

It is to be noted from table 1 that, the performance of most

of the FLD algorithms obtained optimal recognition accuracy

only when few eigen vectors (projection vectors) are used for

feature extraction. For this, we had to repeat the experiment

several times by varying number of eigen vectors and chose

the best among them which yields a better recognition accu-

racy. However, this test condition may not be suitable under

real time conditions as it involves series of monotonous and

routine computations just to classify an image to a correct

class.

In the second type of experiments, the recognition accu-

racy is computed for a fixed number of projection vectors (d)

controlled by Eq.(2). In this experiment, we compute the ac-

curacy of all the FLD based algorithms for 98% confidence

interval (i.e. by setting θ to 0.98 in Eq.(2)). This experi-

ment is repeated for varying number of training samples on

1www.uk.research.att.com/facedatabase.html

ORL database. The result is depicted in table 2. It is asscer-

tained that, when more number of projection vectors are used,

performance of the orthogonalized versions are significantly

better than their corresponding conventional counterparts.

Note that PCA based methods are inherently orthogonal

in nature. Reason is that they use covariance matrix (which

is also a symmetric matrix) for eigen value decomposition.

The symmetric matrices always yield real eigen values and

the corresponding eigenvectors can be chosen orthogonal as

they are guaranteed to be linearly independent. However, in

case of FLD based methods, the matrix of S−1w · Sb transfor-

mation is not a symmetric one. Under such circumstances (for

general non-symmetric matrices), there is a possibility of ob-

taining linearly dependent eigen vectors due to repeated eigen

values. Therefore, chances of getting redundant information

among Fisher discriminant vectors are more. This is the rea-

son why FLD based methods perform poorly when more or

all projection vectors are used for feature extraction.

3.2. Performance under noise data

It would be interesting to check the orthogonalized algorithms’

performance under noise conditions. For this purpose, we

conduct a similar kind of experiment using first five clean

samples of ORL as training data. We have modeled five dif-

ferent noise contaminations such as salt-and-pepper, Gaus-

sian, Exponential, Weibull and Beta using respective con-

tinuous distributions in their discretized form2. The noisy

images thus generated are used as test samples. The aver-

age recognition accuracy for (average of ten experiments for

varying noise density from 0.1 to 1.0 insteps of 0.1) different

noise conditions are shown in table 2. Under weibull and ex-

ponential noise, conventional (non-orthogonal) versions per-

form better. Some observations from this experiment are:

1. The Fisherface method outperformed its orthogonalized

counterpart under all noise conditions by a good mar-

gin.

2. Conversely, the two dimensional versions of the FLD

algorithms (except FLD method) in their orthogonal-

ized form perform better than their original form. This

is because, in case of 2D versions, there is high possi-

bility of capturing the advantage of orthogonality with

respect to both spatial and feature variations.

3. The 2D2LDA algorithm (conventional form) is very ro-

bust under exponential and weibull conditions.

There are many other orthogonalization techniques avail-

able. One can use any of them as a preprocessing step but the

Gram-Schmidt method is computationally least complex [2].

However, it would be an interesting issue to study the impact

2We have used Matlab command imnoise to generate the salt-and-pepper

noise


of different orthogonal spaces with respect to computational

and classification performances.

We have attempted in this direction and reported their be-

havior in table 33. For this experiment, we have considered

the first five samples of ORL for training and remaining five

samples are used as testing samples. Table 3 presents this re-

sult for 98% confidence interval. It is to be noted that, in any

case, the orthogonalized versions outperformed correspond-

ing regular versions.

4. CONCLUSIONS

In this paper, we have identified the need of a preprocessing

step for FLD based algorithms. For this purpose, we have sug-

gested Gram-Schmidt decomposition model, which obtains

the orthonormal projection vectors of the conventional Fisher

discriminant vectors. We have conducted extensive experi-

ments and compared the performance of orthogonal versions

of FLD algorithms with their regular counterparts. For this

purpose, we have considered both clean data samples and also

the data contaminated by five different noise conditions.

5. REFERENCES

[1] J.Duchene and S.Leclercq. An optimal transformation

for discriminant and principal component analysis. IEEEtransactions on Pattern Aanalysis and Machine Intelli-gence, 10(6):978–983, 1988.

[2] Todd K.Moon and Wynn C.Stirling. Mathematical Meth-ods and Algorithms for Signal Processing. Pearson Edu-

cation, low price edition, 2004.

[3] Ming Li and Baozong Yuan. 2DLDA: A statistical linear

discriminant analysis for image matrix. Pattern Recogni-tion Letters, 26(5):527–532, 2005.

[4] Noushath.S, Hemantha Kumar.G, Manjunath Arad-

hya.V.N, and Shivakumara.P. Divide-and-conquer strat-

egy incorporated Fisher linear discriminant analysis for

efficient face recognition. In Internationl Conference onAdvances in Pattern Recognition, pages 40–45, 2007.

[5] Noushath.S, Hemantha Kumar.G, and Shivakumara.P.

2D2LDA: An efficient approach for face recognition.

Pattern Recognition, 39(7):1396–1400, 2006.

[6] Noushath.S, Hemantha Kumar.G, and Shivakumara.P.

Diagonal Fisher linear discriminant analysis for efficient

face recognition. Neurocomputing, 69:1711–1716, 2006.

3QR via HH and QR via GS indicate the implementation of QR decom-

position through House Holders transformation and Grahm-Schmidt decom-

position respectively

[7] P.Belhumeur, J.Hespanha, and D.Kriegman. Eigenfaces

vs Fisherfaces: Recognition using Class Specific Linear

Projection. IEEE Transactions on Pattern Analalysis andMachine Intelligence, 19(7):711–720, 1997.


Table 1. Best Recognition Accuracy of FLD based methods for varying number of training samples and principal components

Methods Number of Training Samples

2 4 5 6 8

FLD[7] 85.75(40) 94.50(29) 96.25(31) 97.75(10) 99.25(13)

2DLDA[3] 91.25(2) 96.50(5) 98.00(7) 99.25(6) 99.50(7)

A2DLDA[5] 88.75(8) 95.25(8) 97.00(9) 98.50(10) 99.75(3)

2D2LDA 89.75(8) 96.25(11) 97.50(8) 98.75(9) 99.75(5)

DiaFLD[6] 89.75(8) 95.25(4) 97.25(8) 98.50(4) 99.25(5)

DiaFLD + 2DFLD[6] 89.25(7) 95.75(7) 98.00(7) 98.75(7) 99.75(4)

dcFLD[4] 87.00(12) 94.75(18) 96.50(13) 98.50(12) 100.00(6)

Table 2. Comparing results of FLD based methods under clean and noise conditions (For 98% confidence interval)

Methods Number of Training Samples (Clean) Different Noise Conditions2 4 5 6 8 Gaussian Salt-and Expo- Weibull Beta

-Pepper nential

FLD 85.00 87.00 89.50 91.25 95.25 94.75 67.00 73.25 59.75 99.50

Ortho-FLD 89.25 94.00 96.00 97.50 99.00 89.00 61.25 24.25 32.00 90.50

2DLDA 70.50 82.50 86.75 91.75 98.00 76.75 54.25 77.50 71.75 77.75

Ortho-2DFLD 87.25 93.50 95.50 96.75 99.00 86.75 59.25 24.50 31.75 97.00

A2DLDA 70.25 85.00 90.00 93.50 98.25 87.25 58.25 60.50 48.25 89.50

Ortho-A2DFLD 88.00 94.75 97.00 97.50 99.25 86.75 59.75 16.00 29.25 95.50

2D2LDA 27.25 51.00 60.00 68.75 85.75 73.25 49.25 92.50 96.25 77.00

Ortho-2D2LDA 86.00 93.25 95.50 96.75 99.00 86.50 59.00 37.75 38.50 94.75

DiaFLD 76.75 86.75 89.00 93.00 97.50 87.25 58.75 74.50 66.75 91.25

Ortho-DiaFLD 87.25 93.50 95.50 97.00 99.00 91.00 62.75 39.50 39.75 98.00

DiaFLD+2DFLD 27.00 49.50 59.25 68.25 86.50 54.25 40.25 87.50 89.25 60.00

Ortho-DiaFLD 69.75 82.25 85.25 90.75 97.00 75.50 52.00 45.25 43.25 79.75

+2DFLD

Table 3. Results obtained under different approaches to orthogonalization

Orthogonalization FLD dcFLD 2DLDA A2DLDA 2D2LDA DiaFLD DiaFLD+

Approaches 2DFLD

QR via HH 96.75 95.50 95.50 96.25 96.00 96.00 93.75

QR via GS 96.00 95.00 95.50 97.00 95.50 95.50 85.25


MATCH SCORE FUSION FOR PERSON VERIFICATION SYSTEM USING FACE ANDSPEECH

Raghavendra.R,1 Ashok Rao,2 Hemantha Kumar.G1

1Dept of Studies in Computer Science, University of Mysore, Mysore-570 006, [email protected]

2Dept of Computer Science, J.S.S. College of Arts, Commerce and Science, Mysore-570 025, [email protected]

ABSTRACTMultimodal biometric person verification is gaining much pop-

ularity in recent years as they outperform unimodal person

verification. This paper presents the person verification sys-

tem using speech and face data. The verification system com-

prises of two classifiers whose scores are fused using sum rule

after normalization. The experiments are carried out on Vid-

TIMIT database. The experimental results show that face ex-

pert designed using Two-Dimensional Principal Component

Analysis and speech expert using Linear Prediction Cepstral

Coefficients as feature extractor and Gaussian Mixture Model

as opinion generator with 16 mixture will provide a Half Total

Error Rate of 1.95%.

Index Terms— Biometrics, Information fusion, Multi-

modal biometrics, Identity verification

1. INTRODUCTION

The area of biometric verification has been receiving a lot of

attention in recent years. Biometrics is defined as science of

recognizing an individual based on physiological or behavior

traits [1]. Majority of the biometric system deployed in real

world application are unimodal [2]. The performance of uni-

modal biometric system are limited by noise in sensed data,

intra-class variation, spoof attacks and so on [1]. These limi-

tations can be overcome by including multiple sources of in-

formation for obtaining the identity. Such a system is called

Multimodal biometric system and these systems are expected

to be more reliable due to the presence of multiple, indepen-

dent pieces of evidence. Multimodal biometric system inte-

grates information presented by multiple biometric systems.

The information can be consolidated at the various levels. The

process of combining the information provided by multiple

biometric systems is called as fusion. Fusion can be accom-

plished at various levels in a biometric system. The levels of

fusion are broadly classified as (i) fusion prior to matching

(ii) fusion after matching [2]. Fusion prior to matching in-

cludes sensor level and feature level fusions. However, fusion

at this level is difficult to achieve in practice [1]. To fuse the

data at sensor level, the multiple cues must be compatible and

correspondence between points in raw data must be known

in advance [2]. Feature level fusion refers to combining dif-

ferent feature sets extracted from multiple biometric systems.

However, fusion at this level is difficult because feature sets of

various modalities may not be compatible. Hence concatena-

tion is not possible when feature sets are incompatible. Most

of the multimodal biometric systems fuse information at the

match score level or decision level. Fusion at the decision

level is considered to be rigid due to the availability of lim-

ited data set [1]. Thus, fusion at the match score level is pre-

ferred, as it is relatively easy to access and combine the scores

presented by the different unimodal biometric system [2].

A large number of commercial biometric systems uses fin-

gerprint, face, iris, retina, palmprint, speech and hand geom-

etry. Each of these modalities have their own drawbacks. The

technique based on iris and retinas are very reliable but not

well suited for end-user [1]. Identification through voice and

face is natural and easily accepted by end-user. A lot of work

has been done in the last years in the field of face and speaker

verifications.

Kittler et al [3] combine the classifiers in a probabilis-

tic Bayesian framework and fuse the scores of three traits(

speech, frontal face and profile face). Several ways of imple-

menting the fusion are described such as sum, product, max,

min, median and majority voting, from which the sum rule

outperforms the remainder. Chibelushi et. al. [4] combined

information from speech and face images. The speech expert

consists of three versions of the E-set from English alpha-

bet and perceptually-weighted linear prediction (PLP) cepstra

is used for feature extraction. The face expert uses Zernike

moment invariants to represent the features. The decision

level fusion is employed using weighted sum rule. Brunelli

et al. [5] combine speech, eyes, nose and mouth informa-

tion for person verification. Features of speech are extracted

using Mel Frequency Cepstral Coefficients(MFCC) and vec-

tor quantization. Features of eyes, nose and mouth are ex-

tracted using grey level similarity measure. All features are

combined using a geometric average. Ben-Yacoub et al. [6]


combine speech and face information. The speech expert uses

LPCC with sphericity measure and Linear Prediction Cepstral

Coefficients (LPCC) with Hidden Markov Model (HMM) for

text independent and dependent speaker verification. Face ex-

pert uses elastic graph method for feature extraction. Five

different methods of fusion are used namely Support Vector

Machine, C4.5, multilayer perceptron, Bayesian classifier and

Fishers linear discriminant. Sanderson et. al. [7] combine

speech and face information. The speech expert uses MFCC

for feature extraction and Gaussian Mixture Model (GMM)

as opinion generator. The face expert uses PCA technique

for feature extractor. The secondary classifiers like SVM and

Bayesian classifier are used for fusion.

In this paper a multimodal person identity verification sys-

tem is presented which relies on face and speech modalities.

The face expert uses Two-Dimensional Principal Component

Analysis (2DPCA) as expert/algorithm for feature extraction.

The speech expert employs text independent speaker verifica-

tion using LPCC and also MFCC for feature extraction and

GMM for opinion generator for different number of mixtures

namely 4, 8, 16, and 32. The score level fusion of expert de-

cision is made using sum rule. The experiments are carried

out on VidTIMIT database [8].

Remainder of the paper is organized as follows; section 2

presents the verification models, section 3 shows the experi-

mental results and section 4 draws the conclusions.

2. VERIFICATION MODEL

2.1. Face verification

A full face verification system consists of three stages: (i)

face localization and segmentation (i), Normalization (ii) fea-

ture extraction and classification. In this paper we explore the

subspace based 2DPCA for feature extraction. As opposed to

conventional PCA, 2DPCA is based on 2 dimensional matrix

and does not require matrix to one dimensional vector conver-

sion. Instead, an image covariance matrix can be constructed

directly using the image matrix [9]. This results in smaller

size of image covariance matrix thereby taking less time in

determining the corresponding eigen vectors. In addition, the

recognition accuracy of 2DPCA is better than PCA [9].

The facial image is normalized to size 70× 65 pixel win-

dow which contains eyes, nose, ears and hair style. 2DPCA

based feature extraction is performed as described in [9]. The

number of dominant eigen values chosen in experiment is ten.

After transformation by 2DPCA, a feature matrix is obtained

for each image. Then, a nearest neighbor classifier is used for

classification.

2.2. Speaker verification

The text independent speaker verification is used. The speech

expert consists of two parts: speech feature extraction and

opinion expert. We employ LPCC and MFCC as different

cases of feature extraction. GMM is used as opinion expert.

GMM is the main tool used in speaker verification which is

trained using Expectation Maximization (EM) algorithm [10].

The speech signal is first preprocessed by removing silence

zones. A high emphasis filter H(z) = 1 − 0.95Z−1 is ap-

plied to the speech signal. The speech signal is then divided

into analysis frames using hamming window. Each frame is of

length 25msec and frame advance of 10msec. For each frame

we calculate 15th order LPCC and MFCC coefficients as fea-

tures. Each of these features is analyzed using GMM sepa-

rately. In GMM, the distribution of feature vector for each

person is modeled using a mixture of Gaussians as follows

[10][11]:

p(x) =M∑

i=1

pibi(x) (1)

bi(x) =1

|2πΣi|1/2exp

{−1

2(x− μi)T Σ−1

i (x− μi)}

(2)

Here μi,Σi represent the mean and covariance of the ith

mixture. Given the training data and number of mixtures, the

parameters μi,Σi and pi are learnt using EM algorithm. Thus,

a specific model is built through finding proper parameters

in the GMM based on the speakers own feature set. In our

experiments we use 4, 8, 16, 32 Gaussian mixtures in a GMM.

3. EXPERIMENTAL SETUP

The experiments are carried out on VidTIMIT database [8].

This database comprises of video and audio recording of 45

people. It was recorded in 3 sessions. There are 10 sentences

per person (including all the three sessions). The video of

each person is stored as a sequence of JPEG images with a

resolution of 512 × 384 pixels. From this database 15 users

are selected as true claims and 10 users are selected as im-

posters claims. Session 1 and 2 are used as training data.

To find the performance session 3 were used for obtaining

expert opinion of imposter claim and true claim. Four utter-

ances from the 10 imposters are used for simulating imposter

against remaining 15 persons. Thus we have 600 imposter

and 30 true claims. The expert opinions are normalized using

Z-Score normalization and mapped on the interval [0,1] using

sigmoid function[10].

3.1. Performance measure

The biometric verification systems are evaluated using two

error measures such as False Accept Rate (FAR) and False

Reject Rate (FRR). The FAR is defined as the ratio of Number

of Imposter accepts to total number of Imposters. The FRR

is defined as the ratio of Number of genuine class rejected

to total number of genuine claims. The measure that can be

obtained by combining these two errors into the Total Error

Rate (TER). The TER is defined as the ratio of total number of


Table 1. Performance of speech and face opinion expert

Modality Methods Performance

FAR FRR HTER

LPCC,GMM4 9.4 0 4.7

LPCC,GMM8 4.4 0 2.2

LPCC,GMM16 4.1 0 2.1

Speech MFCC,GMM4 9.9 0 4.95

MFCC,GMM8 9.7 0 4.85

MFCC,GMM16 6.88 0 3.44

MFCC,GMM32 8.0 0 4

FACE 2DPCA 4.0 3.8 3.9

Table 2. Results obtained using SUM rule

Fusion FAR FRR HTER

LPCC,GMM4,2DPCA 4.5 3.3 3.9

LPCC,GMM8,2DPCA 4.0 0 2.0

LPCC,GMM16,2DPCA 3.3 0.6 1.95

MFCC,GMM4,2DPCA 9.5 0 4.75


MFCC,GMM16,2DPCA 4.56 0 2.28


imposter accepts and genuine class rejects to the total number

of accepts.

FAR and FRR are functions of a threshold that can con-

trol the tradeoff between two error rates. In practical applica-

tion, this threshold is estimated to reach the Equal Error Rate

(EER) or to minimize Half Total Error Rate (HTER).

The HTER is defined as:

HTER =FAR + FRR

2× 100% (3)

3.2. Experimental results

The result of the experiments conducted are shown in the Ta-

bles 1 and 2. The FAR, FRR and HTER for each combination

are included. From Table 1 it can be seen that the combination

of LPCC and GMM16 out performs all other. The combina-

tion of MFCC with GMM4 will result in poor performance of

speech expert as it exhibit HTER of 4.95%. The face expert

using 2DPCA exhibit HTER of 3.9%.

The Table 2 shows the score level fusion of the expert

decisions using sum rule gives better score as compared with

the individual expert decision. Best results are obtained for

16 mixtures using LPCC features.

4. CONCLUSION

In this paper we present a novel score level fusion of face ex-

pert which employ 2DPCA for feature extraction with speech

expert which uses LPCC and MFCC with different mixtures

of GMM. The fusion of these experts are achieved using sum

rule as it allows the acceptance easily. The experiments are

carried out on VidTIMIT database. The best result is ob-

tained using the combination of LPCC ,GMM16 mixtures and

2DPCA as it gives HTER of 1.95%.

5. REFERENCES

[1] A.Ross, K.Nandakumar, and A.K Jain, Handbook ofMultibiometrics, Springer-virlag edition, 2006.

[2] A.Ross A.K Jain and S.Prabhakar, “An introduction to

biometric recognition,” IEEE Transations on circuitsand systems for Video Technology, vol. 14, pp. 4–20,

2004.

[3] J. Kittler, M. Hatef, R.P.W. Duin, and J. Matas, “On

combining classifiers,” IEEE Trans. on Pattern Analysisand Machine Intelligence, vol. 20, pp. 226–239, 1998.

[4] C. Chibelushi, F. Deravi, and J. Mason, “Voice and fa-

cial image integration for person recognition,” 1993.

[5] R. Brunelli and D. Falavigna, “Person identification us-

ing multiple cues,” IEEE Transations on Pattern Anal-ysis and Machine Intelligence, vol. 10, pp. 955–965,

1995.

[6] S. Ben-Yacoub, Y. Abdeljaoued, and E. Mayoraz, “Fu-

sion of face and speech data for person identity verifi-

cation,” IEEE Transations on Neural Networks, vol. 10,

pp. 1065–1074, 1999.

[7] C. Sanderson and K.K. Paliwal, “Noise compensation

in a person verification system using face and multiple

speech features,” Pattern Recognition, vol. 36, no. 2, pp.

293–302, 2003.

[8] C. Sanderson and K.K. Paliwal, “Identity verification

using speech and face information,” Digital signal Pro-cessing, vol. 14, no. 5, pp. 449–480, 2004.

[9] Jian Yang, David Zhang, Alejandro F. Frangi, and Jing

yu yang, “Two-dimentional PCA: A new approach to

appearance-based face representation and recognition,”

IEEE Transations on Pattern Analysis and Machine In-telligence, vol. 26, no. 1, pp. 131–137, 2004.

[10] D. Reynolds, A Gaussian mixture modeling approach totext independent speaker identification, Technical Re-

port 967, Lincoln Laboratory, Massachusetts Institute of

Technology edition, 1993.

[11] Christopher.M.Bishop, Pattern Recognition and Ma-chine Learning, Springer edition, 2006.


REMOVAL OF DEGRADATIONS IN VIDEO SEQUENCES USING DECISION BASED ADAPTIVE SPATIO TEMPORAL MEDIAN ALGORITHM

S.Manikandan, D.Ebenezer

Sri Krishna College of Engineering and Technology

Coimbatore, Tamilnadu, India

ABSTRACT

Line Scratches, blotches, dirt and impulsive noise are the most annoying artifacts present in the images and videos. A new decision based adaptive spatio temporal median type algorithm is proposed for removal of all these artifacts in videos with improved performance. Only those pixels of the video images which are detected as noises are replaced with the estimated values using the proposed algorithm. Based on the amount of noise present in the window, the mode of operation and the window size is selected for processing the corrupted pixel. Adaptive rood pattern search block matching algorithm is used for motion estimation of the image sequences. The advantage of the proposed algorithm is that a single algorithm with improved performance can replace several independent algorithms required for removal of different artifacts in image sequences. The proposed algorithm produces better results visually and also MSE and PSNR are better compared to the different image sequences restoration algorithms.

Index Terms— Line scratches, blotches removal, adaptive median based algorithm

1. INTRODUCTION Video sequences are often corrupted by noise, e.g., due to bad reception of television pictures. Typical problems are ‘blotches’ or ‘Dirt and Sparkle’ due to missing information. Scratches can be introduced due to dirt in the apparatus and noise can be introduced in the recording process. For removal of these artifacts, generally separate methods[1][2][3][5] are employed. White lines and black lines are considered as line scratches by Milady [5] for image sequences. According to him, a positive type film suffers from bright scratches and negative film suffers from dark scratches. Milady has considered only the dark scratches; if bright scratches exist he inverted them and used the same algorithm. Alexandre [6] gives a remedy for removing the blotches and line scratches in images. He has considered only vertical lines (which are narrow) and the blotches as impulsive with constant intensity having

irregular shapes. Kokaram [2,3] has given a method for removal of scratches and restoration of missing data in the image sequences based on wiener filtering and Bayesian techniques. And for impulsive noises, it has been shown recently that an adaptive length algorithm provides a better solution for removal of impulsive noise with better edge and fine detail preservation. Several adaptive algorithms are available for removal of impulse noises in images and videos. However, none of these algorithms addressed the problem of white lines, black lines, blotches and data missing in images sequences. The objective of this paper is to propose a decision based adaptive length median algorithm that can simultaneously remove impulsive noises, white lines, black lines, data missing and blotches with fine details and edge preservation. The advantage of the proposed algorithm is that a single algorithm with improved performance can replace several independent algorithms required for removal of different artifacts in image sequences.

2. DECISION BASED ADAPTIVE SPATIO TEMPORAL MEDIAN ALGORITHM

The proposed decision based adaptive spatio temporal median algorithm consist of adaptive median based algorithm, motion estimation and compensation using block matching adaptive rood pattern search (ARPS) algorithm, and temporal median filtering algorithm. The block diagram is shown in fig 2.

2.1 Adaptive Median based algorithm

Adaptive median based algorithm is the non linear algorithm, which is adaptive in nature in choosing the filtering window size. The probability of getting the noisy pixel, as estimated pixel is very low. The proposed algorithm reduces streaking and preserves edges and fine details. Consider the pixels that has the value 0 &255 as noisy pixel

1. If the pixel is noisy, first perform 3 X 3 median filters and replace the noisy pixel by the median


value. If the median value is noisy in 3 X 3 window, then

2. Perform 5 X 5 median filters, and replace the noisy pixel by median value. Suppose , in 5 X 5 window also the median value is noisy, then

3. Take 3 X 3 window, and calculate the number of uncorrupted pixels in it,

4. If the number of uncorrupted pixels is odd, calculate the median value of the uncorrupted pixels and replace it for the noisy pixel,

5. If the number of uncorrupted pixels is even, then sort the uncorrupted pixels, and consider the two centermost pixels and find the mean of it and replace it for noisy pixel.

2.1.1 Illustrations Let Pmax =255 & Pmin = 0 be the corrupted pixel values and P (i,j)� 0, 255 represent uncorrupted pixels. If P (i,j)= 0 or 255,then the following cases are considered : Step 1: Consider window size 3 X 3 with typical values of pixels shown as an array below. Let n be the number of corrupted pixels in the window. If P (i,j)� 0, 255 , then the pixels are unaltered. For the array shown, there are no corrupted pixels in the array, therefore, the pixels are unaltered.

Step 2: If the number of corrupted pixels ‘n’ in the window W (i, j) is less than or equal to 4 i.e., n< 4, then 2D window of size 3X3 is selected and median operation is performed by column sorting, row sorting and diagonal sorting. The corrupted P(i,j) is replaced by the median value.

corrupted matrix row sorting

column sorting diagonal sorting

Step 3: The number of corrupted pixels is 5<n< 12. Perform 5 X 5 median filtering and the corrupted values are replaced by the median value.

Corrupted matrix 123 0 156 255 234 255 0 214 98 0

0 234 255 133 190 199 255 234 255 0 255 167 210 198 178

Row sorting

Column sorting

Diagonal sorting

Step 4 : The number of corrupted pixels is >13 (a typical case is shown as an array below). Increasing the window size may lead to blurring; choose 3X3 median filtering. On median filtering with smaller window sizes, the output may happen to be noise pixels whenever the noise is excessive. In this case, calculate the number of uncorrupted pixels in 3 X 3 window , If the number is odd, then find the median of the uncorrupted pixels and replace it for the noisy pixel. Suppose, if the number of uncorrupted pixels is even, then the centermost two pixels are averaged and replaced for the noisy pixel

123 214 156 236 167 214 123 234 56

123 214 255

0 214 255

0 123 234

255 214 123

0 255 214

123 234 0

0 123 234

0 214 255

123 214 255

0 123 123

0 214 255

234 214 255

0 123 156 234 255 0 0 214 255 0 133 190 234 255 0 199 234 255 255

167 178 198 210 255

0 0 98 210 255

0 123 156 214 255

0 133 190 234 255

0 178 198 234 255

167 199 234 255 255

0 0 98 210 167

0 123 156 178 255

0 133 190 234 255

0 214 198 234 255

255 199 234 255 255


(i.e.,) (145, 123, 133) are the uncorrupted pixels in the window . On finding the median of the values 133 comes and it is replaced for noisy value.

(145+123) / 2 = 134 Suppose only two values are present i.e., 145 and 123, then the average of the two values are replaced for the noisy pixel value.

2.2. Motion estimation and Temporal Filtering

Motion estimation and compensation techniques [4],are to track scratch on the frame, offers a good method to sniff out line scratches. The prediction and interpolation are used to estimate motion vectors for video denoising. For fast motion prediction, commonly used technique is Block Matching (BM) motion estimator. The motion vector is obtained by minimizing a cost function measuring the mismatch between a block and each predictor candidate. The motion estimation (ME) gives motion vector of each pixel or block of pixels which is an essential tool for determining motion trajectories. Due to motion of objects in scene (i.e. corresponding regions in an image sequence) the same region does not occur in the same place in the previous frame as in current one. ARPS algorithm makes use of the fact that the general motion in a frame is usually coherent, i.e. if the macro blocks around the current macro block moved in a particular direction then there is a high probability that the current macro block will also have a similar motion vector. This algorithm uses the motion vector of the macro block to its immediate left to predict its own motion vector. The rood pattern search directly puts

the search in an area where there is a high probability of finding a good matching block. The point that has the least weight becomes the origin for subsequent search steps, and the search pattern is changed to small diamond search pattern (SDSP). The procedure keeps on doing SDSP until least weighted point is found to be at the center of the SDSP. The main advantage of this algorithm over diamond search (DS) is if the predicted motion vector is (0, 0), it does not waste computational time in doing large diamond search pattern (LDSP), it rather directly starts using SDSP. The temporal median filter smoothes out sharp transitions in intensity at each pixel position; it not only denoises the whole frame and removes blotches but also helps in stabilizing the illuminating fluctuations. Temporal median filtering removes the temporal noise in the form of small dots and streaks found in some videos. In this approach, dirt is viewed as a temporal impulse (single-frame incident) and hence treated by inter-frame processing by taking into account at least three consecutive frames.

3. RESULTS The proposed Decision based Adaptive spatio temporal median algorithm is tested by using 20 frames from the film “Dharmathyin thalaivan”. Fig 3(a,b,c & d) show the different types of degradations in the frames of the film“Dharmathyin thalaivan”. Fig 3(e,f,g &h) show the results of the proposed algorithm. The results shows that the proposed algorithm preserves image details well and at the same time sufficiently clears the noise in non-detailed parts of the image. Fig 1 show the PSNR graph of different algorithms such as general median filter, Recursive weighted median filter using median controlled algorithm, simple temporal median algorithm, Recursive weighted median filter using Lin’s algorithm and the proposed algorithm. Table 1 show the MSE value of different types of filters of frame 11 from the film “Dharmathyin thalaivan”

4. CONCLUSION

In this paper a decision based adaptive spatio-temporal video de-noising algorithm was presented, which combines adaptive median based algorithm for denoising, block based motion estimation and compensated techniques and temporal median filtering for improving the PSNR and MSE of the image sequences. The proposed strategy can considerably reduce undesirable noise within a video sequence, to achieve improved objective and subjective quality compared to similar algorithms. In future work we intend to refine our motion estimation framework in order to deal with occlusion and moving block edges, i.e. refine motion vectors for blocks undergoing two or more different motions.

123 0 156 255 234

255 255 123 255 0

0 145 255 133 145

199 0 255 0 255

255 167 0 198 178

255 123 255

145 255 133

0 255 0

255 123 255

145 133 133

0 255 0

255 123 255

145 255 0

0 255 0

255 123 255

145 134 0

0 255 0


5. REFERENCES [1] L. Joyeux, O.Buisson, B. Besserer, S. Boukir,“Detection and Removal of Line Scratches in Motion Picture Films”. [2] A. Kokaram, “Detection and removal of line scratches in degraded motion picture sequences”, in Signal Processing VIII, vol 1, page 5-8, September 1996. [3] A. C. Kokaram. Motion Picture Restoration, Springer-Verlag, 1998. [4] Yao Nie, Adaptive Rood Pattern Search for Fast Block-Matching Motion Estimation, IEEE Trans. on Image Processing, vol. 11, no.12, pp. 1422-1429, Dec. 2002 [5] S. Milady and S. Kasaei“A New method for a fast detection and seamless restoration of line scratches in motion pictures”, in Proc. of 15th IASTED International Conference on Modeling and Simulation, pp. 413-417, Mar. 2004 [6] Alexandre Ulisses Silva, Luís Corte-Real,” Removal of Blotches and Line Scratches from Film and video sequences using a digital Restoration chain”, Proceedings on IEEE NSIP’99, pp-no-177.

Table –1 MSE Comparison with different types of algorithms

Fig 1. PSNR Comparison graph of different types of algorithms

Fig 2. Block diagram

(a) (b) (c) (d)

(e) (f) (g) (h)

Fig 3 . (a),(b),(c) & (d) are corrupted frames from the film ‘Dharmathyin Thalaivan’ , (e),(f) ,(g)& (h) are video denoised frames using the proposed algorithm

Algorithm types MSE General median algorithm 294.21

RWM using Lin’s 221.56 Temporal median 180.26 ARWM algorithm 129.06

Proposed Algorithm 71.65

Frame- wise Window size =3

ARPA Block Matching Algorithm Pixel -wise

Motion Estimation

Blotch Detection

Motion Detection

Write out Frames Read in

Frames

MC Filtering

Temporal Median Filter

Adaptive median based

filtering

Pixel –wise Adaptive window


ROI BASED APPROACH FOR THE USER ASSISTED SEPARATION OF REFLECTIONS FROM A SINGLE IMAGE

M.A. Ansari and R.S. Anand

Department of Electrical Engineering, Indian Institute of Technology Roorkee-247667 India. E-mail: [email protected], [email protected]

ABSTRACT

In this paper, we present a technique that works on arbitrarily complex images but we simplify the problem by allowing user assistance. We allow the user to manually mark certain edges or areas in the image as belonging to one of the two layers. Separating reflections from a single image is a massively ill-posed problem. We have focused on slightly easier problem in which the user marks a small number of gradients as belonging to one of the layers. This is still an ill-posed problem and we have used a prior derived from the statistics of natural scenes: that derivative filters have sparse distributions. We showed how to efficiently find the most probable decompositions under this prior using linear programming. Our results show the clear advantage of a technique that is based on natural scene statistics rather than simply assuming a Gaussian distribution.

Index Terms: Linear superposition, Complex images, Separating reflections, Derivative filters, ROI (Region of Interest), Linear programming.

I. INTRODUCTION

When we take a picture through transparent glass, the image we obtain is often a linear superposition of two images: the image of the scene beyond the glass plus the image of the scene reflected by the glass.

The Problem: Mathematically, the problem is massively ill-posed. The input image I(x, y) is a linear combination of two unknown images I1(x,y) and I2(x,y) given by:

I(x, y) = I1(x, y) + I2(x, y) ………….. (1) Obviously, there are an infinite number of solutions to equation (1): the number of unknowns is twice the number of equations. Additional assumptions are needed. On the related problem of separating shading and reflectance, impressive results have been obtained using a single image. These approaches make use of the fact that edges due to shading and edges due to reflectance have different

statistics e.g. shading edges tend to be monochromatic. Unfortunately, in the case of reflections, the two layers have the same statistics, so the approaches used for shading and reflectances are not directly applicable [1,2]. A method was presented that used a prior on images to separate reflections with no user intervention [7]. While impressive results were shown on simple images, the technique used a complicated optimization that often failed to converge on complex images. Fig.1(d) in the next page shows the Mona Lisa image with manually marked gradients: blue gradients are marked as belonging to the Mona Lisa layer and red are marked as belonging to the reflection layer. The user can either label individual gradients or draw a polygon to indicate that all gradients inside the polygon belong to one of the layers. This kind of user assistance seems quite natural in the application we are considering: imagine a Photoshop plug-in that a tourist can use to post-process the images taken with reflections. As long as the user needs only to mark a small number of edges, this seems a small price to pay. Even when the user marks a small number of edges, the problem is still ill-posed. Consider an image with a million pixels and assume the user marks a hundred edges. Each marked edge gives an additional constraint for the problem in equation (1). However, with these additional equations, the total number of equations is an only million and a hundred, far less than the two million unknowns. Unless the user marks every single edge in the image, additional prior knowledge is needed.

The Need: Fig.1 (a) in the next page shows the room in which Leonardo's Mona Lisa is displayed at the Louvre. In order to protect the painting, the museum displays it behind a transparent glass. While this enables viewing of the painting, it poses a problem for the many tourists who want to photograph the painting in fig.1(b). The fig.1(c) shows a typical picture taken by a tourist: the wall across from the painting is reflected by the glass and the picture captures this reflection superimposed on the Mona-Lisa image.


A similar problem occurs in various similar settings: photographing window dressings, jewels and archaeological items protected by glass. Professional photographers attempt to solve this problem by using a polarizing lens. By rotating the polarizing lens appropriately, one can reduce (but not eliminate) the reflection [5]. As suggested, the separation can be improved by capturing two images with two different rotations of the polarizing lens and taking an optimal linear combination of the two images. An alternative solution is to use multiple input images in which the

reflection and the non-reflected images have different motions. By analyzing the movie sequence, the two layers can be recovered [8]. A similar approach is applied to stereo pairs. While the approaches based on polarizing lenses or stereo images may be useful for professional photographers, they seem less appealing for a consumer level application [5]. Viewing the image in figure 1(c), it seems that the information for the separation is present in a single image. Can we use computer vision to separate the reflections from a single image?

( a ) ( b) ( c ) ( d )

Fig. 1(a)-(b) The scene near the Mona Lisa in the Louvre. The painting is housed behind glass to protect it from the many tourists. (c) A photograph taken by a tourist at the Louvre. The photograph captures the painting as well as the reflection of the wall across the room. (d) The user assisted reflection problem. We assume the user has manually marked gradients as belonging to the painting layer or the reflection layer and wish to recover the two layers. The Solution: A kind of user assistance seems quite natural in the application we are considering: imagine a Photoshop plug in that a tourist can use to post-process the images taken with reflections. Practically speaking, there are more than four Lakh pixels in an average image. And it is not possible for the user to pin point every pixel on the image as belonging to either of the layers. As long as the user needs only to mark a small number of edges, this seems a small price to pay. The probability of a single derivative is given by:

Given the histograms over derivative filters, we follow in using it to define a distribution over images by assuming that derivative filters are independent over space and orientation so that our prior over images is given by:

Where ƒ.I denotes the inner product between a linear filter f and an image I, and ƒi,k is the kth derivative filter centered on pixel i. The use includes two orientations (horizontal and vertical) and two degrees (i.e. first derivative filters as well as second derivative). The probability of a single derivative is given by equation (2). Equation (3) gives the probability of a single layer. We

follow in defining the probability of decomposition I1, I2 as the product of the probabilities of each layer (i.e. assuming the two layers are independent).

II. OPTIMIZATION

Now, we are ready to state the problem formally. We are given an input image I and two sets of image locations Sı, S2 so that gradients in location Sı belong to layer 1 and gradients in location S2 belong to layer 2. We wish to find two layers Iı, I2 such that:

1. The two layers sum to form the input image: I = Iı + I2 .

2. The gradients of Iı at all locations in Sı agree with the gradients of the input image I and similarly the gradients of I2 at all locations in S2 agree with the gradients of I. Subject to these two constraints we wish to maximize the probability of the layers: Pr(Iı,I2) = Pr(Iı) Pr(I2) given by equation(3). Our approximation proceeds in two steps. We first approximate the sparse distribution with a Laplacian prior. This leads to a convex optimization problem for which the global maximum can be found using linear programming. We then use the solution with a Laplacian prior as an initial condition for a simple, iterative maximization of the sparse prior.

Exactly Maximizing: A Laplacian Prior Using Linear Programming: Under the Laplacian approximation, we


approximate Pr(I) with an approximate Pr(I) defined as:

To find the best decomposition under the Laplacian approximation we need to minimize:

Subject to the two constraints given above: that Iı + I2 = I and that the two layers agree with the labeled gradients. This is an Lı minimization with linear constraints. We can turn this into an unconstrained minimization by substituting in I2 = I – Iı so that we wish to find a single layer Iı that minimizes:

This minimization can be performed exactly using linear programming. This is due to the fact that the derivatives are linear functions of the unknown image. To see this, define v to be a vectorized version of the image Iı then we can rewrite J2 as:

Where || ||1 is the Lı norm, the matrix A has rows that correspond to the derivative filters and the vector ‘b’ either has input image derivatives or zero so that equation(7) is equivalent to equation(6). Minimization of equation (7) can be done by introducing slack variables and solving

The idea is that at the optimal solution one of the variables

zi+, zi

- is zero, and the over is equal to . The above problem is a standard linear programming one and we use the linear programming package to solve it.

Optimization of the Sparse Prior Using Iterated Linear Programming: To find the most likely decomposition under the sparse prior we need to maximize the probability of the two layers as given by equation(3). Using the same algebra as in the previous section this is equivalent to finding a vector ‘v’ that minimizes:

Where ρ(x) is the log probability, ρ(x) is similar to a robust error measure and hence minimizing ‘J’ is not a convex optimization problem. Since we use a mixture model to describe the sparse prior, we can use expectation-maximization (EM) to iteratively improve the probability of decomposition. In the E step we calculate the

expectation of hi and in the M step we use this expected value and optimize an expected complete data log likelihood. A standard derivation shows that the EM algorithm reduces to: {E step. calculate two weights wı, w2 for every row of the matrix. A:

The proportion constant is set so that wı(i) + w2(i) = 1 for

all i,{M step: perform an Lı minimization given:

with D a diagonal matrix whose elements are given by:

At every iteration, we are provably decreasing the cost function J3 in equation (8). The optimization in the M step was performed using the same linear programming software as in the Laplacian approximation. Three EM iterations are usually sufficient.

III. THE ALGORITHM

1. Load an Image with some regions of Image being fused with reflection. 2. Ask user to mark some points with the help of either point marking or polygon marking. These points are generally marked with a brush thickness ranging from 1 pixel to 10 pixels depending on their relative position in the image and density. The Points should be marked with respect to following scheme:

a. Blue points: User should mark some points on the image as belonging to one of the layers. These points are considered to be coherent part of that image and treated as it while separation the other image from it.

b. Red points: User should mark some other points on the image as belonging to the other layers. These points are considered to be coherent part of the other image and treated as it while separation the other image from the first image. 3. Store these points in an array. 4. Collect the data of image in an array in form of float numbers. 5. Create a derivative filter (-1 0 1) in x direction for the image. 6. The dimension of the filter should be that of image size X image size where image size is the total number of pixels in the image. Store it as Gx. 7. Similarly create 4 more such derivative filters in other direction (y-direction) and other orientations respectively as Gy, Gxx, Gyy and Gxy. 8. Merge all these five derivative filters in one as A1. Remove any filter value at the boundaries of the image as these pixels don’t satisfy the balancing of filter as they don’t have all the necessary neighborhood pixels. 9. Merge A1 twice to create a matrix (sparse) A (dimensions of A : 10 * image size rows, image size)


10. This A will serve as LHS side of equation Ax = b; 11. To create RHS of the equation, apply the above created filter A1 on Image Array (I). Result is an array of size (5 * image size, 1). 12. Now merge the above mentioned array in other array of same size with all entries as zero. 13. Hence now we have RHS of size (10 * image size, 1) and LHS of size (10 image size, image size). 14. Now to bring the effect of marked points into the equation, create an array of 10 image size, and store every element's value as 1. 15. Then multiply rows of A and b with different weights of 4,100 depending on the marked pixels. If the pixel is marked blue than all those rows are multiplied with 100 and those with other color with 4. Now for the remaining 9 * image size rows we multiply each row with some weightage depending on their corresponding pixel in image. 16. Now we are ready with final equation Ax = b. To solve this equation create a numerator by pre multiplying A' with A and pre multiplying A' with b for RHS. This equation is solved by the Matlab equation solver and results are stored in x (image size). 17. This is our Initial guess for the solution. If user himself provides initial guess than no need of this step is there. Directly the values of initial guess are stored in x. The whole result depends on this initial guess. 18. Now we use EM Iteration to refine our solution obtained as initial guess in x; 19. Store the values of A in another sparse matrix oA, b in oB and weightage vector ‘f’ in oF.

20. Calculate the residual of the solution by putting back the value of x in: power of ‘e’ to (abs(Ax - b)). If the residual is small for every pixel we move on from our iteration to next otherwise create another vector of size image size E with values of residual calculated in the above step. 21. Solve another equation with LHS as (A' * E * A) and RHS as (A' * E * b). 22. Store these values in the same x. 23. This was M Step of the iteration. The above steps 19 to 22 are repeated again one more time. 24. Begin with E step of iteration. Calculate the probability distribution as mentioned in the formulae. 25. Find two weights using the above mentioned step 23 for different constant values as of ‘s’ and ‘Π’ in the formulae. 26. Sum these weights and take weighted average. 27. Store these into a diagonal matrix and pre multiply this matrix with A and b (in the same passion as we multiplied weights for the initial guess). Store the results in oA and ob respectively. 28. Now we repeat the above mentioned steps of EM Iteration 18 to 26 for n number of times where n ranges from 3 to 15 depending upon the quality of the output and time of execution. 29. The values in x are moderated (to remove any -ve entry) and results are displayed in the form an Image. 30. To calculate the other image (reflection) subtracts values of x from I (Problem Image) and display moderated result in another image. 31. Hence we are left with two Images with original and reflection image separated from the problem image.

(a) (b) (c) (d) Figure 3. Results: (a) input image. (b-d) decomposition and separation of reflections.

IV. RESULTS AND DISCUSSIONS

We show results of our algorithm on images of scenes with reflections. Four of the images were taken for the test and we had no control over the camera parameters or the compression methods used. For color images we ran the algorithm separately on the R, G and B channels. Fig.(3) above shows the input images with labeled gradients and our results. The Laplacian prior gives good results

although some ghosting effects can still be seen (i.e. there are remainders of layer 2 in the reconstructed layer 1). These ghosting effects are fixed by the sparse prior. Good results can be obtained with a Laplacian prior when more labeled gradients are provided. The non sparse nature of the Gaussian distribution is highly noticeable, causing the decomposition to split edges into two low contrast edges, rather then putting the entire contrast in one of the layers.


In the fig.(4) below, shows software results of the input image and separation of reflections. In fig.(5) results, the technique was applied for removing shading artifacts.

For this problem, the same algorithm was applied in the log-domain and the results obtained show almost the complete removal of the shading artifacts.

Figure 4. Input image & software results of separation of reflections

( a ) ( b ) ( c ) ( d )

Figure 5. Removing shading artifacts (a) original image. (b) Labeled image. (c-d) decomposition results.

Since we are using an off-the-shelf linear programming package, we are not taking advantage of the spatial properties of the optimization problem. The current run time of the linear programming for images of size 240x320 is a few minutes on a standard PC. We have not performed an extensive comparison of linear programming packages so that with other packages the run times maybe significantly faster. We are currently working on deriving specific algorithms for minimizing L1 cost functions on image derivatives. Since this is a convex problem, local minima are not an issue and so a wide range of iterative algorithms may be used. In preliminary experiments, we have found that a multigrid algorithm can minimize such cost functions significantly faster. We are also investigating using a mixture of Gaussians rather than a mixture of Laplacians to describe sparse distributions. This leads to M steps in which L2 minimizations need to be performed, and there are a wide range of efficient solvers for such minimizations.

We are also investigating the use of other features other than derivatives to describe the statistics of natural images. Our experience shows that when stronger statistical models are used, we need less labeled points to achieve a good separation. We hope that using more complex statistical models will still enable us to perform optimization efficiently. This may lead to algorithms that separate reflections from a single image, without any user intervention.

V. REFERENCES [1]. H. Farid and E.H. Adelson, “Separating reflections from images by use of independent components analysis”, Journal of the optical society of America, 16(9):pp.2136-145, 1999. [2]. G. D. Finlayson, S. D. Hordley, and M.S. Drew, ‘‘Removing shadows from images”, European Conf. on Computer Vision, 2002. [3]. Levin, A. Zomet, and Y. Weiss, “Learning to perceive transparency from the statistics of natural scenes”, Advances in Neural Information Processing Systems, Vol 15, Edn. 2002. [4]. B.A. Olshausen and D. J. Field, “Emergence of simple-cell receptive field properties by learning a sparse code for natural images”, Nature, 38(1), pp.607-608, 1996. [5]. Y. Shechner, J. Shamir, and N. Kiryati, “Polarization-based decorrelation of transparent layers: The inclination angle of an invisible surface”, Int. Conf on Computer Vision, pp 814--19, 1999. [6]. E.P. Simoncelli, ‘‘Bayesian denoising of visual images in the wavelet domain”, P Mller and B Vidakovic, editors, Wavelet based models, 1999. [7]. R. Szeliksi, S. Avidan, and P. Anandan, “Layer extraction from multiple images containing reflections and transparency”, Conf. on Computer Vision and Pattern Recognition, 2000. [8]. M. Tappen, W.T. Freeman, and E.H. Adelson, “Recovering intrinsic images from a single image”, In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, 2002.


COMPRESSION OF MULTI-LEAD ECG SIGNALS AND RETINAL IMAGES USING 2-DWAVELET TRANSFORM AND SPIHT CODING SCHEME FOR MOBILE TELEMEDICINE

L. N. Sarma, S. R. Nirmala, M. Sabarimalai Manikandan, and S. Dandapat

Department of Electronics and Communication EngineeringIndian Institute of Technology Guwahati, Guwahati-781039, Assam, India

Email:{lns,nirmala,msm,samaren}@iitg.ernet.in

ABSTRACT

The objective of mobile telemedicine system is to assist health-care centers in providing prehospital care for patients at re-mote locations. Due to the channel and storage capacity limita-tions, biosignals and medical images must be compressed be-fore transmission and storage. This reduces transmission timeand also increases the number of users at a time in a limitedbandwidth condition.This paper presents a compression method for medical imagesand multi-lead ECG signals in a single coder architecture formobile telemedicine applications. The proposed method em-ploys a preprocessing, 2-D discrete wavelet transform (DWT),and coding of transformed wavelet coefficients using the ef-ficient modified set partitioning in hierarchical trees (SPIHT).The retinal images are compressed directly and the quality ofthe reconstructed images is assessed using laplacian mean squareerror (LMSE) map, structural similarity index (SSIM) map andmean opinion score (MOS) test. The same coder is used for thecompression of preprocessed multi-lead ECG signals. A Two-dimensional (2-D) array is constructed and this exploits theinter-beat and inter-channel correlation. The segmented blocksare compressed and decompressed using SPIHT coder. The re-constructed signal quality are evaluated using PRD, WWPRDand WEDD measures. The quality of the retinal images andECG signals are good at compression ratio of 8. The deficiencyof the PRD and WWPRD are shown in this work. This paperalso describes our ongoing research into the applicability ofdiagnostic distortion measures in automatic quality controlledmedical signals compression in a mobile telemedicine system.Index Terms: electrocardiogram, fundus image, SPIHT, Multi-lead ECG, WEDD.

1. INTRODUCTION

Today, the growth of the aging population in India needs anincreasing number of healthcare professionals and facilities.Healthcare is one of the basic needs of society, and medicalscientists and technologies have continued to drive the devel-opment of new advances in order to improve the quality ofour healthcare system [2]- [6]. With emerging cellular tech-

nologies, patients can access healthcare services not only fromhospitals, but also from rural healthcare centers, ambulances,ships, trains, airplanes, and homes. A major challenge is howto provide better health care services to an increasing numberof people using limited financial and human resources. Thecountry is geographically large with many towns and villageslocated in remote rural areas. But they are interconnected viacellular networks.Different types of data such as biosignals, medical images, di-agnostic video, etc. are used for the diagnosis of diseases. Ahigh quality real-time audio link, still image, and video are alsoessential for communicating instructions between the therapist,the caregiver and the patient. In healthcare application whereinherently large volumes of digitized biosignals and images arepresented, signal and image compression is indispensable to re-duce transmission time. Although recent advances in computertechnology have increased the data storage capacity and com-munication speeds, they alone are not sufficient to overcomethis problem. Image compression techniques can be employedto reduce the data volume into a more manageable size withoutsignificantly compromising image quality. Data compressionbalances the situation between the limited capacities of cellu-lar networks and the limited user demand. There are two basickinds of medical image compression: lossless and lossy. Thechoice between the two depends on the system requirements.Lossless compression is fully reversible, with no loss of infor-mation, but is typically limited to a compression ratio of the or-der of 3:l. Lossy compression is irreversible and loses informa-tion but it can provide more than 10:1 compression ratio withsmall or imperceptible differences between the original and re-constructed images. In a lossy method, the image compressionalgorithm should achieve a tradeoff between compression ra-tio and image quality. Higher compression ratios will producelower image quality and vice versa. Quality and compressioncan also vary according to input image characteristics and con-tent. In general, a major design goal of any lossy compressionmethod is to obtain the best visual quality with the lowest bitrate.Wavelet-based compression of images for storage or transmis-sion has gained favor in recent years. Several different ap-


proaches have appeared in the literature for compressing thewavelet coefficient array by exploiting its statistical proper-ties. These methods include: the vector quantization (VQ),the direct adaptive arithmetic coding or entropy coding andthe zero-tree quantization [9], [10], [11]. Image compressionmethods based on wavelet transform and zerotree quantizationhave been very popular. A popular image coding techniquefeaturing progressive transmission is the embedded zero-treewavelet (EZW) coding [7]. EZW exploits the interband de-pendencies within the wavelet transformed image by using ef-ficient data models, zero-tree coding, and conditional entropycoding. Among all algorithms based on wavelet transform andzero-tree quantization, set partitioning in hierarchical trees (SPIHT)algorithm is well-known for its simplicity and efficiency [8].We pick the SPIHT coder because it is an improved version oforiginal EZW coder.A number of studies have looked at digital compression ofmedical images with relative preservation of image quality fol-lowing compression. Many of the medical images are com-pressed using the EZW and SPIHT coding schemes. Digitalimaging is widely used for diabetic retinopathy screening. Thestorage and transmission of digital images can be facilitatedby image compression. A very few method for compressionof retinal images is reported in literature. The retinal imagesare compressed using the JPEG (Joint Photographic ExpertsGroup) and the wavelet transform [1]. The JPEG and waveletimage compression methods were applied in five different lev-els to 11 eyes with subtle retinal abnormalities and to 4 normaleyes. It was shown that The JPEG and wavelet methods areboth suitable for compression of retinal images. For situationswhere digital image transmission time and costs should be min-imized, wavelet compression of a 1.5 MB image to 15 KB isrecommended.The multi-lead ECG signals are segmented and aligned to forma 2-D matrix. This 2-D representation of multi-lead ECG sig-nals are then compressed using the efficient SPIHT coder em-ployed for medical image compression in order to reduce thecost and size of mobile telemedicine system. Compression ofECG can be achieved by exploiting the intrabeat, interbeat andinterlead correlation between the samples. Many image com-pression methods are employed for the compression of 2-D rep-resentation of single lead ECG and the high compression ratioswere shown with use of longer duration ECG signal. Thesemethods are well suitable for offline processing and not for thereal time ECG transmission where the short duration of ECGsignals to be transmitted.In this paper, a well know SPIHT coder is employed for thecompression of the retinal images and the 2-D representation ofmulti-lead ECG signals. The performance of the SPIHT coderfor both the signals are evaluated in terms of subjective and per-ceptual metrics. The quality of the reconstructed signals is alsoevaluated using the recently reported wavelet energy based di-agnostic distortion (WEDD) measure. The rest of this paper isorganized as follows. 2-D transform coding methods and dis-

tortion Measures are discussed in Section 2. The compressionof retinal images and quality assessment are presented in Sec-tion 3. The quality of the reconstructed images are analyzedusing SSIM and LMSE measures. Section 4 proposes a multi-lead ECG signals compression using 2-D SPIHT coder. Theperformance of the proposed scheme is examined with blocksize of 64 × 64. Finally, conclusions are presented in Section5.

2. 2-D TRANSFORM CODING METHODS ANDDISTORTION MEASURES

The transform coding is a widely used method of compress-ing image information. In transform based method, the im-ages are transformed from spatial domain to the frequency do-main [7], [8], [9], [11]. The goal of the transform such asKarhunen-Loeve transform (KLT), discrete Fourier transform(DFT), Hadamard transform (HT), slant transform (ST), anddiscrete cosine transform (DCT) is to decorrelate the originalsignal and this decorrelation generally results in the signal en-ergy being redistributed among only a small set of transformcoefficients. By the transmission or storage of that small setof coefficients, the goal of compression is achieved. But hu-man visual system (HVS) is more sensitive to energy with lowspatial frequency than with high spatial frequency. Therefore,compression can be achieved by quantizing the coefficients sothat important coefficients (low-frequency coefficients) are trans-mitted and the remaining coefficients are discarded. It has beenshown that the DCT approaches the performance of the KLTfor image coding. The JPEG is capable of obtaining a rea-sonable compression performance at high and intermediate bitrates but for low bit rates quality is poor. The breakdown athigh compression is mainly due to the blocking artifacts causedby the initial partitioning of the image in square blocks withinthe DCT-based decorrelating module and the block based quan-tization module. With respect to functionality, JPEG supportslossy and lossless compression but not simultaneously. In addi-tion, progressive refinement of the image (scalability in qualityand resolution) is hard to support.Since the introduction of the wavelet transform as a signal pro-cessing tool, a variety of wavelet based algorithms have ad-vanced the limits of compression performance well beyond thatof the current commercial JPEG image compression standard.Because of the nature of the image signal and the mechanism ofhuman vision, the transform used must accept nonstationarityand be well localized in both the space and frequency domains.The DWT offers adaptive spatial-frequency resolution (betterspatial resolution at high frequencies and better frequency res-olution at low frequencies) that is well suited to the proper-ties of an HVS. The wavelet image compression system canbe created by selecting a type of wavelet function, quantizer,and statistical coder. The JPEG2000 clearly outperforms theJPEG at low bit-rates and has as an important property in itslossy-to-lossless coding functionality; that is the capability to


start from lossy compression at a very high compression ratioand to progressively refine the data by sending detail informa-tion, eventually up to the stage where a lossless decompressionis obtained. These algorithms achieve twice as much compres-sion as the baseline JPEG coder does at the same quality. How-ever, the ringing artifacts occurring at high compression ratios(mainly in the vicinity of edges and/or in textured regions) aregenerally less disturbing for the HVS than the JPEG block-ing artifacts. Shapiro made a breakthrough with his embed-ded zero tree wavelet (EZW) coding algorithm has become thecore technology of many successful image coders [7]. EZWexploits the interband dependencies within the wavelet trans-formed image by using efficient data models, zerotree coding,and conditional entropy coding. EZW applies successive ap-proximation quantization (SAQ) to provide a multi precisionrepresentation of the coefficients and to facilitate the embeddedcoding. The SPIHT algorithm a refinement on EZW, furtherexploits the self-similarity of the wavelet transform at differ-ent scales that results from structuring the data in hierarchicaltrees that are likely to be highly correlated. The essential dif-ference of the SPIHT algorithm process with respect to EZW isthe way trees of coefficients are partitioned and sorted [8]. TheSPIHT coding is selected because it displays exceptional char-acteristics over several properties all at once including: goodimage quality with a high peak signal-to-noise ratio (PSNR),fast coding and decoding, a fully progressive bit-stream, canbe used for lossless compression, may be combined with errorprotection, and ability to code for exact bit rate or PSNR.

2.1. The SPIHT Coder

One interesting property of the SPIHT coding scheme is itsprogressive coding capability where the signal quality can beimproved gradually as the bit rate increases. The SPIHT codesa DWT image based on zerotree structure [11]. It containstwo main procedures: sorting pass and refinement pass. Insorting pass, significant coefficients (coefficients whose mag-nitudes are greater than the current threshold) are sought outand their most significant bits are outputted during the follow-ing refinement pass. After one sorting pass and one refine-ment pass, which can be combined as one scan pass, the codingprocess will go on until the expected bit-rate is reached. Dur-ing SPIHT coding, the information to indicate zerotree struc-ture and significance is stored in three temporary lists: list ofsignificant coefficients (LSP) which stores the coordinates ofsignificant coefficients, list of insignificant coefficients (LIP)which stores the coordinates of isolated zeros (insignificant co-efficients whose descendants may be significant) and list of in-significant sets (LIS) which stores the coordinates of zerotreeroots. It can be outlined in four steps as follows: (1) Outputn = �log2(max(i,j)|ci,j |), where ci,j is the DWT coefficient ofthe image (called transform coefficient hereafter) at coordinate(i, j). So the initial threshold T = 2n. (2) (Sorting pass) Out-put μnas the number of coefficients such that T ≤ |ci,j | < 2T ,

together with their coordinates (i, j) and signs. (3) (Refinementpass) Output the nth most significant bit of the coefficients suchthat |ci,j | ≤ 2T , following the same order to output their coor-dinates in previous sorting passes. (4) Decrease n by 1, halveT and go to Step (2).

2.2. Image Distortion Measures

The image quality can be evaluated objectively and subjec-tively. Objective measures are mathematically computable dis-tortion measures. Various numerical objective measures havebeen proposed for image quality assessment. Some of theminclude mean square error (MSE), laplacian mean square er-ror (LMSE), normalized absolute error (NAE), maximum Ab-solute error (MaxAE), peak mean square error (PMSE), rootmean square error (RMSE), peak signal to noise ratio (PSNR),normalized cross correlation (NK), structural content (SC) [26]-[31]. Among these, the most widely used objective measuresare MSE and PSNR and are given by

MSE =1

MN

M∑

m=1

N∑

n=1

(X(m,n) − Y(m,n))2 (1)

PSNR = 10log10

(2b − 1)21

MN

∑Mm=1

∑Nn=1 (X(m,n)−Y(m,n))2

(2)The digital retinal image used in the tests has blood vesselswhose contrast is different from the background. The suddenchange in intensity levels makes edge error measures more suit-able to evaluate the quality of the image. Hence LaplacianMean Square Error (LMSE) defined as the MSE between theLaplacian of the original image and Laplacian of the decom-pressed image is computed for the evaluation. LMSE reflectsthe distortion in the edges than distortion in the smooth region.It is observed that distortion in the smooth region resulted ina lower value of LMSE while, the same amount of distortionin edge region resulted in a higher value of LMSE keeping thenumber of distorted pixels same. It would be helpful to knowthe given test image is distorted or not before applying the qual-ity assessment procedure. Laplacian of the reconstructed im-age gives an overall idea of whether the reconstructed imageis distorted or not without requiring a reference image. Lapla-cian produces closed edge contours, if the image meets cer-tain smoothness constraints. It is also observed that, properlyscaled laplacian image exhibits small closed loop like forma-tions in the region blurred(over smoothed) due to wavelet com-pression at moderate and high compression ratio. This indi-cates Laplacian localizes the distorted region by showing looplike formations. Hence measuring the amount of smoothnessalso indicates the quality of the reconstructed image.

LMSE =∑M

m=1

∑Nn=1 [O(m,n))−O(Y (m,n))]2

∑Mm=1

∑Nn=1 [O(X(m,n))]2

(3)


SSIM(x, y) =(2μxμy + C1)(2σxy + C2)

(μ2x + μ2

y + C1)(σ2x + σ2

y + C2)(4)

where X and Y are the original and reconstructed image re-spectively of size M×N, b is the number of bits per pixel andO((X(m,n)) = X(m + 1, n) + X(m − 1, n) + X(m,n +1) + X(m,n − 1) − 4X(m,n). Since the images are ulti-mately viewed by human observer, a HVS based measure struc-tural similarity index (SSIM) is also used to evaluate the im-age quality [30]. Any image compression method must ensurethat there is no significant loss of diagnostic information for vi-sual examination as well as subsequent medical image analysis.Hence in addition to objective measures, the perception and di-agnostic based subjective evaluation quantified by Mean Opin-ion Score (MOS) is also considered. The MOS is obtained bycomparing the various diagnostic features such as optic disk,macula, blood vessel structure, pathological features betweenthe original and compressed images. We focused on the expertsassessment of diagnostically acceptable quality of compressedimages to ensure sufficient diagnostic accuracy.

2.3. ECG Distortion Measures

The PRD is a normalized value that indicates the error betweentwo signals. The PRD is given by (rmse/rmss) × 100, wherermse and rmss denotes the root mean square (rms) values ofthe error and the original signals, respectively. Generally, thePRD corresponds to normalizing the average error by the RMSvalue of the original signal with different offsets [24], [25].Based on the normalization, three different PRD measures suchas PRD1, PRD2 and PRD3 are defined in literature [25]. ThePRD1 is defined as

PRD1 =

√√√√∑N

n=1[x(n)− x̃(n)]2∑N

n=1[x(n)− μo −K]2× 100. (5)

where, x(n) denotes the original signal with N samples, x̃(n)denotes the reconstructed signal and μo denotes the mean ofthe original signal. In many compression methods, the base-line value (K=1024 for mita database records) is added forstorage purpose. In PRD1, the normalized value is calculatedwith exclusion of μo and K values. In PRD2, the normalizedvalue is calculated with inclusion of μo value and exclusion ofK. In PRD3, the normalized value is calculated with inclusionof mean μo and baseline values. For a given set of original andreconstructed signals the PRD3 value is very low as comparedto the values of PRD1 and PRD2. The comparison of PRDswith different offsets is meaningless. Also, the reconstructionwith a low PRD/RMSE does not necessarily guarantee clini-cal acceptance. The PRD measure does not weigh the actualdiagnostic relevance of the compression methods that shouldonly remove redundant and/or irrelevant information withoutsacrificing the diagnostic information [24]. The survey of thedistortion measures were presented in our previous work [25].

2.4. Wavelet transform, Wavelet energies and WEDD mea-sure

The wavelet transform involves a varied time-frequency win-dow and can provide good localization property in both timeand frequency domain which provides nice performance in an-alyzing nonstationary signals [17], [18], [19]. The informa-tion can be organized in a hierarchical scheme of nested sub-spaces called multiresolution analysis in L2(�). This makes itpossible to organize the information in some particular struc-ture to distinguish, for example, trends, or the shape associ-ated with long scales of the local details from correspondingshort scales. It provides a direct estimation of local energiesat different scales. The relative wavelet energy concept givesa suitable tool for detecting and characterizing specific eventsin time and frequency planes. These localization and the en-ergy compaction characteristics are successfully exploited infeatures classification and compression methods. Meanwhile,these relative energies are used as dynamic weights in WEDDmeasure [25]. In this work, the WEDD measure is presentedin terms of energy based weighted PRD of each subband. TheWEDD is defined as

WEDDl =J+1∑

l=1

wl WPRDl l = 1, 2, 3.....(J + 1) (6)

where L is the number of decomposition levels which is equiv-alent to five, wl is the energy based weight for lth subband andWPRDl is the PRD measure for lth subband. The dynamicweights are calculated as

wl =∑Kl

k=1 d2l (k)

∑J+1m=1

∑Km

k=1 d2m(k)

(7)

where wl is the weight for lth subband, Kl denotes the num-ber of wavelet coefficients in lth subband and dl(k) is the kth

wavelet coefficient of the original signal in lth subband. Thewavelet PRD (WPRD) is defined as

WPRDl =

√√√√∑Kl

k=1 [dl(k)− d̃l(k)]2∑Kl

k=1 [dl(k)]2(8)

where WPRDl is the error in the lth subband, dl(k) is the kth

wavelet coefficient of the original signal in lth subband andd̃l(k) is the kth wavelet coefficient of the reconstructed signalin lth subband. This measure gives the idea about the struc-tured error since it estimates the error between the local wavesof the ECG signal. The WEDD is more sensitive to diagnos-tic feature shape changes while it is insensitive to smoothingof background noise. It is observed that the correlation be-tween the clinical quality and the WEDD is much higher thanthe other objective quality measures [25].


3. RETINAL IMAGES COMPRESSION

The numerical objective quality measures MSE, PSNR, LMSEand SSIM-a measure inspired by HVS, are computed to quan-tify the distortion in the reconstructed image after decompres-sion. The spatially varying Laplacian error generates a distor-tion map of diagnostic vessel features. Similarly the SSIM mapis also generated. The LMSE error map generation is simple asit dose not require the computation of mean, variance and stan-dard deviation as done in SSIM map generation. The laplacianerror map and the SSIM map are compared and studied. Tohave meaningful comparison, the laplacian images are invertedso that brighter region indicates lower distortion and darker re-gion indicates higher distortion. Laplacian error map showsgood error visualization of blood vessels compared to SSIMmap. The LMSE for different image quality, is computed asthe mean square error between the laplacian of the original andreconstructed images. While conducting the experiments it isobserved that distortion in the smooth region resulted in a lowervalue of LMSE but when the regions with vessel starts gettingdistorted, the LMSE value increases greatly. This shows thatthe LMSE reflects distortion in the edges (blood vessels) moreclearly than distortion in the smooth region.

3.1. Image Quality assessment

The retinal image should possess some common features whichhelps to define the quality of the image. The diagnostic fea-tures of retinal image are, optic disc (OD), the central visionpart macula (MC) and the blood vessel structure. Blood ves-sel detection is a critical topic in automatic retinal image pro-cessing. Blood vessel morphology is an important indicatorof many diseases such as diabetes and hypertension. Abnor-mality of blood vessel includes: change in color, change inwidth, change in tortuosity and neovascular generation. Mor-phological changes in retinal vessel structure helps in detect-ing and grading the Diabetic Retinopathy(DR). The measure-ment of changes may then be applied to a variety of clinicalstudies:screening, diagnosis and evaluation of treatment. Toidentify blood vessel abnormality in large scale screening, itsessential to detect the blood vessel map fast and accurately.Then different measurement can then be done to help doctorsin making diagnosis. Two strategies have been generally em-ployed for the detection of blood vessels in retinal image. Oneis edge detection, the other is vessel tracking which needs apriori knowledge of the beginning position in the image. Theformer method is used in this paper as it is simple to apply andanalyze. In this work, the 2-D laplacian operator based edgedetector expressed by (9), which plays important role in imageprocessing applications is used for detecting the blood vessels.

O((X(m,n)) = X(m + 1, n) + X(m− 1, n) +X(m,n + 1) + X(m,n− 1)− 4X(m,n) (9)

The first stage of edge detection is the evaluation of deriva-tives of the image intensity. Gaussian smoothing filters areused to make differentiation more immune to noise.When atwo-dimensional laplacian is applied to the reconstructed im-age, it detects blood vessels present in the image. The Fig. 1shows the original image, reconstructed, laplacian of the orig-inal and laplacian of the reconstructed images. The laplacianMSE (LMSE) is calculated using the corresponding pixel val-ues of the laplacian of original and reconstructed images. Sinceedge information is an image property to which the HVS ishighly sensitive, use of LMSE has been found to be successfulin our experiments as it captures information relating to bloodvessel features. The distortion can be computed locally for ev-

Fig. 1: (a) Original image (b) Reconstructed image (c) Laplacian of originalimage (d)Laplacian of reconstructed image

ery pixel, yielding perceptual distortion maps for better visu-alization of the spatial distribution of distortions.The pixelwiseerror between the laplacian of original image and laplacian ofreconstructed image gives the laplacian error map as demon-strated in Fig. 2. Such a distortion map can help the expertto make proper decision on diagnostic quality of the image.Hence this can be more useful and more reliable than a globalmeasure in quality assessment applications. The image qual-ity is also evaluated by computing traditional MSE, PSNR andMSSIM. The performance of LMSE is compared with MSE,PSNR and MSSIM quality measures. The numerical objectivequality measures MSE, PSNR, LMSE and MSSIM-a measureinspired by HVS, are computed to quantify the distortion in thereconstructed image after decompression. The spatially vary-ing Laplacian error generates a distortion map of diagnosticvessel features. Similarly the SSIM map is also generated. TheLMSE error map generation is simple as it dose not requirethe computation of mean ,variance and standard deviation asdone in SSIM map generation. The laplacian error map andthe SSIM map are compared and studied. To have meaningfulcomparison, the laplacian images are inverted so that brighterregion indicates lower distortion and darker region indicates


(a) (b)

(c) (d)

Fig. 2: Laplacian error map for (a) CR=4 (b) CR=8 (c) CR=12 (d) CR=16

higher distortion. Laplacian error map shows good error visu-alization of blood vessels compared to SSIM map as shown inFig. 3. The LMSE for different image quality, is computed asthe mean square error between the laplacian of the original andreconstructed images. While conducting the experiments it is

(a)

(b)

Fig. 3: Laplacian error map and SSIM map for (a) CR=12 (b) CR=16

observed that distortion in the smooth region resulted in a lowervalue of LMSE but when the regions with vessel starts gettingdistorted, the LMSE value increases greatly. This shows thatthe LMSE reflects distortion in the edges (blood vessels) moreclearly than distortion in the smooth region. To illustrate this,the laplacian of 04-test image is taken and more number of pix-els in the nonvessel area are altered manually. This resulted in alow LMSE value. Similarly when few vessel pixels are altered,LMSE value is very high compared to the previous case. Theseeffects are shown in Fig. 4. In the above example, approxi-mately 40,000 nonvessel pixels resulted in an LMSE of 3 and

around 1500 altered vessel pixels gives an LMSE of 20. HenceLMSE can be used as a good quality measure for retinal im-ages. In some applications like telemedicine , it would be help-

( a ) ( b )

( c ) ( d )

Fig. 4: (a) Laplacian of original image (b) Nonvessel pixels are distorted (c)Some vessel pixels are distorted (d) more vessel pixels are distorted

ful to know, the given test image is useful for diagnosis or notbefore applying the quality assessment procedure. This is pos-sible, if we get an overall qualitative view about the degree ofthe distortion in the reconstructed image. Laplacian producesclosed edge contours, if the image meets certain smoothnessconstraints. At low compression rate, the smoothed regions inthe image is not strong enough to show the closed loop like for-mations. It is observed that, a properly scaled laplacian imageexhibits small closed loop like formations in the blurred (oversmoothed) regions due to wavelet compression at moderate andhigh compression ratio. This indicates laplacian localizes thedistorted smooth regions by showing loop like formations asshown in Fig. 5. Hence laplacian of the reconstructed imagegives an overall idea of whether the reconstructed image is dis-torted or not without requiring for a reference image. Whileconducting the subjective evaluation of images by medical ex-perts, no constraints were placed on viewing time, viewing dis-tance or illumination (lighting) conditions. The experts wereallowed to simulate the conditions they would use in their ev-eryday work. It is focused more on the experts assessment ofdiagnostically acceptable quality of compressed images to en-sure sufficient diagnostic accuracy. The most recent UQI andSSIM measures are widely used and perform well as qualitymeasures in video and web browsing image compression meth-ods. This is because these measures match well with the gen-eral human observer as a final receiver. However, in Telehealthcare monitoring system, the reconstructed images are observedby experts for detection and diagnosis of various diseases. Theslight variation in the intensity values may affect the diagnos-tic detection accuracy by the experts. The images are scoredan MOS of 4.125 by the subjects, but 3.5 by the experts. Theexperts were inferring that this fundus image is likely to pos-


sess some disease. Hence there is a need for quality measures,which quantify diagnostic distortion and also general imageperceived quality. From this experiments, it is found that thequality of the image is good at compression ratio 8:1.

��

Fig. 5: formation of closed contours for (a) CR=16 (b) CR=32

4. MULTI-LEAD ECG SIGNALS COMPRESSION

Heart beat signals generally show considerable similarity be-tween adjacent beats along with short-term correlation betweenadjacent samples. The temporal beat alignment method can beemployed to get more efficient ECG compression methods withthe use of long duration of single lead ECG signals [12]- [16].The compression ratio achieved will be reduced if the numberof cycles employed for the compression. These 2-D compres-sion methods not tested for multi-lead ECG signals with shortduration that enables a real time processing. Mammen andRamamurthi proposed a multi-lead ECG signals compressionbased on classified vector quantization (CVQ) of m-AZTEC al-gorithm which exploited the correlation that exists across chan-nels in the duration-parameter [20]. The application of CVQexploited the cross-channel correlation in the value-parameter.Even though the m-AZTEC provided the high compression ra-tio ratio, the fidelity of the reconstructed signal is not accept-able because of the discontinuity (step-like quantization) thatoccurs in the reconstructed ECG signal. Enis et al. presented amulti-lead ECG signals compression based on multirate signalprocessing and transform coding techniques [21]. The tech-nique achieved high compression ratio by exploiting the inher-ent correlation among the ECG leads. The recorded ECG sig-nals are decorrelated by using either an approximate KL trans-form or DCT. The result of the uncorrelated signals are codedusing various single lead ECG coding methods. The qualityof the reconstructed signals is evaluated using PRD which isnot good measure for the distortion in the reconstruction oftransform coder. These two transform coder has noise filteringcharacteristics. Hence, the PRD criterion does not weigh theactual diagnostic relevance of the compression techniques thatshould only remove redundant information without sacrificingdiagnostic performance. Cohen and Zigel presented compres-sion of multi-lead ECG through multi-lead long-term predic-

tion (LTP) [22]. The reported method is based on the beatsegmentation, the template matching, the adaptive multi-leadcodebook creation, the LTP coding and the coding of resid-ual using the VQ. It was shown that the compression perfor-mance is better than the previous reported methods. Miaou andYen presented a multi-lead ECG compression using multi-leadadaptive vector quantization which was reported for single-lead ECG [23]. The multichannel adaptive vector quantiza-tion (MC-AVQ) algorithm where AVQ is performed on eachchannel and across channels to exploit the correlation within achannel and across channels, respectively. The MC-AVQ per-forms the single-channel AVQ for each ECG channel first, andthen encodes the collection of the resulting indexes using theindex codebook. The coding performance of the MC-AVQ wassuperior to the single-channel AVQ in CDR with no additionalloss of reconstructed fidelity. However the complexity and theperformance of the VQ and its variant techniques depend on se-lection of better codebook parameters including codebook size,vector dimension, distortion threshold, etc.In this work, a novel two-dimensional (2-D) DWT approachto multi-lead ECG signals compression method is proposed.The short duration multi-lead ECG signals are first cut andbeat aligned to form a 2-D array. Then, the 2-D DWT is ap-plied on the 2-D array and the transformed wavelet coefficientsare coded using the efficient SPIHT coder. The proposed 2-D array construction exploits the inter-lead correlation ratherthan the intra-lead correlation which contains short term corre-lation (intrabeat correlation) and long term correlation (inter-beat correlation). The 2-D array is formed using short timeduration multi-signals rather longer duration signal used in re-ported methods.

4.1. 2-D representation of Multi-Lead ECG Signals and Com-pression

To perform ECG compression using 2-D DWT and SPIHT, ablock coding method is employed. The QRS complex of eachheartbeat is detected, cut into segments and aligned with othersegments to form a 2-D data array [12]. By picking up the R-wave position among heartbeats, a segment from one R-waveto the next for delineating ECG segments is defined and storethe RR interval as the beat information for period transforma-tion. Since in general the length of each heartbeat segment isdifferent, the period normalization is employed to align theirbeat lengths. The mean removal and amplitude normalizationincreases further correlation between the ECG beat segments.In summary, the proposed method is implemented in the fol-lowing steps: QRS detection of the 1-D ECG signal, periodsegmentation, preprocessing (mean removal, period and am-plitude normalization), construction of 2-D array with M ×N ,slicing constructed 2-D array into L×L blocks for 2-D waveletdecomposition and SPIHT coding, and coding of beat lengthinformation and mean value of the segments. To reveal the vi-


Fig. 6: Strong (a) inter-beat correlation, (b) inter-lead correlation among consecutive ECG cycles from the rearranged ECG matrix after period normalization andmatrix conversion.

500 1000 1500 2000

50

150

250

ECG lead−1

500 1000 1500 2000

50

150

250

Rec.,WEDD=2.75,PRD1=9.65

500 1000 1500 2000

−20

0

20

Error signal

500 1000 1500 2000

50

100

150

200

ECG lead−2

500 1000 1500 2000

50

100

150

200


500 1000 1500 2000

−20

0

20

Error signal

500 1000 1500 2000

0

100

200

ECG lead−3

500 1000 1500 2000

0

100

200


500 1000 1500 2000

−20

0

20

Error signal

500 1000 1500 2000

50

150

250

ECG lead−4

500 1000 1500 2000

50

150

250


500 1000 1500 2000

−20

0

20

Error signal

500 1000 1500 2000

50

150

250

ECG lead−5

500 1000 1500 2000

50

150

250


500 1000 1500 2000

−20

0

20

Error signal

500 1000 1500 2000

50

100

150

200

ECG lead−6

500 1000 1500 2000

50

100

150

200

Rec.,WEDD=3.45, PRD1=8.47

500 1000 1500 2000

−20

0

20

Error signal

500 1000 1500 2000

50

100

150

200

ECG lead−7

500 1000 1500 2000

50

100

150

200


500 1000 1500 2000

−20

0

20

Error signal

500 1000 1500 2000

50

150

250

ECG lead−8

500 1000 1500 2000

50

150

250


500 1000 1500 2000

−20

0

20

Error signal

(a) (b) (c) (d)

(e)(f) (g)

(h)

Fig. 7: Original, reconstructed and error signals of multi-lead ECG signals. Number of ECG beat is nine. The block size is 64× 64. The reconstructed signals areat CR=8:1 and their error distribution are shown for each reconstruction.


sual quality of the reconstructed signals, 2048 samples of theoriginal signal and the reconstructed signals decompressed atCR of 8:1 of all ECG records are reproduced in Fig. 7. Theproposed method preserves diagnostic information present inthe multi-lead ECGs of our experimental data records. Thisexperimental results show that the PRD and WWPRD fails tosignificantly reflect the diagnostic behavior of the compressionmethod.

5. CONCLUSION

In this work, the compression of retinal images and multi-leadECG signals is studied. The compression methodology is basedon preprocessing, DWT and SPIHT coding scheme. The twodifferent source signals are compressed on same coder platformto reduce the cost of the mobile telemedicine system. The per-formance of the SPIHT coder is also good for multi-lead ECGsignals. The reconstructed images and multi lead ECG sig-nals are evaluated via objective and subjective measures. Thedeficiency of some of the quality measures for evaluating thereconstruction of the retinal images and the ECG signals isdemonstrated with different sets of experiments. The impor-tance of the diagnostic distortion measure for the evaluation ofthe medical images and ECG signal is discussed in this work.

6. REFERENCES

[1] R.H.Eikelboom, K.Yogesan, C.J.Barry, I.J.Constable,M.Tay-kearney, L.Jitskaia, P.H.House, “Methods and Lim-its of Digital Image Compression of Retinal Images forTelemedicine”, Investigative Ophthalmology and VisualScience, Vol.41, No.7, pp.1916-1924, 2000.

[2] K.Hung and Y.T.Zhang, “Implementation of a WAP-Basedtelemedicine system for patient monitoring,” IEEE Trans.Inf. Technol. Biomed., Vol.7, No.2, pp.101-107, 2003.

[3] M.F.A.Rasid, and B.Woodward, “Bluetooth telemedicineprocessor for multichannel biomedical signal transmissionvia mobile cellular networks,” IEEE Trans. Inf. Technol.Biomed., Vol.9, No.1, pp.167-174, 2005.

[4] C.H.Salvador, M.P.Carrasco, M.A.Gonzlez de Mingo,A.M.Carrero, J.M.Montes, L.S.Martn, M.A.Cavero,I.F.Lozano, and J.L.Monteagudo, “Airmed-Cardio: AGSM and internet services-based system for out-of-hospital follow-up of cardiac patients,” IEEE Trans. Inf.Technol. Biomed., Vol.9, No.1, pp.73-85, 2005.

[5] M.S.Manikandan, and S.Dandapat, “An efficient wavelet-based cardiac signal compression for telemedicine ,” inProc. Indian Conf. Medical Informatics and Telemedicine,IIT Kharagpur, India, pp.132-136, 2006.

[6] L.B.Hadzievski Bojovic, V.Vukcevic, P.Belicev,S.Pavlovic, Z.V.Pokrajcic, M.Ostojic,“A novel mo-bile transtelephonic system with synthesized 12-leadECG,” IEEE Trans. Inf. Technol. Biomed., Vol.8, No.4, pp.428-438, 2004

[7] J.M.Shapiro, “Embedded image coding using zerotrees ofwavelet coefficients,” IEEE Trans. Signal Process., Vol.41,No.12, pp.34453462, 1993.

[8] A. Said and W. A. Pearlman, “A new fast and efficient im-age codec based on set partitioning in hierarchical trees,”IEEE Trans. Circuits Syst. Video Technol., Vol.6, No.3,pp.243250, 1996.

[9] J.D.Villasenor, “Alternatives to the discrete cosine trans-form for irreversible tomographic image compression,”IEEE Trans. Med. Imag., Vol.12, pp.803811, 1993.

[10] H.Lee, Y.Kim, E.A.Riskin, A.H.Rowberg, M.S. Frank,“A predictive classified vector quantizer and its subjectivequality evaluation for X-ray CT images,” IEEE Trans. Med.Imag., Vol.14, pp.397406, 1995.

[11] Peter Schelkens, Adrian Munteanu, Joeri Barbarien, Mih-nea Galca, Xavier Giro-Nieto, and Jan Cornelis, “WaveletCoding of Volumetric Medical Datasets,” IEEE Trans.Med. Imag., Vol.22, No.3, pp.441-458, 2003.

[12] H.Lee, K.M. Buckley, ECG data compression using cutand align beats approach and 2-D transforms, IEEE Trans.Biomed. Eng., Vol.46, No.5, pp.556-565, 1995.

[13] J.J. Wei, C.J. Chang, N.K.Chou, G.J.Jan, “ECG data com-pression using truncated singular value decomposition,”IEEE Trans. Biomed. Eng., Vol.5, No.4, pp. 290-295, 2004.

[14] A.Bilgin, M.W.Marcellin, M.I.Altbach, “Compression ofelectrocardiogram signals using JPEG2000,” IEEE Trans.Consum. Electron., Vol.49, No.4, pp.833-840, 2003.

[15] S.C.Tai, C.C.Sun, W.C. an, “2-D ECG compressionmethod based on wavelet transform and modified SPIHT,”IEEE Trans. Biomed. Eng., Vol.52, No.6, pp.999-1008,2005.

[16] H.H.Chou, Y.J.Chen, Y.C.Shiau, T.S.Kuo, “An Effectiveand Efficient Compression Algorithm for ECG SignalsWith Irregular Periods,” IEEE Trans. Biomed. Eng., Vol.53,No.6, pp.1198-1205, 2006.

[17] C.Li, C.Zheng, C.Tai, “Detection of ECG characteristicpoints using Wavelet transforms,” IEEE Trans. Biomed.Eng., Vol.42, No.1, pp.21-28, 1995.

[18] S.Kadambe, R.Murray, G.F.B.Bartels,“Wavelettransform-based QRS complex detector,” IEEE Trans.Biomed. Eng., Vol.46, No.7, pp.838-847, 1999.


[19] J.P.Martnez, R.Almeida, S.Olmos, A.P.Rocha, P.Laguna,“A Wavelet based ECG delineator: evaluation on stan-dard databases,” IEEE Trans. Biomed. Eng., Vol.51, No.4,pp.570-581, 2004.

[20] C. P. Mammen, B. Ramamurthi, “Vector quantization forcompression of multichannel ECG,” IEEE Trans. Biomed.Eng., Vol. 37, pp.821825, 1990.

[21] A.E.Cetin, H.Koymen, M.C.Aydin, “Multichannel ECGdata compression by multirate signal processing and trans-form domain coding techniques,” IEEE Trans. Biomed.Eng., Vol. 40, pp.495499, 1993.

[22] A. Cohen and Y. Zigel, “Compression of multichannelECG through multichannel long-term prediction,” IEEEEng. Med. Biol. Mag., Vol.16, No.4, pp.109115, 1998.

[23] S.G.Miaou, H.L.Yen, “Multichannel ECG compressionusing multichannel adaptive vector quantization,” IEEETrans. Biomed. Eng. Vol. 48, No.10, 1203-1207, 2001.

[24] A.S.Al-Fahoum, “Quality assessment of ECG compres-sion techniques using a wavelet-based diagnostic mea-sure,” IEEE Trans. Inf. Technol. Biomed., Vol.10, No.1,pp.182-191, 2006.

[25] M.S.Manikandan, S.Dandapat, “Wavelet energy based di-agnostic distortion measure for ECG,” Biomedical SignalProcessing and Control, Vol.2, No.2, pp.80-96, 2007.

[26] P.Cosman , R.M.Gray, R.A.Olshen, “Evaluating the Qual-ity of Compressed Medical Images: SNR, Subjective Rat-ing and Diagnostic accuracy” Proc. IEEE, Vol.82, No.6, pp919-932, 1994.

[27] ITU-R Recommendation BT.500-10, “Methodology forthe subjective assessment of the quality of the televisionpictures”. 2000.

[28] Andrew P. Bradley, “Wavelet Visible Difference Pre-dictor”, IEEE Trans. on Image Processing, Vol.8, No.5,pp.717-730, 1999.

[29] Z.Wang, A.C. Bovik, “A Universal Image Quality Index”.IEEE Signal Processing Letters, Vol.9, No.3, pp 81-84,2002.

[30] Z.Wang, A.C. Bovik, H. Sheikh, E. P. Simoncelli, “Im-age Quality Assessment: From Error Visibility to Struc-tural Similarity,” IEEE Trans. Image Processing, Vol.13,No.4, pp 600-612, 2004.

[31] M. Basu, “Gaussian-Based Edge Detection Methods-A survey”, IEEE Trans. Systems, Man and Cybernetics-PartC:Application and Reviews, Vol.32, No.3, pp.252-260,2002.


SPECIALIZED TEXT BINARIZATION TECHNIQUE FOR CAMERA-BASEDDOCUMENT IMAGES

T Kasar, J Kumar and A G Ramakrishnan

Medical Intelligence and Language Engineering LaboratoryDepartment of Electrical Engineering, Indian Institute of Science

Bangalore, INDIA - 560 [email protected], [email protected], [email protected]

ABSTRACT

Complex color documents with both graphics and text, where

the text varies in color and size, call for specialized binariza-

tion techniques. We propose a novel method for binarization

of color documents whereby the foreground text is output as

black and the background as white regardless of the polar-

ity of foreground and background shades. The method em-

ploys an edge-based connected component approach to de-

termine text-like components and binarize them individually.

The threshold for binarization and the logic for inverting the

output are derived from the image data and do not require

any manual tuning. Unlike existing binarization methods,

our technique can handle documents with multi-colored texts

with different background shades. The method is applica-

ble to documents having text of widely varying sizes, usually

not handled by local binarization methods. Experiments on

a broad domain of target document types illustrate the effec-

tiveness and adaptability of the method.

Index Terms— Binarization, Color documents, Camera-

based document analysis

1. INTRODUCTION

In acquiring document images, there has been an increased

use of cameras as an alternative to traditional flat-bed scan-

ners and research towards camera based document analysis

is growing [1]. Digital cameras are compact, easy to use,

portable and offer a high-speed non-contact mechanism for

image acquisition. The use of cameras has greatly eased doc-

ument acquisition and has enabled human interaction with

any type of document. It has several potential applications

like licence plate recognition, road sign recognition, digital

note taking, document archiving and wearable computing. At

the same time, it has also presented us with much more chal-

lenging images for any recognition task. Traditional scanner-

based document analysis systems fail against this new and

promising acquisition mode. Camera images suffer from un-

even lighting, low resolution, blur, and perspective distortion.

Overcoming these challenges will help us tap the potential

advantages of camera-based document analysis.

2. REVIEW OF EARLIER WORK

Binarization often precedes any document analysis and recog-

nition procedures. It is critical to achieve robust binarization

since any error introduced in this stage will affect the sub-

sequent processing steps. The simplest and earliest method

is the global thresholding technique that uses a single thresh-

old to classify image pixels into foreground or background

class. Global thresholding techniques are generally based on

histogram analysis [2, 3]. They work well for images with

well separated foreground and background intensities. How-

ever, most of the document images do not meet this condi-

tion and hence the application of global thresholding meth-

ods is limited. Camera-captured images often exhibit non-

uniform brightness because it is difficult to control the imag-

ing environment as much as we can with the scanner. As

such, global binarization methods are not suitable for cam-

era images. On the other hand, local methods use a dynamic

threshold across the image according to the local informa-

tion. These approaches are generally window-based and the

local threshold for a pixel is computed from the gray values

of the pixels within a window centred at that particular pixel.

Niblack [4] proposed a binarization scheme where the thresh-

old is derived from the local image statistics. The sample

mean μ(x, y) and the standard deviation σ(x, y) within a win-

dow W centred at the pixel location (x,y) are used to compute

the threshold T(x, y) as follows:

T(x, y) = μ(x, y)− k σ(x, y), k = 0.2 (1)

Trier and Jain [5] evaluated 11 popular local thresholding

methods on scanned documents and reported that Niblack’s

method performs the best for optical character recognition

(OCR). The method works well if the window encloses at

least 1-2 characters. However, in homogeneous regions larger

than the size of the window, the method produces a noisy


0 50 100 150 200 250

0

0.5

1

1.5

2

2.5

x 104

Fig. 1. An example image with multi-colored textual con-

tent and its gray level histogram. A conventional binarization

technique, using a fixed foreground-background polarity, will

treat some of the characters as background pixels leading to

the loss of some textual information.

output since the expected sample variance becomes the back-

ground noise variance. Sauvola and Pietikainen [6] proposed

an improved version of the Niblack’s method by introducing

a hypothesis that the gray values of the text are close to 0

(Black) while the background pixels are close to 255 (White).

The threshold is computed with the dynamic range of stan-

dard deviation (R) which has the effect of amplifying the con-

tribution of standard deviation in an adaptive manner.

T (x, y) = μ(x, y) [1 + k (σ(x, y)

R− 1)] (2)

where the parameters R and k are set to 128 and 0.5 respec-

tively. This method minimizes the effect of background noise

and is more suitable for document images. However, Sauvola

method fails for images where the assumed hypothesis is not

met and accordingly, Wolf and Jolion [7] proposed an im-

proved threshold estimate by taking the local contrast mea-

sure into account.

T (x, y) = (1−a)μ(x, y) + aM+aσ(x, y)Smax

(μ(x, y)−M) (3)

where M is the minimum value of the grey levels of the whole

image, Smax is the maximum value of the standard deviations

of all the windows of the image and ‘a’ is a parameter fixed

at 0.5. This method combines Savoula’s robustness with re-

spect to background textures and the segmentation quality of

Niblack’s method. However it requires two passes since one

of the threshold decision parameter Smax is the maximum of

the standard deviation of all the windows of the images.

With the recent developments on document types, more

specialized binarization techniques are required to handle com-

plex documents having both graphics and text. We often en-

counter text of different colors in a document image as shown

in Fig. 1. Conventional methods assume that the polarity of

the foreground-background intensity is known a priori. The

text is generally assumed to be either bright on a dark back-

ground or vice versa. Binarization using a single threshold

on such images, without a priori information of the polarity

of foreground-background intensities, will lead to loss of tex-

tual information as some of the text may be assigned as back-

ground. The characters once lost cannot be retrieved back

and are not available for further processing. Possible solu-

tions need to be sought to overcome this drawback so that any

type of document could be properly binarized without the loss

of textual information.

3. SPECIALIZED TEXT BINARIZER

Text is the most important information in a document. We

propose a novel method to binarize camera-captured color

document images, whereby the foreground text is output as

black and the background as white irrespective of the original

polarities of foreground-background shades. The proposed

method uses an edge-based connected component approach

and determines the threshold for each component individu-

ally. Canny edge detection [8] is performed individually on

each channel of the color image and the edge map E is ob-

tained by combining the three edge images as follows

E = ER ∨ EG ∨ EB (4)

Here, ER, EG and EB are the edge images corresponding to

the three color channels and ∨ denotes the logical OR op-

eration. We have used the thresholds 0.2 and 0.3 for the

hysteresis thresholding step of Canny edge detection. An 8-

connected component labeling follows the edge detection step

and the associated bounding box information is computed.

We call each component, thus obtained, an edge-box (EB).

We make some sensible assumptions about the document and

use the area and the aspect ratios of the EBs to filter out the

obvious non-text regions. The aspect ratio is constrained to

lie between 0.1 and 10 to eliminate highly elongated regions.

The size of the EB should be greater than 15 pixels but smaller

than 1/5th of the image dimension to be considered for fur-

ther processing. Since edge detection captures both the inner

and outer boundaries of the characters, it is possible that an

EB may completely enclose one or more EBs as illustrated

in Fig. 2(a). If a particular EB has exactly one or two EBs

that lie completely inside it, the internal EBs can be conve-

niently ignored as it corresponds to the inner boundaries of

the text characters. On the other hand, if it completely en-

closes three or more EBs, only the internal EBs are retained

while the outer EB is removed as such a component does not

represent a text character. Thus, the unwanted components

are filtered out by subjecting each edge component to the fol-

lowing constraint:

if (Nint <3){Reject EBint, Accept EBout}

else{Reject EBout, Accept EBint}

where EBint denotes the EBs that lie completely inside the

current EB under consideration and Nint is the number of


(a)

(b)

Fig. 2. (a) Edge-boxes for the English alphabet and numer-

als. Note that there is no character that completely encloses

more than two edge components. (b) The foreground and the

background pixels of each edge component

EBint. These constraints on the edge components effectively

remove the obvious non-text elements, while retaining all the

text-like elements. Only these filtered set of EBs are consid-

ered for binarization.

For each EB, we estimate the foreground and background

intensities and the threshold is computed individually. Fig.

2(b) shows the foreground and the background pixels which

are used for obtaining the threshold and inversion of the bi-

nary output. The foreground intensity is computed as the

mean gray-level intensity of the pixels that correspond to the

edge pixels.

FEB =1

NE

∑

(x,y)∈E

I(x, y) (5)

where E represents the edge pixels, I(x,y) represents the in-

tensity value at the pixel (x,y) and NE is the number of edge

pixels in an edge component. For obtaining the background

intensity, we consider three pixels each at the periphery of the

corners of the bounding box as follows

B = {I(x− 1, y − 1), I(x− 1, y), I(x, y − 1),I(x + w + 1, y − 1), I(x + w, y − 1), I(x + w + 1, y),I(x− 1, y + h + 1), I(x− 1, y + h), I(x, y + h + 1),I(x + w + 1, y + h + 1), I(x + w, y + h + 1),I(x + w + 1, y + h)}

where (x, y) represent the coordinates of the top-left corner

of the bounding-box of each edge component and w and h are

its width and height, respectively. The local background in-

tensity is then computed as the median intensity of these 12

background pixels.

BEB = median(B) (6)

Assuming that each character is of uniform color, we bina-

rize each edge component using the estimated foreground in-

tensity as the threshold (TEB). Depending on whether the

foreground intensity is higher or lower than that of the back-

ground, each binarized output is suitably inverted so that the

foreground text is always black and the background, always

white.

TEB ={

FEB , if FEB < BEB

(255− FEB), if FEB > BEB(7)

4. EXPERIMENTS

The test images used in this work are acquired using a Sony

digital still camera at a resolution of 1280× 960. The images

are taken from both physical documents such as book covers

and newspapers and non-paper document images like text on

3-D real world objects. Fig. 3 compares the results of our

method with some popular local binarization techniques, viz,

Niblack’s method, Sauvola’s method and Wolf’s method on a

document image having large variation in text sizes with the

smallest and the largest components being 5 × 16 and 414

× 550 respectively. Clearly, these local binarization methods

fail when the size of the window is smaller than the stroke

width. A large character is broken up into several compo-

nents and undesirable voids occur within thick characters. It

requires a priori knowledge of the polarity of foreground-

background intensities as well. On the other hand, our method

can deal with characters of any size and color as it only uses

edge connectedness. The generality of the algorithm is tested

on 50 complex color document images and is found to have a

high adaptivity and performance. Some results of binarization

using our method are shown in Fig. 4. The algorithm deals

only with the textual information and it does not threshold the

edge components that were already filtered out. In the result-

ing binary images, as desired, all the text regions are output

as black while the background as white, irrespective of their

colors in the input images.


(a) Input image (b) Niblack (c) Sauvola

(d) Wolf (e) Proposed

Fig. 3. Comparison of the proposed method with some pop-

ular local binarization methods for a document image with

large variation in text size. While our method is able to han-

dle characters of any size and color, all other methods fail to

binarize properly the components larger than the size of the

window (35 × 35 used here).

5. CONCLUSIONS AND FUTURE WORK

We have developed a specialized binarization technique well-

suited for camera-based document images. It has good adapt-

ability without the need for manual tuning and can be applied

to a broad domain of target document types. It simultane-

ously handles the ambiguity of the polarity of the foreground-

background intensities and the algorithm’s dependency on the

parameters. The edge-box analysis captures all the characters,

irrespective of size and color, thereby enabling us to perform

local binarization without the need to specify any window.

The proposed method retains the useful textual information

more accurately and thus, has a wider range of target docu-

ment types compared to conventional methods.

The edge detection method is good in finding the charac-

ter boundaries irrespective of the foreground-background po-

larity. However, if the background is textured, the edge com-

ponents may not be detected correctly due to edges from the

background. This can affect the performance of our edge-box

filtering strategy. Overcoming these challenges is considered

as a future extension to this work.

6. REFERENCES

[1] D. Doermann, J. Liang, and H. Li, “Progress in camera-

based document image analysis,” ICDAR, vol. 1, pp. 606–

615, 2003.

[2] J. N. Kapur, P. K. Sahoo, and A.K.C. Wong, “A new

method for gray-level picture thresholding using the en-

tropy of the histogram,” Computer Vision Graphics Im-age Process., vol. 29, pp. 273–285, 1985.

Fig. 4. Some examples of binarization results obtained using

the proposed method. All the text regions are output as black

and the background as white, irrespective of their original col-

ors in the input images.

[3] N. Otsu, “A threshold selection method from gray-level

histograms,” IEEE Trans. Systems Man Cybernetics, vol.

9, no. 1, pp. 62–66, 1979.

[4] W. Niblack, “An introduction to digital image process-

ing,” Prentice Hall, pp. 115–116, 1986.

[5] O. D. Trier and A.K. Jain, “Goal-directed evaluation of

binarization methods,” IEEE Trans. PAMI, vol. 17, no.

12, pp. 1191–1201, 1995.

[6] J. Sauvola and M. Pietikainen, “Adaptive document im-

age binarization,” Pattern Recognition, vol. 33, pp. 225–

236, 2000.

[7] C. Wolf and J.M. Jolion, “Extraction and recognition of

artificial text in multimedia documents,” Pattern Analysisand Applications, vol. 6, no. 4, pp. 309–326, 2003.

[8] J. Canny, “A computational approach to edge detection,”

IEEE Trans. PAMI, vol. 8, no. 6, pp. 679–698, 1986.


A MULTI-LINGUAL CHARACTER RECOGNITION SYSTEM BASED ON SUBSPACEMETHODS AND NEURAL CLASSIFIERS

Manjunath Aradhya V N1, Ashok Rao2 and Hemantha Kumar G1

1Dept of Studies in Computer Science, University of Mysore, Mysore, INDIA - 570006mukesh [email protected]

2Dept of Computer Science, J.S.S. College, Ooty Road, Mysore, INDIA - [email protected]

ABSTRACT

In this paper, we propose a character recognition system for

printed multi-lingual south Indian scripts (Kannada, Telugu,

Tamil and Malayalam) pertaining to consonants, vowels and

modifiers. We extract features by some popular subspace

methods like Principal Component Analysis (PCA), 2Dimen-

sional - PCA (2D-PCA), Diagonal-PCA, Fisher Linear Dis-

criminant Analysis (FLD), and 2D-FLD and for classification

purpose we use different neural network techniques such as

Radial Basis Function, Generalized Regression Neural Net-

work, Probabilistic Neural Network, Self-Organizing Maps,

and Linear Vector Quantization for recognizing. The system

shows good performance for scripts printed as clear document

and even for documents corrupted by variety of noise.

1. INTRODUCTION

Optical Character Recognition (OCR) is one of the most suc-

cessful approaches to automatic pattern recognition for its

various application potential in banks, post offices, defense

organizations, reading aid for the blind, library automation,

language processing and multi-media design. India is the

most multi-lingual multi-script country, where a single doc-

ument page (e.g. passport application form, an examination

question paper, a money order form, bank account opening

application form, train reservation form etc.,) may contain

text in two or more language scripts. Optical Character Recog-

nition is of special significance for a multi-lingual country

like India having sixteen major national and over hundred re-

gional languages. Several methods have been proposed in the

literature for recognizing English, Chinese, Japanese and Ara-

bic script. However, for Indian scripts some pioneering work

has been done on Telugu [1], Bangla and Devnagari. The ma-

jor contribution in this area is the Devnagari OCR system de-

veloped by Chaudhuri B.B [2], which is commercially avail-

able. Another important OCR system is also proposed in the

literature that can read two Indian language scripts: Bangla

and Devnagari (Hindi), the most popular ones in Indian sub-

continent by Chaudhuri B.B. and U. Pal [3]. The character

recognition process from printed documents containing Hindi

and Telugu text is described in [4]. Some of the works related

to Telugu, Tamil and Kannada OCR can be seen in [5, 6, 7].

A survey on Indian script character recognition is presented

in [8]. The paper discusses a review of OCR work done on

Indian language scripts and different methodologies applied

in OCR development in international scenario.

Subspace analysis is an effective technique for dimension-

ality reduction, which aims at finding good representation in

low-dimensional space of inherently high-dimensional data.

There are many popular methods to extract features, amongst

which PCA and FLD are the state-of-the art methods used

widely in the areas of computer vision and pattern recogni-

tion [9]. Using these techniques, an image can be efficiently

represented as belonging to a subspace of low dimensional-

ity such that it retains much information of the original image

space. The features in such subspace provide more salient and

richer information for recognition than the raw image. Hence,

in this paper we study the performance of different subspace

methods under multi-lingual character image of south Indian

scripts. Different neural networks techniques have also been

addressed to show the performance of the system. To the best

of our knowledge, this is the first report of its kind to rec-

ognize under multi-lingual characters of south Indian scripts

using subspace methods and neural classifiers.

The organization of this paper is as follows. In section

2, an overview on subspace methods and artificial neural net-

works are reported. Experimental results are reported in sec-

tion 3. Finally, discussion and conclusions are drawn at the

end.

2. OVERVIEW OF SUBSPACE METHODS ANDARTIFICIAL NEURAL NETWORKS

Feature extraction is the identification of appropriate content

to characterize the component images distinctly. There are

many popular methods to extract features, amongst which

PCA and FLD are the state-of-the art methods used in data

representation technique widely used in the area of computer


vision and pattern recognition. Using these techniques, an

image can be efficiently represented as a feature vector in a

low dimension space. The features in such subspace provide

more salient and richer information for recognition than the

raw image. In this work, we have explored some of the pop-

ular subspace methods such as PCA [10], 2D-PCA [11], Dia-

PCA [12], FLD [13] and 2D-FLD [14] methods for feature

extraction.

Artificial neural networks are simplified models of highly

interconnected neural computing elements that have the abil-

ity to respond to input stimuli and to learn and adapt to the

environment. In this work we have used different neural net-

works such as Radial basis function (RBF), generalized re-

gression neural network (GRNN), probabilistic neural net-

work (PNN), self-organizing maps (SOM) and linear vector

quantization (LVQ). Detailed description regarding these ar-

chitectures can be seen in [15].


This section deals with checking the performance of proposed

methods with a dataset containing various characters pertain-

ing to south Indian languages. Each experiment is repeated

25 times by varying number of projection vectors k(where k= 1, 2,....20, 25, 30, 35, 40, and 45). Since k, has a consider-

able impact on recognition accuracy, we chose the value that

corresponds to best classification result on the image set. All

our experiments are carried out on P4, 3GHz CPU and 512

MB RAM memory under Matlab 7.0 platform.

3.1. Experiment on printed clear data

We have evaluated the performance of the proposed system

of the recognition process on consonants and vowels of four

south Indian scripts. The characters are collected in a sys-

tematic manner from the printed pages scanned on a HP 2400

scanjet scanner. The size of the input character is normal-

ized to 50x50. In this experiment, we consider four fonts for

each of these languages and totally around 1,00,000 samples

for testing purpose. The total number of classes pertaining

to consonants, vowels and modifiers numbered 881. Results

pertaining to printed clear data is reported in Table 1. Ta-

ble 1 shows the performance comparison of different sub-

space methods when combined with different neural classi-

fiers (Recognition accuracy is defined as the ratio of correctly

classified images to the total test samples). It can be seen from

the Table 1 that, 2D-PCA with LVQ performs better compared

to other methods in terms of accuracy. The results obtained

from GRNN and PNN classifiers are also competitive when

compared to other NN classifier.

3.2. Experiment on printed data corrupted by noise

While one wishes for noise free data it is almost impossible

due to image acquisition problem. To work with data cor-

rupted by noise and to show the proposed method is robust,

we conducted a series of experiment on noisy images. For

this we have modeled different noise condition using five dif-

ferent continuous distributions namely Beta, Weibull, Expo-

nential, Gaussian (Wide band), & Salt and Pepper noise and

used in their discretized version. For this, we randomly select

one character image from each class and generate 10 corre-

sponding noisy images for each distribution by varying noise

density from 0.1 to 1.0. In total for each distribution we have

8810 noisy character image for testing. Some of the sample

noisy images are shown in Figure 1. Performance comparison

with different subspace method combined with different NN

classifiers for five different noise distributions are reported in

Table 2. Moreover, we have noticed the following from this

experiment:

1. FLD based method are more robust under all six noise

condition.

2. FLD-PNN combination exhibits better robustness un-

der beta noise when compared to other methods.

3. 2D-FLD with GRNN shows robust behavior under Weibull

and Exponential noise conditions.

4. 2D-FLD with PNN method performs well under Gaus-

sian (Narrow and Wide band) and salt and pepper noise

conditions.

4. DISCUSSION AND CONCLUSIONS

Research and evolution in the field of printed OCR for In-

dian scripts has many specific and generic applications. Rec-

ognizing multi-lingual characters for Indian scripts is really

interesting and challenging. Selection of a feature extrac-

tion method is probably the single most important factor in

achieving high recognition performance. Earlier multilingual

works consider Devnagari and Bengali which are feature wise

”closer” to each other. Here we have considered Tamil and

Malayalam (feature wise closer) and Kannada and Telugu (fea-

ture wise closer) but between one group and another, signif-

icant difference exists. In this paper we have successfully

shown the performance of subspace methods and neural clas-

sifiers on recognizing multi-lingual characters of south Indian

scripts. Performance of a system on clean and noisy data dis-

played the effectiveness of the proposed methods. 2D-PCA

with LVQ performed well under clean data. Performance of

FLD with GRNN and PNN based methods showed encour-

aging results for noisy data. This work is being extended to

include modified characters and mixtures of Devnagari, Dra-

vidian scripts and English characters.


Fig. 1. Sample of six different noisy distributions From

(Top): Beta, Weibull, Exponential, Gaussian (Narrow and

Wide band) and Salt and Pepper

5. REFERENCES

[1] Rajasekharan.SNS and BL.Deekshatulu, “Generation

and recognition of printed Telugu characters,” ComputerGraphics and Image Processing, vol. 6, pp. 335–360,

1977.

[2] Chaudhuri.B.B and Pal.U, “A complete bangla ocr sys-

tem,” Pattern Recognition, vol. 31, pp. 531–549, 1998.

[3] Chaudhuri.B.B and Pal.U, “An ocr system to read two

indian language scripts: Bangla and devnagari (hindi),”

in Proceedings of Intl Conf on Document Analysis andRecognition, 1997, pp. 1011–1015.

[4] Jawahar.C.V, M.N.S.S.K.Pavan Kumar, and S.S.Ravi

Kiran, “A bilingual ocr for hindi-telugu documents and

its applications,” in Proceedings of Intl Conf on Docu-ment Analysis and Recognition, 2003, pp. 656–660.

[5] Atul Negi, Chakravarthy.B, and Krishna.B, “An ocr sys-

tem for telugu,” in Proceedings of Intl Conf on Docu-ment Analysis and Recognition, 2001, pp. 10–13.

[6] Seethalakshmi R, Sreeranjani T.R, and Balachandar,

“Optical character recognition for printed tamil text us-

ing unicode,” Journal of Zhejiang University Science,

vol. 6A11, pp. 1297–1305, 2005.

[7] Ashwin T.V and P S Sastry, “A font and size-

independent OCR system for printed kannada docu-

ments using support vector machines,” Journal of Sad-hana, vol. 27, pp. 35–58, 2002.

[8] Pal.U and B.B.Chaudhuri, “Indian script character

recognition: A survey,” Pattern Recognition, vol. 37,

pp. 1887–1898, 2004.

[9] Jain and Li, Handbook of Face Recognition, Springer,

2005.

[10] Turk.M and A.Pentland, “Eigenfaces for recognition,”

Journal of Cognitive Neuroscience, vol. 3, pp. 71–86,

1991.

[11] Yang.J, Zhang.D, Frangi.A.F, and Yang.J, “Two-

dimensional pca:a new approach to appearance-based

face representation and recognition,” IEEE Trans. onPattern Analysis and Machine Intelligence, vol. 26, pp.

131–137, 2004.

[12] Daoqiang.Z, Zhou.Z.H, and Chen.S, “Diagonal princi-

pal component analysis for face recognition,” PatternRecognition, vol. 39, pp. 140–142, 2006.

[13] Belhumeur.P.N, Hespanha.J.P, and D.J.Kriegman,

“Eigenfaces vs fisherfaces: Recognition using class

specific linear projection,” IEEE Trans. on Pat-tern Analysis and Machine Intelligence, vol. 19, pp.

711–720, 1997.

[14] Huilin.X, M.N.S.Swamy, and M.O.Ahmad, “Two-

dimensional fld for face recognition,” Pattern Recog-nition, vol. 38, pp. 1121–1124, 2005.

[15] Patterson.D.W, Artificial Neural Networks, Prentice

Hall, 1995.


Table 1. Performance Comparison for different NN classifiers under clean data

Subspace RBF(%) GRNN(%) PNN(%) LVQ(%) SoM(%)

Methods

PCA 95.41 97.52 97.21 98.43 66.40

2DPCA 96.12 98.14 97.25 98.12 67.54

Dia-PCA 94.30 96.12 96.30 96.02 66.98

FLD 95.21 95.78 95.87 96.04 62.80

2DFLD 95.05 94.65 96.10 95.45 64.11

Table 2. Performance comparison with different NN classifiers under various noise conditions

Noise Subspace Methods RBF(%) GRNN(%) PNN(%) LVQ(%) SoM(%)

PCA 72.1 92.2 97.76 64.1 55.6

2D-PCA 73.2 92.0 98.1 64.9 55.6

Beta Dia-PCA 70.2 90.3 96.5 64.0 55.0

FLD 74.2 93.4 96.78 84.5 56.0

2D-FLD 72.9 93.6 96.01 81.4 50.4

PCA 66.6 64.27 74.05 60.0 52.4

2D-PCA 66.9 66.8 78.4 64.2 55.7

Weibull Dia-PCA 64.2 61.7 73.6 59.9 49.9

FLD 70.2 62.24 72.07 66.06 52.1

2D-FLD 69.8 62.0 70.5 65.7 52.1

PCA 67.2 95.8 98.29 59.5 52.8

2D-PCA 68.4 95.6 98.26 60.7 55.4

Exponential Dia-PCA 67.3 94.8 97.4 60.1 57.2

FLD 65.6 96.71 98.62 78.6 54.5

2D-FLD 62.9 95.6 98.6 75.6 52.3

PCA 70.2 68.23 77.78 60.2 50.6

2D-PCA 71.5 70.5 78.8 61.8 51.2

Gaussian Dia-PCA 71.2 66.5 76.2 59.8 49.6

FLD 70.2 70.24 79.76 60.0 52.6

2D-FLD 68.6 68.9 78.6 58.8 52.1

PCA 71.0 94.62 99.68 61.7 49.4

2D-PCA 71.4 96.8 99.68 64.8 58.6

Salt & pepper Dia-PCA 71.5 92.5 98.5 61.0 55.2

FLD 73.6 94.25 99.66 82.2 50.1

2D-FLD 73.1 94.0 99.42 81.5 51.0


TWO-LEVEL SEARCH ALGORITHM FOR MOTION ESTIMATION

Deepak .J.Jayaswal1 and Mukesh A. Zaveri2

1Electronics and Telecommunication Department, Computer Engineering Department, St Francis Institute of Technology,Borivali, Mumbai 400103, India

2S.V.National Institute of Technology, Surat - 395007, Gujarat, India Email: [email protected], [email protected]

ABSTRACT

Motion Estimation is the process of estimating the other frames from a video sequence using the I-frames of the specific video. These I-Frames contain the most amount of information in the video sequence and are repeated in the sequence of I, P, and B frames at fixed interval depending on the type of video during transmission. There are many algorithms reported in the literature for motion estimation in successive frames from these I-Frames. These algorithms are optimal, sub-optimal and non-sub optimal in the nature. The optimal algorithm performs best but computationally very expensive. The sub-optimal algorithm are considered to be fast in terms of computations but slightly distorted in terms of reconstruction quality, whereas the non-sub optimal algorithms are slow in terms of computations but good in terms of reconstruction quality. In this paper, we propose an algorithm, namely, Two Level Search (2LS) that exploits the best out of the sub-optimal and non-sub optimal algorithms. In 2LS algorithm, the points are selected around the center of the search window of the current block at a step size of one. The mean of all these selected points are calculated and compared to the mean of the reference block. The few search points namely the ‘q’ points with the closest means are considered and macro-block at each of ‘q’ points are compared with the reference block to find the best match.

1. INTRODUCTION Motion estimation examines the movement of objects in an image sequence and obtains the vectors representing the estimated motion. A motion video is defined as a time ordered sequence of images called

frames. Each successive frame is correlated along the temporal dimension to the previous frame; this strong interframe correlation can be exploited in terms of prediction of the current frame from the previous frame or reference frame. The motion estimation generates the motion vectors which are used for reconstruction the current frame at the receiver end. Along with the encoded and transmitted motion vectors, the error frame between the regenerated current frame and the actual current frame is also sent, which reduces the number of bits used to convey the information. The technique used for motion estimation is based on block matching. The most accurate block matching algorithm (BMA) is the exhaustive full-search (FS) method, which exhaustively evaluates all possible macro-blocks (MB) over a predetermined search window of size (2p+1) x (2p+1) to find the best match. The estimated motion vector is the best match achieved for predefined value of block distortion measure. The only disadvantage of this method and perhaps the biggest flaw is the high computational cost associated with it. To reduce computations, number of algorithms have been proposed such as successive elimination algorithm (SEA) [1], three-step search (TSS) [2], new three step search [3] four-step search (4SS) [4], efficient four-step search [5], unrestricted center biased diamond search (UCBDS) [6], cross search[7]. Among these algorithms, the SEA [1] is similar to the full search method except, the first one eliminates certain search points based on the Minkowiski’s formula. Further reduction in number of search points has been achieved in TSS [3] algorithm which starts with a step having nine uniformly spaced search points which get closer after every step until the step size reduces to 1. The best candidate search point in the previous step becomes the center of the current step.


In this paper an algorithm is proposed for video coding aiming to reduce the computational burden due to motion estimation and to find a good trade off between the quality obtained, close to the one obtained using the FS algorithm, and the reduction of the computational load obtained by fast full search [8], a complexity bounded motion estimation [9], new fast algorithm for estimation of block motion vector [10], dynamic search window algorithm [11], predictive coding based on efficient motion estimation [12]. Displacement measurement and its application in interframe image coding, [13]. A new efficient block-matching algorithm for motion estimation [14] and Fast multiresolution motion estimation algorithms for wavelet-based scalable video coding [15].The proposed algorithm, 2LS, has been designed and optimized for those environments, e.g., mobile communication, in which low power consumption is mandatory. The paper is organized as follows: Section II introduces the 2-Level Search pattern, and explains the algorithm in detail. Section III presents simulation results. The proposed algorithm has been also compared with the different algorithms like the FS, SEA, TSS, 4SS and UCBDS.

2. 2-LEVEL SEARCH (2LS) In block based motion estimation algorithm, each frame is first divided in to MB of size NxN pixels. Block sizes of 16x16 are used for MPEG-1, MPEG-2, H-261 and H-263. Furthermore, for a selected block in an image, it tries to find similar block with same size in the second image. The search for each block match is usually performed over an MxM search area, requiring MxM possible candidate search points per block when full search is used. As mentioned earlier, this too computationally expensive. Hence, our main objective is concerned with choosing a suitable subset of these MxM points for suboptimal version of search algorithm. The proposed algorithm utilizes a centre biased search pattern with (2p+1) x (2p+1) checking points, as a first level search. At this level the mean values are compared at (2p+1) x (2p+1) locations and first ‘q’ points with minimum mean value difference are chosen for next level search. In second level search mean absolute difference (MAD) forms the objective cost function for ‘q’ point’s evaluation. A point with a minimum block distortion measure (BDM) is found from ‘q’ checking points and the direction of the motion vector is considered as the minimum BDM point among the ‘q’ search points.

The detailed stepwise description of the algorithm is described as below. Step 1: Compute mean value at each pixel location with MB size N x N of reference frame I

m (1,1) m (1,2)………m (1, col-MB size +1) m (2,1) m (2,2) …….. m (2, col-MB size +1) Mean I = m(Row –MB Size+1,1)…m (Row –MB Size+1, col- MB size +1) Step 2: The mean value of each MB in target frame will be compared with (2p+1)2 location in reference frame with the step size of one and first ‘q’ points with minimum mean value difference are chosen for next level search (Fig. 1-(a) and Fig. 1-(b))

Fig. 1-(a) Level One Search Sequence

m(1,1) m(1,9)………m(1, col-MB size +1)

Mean T = m(9,1) m(9,9) …….. m(9, col-MB size +1)

m(Row –MB Size+1,1)…m(Row –MB Size+1, col-MB size+1)

CENTER POINT (2p+1)2 SEARCH POINTS


Step 3: The direction of overall motion vector is considered as a minimum BDM points among the ‘q’ search points.

Fig. 1-(b) Level Two Search Sequence

3. SIMULATION AND RESULTS We have evaluated our algorithm using the large number of standard video clips. These video clips are of different nature. The simulation and results of some of clips are described here. First video clip used for experiment is that of ‘Miss America’. It contains a person presenting a report. This video only has subtle lateral movement of the person, keeping the background stationery. Second video is the ‘Tennis Sequence’ which contains a person playing table tennis. In the clip the person serves by tossing the ball in the air off his racquet before serving. Second clip has more amount of motion compared to first one. The effects of camera zooming and panning are also involved in the second clip. Third one is ‘Flower Garden Sequence’. In the sequence, a camera is moving laterally, and recording a garden sequence and the amount of motion is extremely large. Moreover, a large number of new objects appear in the successive frames. The 2LS algorithm is simulated using the luminance component of the first 80 frames of the “Miss America”, “Tennis” and “Flower garden” sequences. The size of each individual frame is 352 x 240 pixels quantized uniformly to 8-bits, The mean absolute difference (MAD) distortion function is used as the BDM. The performance is evaluated using three parameters: PSNR, average number of search points and mean square error (MSE). The proposed algorithm, 2LS, is compared with other algorithms like FS, SEA, TSS, 4SS, and UCBDS.

MSE provides how best the search points are selected through out the sequence. MSE also depicts the quality of the video reconstructed. Naturally the minimum MSE is expected for better quality of the video using the best algorithm. The average number of search points provides the information about the computational complexity, i.e., few the number of search points less the computational complexity. The best algorithm is one which gives minimum MSE and very few search points. Table I describe the average number of search points required for motion estimation. From Table I it is clear that the proposed algorithm, 2LS, performs the best in comparison with other algorithms. It is found through experimental result that 2LS needs only (2p+1) number of search points for computation of motion vectors, which is much less compared that of other algorithms. Table I: AVERAGE NO. OF SEARCH POINTS PER MOTION VECTOR FOR THE FIRST 80 FRAMES Algorithm Miss

America Flower Garden

Tennis

FS 274 274 274 SEA 105.31 152.53 175.62 TSS 26 26 26 4SS 27.22 26.06 27.34 UCBDS 24.74 26.72 23.97 2LS 10 10 10 Average MSE is obtained using the estimated frame and the original frame over first 80 frames. These values are depicted in Table II and Table III for 8 x 8 and 16 x 16 MB size respectively. It can be observed that the MSE for our proposed algorithm is less compared to TSS, 4SS and UCBDS algorithms and more in comparison with FS and SEA algorithms except for ‘Miss America’ clip. It is observed that 2LS performs better when there is large motion in the video clip. Table II: AVERAGE MSE OF THE FIRST 80 FRAMES ( 8 x 8 BLOCK ) Algorithm Miss


Tennis

FS 18.37 2330.99 1331.30 SEA 18.37 2330.99 1331.30 2LS 26.81 2544.18 1345.60 TSS 24.90 2685.68 1439.86 4SS 28.08 2840.31 1489.99 UCBDS 28.30 2887.68 1498.06 These results are very much expected as it is mentioned earlier that the 2LS algorithm is proposed to reduce the

q POINTS WITH CLOSET MEANS SEARCH WINDOW


computational complexity without much degradation in the quality of a video. FS and SEA algorithms give better quality of video at the cost of computational complexity. TSS, 4SS and UCBDS algorithms reduce the computational complexity at the cost of quality of the video. It is important to note that that the proposed algorithm, 2LS, reduces the computations to the large extent, even it is less than the optimal algorithm FS. Though the MSE with 2LS algorithm is more than FS, but at the same time its performance is much better than sub-optimal algorithms (TSS, 4SS and UCBDS). From these tables it is concluded that 2LS algorithm reduces the computational complexity without much degradation in the quality of a video. Table III: AVERAGE MSE OF THE FIRST 80 FRAMES ( 16 x 16 BLOCK ) Algorithm Miss


Tennis

FS 18.99 2419.21 1286.91 SEA 18.99 2419.21 1286.91 2LS 51.38 3241.69 1589.89 TSS 35.25 3372.11 1618.63 4SS 36.43 3477.62 1660.13 UCBDS 36.05 3502.61 1660.63 Another criteria used for comparison is PSNR. Fig. 3, Fig. 4 and Fig.5 depict the comparison in terms of PSNR using different search algorithms for various clips. From these figures it is clear that the proposed 2LS algorithm performs as good as FS and SEA and much better than TSS. The 2LS has an advantage over the other search algorithms as 2LS has a very compact search pattern. This allows a minimum of only up to 97% (q=5-10) over FS.

Fig. 3 PSNR comparison of 2LS with FS, SEA & TSS algorithms for MISS AMERICA sequence.

4. CONCLUSION

A new fast algorithm, 2LS, has been presented in this paper for motion estimation. Simulation results show that 2LS achieves better estimate accuracy as compared to TSS, 4SS, and UCBDS with much reduced computational complexity. In addition, 2LS is more robust as compared to these algorithms. Even the performance of 2LS is consistent for the image sequence that contains complex movements such as camera zooming and fast motion. The simulation result demonstrates that the proposed algorithm, 2LS, is very suitable for those applications that require both very low bit rates and good coding quality.

Fig. 4 PSNR comparison of 2LS with FS, SEA & TSS algorithms for Tennis sequence.

Fig. 5 PSNR comparison of 2LS with FS, SEA & TSS algorithms for Flower Garden sequence.


REFERENCES

[1] W. Li and E. Salari, “Successive elimination algorithm for motion estimation”, IEEE Transactions on Image Processing, Vol. 4, No. 1, January 1995. [2] T. Koga, K. Iinuma, A. Hirano, Y. Iijima, and T. Ishiguro, “Motion compensated interframe coding for video conferencing,” in Proc. National Telecommunications Conf., Nov. 29–Dec. 3, 1981, pp. G.5.3.1–G.5.3.5. [3] R. X. Li, B. Zeng, and M. Liou, “A new three step search algorithm for block motion estimation,” IEEE Trans. Circuits Syst. Video Technol., vol. 4, pp. 438–442, Aug. 1994. [4] Lai-Man Po and Wing-Chung Ma, “A novel four-step search algorithm for fast block motion estimation”, IEEE Transactions on Circuits and Systems for Video Technology, vol. 6, no. 3, June 1996. [5] Kuan-Tsang wang, “Efficient four-step search’’, IEEE international symposium on Circuits and Systems, 1998, ISCAS apos,’98 Vol. 4, 31-May – 3 Jun 1998. [6] Jo Yew Tham, Surendra Ranganath, Maitreya Ranganath and Ashraf Ali Kassim, “A novel unrestricted centre biased diamond search algorithm for motion estimation”, IEEE Transactions on Circuits and Systems for Video Technology, vol. 8, no. 4, August 1998. [7] M.Ghanbari, “The cross search algorithm for motion estimation,” IEEE Trans.commun., vol.38, no.7, pp.950-953, July 1990 [8] Jong-Nam Kim, Sung-Cheal Byun, Yong-Hoon Kim, and Byung-Ha Ahn, “Fast full search motion estimation algorithm using early detection of Impossible candidate vectors”, IEEE Transaction on Signal Processing, vol. 50, no. 9, September 2002. [9] Antonio Chimienti, Claudia Ferraris, and Danilo Pau, “A complexity-bounded motion estimation algorithm”, IEEE Transactions on image processing, vol. 11, no. 4, April 2002. [10] B.Liu and A.zaccarin, “New fast algorithm for estimation of block motion vector,” IEEE Trans. Circuits Syst. Video Technol., vol.3, pp 148-157, Apr 1993

[11] L.W. Lee, J.F. Wang, J.Y. Lee, and J.D. Shie, “Dynamic search window adjustment and interlaced search for block- matching algorithm,” IEEE Trans. Circuits Syst. Video Technol., vol.3, pp. 85-87, Feb. 1999 [12] R. Srinivasan and K. R. Rao, “Predictive coding based on efficient motion estimation,” IEEE Trans. Commun., vol. 33, pp. 1011–1015, Sept. 1995. [13] J. R. Jain and A. K. Jain, “Displacement measurement and its application in interframe image coding,” IEEE Trans. Commun., vol. COM-29, pp. 1799–1808, Dec. 1981. [14] Hanan, suneer, mohsen and magdy bayoumi, “A new efficient block-matching algorithm for motion estimation” Journal of VLSI Signal processing system, vol. 42, issue1, Jan. 2006. [15] Yu Liu and King Ngi Ngan, “Fast multiresolution motion estimation algorithms for wavelet-based scalable video coding,” Elsevier Signal processing Image communication vol.22, March 2007.


SSppeeeecchh && AAuuddiioo PPrroocceessssiinngg

FOUR-WAY CLASSIFICATION OF PLACE OF ARTICULATION OF MARATHI UNVOICED STOPS FROM BURST SPECTRA

Veena Karjigi, Preeti Rao

Department of Electrical Engineering, Indian Institute of Technology, Bombay, India

{veena, prao}@ee.iitb.ac.in

ABSTRACT Acoustic features computed from the release burst spectrum are evaluated for the classification of Marathi unvoiced and unaspirated stops characterized by four places of articulation. The burst onset spectra are found to provide significant information on place of articulation as determined by feature evaluation measures and classification experiments on a database of Marathi word-initial stops. Classification accuracies are compared with those of cepstral coefficients computed from the same analysis data. Index Terms— acoustic features, unvoiced stops, four-way place of articulation

1. INTRODUCTION

Close but contrasting phonemes in a language can be distinguished from each other in terms of linguistically motivated distinctive features, characterized by corresponding acoustic correlates in the speech signal. Present day speech recognizers typically employ raw spectral information in the form of cepstral coefficients. Acoustic-phonetic features, being more directly related to underlying distinctive articulatory properties, are expected to be more robust to variations at the phone level due to context, regional accent or vocal tract differences due to gender or age.

In this work, acoustic attributes for unvoiced, unaspirated stops of Marathi which differ in their place of articulation (PoA) are investigated. Marathi is the official language of the Indian state of Maharashtra with roughly 70 million native speakers. It distinguishes four places of articulation for stop consonants in contrast to the three used in English. The four places of articulation are labial [p], dental [t�], retroflex [�] and velar [k] each of which can occur as unvoiced-unaspirated, unvoiced-aspirated, voiced-unaspirated and voiced-aspirated [1]. The dental consonant is produced by making a constriction of the vocal tract with the tongue blade, immediately behind the upper front teeth. The retroflex consonant is produced by curling the tip of the tongue upwards towards the hard palate to make a constriction behind the alveolar arch. Phonologically, dental

and retroflex stops are clubbed in the same class (coronal) and distinguished from labials as well as velars [2]. The retroflex and dental stop consonants would typically both be categorized as alveolar [t] by an English listener. Considering their contrastive role in Marathi (as also in some other Indian languages), it is of interest to investigate the acoustic correlates of these different PoA.

Finding acoustic attributes for the classification of the stops based on PoA has been a subject of active research. Spectral shape of the release burst and formant transitions in adjacent vowels have been widely investigated. Burst shape (and amplitude) attributes are found to be relatively invariant to vowel context [3]. Early work [4] using burst spectra showed that labials and alveolars had diffuse spectra as compared with the peaked spectra of velars. The labials are further distinguished from alveolars by the spectral location of major energy concentration. Zue [5] computed burst spectra by linear prediction (LP) using the first 10 to 15 ms of the waveform after the burst release. These LP spectra were further smoothened and the location of the biggest peak in the resultant spectrum was measured as the burst frequency. The labials showed large variations in the burst frequency and they did not exhibit prominent peaks. Blumstein and Stevens [6] derived a 14th order LP spectrum by placing a modified raised cosine window of length 26 ms at the burst onset. They proposed three templates, diffuse flat or falling for labials, diffuse rising for alveolars and compact for velars, using which a high classification accuracy was obtained on word initial stops. Several experiments have been reported on the importance of the first 10 ms of the waveform after the burst onset [5,6]. Suchato [7] used average power spectra for measuring 15 acoustic attributes of American English stops out of which 3 were extracted purely from the burst spectrum. There has been very limited amount of work on extending the above acoustic attributes to the classification of stops in Indian languages, several of which have more than three PoA in their inventory. Lahiri et al. [8] investigated the known acoustic attributes for the labial, dental and alveolar stops of Malayalam. It was observed that the dental stops could not be reliably separated from labials based on burst spectra alone. Features based on the change in spectral energy concentration from burst onset to


voicing onset were found to perform better in terms of grouping the Malayalam dentals with alveolars.

The present work is restricted to acoustic-phonetic features extracted from the release burst segment for PoA classification of Marathi unvoiced and unaspirated stops. Acoustic features related to the release burst proposed in the literature for English unvoiced plosives are tested on a database of Marathi plosives and improvements are proposed. The acoustic features are compared with more general raw spectrum features (such as the widely used MFCC) computed on the same data via stop classification experiments.

2. DATABASE AND ANALYSIS 2.1. Database Marathi words with one of the four stops {p, t�, �, k} in the word-initial position followed by one of the eight vowels and two diphthongs of the language were used for the analysis. Two distinct words for each stop-vowel combination were chosen from the dictionary to obtain 80 words. The words were each embedded in two different carrier phrases (one statement and one question). Five male and five female speakers of standard Marathi [1] were selected for the study. This led to a data set of 80 x 10 x 2 = 1600 tokens (or 400 per stop consonant), recorded at a sampling rate of 16 kHz in quiet condition. The time locations of the release burst and the voicing onset were manually labeled. The burst onset was marked as the time instant after the closure silence at which a rapid change in the waveform amplitude sets in. The first negative to positive going zero crossing in the first cycle of the periodic waveform following the burst was labeled as the voicing onset.

2.2. Acoustic Analysis Since the goal is to extract acoustic-phonetic features from the release burst spectrum of the unvoiced plosive, the analysis data is restricted to the region between the labeled burst and voicing onsets. This duration is known as the voicing onset time (VOT) of the unvoiced stop. A statistical study of measured VOT across the Marathi word data set is summarized in Table 1. The observations are consistent with articulatory properties [9]. Retroflex exhibit lowest VOT due to the relatively fast movement of the active articulator involved (the tongue tip), which offsets the effect of the more posterior PoA. Further, although the place of constriction for the dental is posterior to that of labial, the VOTs are comparable. A possible explanation is the observed occasional presence of aspiration in the word initial [p]. In Marathi, the aspirated [p] has linguistically evolved to be replaced by [f] due to which there is a spread in the allophonic varieties of [p].

VOT (ms) Place of Mean Std. dev.

Labial 17.0 7.7 Dental 15.2 6.1

Retroflex 9.9 4.0 Velar 28.5 10.6

Table 1. VOT mean and std. dev. for the 4 PoA

Based on the observations, the data extracted for burst

spectrum analysis was limited to either a fixed 10 ms or the VOT, whichever was lower. 2.2.1. Computation of average power spectrum A smooth power spectrum was obtained, following the method of [10, 7], by averaging the power spectra of a series of windowed data segments each of duration 6.4 ms. The Hanning data window was shifted every 1 ms starting from a center value of 7.5 ms before the burst onset to 7.5 ms after the burst onset. If the VOT was found to be less than 7.5 ms, the last window was centered at 3.2 ms before the voicing onset so as not to encroach on the voiced region. Thus the maximum number of spectra averaged was 16. The time averaging of power spectra serves to compensate for possible errors in the manual labeling of the burst onset.

2.2.2. Burst spectrum characteristics

Stops in Marathi contrast in four PoA as opposed to the three of English. Hence it is important to characterize differences in the spectra obtained from our database with that of the stops in English as noted in the literature. Figure 1 shows typical average power spectra of the four stops from the data of a female speaker. Similar to [4,6], we find Marathi labials showing diffuse falling spectra but with a higher roll-off in the low frequency region (0-750 Hz) compared to the rest of the frequency band. The velars show a compact peak near F2 of the following vowel.

However, burst spectral characteristics of the English alveolar [t] (diffuse, rising) do not completely describe the observed spectra of the two Marathi coronals. While the three coronal stops share the diffuseness characteristic [11], it is seen that the dental [t�] has a diffuse flat spectrum and retroflex [�] exhibits a slightly more compact and high-energy spectrum up to 4 kHz with an abrupt decrease in energy beyond that. The observed spectral characteristics are generally consistent with articulatory properties corresponding to the resonances of the vocal tract volume downstream from the place of constriction. Labials do not exhibit clearly defined peaks in the spectrum because of the absence of the anterior cavity while velars exhibit a clear low frequency peak due to


Figure 1: Average power spectra of the four stop bursts

from a female speaker the long anterior cavity [12]. The lower frequency concentration of retroflex stops relative to dentals is explained by the longer anterior cavity for retroflex articulation. Increased posterior cavity for labials, dentals and retroflex increases acoustic losses thereby giving rise to diffuse peaks. That an abrupt decrease in energy with frequency distinguishes apicals (such as the retroflex) from laminals (such as the dental) has been noted previously [11].

3. ACOUSTIC FEATURES AND EVALUATION We see above that the gross shape of the burst spectrum and the locations of spectral prominences distinguish the average power spectra of the four places of articulation. Acoustic features that capture these characteristics could be effective in the automatic recognition of the unvoiced plosives. We measure the effectiveness of the individual features with respect to the classification problem using the information theoretic mutual information (MI) [13]. The MI has been used for feature selection in an HMM-based phone recognizer and shown to correlate well with recognition scores for the ranking of features [14].

We start with acoustic features previously proposed for three-way classification of English stops and evaluate these for the three-way classification of the Marathi stops with the two coronal stops clubbed in one class to be distinguished from labial and velar. Next, we propose modifications to the feature definitions considering the observed burst spectrum characteristics of the Marathi plosives discussed in Sec. 2.2.2, and evaluate their effectiveness by the same measures.

Suchato [7, 3] proposed several acoustic attributes relating to the burst spectrum and formant transitions for stop consonant place of articulation classification. Of these, three attributes relate purely to the burst spectrum shape and

are computed from the average power spectrum described in Sec. 2.2.1. These three acoustic features and the associated parameters are detailed below. 1. Energy difference:

��

��

�=

2

1log10E

EEdiff

(1)

where, E1=E [3500:8000] (energy of the burst spectrum in the range 3500-8000 Hz), and E2=E [1250:3000]. Ediff was shown to achieve reasonable separation of the three PoA of English stops [7]. 2. Amplitude difference:

��

��

�=

2

1log20A

AAdiff

(2)

where, A1 is the amplitude of the biggest peak of the burst spectrum in the range 3500-8000 Hz and A2 is the average peak amplitude of the burst spectrum in the range 1250-3000 Hz. Adiff was shown to separate alveolar stops from English labials and velars [7]. 3. Center of gravity in frequency (“cgFa”): of the average power spectrum obtained from the data between burst and voicing onsets computed over the frequency region 0-8 kHz.

The above features were tested on the Marathi unvoiced stops data for three-way classification (labial, dental+retroflex, velar). Table 2 shows the MI for the above mentioned three features. We see that while the cgFa shows reasonable discrimination ability for the three-way classification, the remaining two features, Ediff and Adiff

perform poorly on the Marathi data. (The quantitative measures are supported by the visual inspection of the probability distributions of feature values which overlap significantly across the classes). Based on the study of Sec. 2.2.2 of the Marathi consonant burst spectra, the parameter values of the energy and amplitude ratio attributes of Eq. (1) and (2) were modified as described next. Further, new features were explored for the two-way classification of the coronal stops [�] and [t�]. Later, few features are added to account for the four-way distinction. 3.1. Features for three-way classification Ediff is modified for the three-way classification to consider the steep spectral roll-off of labials in the low frequency region distinguishing them from the relatively flat spectral shape of retroflex and dentals. The modified feature (Eml) is given by Eq. (1) where, E1=E [750:2500] and E2=E [0:750]. Next, Adiff was modified based on the observation that burst spectra become increasingly diffuse proceeding from velar toward labial PoA. Labials exhibit insignificant peaks beyond 500 Hz whereas retroflex and dental show larger spectral peaks. Velars exhibit relatively high and narrow peaks. The amplitude ratio feature was thus modified to Ahl given by Eq. (2), where, A1 is the amplitude of the biggest peak in the region 500-7500 Hz and A2 is the average amplitude of the burst spectrum in the region 0-

0 2 4 6 8−80

−60

−40

−20

0

20

Mag

nitu

de (

dB)

0 2 4 6 8−80

−60

−40

−20

0

20

0 2 4 6 8−80

−60

−40

−20

0

20

Frequency (kHz)

Mag

nitu

de (

dB)

0 2 4 6 8−80

−60

−40

−20

0

20

Frequency (kHz)

[p] [t�]

[�] [k]


500Hz. Drawing on the same spectral characteristic, a new feature (Slf) is defined as the spectral slope obtained by fitting a straight line to the burst spectrum in the region 0-1.5 kHz using linear regression. All the three above mentioned features are expected to take positive values for velars, negative values for retroflex and dental and large negative value for labials. Further, cgFa was recomputed from the average power spectrum of Sec. 2.2.1, and modified to the frequency range of 0-7 kHz to account for the microphone characteristics.

The feature evaluation measures obtained on the Marathi database for the four new features appear in Table 2. We see that the modified energy and amplitude ratio features improve significantly upon the features of Suchato [7].

Table 2. Feature evaluation for the three-way classification 3.2. Features for two-way classification of coronals From the discussion of Sec. 2.2.2 comparing the burst spectra of the two coronal stops, we note that [�] is characterized by an abrupt fall in spectral energy, while [t�] exhibits a more gradual decrease in energy across frequency. To capture this distinction, an energy ratio is defined (Ehm) as in Eq. (1) in the two frequency bands, where E1=E[5000:7000] and E2=E[1500:3500]. The energy variation with frequency is also captured by the spectral slope in the region 2-6 kHz. A slope feature (Smf) is defined over this frequency region derived in the same way as Slf. Both Ehm and Smf are expected to be highly negative for [�] and less so for [t�]. 3.3. Additional features To increase the separation between labials and dentals, a new feature Ehl is defined as in Eq. (1), where E1=E [5000:7000] and E2=E[0:750]. Because of the prominent energy in low frequency region for labials and high frequency region for dentals, Ehl was expected to show large negative values for labials and larger positive values for dentals. In addition, it showed a relatively high four-way distinction.

Further, spectral prominences in each of the four sub bands (0-750, 750-2500, 2500-5000 Hz and 5000-7000 Hz) were computed (similar to cgFa) and named as cgB1, cgB2, cgB3 and cgB4. Features are ranked using a greedy

algorithm [13]. The 11 features in the order of ranking are: Ahl, Smf, cgB1, cgB4, Ehl, cgB3, cgB2, Ehm, Slf, cgFa and Eml.

4. CLASSIFICATION EXPERIMENTS Based on the feature evaluation of Sec. 3, two feature vectors were obtained: one with the 11 acoustic features and other with only 8 best features. These two feature vectors are tested in a GMM classifier framework for the four-way classification of PoA. A diagonal covariance GMM classifier was trained using EM algorithm with 1, 3, 5 and 8 mixtures per class. Also tested in the same framework were MFCC vectors of two different dimensions {first 8 and first 13 coefficients} with the MFCC obtained from 20 ms Hamming windowed data centered at the burst onset (i.e. extending 10 ms beyond the burst onset). Two different classification tasks are defined. (a) Task 1: The training set comprised of tokens (800) from three male and two female speakers and testing set comprised of tokens (800) from the remaining two male speakers and three female speakers and vice-versa. Hence there were 2 sets of training-testing in the round-robin (1600 test items). Classification results in % accuracy (percentage of 1600 test tokens identified correctly) are given in Table 3.

No. of GMM mixtures Feature set 1 3 5 8

8 AP features 78.63 77.44 78.88 77.69 11 AP features 78.63 77.50 77.38 77.00 First 8 MFCCs 70.81 74.94 76.75 74.81

First 13 MFCCs 71.50 74.25 75.25 77.81

Table 3. Classification results: Trained and tested with different speaker sets, each including males and females

(b) Task 2: The training set comprised of tokens (800) from female speakers and testing set comprised of tokens (800) from male speakers and vice-versa. Hence there were 2 sets of training-testing in the round-robin (1600 test items). Classification results in % accuracy are given in Table 4.

No. of GMM mixtures Feature set 1 3 5 8

8 AP features 77.38 77.94 78.00 77.13 11 AP features 78.13 76.06 76.19 76.00 First 8 MFCCs 69.13 71.94 74.13 74.31

First 13 MFCCs 67.75 71.50 71.13 72.06

Table 4. Classification results: Trained with male and tested with female speakers and vice-versa

Suchato’s features

Normalized MI

Modified features

Normalized MI

Ediff 0.1913 Eml 0.8777 Adiff 0.2675 Ahl 1

cgFa 0.5430 cgFa 0.7283 Slf 0.6101


5. DISCUSSION Burst onset spectra are found to provide significant information on place of articulation as demonstrated by acoustic feature evaluation and classification results on Marathi unvoiced stops. While previously proposed articulatory-acoustic features for the classification of English unvoiced plosives were found inadequate for the three-way separation of Marathi stops, suitably modified features performed significantly better. The proposed set of 11 as well as the best 8 AP features derived from a study of burst spectrum characteristics across the four PoA of Marathi stops, compare favorably in classification accuracy with the 13-MFCC and 8-MFCC vectors extracted from frames aligned with the burst onset. In the speaker-independent classification task, both the AP feature sets obtain a maximum classification accuracy similar to that of the 13-MFCC and 8-MFCC vectors. Moving to the cross-gender classification, the performance of the AP features decreases only slightly compared with the steep reduction in accuracy recorded with the MFCCs. The lower dimension MFCC vector fares slightly better than the full 13-MFCC in the cross-gender task indicating that the essential (PoA-specific) shape of the burst spectrum is captured by 8-MFCC and finer spectral detail in the 13-MFCC may reduce robustness to irrelevant variations. In summary, the results of the present work provide support for the notion that acoustic features have the potential to capture essential phonetic distinctions in a robust manner. Wider testing conditions including variations in dialect, speaking rate and recording conditions would be useful to further validate this. The proposed acoustic feature set has not been systematically optimized for parameter settings (e.g. frequency ranges). A more efficient set of features could result from the fine-tuning of the individual features combined with feature selection to reduce redundancy. Finally, the present work was restricted to the release burst spectral shape of the unvoiced stop. Important acoustic cues to PoA are known to lie in the transition and voicing onset regions. Future work will be directed towards improving classification accuracy by extending the analysis data to include more of the speech waveform for the place detection of Marathi unvoiced stops.

6. ACKNOWLEDGEMENT

The authors wish to acknowledge useful discussions with K. Samudravijaya.

7. REFERENCES

[1] “Marathi language”,

http://en.wikipedia.org/wiki/Marathi_language

[2] Ladefoged P. and Maddieson I., The Sounds of the World’s Languages, Blackwell, 1996.

[3] Suchato A. and Punyabukkana P., “Factors in classification of stop place of articulation”, Proc. ICSLP, pp. 2969-2972, Sep. 2005.

[4] Halle M., Hughes G.W. and Radley J.P.A., “Acoustic properties of stop consonants”, J. Acoust. Soc Am., vol. 29, no. 1, pp.107-116, Jan. 1957.

[5] Zue V.W., “Acoustic characteristics of stop consonants: A controlled study”, Sc.D. Thesis, MIT, May, 1976.

[6] Blumstein S.E. and Stevens K.N., “Acoustic invariance in speech production: Evidence from measurements of the spectral characteristics of stop consonants”, J. Acoust. Soc. Am., vol. 66, pp. 1001- 1017, Oct., 1979.

[7] Suchato A., “Classification of stop place of articulation”, Ph.D. Thesis, MIT, Jun., 2004.

[8] Lahiri A., Gewirth L. and Blumstein S.E., “A reconsideration of acoustic invariance for place of articulation in diffuse stop consonants: Evidence from a cross-language study”, J. Acoust. Soc. Am., vol. 76, pp. 391-404, Aug., 1984.

[9] Cho T. and Ladefoged P., “Variation and universals in VOT: evidence from 18 languages”, J. Phonetics., vol. 27, pp. 207-229, Jul., 1999.

[10] Stevens K.N., Manuel S.Y. and Matthies M., “Revisiting place of articulation measures for stop consonants: Implications for models of consonant production”, Proc. ICPhS, pp. 1117-1120, Aug., 1999.

[11] Hamann S.R., “The phonetics and phonology of retroflexes”, Ph.D. Thesis, Netherlands Graduate School of Linguistics, Jun., 2003.

[12] Stevens K.N., Acoustic Phonetics, MIT Press, 2000. [13] Battiti R., “Using mutual information for selecting

features in supervised neural net learning”, IEEE Trans. on Neural Networks, vol. 5, no. 4, pp. 537-550, Jul., 1994.

[14] Omar M.K., Chen K., Hasegawa-Johnson M. and Brandman Y., “An evaluation of mutual information for selection of acoustic features of phonemes for speech recognition”, Proc. ICSLP, pp. 2129-2132, Sep., 2002.


EXPLICIT SEGMENTATION OF SPEECH SIGNALS USING BACH FILTER-BANKS

Ranjani H G1, Ananthakrishnan G2, A G Ramakrishnan3

ABSTRACT For synthesizing high quality speech, a concatenative Text-To-Speech system requires a large number of well annotated segments at the phone level. Manual segmentation, though reliable, is tedious, time consuming and can be inconsistent. This correspondence presents an automated phone segmentation algorithm that force aligns the phonetic transcriptions with the utterances of the corresponding Indian language sentences. The algorithm uses the distance function obtained from the output of the recently proposed Bach scale filter bank and the statistical knowledge of the lengths of the phones to force align the boundaries between successive stop consonants. Preliminary results for Hindi database shows that 85.2% of the boundaries detected by the algorithm are well within 20 ms of the manually segmented boundaries. The misclassified frames (20 ms) per sentence or the Frame Error Rate is 20.4%.

1. INTRODUCTION

Accurate time markers indicating the beginning and ending times of a speech sound (phone) in a spoken sentence are crucial for building a high quality Text to Speech (TTS) system. Segmentation is the process of getting these time markers. Such segmented and labeled phones are used to create a new unit inventory meant for concatenative speech synthesis and also for prosody modeling. The quality of segmentation is critical because an error in either segmentation or labeling can give rise to an audible error in the synthesized speech.

Manual segmentation is a conventional technique for segmenting speech. However, it turns out to be monotonous, time consuming and at times inconsistent. To circumvent these drawbacks, it becomes necessary to automate the process of segmentation.

To develop a concatenative TTS system for any Indian language, a speech corpus is created by recording from a single speaker, utterance of a large number of sentences covering various acoustic and phonetic contexts. The phonetic transcription is obtained by mapping the graphemes to the corresponding phonemes using a

grapheme to phoneme (G2P) converter. Thus, the phonetic labels of segmented speech are obtained from the phonetic transcription. The task therefore is to align these phonetic transcriptions to the actual boundaries in the corresponding speech utterances. This is an explicit segmentation problem, which differs from implicit automated segmentation where there is no a priori knowledge of the phonetic transcription, thus potentially increasing the number of “inserted” and “deleted” boundaries.

As far as the work on automated explicit segmentation is concerned, Abhinav et al proposed refining context dependent phone based HMM (CDHMM) giving good boundary accuracy [1]. Neural network trees with known number of sub-word units have also been used for segmentation [2]. However, the need to develop TTS in multiple Indian languages and the non-availability of large speech corpora for Indian languages are the major constraints, which limit the use of these training based segmentation techniques.

A recent segmentation work using the Bach scale filter bank has the advantage of being language independent and training free [3], [4]. We have extended the ideas of this work for our explicit segmentation algorithm. 2. SEGMENTATION USING BACH FILTER BANK In this method, speech signal is treated as non-stationary. A constant Q filter bank is formulated, motivated by the perception of music. This bank has 12 filters in every octave, wherein the centers of successive filters are separated by a ratio of 2(1/12).

Speech signal sampled at 16 kHz is passed through this filter bank and the set of outputs of the bank at any instant of time is treated as the feature vector. Here, speech is not presented to the filter banks as short segments (frames) as in the usual framework of quasi-stationary signal. Rather, we get feature vectors for every instant of time. Now, the mean of the log of the feature vectors in each 15 ms window is taken and the Euclidean distance between successive means is calculated. Seen as a 2-class problem, the distance between the means should peak if the feature vectors in the adjacent windows belong to different phoneme classes. The

Medical Intelligence and Language Engineering LaboratoryDepartment of Electrical Engineering, Indian Institute of Science

Bangalore, INDIA - 560 012

{ranjani,ramkiag}@ee.iisc.ernet.in, [email protected]


distance measure used is referred to as the Euclidean Distance between Mean Log (EDML) feature vectors.

Figure 1 displays the plot of the values of the feature EDML as a function of time for part of a Hindi word utterance. We can see that the peaks of this function either coincide with or are close to the manually marked phone boundaries of the uttered word.

Figure 1. The plot of EDML against time for a portion of Hindi utterance (“satypar”). The vertical lines denote the manually segmented phone boundaries

This method gives 86.4% accuracy (automated boundary within 20 ms of a manual boundary), 21.4% insertions and 3.2% deletions for Hindi database, 81.9% accuracy, 15.3% deletions and 23.7% insertions for Tamil database, 82.5% accuracy, 22.3% deletions and 18.9% insertions for TIMIT database [5].

3. EXPLICIT SEGMENTATION

The proposed algorithm makes use of the statistical knowledge of the durations of the phones. The major disadvantage of forcing boundary alignments on the entire speech waveform is that the boundary error gets accumulated. To avoid propagating the boundary errors from the start of the sentence to the end of the same, we can force boundaries for the phones between two phone classes at a time. Armed with the phonetic transcription, the first stage of the algorithm detects a phone class but is constrained to be a training free algorithm.

In order to be detected, fricatives, vowels, nasals, diphthongs, nasal vowels and glides need some stored form of features. Also, different phones in each class require different features. However, as described below, stop consonants can be detected without storing any features. Thus, we follow hierarchical segmentation, where the stop consonants in a sentence are first located, and then the phones occurring between successive stop consonants are segmented.

The first frame (10 ms) of any speech sentence is predominantly silence. However, stop consonants can either

be voiced or unvoiced. To remove the low frequency components that are present in the closure region of a voiced stop consonant, the speech signal is high-pass filtered with a Bessel filter with the lower cutoff frequency of 400 Hz (the voice bar of voiced stops extends roughly till 400 Hz). Now, MFCCs of all the frames of the filtered speech are calculated. The Euclidean distance is computed between the MFCC of the first frame of the sentence and the MFCC of every other frame. If this distance drops below a threshold value for a minimum of 3 consecutive frames (the minimum duration of a stop consonant is assumed to be roughly 30 ms), then it implies that the corresponding region may contain the silence part of a stop consonant or a silence region of speech or a combination of both. The frame within this region having the minimum distance from the first silence frame is surely a stop consonant (or silence or both) frame. Preliminary tests on 100 sentences from Hindi database give a stop consonant detection accuracy of 87% with 20% insertions. The number of regions involving actual silence between words and the closure regions of the stop consonants can be known from the phonetic transcription. Using this, the number of silence regions to be detected can be forced. In this case, the equal error rate (i.e., the number of insertions equals number of deletions) is 11.3% for 100 sentences in Hindi and 15% for 50 sentences of TIMIT database. This performance is of the same order as the stop detection accuracy proposed in [6], [7]. Figure 2 illustrates the stop consonants (silence regions) detected by the above algorithm in a portion of a Hindi utterance.

Figure 2. Portion of a Hindi speech utterance – “satyapardriRh”. The utterance has a silence region at the start of the sentence and 3 stop consonants (/t/, /p/ and /d/). The vertical lines denote the start of the frame classified as sure stop consonants by the proposed algorithm.

The rest of the discussion in this work, assumes an error free stop consonant detection, and attempts to segment


the individual phones between 2 successive correctly detected stop consonants.

Using the Bach scale filter bank, the EDML function is calculated for the speech signal between every successive pair of stop consonants. By incorporating the knowledge of the regions of stop consonants, the end of the first stop consonant (b1) and the start of the next stop consonant (b2) can be found out using the energy change in the signal. This is illustrated for a portion of Hindi speech waveform in Figure 3.

Figure 3. Identification of the region between the end of one stop consonant and the beginning of the next, in a portion of a Hindi speech waveform for the phone sequence /k/,/a/,/nl/,/a/,/r/,/ph/.

Consider a rectangular window of size � times the standard deviation of the duration of the next phone, centered at b1+x where x is the mean duration of the next phone. Within this window, the maximum of the EDML function, EDMLmax is computed and all the peaks greater than �*EDMLmax are detected. If the number of such peaks exceeds 1 or is less than 2 (where 1 > 2), then the best 1

possible peaks within that window are chosen. Again, another rectangular window, of size � times the standard deviation of duration of the next phone, is centered at b1+x+y where y is the mean duration of the next phone. The same peak finding process is repeated for all the phones within the successive stop consonants. Hence the possible choices are 2 and � 1 for every boundary to be detected between two successive stop consonants.

Figure 4 shows the EDML contour for the speech waveform in Figure 3. Also shown are the peaks detected for 2 chosen as 5.

Assuming a Gaussian PDF for the duration of the phones, the probability of transition to the next boundary is found out for each of these possible choices.

Now, the problem can be stated as: Find the best possible boundaries such that the product of the transition probabilities in that path is maximized. Equivalently, the sum of the negative log of the transition probabilities is minimized.

We have employed a graph theoretic approach to the problem, wherein each possible choice for a boundary is a

node and the transition probability is the weight of an edge. This is illustrated in Figure 5.

Figure 4. Contour of EDML values of the speech waveform shown in Figure 3. A window is centered at the point away from b1 by the mean duration of phone /a/. The circles indicate the peaks chosen. It can be seen that some peaks are common choices for both the successive phones.

Now, the best path that minimizes the cost of transition can be found. Since the start and end nodes (b1 and b2) are known, we can use Dijkstra’s greedy algorithm. The best choices of nodes obtained from this algorithm are taken as the best possible boundaries within the 2 stop consonants. Figure 6 shows the best possible boundaries obtained using the above algorithm as against the manual boundaries.

Figure 5. Nodes with transition probabilities. The first and last nodes are b1 and b2. The choices for a boundary are considered as a node. The edges indicate the negative log of transition probability.

The experiments were conducted on the Hindi database using the statistics of phone durations computed from the manually segmented database. Best results were obtained for the parameters � = 8, � = 0.1, 1 =5 and 2 = 2. Performance on 30 sentences from the Hindi database is 21.4% FER with a frame size of 20 ms and 20.4% FER with


a frame size of 25 ms. The experiments were repeated for statistics obtained from TIMIT database and the corresponding FER for utterances of a single speaker are 29.5% and 28.4%. The boundary error rate is 14.6% for Hindi and 19.4% for TIMIT database.

Figure 6. Boundaries between stop consonants /k/ and /ph/. The thick vertical lines are the manually marked boundaries and the thin vertical lines are the boundaries identified by the proposed algorithm.

4. CONCLUSION AND FUTURE WORK

The proposed method promises good segmentation provided the statistics of the phones are known. A final round of manual intervention is required. However, this manual intervention is now less tedious and less time consuming.

The mean durations of the phones are normalized to the speaker’s rate of speech between the two stop consonants. Also, it was found that the frame error rate between the manual segmentations carried out independently by 2 trained segmenters is around 9%.

Future work can attempt using the statistics of phone durations of one language for segmenting speech of another language. Also, stop consonant detection method with a much higher accuracy needs to be developed.

5. REFERENCES [1] Abhinav Sethy, Shrikanth Narayanan, “Refined Speech

segmentation for concatenative speech synthesis”, Proc. ICSLP-2002, pp 149-152.

[2] Sharma, M. Mammone, R., “Automatic speech segmentation using neural tree networks”, Proc IEEE workshop on neural networks for signal processing, Sep 1995, pp 282-290.

[3] G. Ananthakrishnan, H. G. Ranjani, and A. G. Ramakrishnan, “Language independent automated segmentation of speech using Bach scale filter-banks”, Proc. ICISIP –Dec. 2006, pp -115 – 120.

[4] G. Ananthakrishnan, H. G. Ranjani, and A. G. Ramakrishnan, “Comparative study of filter-bank mean-energy distance for automated segmentation of speech signals”, Proc ICSCN –Feb. 2007, pp 06- 10.

[5] Ananthakrishnan G, “Music and speech analysis using the ‘Bach’ scale filter-bank”, M.Sc (Engg) thesis, Indian Institute of Science, Apr -2007.

[6] F. Malbos, M. Baudry and S. Montresor , “Detection of stop consonants with the wavelet transform”, Proc. IEEE-SP International symp. time-frequency and time-scale analysis, Oct. 1994, pp 612 – 615.

[7] P. Niyogia and M. M. Sondhi, “Detecting stop consonants in continuous speech”, Proc. JASA Feb. 2002, pp 1063- 1075.

[8] L. Rabiner and B. H. Juang, “Fundamentals of speech recognition”, Pearson Education Press, 1993 (AT&T).


TEMPORAL AND SPECTRAL PROCESSING FOR ENHANCEMENT OF NOISY SPEECH

P. Krishnamoorthy and


Email:{pkm,prasanna}@iitg.ernet.in

ABSTRACTThis paper proposes an approach for enhancement of noisyspeech by temporal and spectral processing. Temporal pro-cessing involves identification and enhancement of high Sig-nal to Noise Ratio (SNR) regions in temporal domain. Spec-tral processing involves identification and enhancement of highSNR regions in spectral domain. The processed speech signalis found to be enhanced significantly compared to degradedversion as well as processed signals by individual temporaland spectral processing methods.Index Terms: speech enhancement, temporal processing, spec-tral processing, temporal and spectral processing.

1. INTRODUCTION

Speech signals collected from uncontrolled environments willhave degradation components along with the required speechcomponents. The degradation components include backgroundnoise, reverberation and speech from other speakers. A de-graded speech signal will give poor performance, when usedin speech processing systems like speech recognition and speakerrecognition. Also such a degraded speech signal will be un-comfortable for listening [1]. Hence degraded speech signalsneed to be processed for enhancement of speech components.This work proposes an approach for enhancing high Signal toNoise Ratio (SNR) regions in the degraded speech signal byprocessing it in temporal and spectral domains.

Speech enhancement is an active research area from manydecades [2–8]. As a result, many approaches have been pro-posed in the literature for processing degraded speech. Allthese methods may be broadly grouped into two categories,namely, temporal processing methods and spectral process-ing methods. The temporal processing methods are based onthe principle of identification and enhancement of high SNRspeech regions in the temporal domain [7]. The main meritof these methods are the effectiveness in the enhancement ofhigh SNR regions and also they do not require explicit mod-eling of degradation present, which is known to be a difficulttask. The demerit of this approach may be the ineffective-ness in removing the degrading component. Alternatively, thespectral processing based methods are based on the princi-ple of estimation and removal of degrading component in the

spectral domain [2]. The main merit of these methods is thatthey are effective in eliminating degrading component, sincethey are explicitly estimated. The demerit of these methods isthe need for explicit modeling of degradation component.

As mentioned above the temporal and spectral process-ing methods have their own merits and demerits. Thereforeapproaches may be developed to effectively combine the twoapproaches such that the relative merits of each of the ap-proaches may help in coming up with new methods, whichmainly provide relatively better performance compared to eachof them individually. Also demerits of each approach may beminimized, if not eliminated, with the help of other. For in-stance, in case of noisy speech enhancement, initial temporalprocessing may help us not only in temporal enhancementbut also in identifying non-speech regions. The knowledgeabout the same will be useful in effectively estimating thenon-stationary noise for spectral processing. This aspect willbe experimentally demonstrated in this work.

The basis for the proposed approach is that human beingsperceive speech by capturing some features present from thehigh SNR regions in the spectral and temporal domains, andthen extrapolating the features in the low SNR regions [7].The proposed temporal and spectral processing of speech in-volves the following steps: The degraded speech signal is pro-cessed to extract information about the high SNR regions atthe gross and fine temporal levels. This information is usedto enhance high SNR temporal regions. The temporally pro-cessed speech signal is then applied to spectral processing,which involves estimation and subtraction of noise and thenenhancement of high SNR spectral peaks.

The rest of the paper is organized as follows: Section 2describes basic principles of temporal and spectral process-ing methods for enhancement of noisy speech. Section 3 pro-poses a method for enhancement of noisy speech by tempo-ral and spectral processing. The experimental results are dis-cussed in Section 4. The summary of the present work andscope for the future work are given in Section 5.

2. ENHANCEMENT OF NOISY SPEECH

In case of noisy speech, the degradation will be predomi-nantly background noise [1]. The characteristics of degrada-

S. R. M. Prasanna


0 0.5 1 1.5 2

−1

0

1

(a)

0 0.5 1 1.5 2

−1

0

1

(b)

0 200 400 600 800 1000 1200

0

0.5

1

1.5

(c)

0 200 400 600 800 1000 1200

0

2

4

6

8

(d)

0 200 400 600 800 1000 1200

0

0.5

1

(e)

Fig. 1: Gross weight function determination: (a) degraded speech, (b) LPresidual, (c) inverse spectral flatness, (d) smoothed inverse spectral flatnessand (e) nonlinearly mapped inverse spectral flatness.

tion is like a random noise and will be uncorrelated with thespeech [2]. Several methods have been proposed for the en-hancement of noisy speech [2,3,5,7,8]. Among these we de-scribe the following methods which are being selected to de-velop the proposed temporal and spectral processing method.

2.1. Temporal Processing of Noisy Speech

The methods for temporal processing of speech mainly in-volve identification and enhancement of high SNR regions inthe speech signal [7, 9]. We briefly describe the method pro-posed in [7] for the enhancement of noisy speech. In thismethod, enhancement is achieved by exploiting the followingthree steps: (i) identification and enhancement of high SNRregions at the gross level, (ii) identification and enhancementof high SNR regions at the fine level and (iii) enhancement ofspectral peaks over valleys. The gross level identification ofhigh SNR regions is done using the inverse spectral flatnessparameter computed from the noisy speech signal. This pa-rameter is the ratio of speech energy to the Linear Prediction(LP) residual energy and will be equal to unity in low SNRregions and will have significant high values in high SNR re-gions [7]. In the present work we also used this parameterfor identifying high SNR regions. The fine level identifica-tion of high SNR regions is based on the Frobenius norm ofthe Toeplitz matrix constructed using the noisy speech sig-nal [7]. This approach has the advantage of exploiting theenvelope information in the noisy speech waveform [7]. Al-ternatively, in this work we use Hilbert envelope of the LPresidual proposed in [9] to identify high SNR regions at finelevel. The identification and enhancement of spectral peaksproposed in [7] is not performed, since it will be taken careby the spectral processing of speech described next.

2.2. Spectral Processing of Noisy Speech

The area of enhancement of noisy speech is dominated byspectral processing methods [2–4, 6, 8]. The basis for these

0 0.5 1 1.5 2

0

0.5

1

(a)

0 0.5 1 1.5 2

0

0.5

1

(b)

0 0.5 1 1.5 2

0

0.5

1

(c)

0 0.5 1 1.5 2

0

0.5

1

(d)

Fig. 2: Weight function determination: (a) gross weight function, (b)smoothed Hilbert envelope of LP residual, (c) fine weight function and (d)final weight function.

methods is mainly the estimation and subtraction of noisecomponents from the noisy speech [2]. All these methodsare based on classical spectral subtraction method proposedin [2]. The spectral subtraction approach is effective in elimi-nating the noise components, but suffer from introducing tonelike noise into the perceived speech signal termed as musi-cal noise [6]. Most of the later methods are aimed at re-ducing the musical noise and also extending the capabilitiesof basic spectral subtraction method to make it suitable fornon-stationary environments. The present work also uses thespectral subtraction method proposed in [2] and modifies it tomake it suitable for non-stationary environments. The mod-ifications will be aimed at better estimation of noise compo-nents, enhancement of spectral peaks and also reducing mu-sical noise.

3. TEMPORAL AND SPECTRAL PROCESSING OFNOISY SPEECH

The proposed temporal and spectral processing method forenhancement of noisy speech involves the following steps:The noisy speech is processed by the LP analysis to extractthe inverse spectral flatness parameter. This parameter is fur-ther smoothed and non-linearly mapped as described in [7] toderive a weight function for identifying the high SNR regionsat gross level. The noisy speech is then processed to extractthe Hilbert envelope of the LP residual as described in [9]to derive a weight function for identifying the high SNR re-gions at the fine level. The voiced speech is produced as a re-sult of excitation of the vocal tract system by a periodic trainof impulses. The significant excitation within a pitch periodoccurs around the instants of Glottal Closure (GC) and thissignificant excitation of vocal tract system is indicated by alarge error in the LP residual. This cannot be directly usedfor identifying GC regions due to the bipolar nature of theresidual [10]. This limitation is overcome by computing theHilbert envelope of the LP residual [10]. The Hilbert enve-lope of the LP residual e(n) is defined as [11]


he(n) =√

e2(n) + e2h(n) (1)

where, eh(n) is the Hilbert transform of e(n), and is given by

eh(n) = IDFT [Eh(k)] (2)

where,

Eh(k) ={−jE(k), k = 0, 1, ...,

(N2

)− 1

jE(k), k =(

N2

),(

N2

)+ 1, ..., (N − 1)

where, IDFT denotes the Inverse Discrete Fourier Transformand E(k) is computed as Discrete Fourier Transform of e(n).

The weight functions at the two levels are combined to ob-tain a single weight function. The combined weight functionis multiplied with the LP residual of noisy speech signal to en-hance the high SNR regions. The multiplication is done withthe LP residual rather than the speech signal directly, because,the LP residual signals are less correlated and weighting maylead to less perceptual distortion [7]. The modified residualis then used to excite the time-varying all-pole filter derivedfrom the noisy speech signal. The resulting speech signal istermed as temporally processed signal.

The steps involved in temporal processing of speech aidsin spectral processing in the following way: the gross levelweight function can be viewed as speech/non-speech detec-tion step. Thus it will enable us to update the noise estimatefrom the most recent noise region and hence it adds the ca-pability to spectral subtraction, making it suitable for non-stationary environments. In the proposed method, the spec-tral processing involves subtraction of noise components esti-mated from the knowledge of speech/non-speech regions andthen identification and enhancement of the spectral compo-nents corresponding to formants, pitch and harmonics loca-tions using the sinusoidal analysis. In the linear speech pro-duction model, the continuous time speech waveform s(t) isassumed to be the output of passing excitation waveform e(t)through a linear time varying filter with impulse responseh(t, τ) that models the characteristics of vocal tract system[12]. Mathematically the speech signal is expressed as

s(t) =∫ t

0

h(t, t− τ)e(τ)dτ (3)

where, the excitation is convolved with a different impulseresponse at each time t. McAulay and Quatieri [13] proposeda sinusoidal model to represent speech signals. This modelrepresents the excitation source signal e(t) interms of sumof sinusoids of arbitrary amplitudes, frequencies and phases.Mathematically the excitation source signal is expressed as

e(t) = ReK(t)∑

k=1

ak(t) exp(

j

[∫ t

0

ωk(σ)dσ + φk

])(4)

where, K(t) is the number of sinewave components at timet. ak(t) and ωk(σ) represent the time varying amplitude andfrequency and θk is the fixed phase offset which accounts forthe fact that the sine waves will generally not be in phase.The sinewave parameters are estimated by applying Short-Time Fourier transform (STFT) to a quasi stationary part ofthe speech signal. The STFT of speech will have peaks occur-ring at all pitch harmonics and formants. Therefore the fre-quencies of underlying sine waves corresponds to the peaksof STFT. The amplitudes and phases are estimated at peaksfrom the high resolution STFT using a simple peak pickingalgorithm [13]. In our study 32 sinusoidal components areselected from the largest 32 peaks of STFT spectra mainlybecause it covers entire frequency range so that almost all for-mants, pitch and harmonics locations are included.

The spectral peaks corresponding to sinusoidal compo-nents are sampled from the spectral subtracted speech spec-trum using the window function of the type

wd(k) = e−a.sign(k); −2 ≤ k ≤ 2 (5)

where, “a” is selected as 1 in this study. The sampled spec-trum is added with the spectral subtracted speech spectrum.The resultant speech spectra is recombined with the originalnoisy speech phase spectra and converted back to the timedomain by an IDFT operation.

The proposed spectral processing method is illustrated inFig. 3 and Fig. 4. Fig. 3(a) shows a frame of voiced por-tion of temporally processed speech and the correspondingSTFT magnitude spectrum and log-magnitude spectrum areshown in Fig. 3(b) and (c), respectively. The selected sinu-soidal component locations are indicated by a “ * ” symbol inthe log-magnitude spectrum. Fig. 3(d) and (e) show the spec-tral subtracted magnitude spectrum and the window functionused for sampling the spectral subtracted spectrum derivedfrom the sinusoidal analysis. The sampled spectrum is addedwith the spectral subtracted speech spectrum and is shown inFig. 3(f). Fig. 4 illustrates the spectral processing steps for aframe unvoiced portion of speech signal.

The steps involved in the proposed temporal and spectralprocessing method for enhancement of noisy speech is sum-marized in Table 1.


A speech signal spoken by a female speaker is selected fromthe TIMIT database, played and collected from a noisy envi-ronments with an average SNR of about 6 dB and is shown inFig. 1(a). The speech signal is processed by LP analysis us-ing frame size of 20 ms, frame shift of 10 ms and 10th orderLP analysis to estimate the LPCs and LP residual [7] and theresidual signal is given in Fig. 1(b). The inverse spectral flat-ness is computed with a non overlapping frame of 2 ms andsmoothed using a 17-point Hamming window. The inverse


Table 1: Temporal and Spectral Processing Algorithm

Temporal Processing:

• Compute LP residual of noisy speech using a frame size of 20 ms,shift of 10 ms and 10th order LP analysis.

• Compute the inverse spectral flatness for each nonoverlapping 2 msframe.

• Smooth the inverse spectral flatness using a 17-point Hammingwindow.

• Nonlinearly map the smoothed inverse spectral flatness value.• Obtain the gross weight function by repeating the each nonlinearly

mapped value by 2 ms interval and smoothing it with a 2 ms meansmoothing filter.

• Compute the Hilbert envelope of LP residual and smoothing it witha 1 ms mean smoothing filter.

• Obtain the fine weight function by nonlinearly mapping thesmoothed Hilbert envelope.

• Multiply the two weight functions (gross weight function and fineweight function) to generate the overall weight function.

• Multiply the LP residual signal of noisy speech by the overallweight function.

• Excite the time-varying all-pole filter using weighted residual toobtain the temporally processed speech.

Spectral Processing:

• Compute the DFT magnitude and phase spectrum for the speech re-gions of enhanced speech obtained by temporal enhancement usinga Hamming windowed speech frame of size 20 ms, with frame shiftof 10 ms and 1024 point DFT.

• Update the noise magnitude spectrum if 5 consecutive frames aredetected as non-speech regions.

• Subtract the noisy speech spectrum from the recent noise magni-tude spectrum.

• Determine the sinusoidal component locations by picking thelargest 32 peaks in the DFT magnitude spectrum.

• Enhance the sinusoidal component locations of spectral subtractedspeech.

• Reconstruct the enhanced speech signal using IDFT.

spectral flatness and its mean smoothed values are given inFig. 1(c) and (d), respectively. The smoothed inverse spectralflatness values are non-linearly mapped to enhance the con-trast between the high SNR and low SNR regions. Fig. 1(e)shows the non-linearly mapped inverse spectral flatness val-ues using a mapping function given by [7]

xmk =

(xm

f − xmi

2

)tanh (αgπ (xk − xo)) +

(xm

f + xmi

2

)

where, xmk is the non-linearly mapped value of xk, xm

f (= 1)is the maximum mapped value, xm

i (= 0) is the minimummapped value, αg(= 2) is a positive constant which decidesthe slope of the mapping function and and xo is the inversespectral flatness value about which the tanh function is anti-symmetric and is experimentally found to be 0.75 times aver-age value of mean smoothed inverse spectral flatness.

The gross weight function is obtained by repeating eachmapped values by 2 ms interval and smoothing it with a 2

0 0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.016 0.018 0.02

−1

0

1

(a)

0 500 1000 1500 2000 2500 3000 3500 4000

0

10

20

(b)

0 500 1000 1500 2000 2500 3000 3500 4000

−4

−2

0

2

(c)

0 500 1000 1500 2000 2500 3000 3500 4000

0

10

20

(d)

0 500 1000 1500 2000 2500 3000 3500 4000

0

0.5

1

(e)

0 500 1000 1500 2000 2500 3000 3500 4000

0

10

20

(f)

Fig. 3: Speech Components Enhancement: (a) a frame of voiced speech(b) magnitude spectrum, (c) log-magnitude spectrum and peak locations, (d)spectral subtracted speech spectrum, (e) window function for sampling, and(f) enhanced spectrum.

ms mean smoothing filter. For deriving fine weight func-tion, first the Hilbert envelope of LP residual is computed andmean smoothed by a 1 ms mean smoothing filter. The meansmoothed values are nonlinearly mapped using the same map-ping function which is used for deriving the gross weightfunction using xm

f = 1, xmi = 0, αg = 10 and xo = average

value of mean smoothed Hilbert envelope of LP residual. Thenonlinearly mapped values are termed as fine weight func-tion. The final weight function for the LP residual of noisyspeech is obtained by multiplying gross and fine weight func-tions. Fig. 2(a)-(d) show the gross weight function, smoothedHilbert envelope of LP residual, fine weight function and finalweight function for the speech signal shown in Fig. 1(a). Theminimum value of weight function (final weight function) iskept as 0.4 to reduce the perceptual distortion. The LP resid-ual of the noisy speech is weighted by the weight functionand the modified residual excites the time-varying all-pole fil-ter derived from the noisy speech to generate the temporallyprocessed speech.

The temporally processed speech is subjected to basic spec-tral subtraction and spectral enhancement. The spectral en-hancement is achieved by enhancing the instants around thesinusoidal components of basic spectral subtracted speech.The resultant speech spectra is combined with the originalnoisy speech phase information to reconstruct the enhancedspeech signal in time domain by using the IDFT. The en-hancement results for the noisy speech showed in Fig. 1(a)are given in Fig. 5. Fig. 5(a)-(d) show the degraded speech,speech processed by the temporal processing, processed bythe combined temporal and spectral processing and the con-ventional spectral subtraction method, respectively and therespective spectrograms are given in Fig. 5(e)-(h). Fig. 6


0 0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.016 0.018 0.02

−1

0

1

(a)

0 500 1000 1500 2000 2500 3000 3500 4000

0

10 (b)

0 500 1000 1500 2000 2500 3000 3500 4000

−2

0

2

(c)

0 500 1000 1500 2000 2500 3000 3500 4000

0

10 (d)

0 500 1000 1500 2000 2500 3000 3500 4000

0

0.5

1

(e)

0 500 1000 1500 2000 2500 3000 3500 4000

0

10 (f)

Fig. 4: Speech Components Enhancement: (a) a frame of unvoiced speech(b) magnitude spectrum, (c) log-magnitude spectrum and peak locations, (d)spectral subtracted speech spectrum, (e) window function for sampling, and(f) enhanced spectrum.

shows the enhancement results of male speaker taken fromthe NOIZEUS database [14] in which 5 dB airport noise isadded. The speech signal enhanced by the proposed methodis found to be perceptually better as compared to the temporalor spectral processing alone and also from the spectrogramsit can be observed that, the spectrogram of processed sig-nal by the proposed method shows a noticeable improvementas compared to degraded speech and also has less numberof random peaks as compared to conventional spectral sub-traction method. The perceptual evaluation of speech qual-ity (PESQ) [15], log-likelihood ratio (LLR) and cepstral dis-tance (CD) [16] measures were used for objective evalua-tion of the proposed method. The PESQ and LLR measureshave been found to yield much stronger correlations with sub-jective quality ratings [14]. PESQ is able to predict sub-jective quality with good correlation in a very wide rangeof conditions, which may include coding distortions, errors,noise,filtering, delay, and variable delay. For evaluating ob-jective performance ten different male and female speakersare selected from the TIMIT and NOIZEUS database. Ta-ble 2, 3 and 4 show PESQ, LLR and CD values for the speechsignals taken from the TIMIT and NOIZEUS database for thevarious SNR levels. From the tables it can be observed thatthe combined method gives the better performance as com-pared to temporal or spectral processing alone.

5. SUMMARY AND CONCLUSIONS

In this paper we have proposed a method for enhancementof noisy speech based on temporal and spectral processing ofspeech. The proposed method involves temporal processingof noisy speech to identify and enhance the high SNR regions

0 0.5 1 1.5 2

−0.5

0

0.5

(a)

0 0.5 1 1.5 2

−0.5

0

0.5

(b)

0 0.5 1 1.5 2

−0.5

0

0.5

(c)

0 0.5 1 1.5 2

−0.5

0

0.5

(d)

Time (secs)

(e)

Fre

q. (K

Hz)

0 0.5 1 1.5 2 2.5

0

2

4

(f)

Fre

q. (K

Hz)

0 0.5 1 1.5 2 2.5

0

2

4

(g)

Fre

q. (K

Hz)

0 0.5 1 1.5 2

0

2

4

(h)

Time (secs)

Fre

q. (K

Hz)

0 0.5 1 1.5 2 2.5

0

2

4

Fig. 5: Results of enhancement of noisy speech of a female voice: (a) de-graded speech, (b) speech processed by temporal processing, (c) speech pro-cessed by temporal and spectral processing, (d) speech processed by spectralsubtraction method and (e)-(f) spectrograms of the respective signals shownin (a)-(d).

0 0.5 1 1.5 2 2.5

−0.5

0

0.5

(a)

0 0.5 1 1.5 2 2.5

−0.5

0

0.5

(b)

0 0.5 1 1.5 2 2.5

−0.5

0

0.5

(c)

0 0.5 1 1.5 2 2.5

−0.5

0

0.5

(d)

Time (secs)

(e)

Fre

q. (K

Hz)

0 0.5 1 1.5 2 2.5

0

2

4

(f)

Fre

q. (K

Hz)

0 0.5 1 1.5 2 2.5

0

2

4

(g)

Fre

q. (K

Hz)

0 0.5 1 1.5 2 2.5

0

2

4

(h)

Time (secs)

Fre

q. (K

Hz)

0 0.5 1 1.5 2 2.5

0

2

4

Fig. 6: Results of enhancement of noisy speech of a male voice: (a) degradedspeech, (b) speech processed by temporal processing, (c) speech processedby temporal and spectral processing, (d) speech processed by spectral sub-traction method and (e)-(f) spectrograms of the respective signals shown in(a)-(d).

in temporal domain. The temporally processed speech sig-nal is further subjected to spectral processing to eliminate thenoise and also enhance the peaks in the spectral domain. Theproposed speech signal is found to be enhanced better com-pared to temporal or spectral processing alone.

The proposed method is illustrated using the existing fea-tures/parameters at each stage. A rigorous study may be done


Table 2: PESQ measure for different speech signals for the examples col-lected from the TIMIT database and NOIZEUS database. In the table ab-breviations SNR, TP, SP and TPSP refers to Signal to Noise Ratio, temporalprocessing and spectral subtraction and the proposed temporal and spectralprocessing, respectively.

SNR DEG TP SP TPSPTIMIT 3dB 1.72 1.79 2.08 2.22DATA 6dB 1.90 1.99 2.33 2.44

9dB 2.09 2.19 2.56 2.62NOIZEUS 5dB 1.78 1.77 2.24 2.33

DATA 10dB 2.08 2.11 2.40 2.6315dB 2.46 2.49 2.72 3.09

Table 3: LLR distance measure for different speech signals for the examplescollected from the TIMIT database and NOIZEUS database.


9dB 0.70 0.67 0.50 0.44NOIZEUS 5dB 1.18 1.16 0.86 0.83

DATA 10dB 0.97 0.94 0.66 0.6415dB 0.73 0.71 0.49 0.47

Table 4: CD measure for different speech signals for the examples collectedfrom the TIMIT database and NOIZEUS database.


9dB 0.46 0.45 0.47 0.42NOIZEUS 5dB 0.90 0.89 0.78 0.72

DATA 10dB 0.86 0.84 0.72 0.6915dB 0.78 0.76 0.65 0.61

to evaluate the significance of each of the parameters. Theproposed approach may be explored for other type of degra-dations like reverberation and multi-speaker speech.

6. REFERENCES

[1] Y. Ephraim, “Statistical-model-based speech enhance-ment systems,” Proc.IEEE, vol. 80, pp. 1526–1555, Oct.1992.

[2] S.F. Boll, “Suppression of acoustic noise in speech us-ing spectral subtraction,” IEEE Tans. Acoust., Speech,Signal Processing, vol. ASSP-27, pp. 113–120, April1979.

[3] Y. Ephraim and D. Malah, “Speech enhancement usinga minimum-mean square error short-time spectral am-plitude estimator,” IEEE Tans. Acoust., Speech, SignalProcessing, vol. ASSP-32, pp. 1109–1121, Dec. 1984.

[4] O. Cappe, “Elimination of the musical noise phe-nomenon with the Ephraim and Malah noise suppres-sor,” IEEE Trans. Speech Audio Processing, vol. 2, pp.345–349, April 1994.

[5] Y. Ephraim and H.L. Van Trees, “A signal subspaceapproach for speech enhancement,” IEEE Trans. SpeechAudio Processing, vol. 3, pp. 251–266, July 1995.

[6] N. Virag, “Single channel speech enhancement basedon masking properties of the human auditory system,”IEEE Trans. Speech Audio Processing, vol. 7, pp. 126–137, March 1999.

[7] B. Yegnanarayana, Carlos Avendano, Hynek Herman-sky, and P. Satyanarayana Murthy, “Speech enhance-ment using linear prediction residual,” Speech Commu-nication, vol. 28, pp. 25–42, May 1999.

[8] Yasser Ghanbari and Mohammad Reza Karami Mollaei,“A new approach for speech enhancement based on theadaptive thresholding of the wavelet packets,” SpeechCommunication, vol. 48, pp. 927–940, August 2006.

[9] S R Mahadeva Prasanna, Event based analysis ofspeech, Ph.D. thesis, Indian Insititute of TechnologyMadras, Department of Computer Science and Engg.,Chennai, India, March 2004.

[10] T.V. Ananthapadmanabha and B. Yegnanarayana,“Epoch extraction from linear prediction residual foridentification of closed glottis interval,” IEEE Tans.Acoust., Speech, Signal Processing, vol. ASSP-27, pp.309–319, Aug. 1979.

[11] A. V. Oppenheim and R. W. Schafer, Digital Signal Pro-cessing, Prentice Hall, 1975.

[12] D. O’Shaughnessy, Speech Communications: Humanand Machine, IEEE Press, second edition, 1999.

[13] R. McAulay and T. Quatieri, “Speech analysis/synthesisbased on a sinusoidal representation,” IEEE Tans.Acoust., Speech, Signal Processing, vol. ASSP-34, pp.744–754, Aug. 1986.

[14] Yi Hu and P.C. Loizou, “Evaluation of objec-tive measures for speech enhancement,” in Proc.INTERSPEECH-2006, Philadelphia,PA, Sept. 2006.

[15] A.W. Rix, J.G. Beerends, M.P. Hollier, and A.P.Hekstra,“Perceptual evaluation of speech quality (PESQ)-a newmethod forspeech quality assessment of telephone net-works and codecs,” in Proc. IEEE Inter. Conf. on Acous-tics, Speech, and Signal Processing, 2001, pp. 749–752.

[16] A. Gray and J. Markel, “Distance measures for speechprocessing,” IEEE Tans. Acoust., Speech, Signal Pro-cessing, vol. ASSP-24, pp. 380–391, Oct. 1976.


PARTIAL ENCRYPTION OF GSM CODED SPEECH USING SPEECH-SPECIFICKNOWLEDGE

V. Anil Kumar, A. Mitra and S. R. M. Prasanna

Department of Electronics and Communication Engineering,Indian Institute of Technology (IIT) Guwahati, Guwahati-781039, India.

Email: (v.anil, a.mitra, prasanna)@iitg.ernet.in.

ABSTRACT

Encryption of the entire speech signal might be computa-

tionally extensive in certain cases like providing security

against casual listeners. As the speech bitstream can be char-

acterized by non-uniform perceptual importance, encryption

of some important features can therefore reduce the system

complexity as well as the encryption delay. In this paper,

we present such a partial encryption scheme for global sys-

tem for mobile communications (GSM) compressed speech.

Speech compressed by most widely used regular pulse exci-

tation long term prediction (RPE-LTP) GSM coder is classi-

fied into two divisions: first, the most perceptually relevant,

to be encrypted; and the next, perceptually less significant, to

be left unprotected. The perceptually important features are

encrypted with pseudo noise (PN) sequences. For this pur-

pose, the effectivity of different PN sequences is analyzed

from mean square correlation measures. Results show the

effetivity of large Kasami sequences among many PN se-

quences. Two objective speech quality measures are used to

compare the performance of the proposed partial encryption

scheme with the full encryption scheme.

Index Terms— Mean square correlation measures, Par-

tial encryption, Pseudo noise sequences, RPE−LTP coder,

Objective quality measures.

1. INTRODUCTION

While transmitting information (speech, image or other data)

through insecure channels, there might be unwanted dis-

closure as well as unauthorized modification of data if that

is not properly secured. Certain mechanisms are therefore

needed to protect the information within insecure channel.

One way to provide such protection is to convert the intelli-

gible data into unintelligible form prior to transmission and

such a process of conversion with a key is called encryp-

tion [1]- [3]. At the receiver side, the encrypted message is

converted back to the original intelligible form by the reverse

process of the encryption called decryption. Cryptographic

techniques are mainly classified as private key cryptography

and public key cryptography. In private key cryptography,

also called as symmetric key cryptography, same key is used

for both encryption and decryption. In public key cryptog-

raphy, on the other hand, different keys are used for encryp-

tion and decryption and it is thus called as an asymmetric

key cryptography. The main disadvantage of private key

encryption technique, since the same key is used for both

encryption and decryption, is the distribution of key to au-

thorized users but it is faster when compared with the public

key cryptography. In private key cryptography, encryption

of the entire speech signal might be computationally exten-

sive in certain cases like providing security against casual

listeners. Traditional speech encryption techniques [1] treat

speech as data only and encrypts the entire speech signal,

which increases the encryption delay of the system. The en-

cryption delay and the system complexity can be reduced

by using partial encryption schemes. The partial encryp-

tion technique for compressed speech is first introduced in

[4]. In these partial encryption schemes, instead of encrypt-

ing the entire speech signal, the compressed speech bits are

partitioned into perceptually significant and less significant

bits, and only the perceptually important speech bits are en-

crypted while keeping the other bits unprotected. In this

paper, we introduce a partial encryption scheme for speech

compressed by the most widely used RPE−LTP coder. The

perceptually important GSM coder parameters are encrypted

with pseudo noise (PN) sequences by XOR operation. The

performance of the partial encryption scheme depends on the

perceptual classification of bitstream and the PN sequence

used for encryption. For this purpose, we analyze the effec-

tivity of different PN sequences for speech encryption from

correlation measures. The mean square correlation measures

are used to test the randomness of the PN sequences and the

PN sequence which has less correlation values have more

noise like characteristics which sequence is better suited for

speech encryption. The perceptually important speech pa-

rameters are encrypted by the XOR operation with the se-

lected PN sequence.

The rest of the paper is organized as follows. In Section

2, we deal with the fundamentals of PN sequences and the

mean correlation measures. Section 3 describes the func-

tionality of the RPE−LTP coder and the sensitivity of dif-

ferent GSM parameters. In Section 4, we discuss about the

proposed partial encryption technique. The objective qual-

ity measures to evaluate the performance of the partial en-

cryption scheme are narrated in Section 5. Results of the

experiment are discussed in Section 6.

2. FUNDAMENTALS OF PN SEQUENCES

PN sequences are sequence of 1’s and 0’s, the numbers looks

like statistically independent and uniformly distributed. Sta-


decoding andRPE grid

positioning

and coding

RPE gridselectionInput Pre−

processing

filteranalysis

Long term

analysisLTP

Speech

LTP Parameters

RPE Parameters

Log−Area Ratios

(36 bits)

(188 bits)

(36 bits)

Short term

Short Term Analysis Section

LPCanalysis

Short termanalysis

filter

Long Term Analysis Section

RPE Encoding Section

s(k)

LAR′(i)

rSTP (n)

rLTP (n)

r′′STP (n)

r′LTP (n)

Fig. 1. Simplified block diagram of RPE−LTP encoder.

tistically a PN sequence nearly satisfies the requirements of

a random binary sequence [5]- [7]. The PN sequences have

the following noise like properties: (i) balance property, (ii)

run property, and (iii) auto-correlation property [8]. These

three properties make PN sequences efficient for speech en-

cryption. However, due to the third property, adjacent bits

correlation becomes considerably less, thereby making the

PN sequences more effective for speech encryption when

compared with data encryption due to high adjacent correla-

tion present in the speech signals. Therefore, PN sequences

that are useful for speech encryption must have very good

auto-correlation and cross-correlation properties as well as

maintaining some randomness properties [9]. Among many

PN sequences the sequences which has less correlation val-

ues are determined by using mean square correlation mea-

sures. The different PN sequences that we have investigated

are: (i) maximal length sequences, (ii) Gold sequences, (iii)

Gold-like sequences, (iv) Barker sequences, (v) Barker-like

sequences, (vi) Kasami sequences, (vii) Walsh Hadamard

sequences, (viii) modified Walsh Hadamard sequences, and

(ix) orthogonal Gold codes. A brief introduction of these

codes can be found in [8]- [9].

2.1. Mean Square Correlation Measures

The performance of the different PN sequences are evaluated

by mean square aperiodic auto-correlation (MSAAC), RAC ,

and mean square aperiodic cross-correlation (MSACC), RCC ,

measures. These correlation measures were introduced by

Oppermann and Vucetic [10]. If ci(n) represents non-delayed

version of ck(i), cj(n+ τ) represents the delayed version of

ck(j) by ‘τ ’ units and N is the length of the sequence ci,

then the discrete aperiodic correlation function is defined as

ri,j(τ) =1N

N−1∑

τ=1−N

ci(n)cj(n + τ). (1)

The mean square aperiodic auto-correlation value for a code

set containing M sequences is given by

RAC =1M

M∑

i=1

N−1∑

τ=1−N,τ �=0

|ri,i(τ)|2 (2)

and a similar measure for the mean square aperiodic cross-

correlation value is given by

RCC =1

M(M − 1)

M∑

i=1

M∑

j=1,j �=i

N−1∑

τ=1−N

|ri,j(τ)|2. (3)

These two measures have been used as the basis for compar-

ing the sequence sets. The sequences which have good auto-

correlation properties will have poorer cross-correlation prop-

erties, and vice-versa, and they have wide and flat frequency

spectrum.

2.1.1. Figure of Merit

As has been mentioned, the price for being able to select

good cross-correlation properties will be a degradation in

the auto-correlation properties of the set of sequences. A

degradation of the auto-correlation properties has a direct

relation on the frequency spectrum of the sequences in the

set. If the RAC values are poor, the spectrum of the se-

quence will not be wide-band and flat. In order to determine

quantitatively how significant this degradation is for a given

set of sequences, a figure of merit (FoM) [10] is required to


judge the suitability of the frequency characteristics of the

sequences. Sequences with low FoM has narrow flat spec-

trum and they are not suitable for speech encryption. The

FoM for a sequence, ci(n), of length N having the auto-

correlation function ri,i(τ) is given as:

Fx =r2i,i(0)

∑τ �=0

|ri,i(τ)|2 =N2

2N−1∑τ=1

|ri,i(τ)|2. (4)

3. RPE−LTP CODER

The RPE−LTP coder, also called as full-rate GSM coder,

is the widely used speech coder for mobile communications

[11]. It operates on blocks of 160 samples or 20 ms to pro-

duce encoded block of 260 bits, so the bitrate of this coder is

13 kbps for a speech signal sampled at 8 kHz. A simplified

block diagram of the RPE−LTP encoder [12] is shown in

Fig. 1. In this diagram the coding and quantization functions

are not shown explicitly. The functional parts of the coder

are: (a) pre-processing, (b) short term prediction (STP) anal-

ysis filtering, (c) long term prediction (LTP) analysis filter-

ing, and (d) RPE computation.

The input speech sinal is segmented into frames of 160

samples, and each speech frame is first pre-processed to pro-

duce off-set free signal, which is then subjected to a first

order pre-emphasis filter to boost the high frequency, low

power part of the spectrum. The obtained 160 pre-processed

speech samples are then short term analyzed (linear predic-

tion (LP) analysis) [13] to predict and remove the short term

redundancy present in the speech signal. The resultant LP

filter parameters, termed as reflection coefficients, are trans-

formed to log-area ratios, since log-area ratios (LAR’s) have

better quantization properties, and encoded with 36 bits. The

obtained 160 samples short term residual (LP error) is sub-

segmented into 4 sub-frames of 40 samples (5 ms) of each.

The long term redundancy present in the each sub-segment

is predicted and removed by using long term prediction anal-

ysis section [11]. The LTP prediction error is minimized by

the LTP delay D, which maximizes the correlation between

the current residual rSTP (n) and its previously received and

buffered history at delay D, that is, rSTP (n−D). The gain

factor G is the normalized cross-correlation found at delay

D. The obtained LTP parameters (gain and delay) are en-

coded with 36 bits.

The resulting block of 40 long term residual samples is

fed to the Regular Pulse Excitation analysis section, which

performs the basic compression function of the algorithm.

As a result of the RPE-analysis, the block of 40 input long

term residual samples are represented by one of 3 candidate

sub-sequences of 13 pulses each. The subsequence selected

is identified by the RPE grid position. The 13 RPE pulses

(beta values) are encoded using adaptive pulse code modu-

lation (APCM) with estimation of the sub-block amplitude

which is transmitted to the decoder as side information.

3.1. Bit Sensitivity of the RPE-LTP Frame

The different parameters of the encoded speech and their in-

dividual bits have unequal importance with respect to sub-

0 50 100 150 200 250

0

5

10

15

20

25

30

35

Bit index

seg

SN

R d

eg

rad

atio

n(D

B)

Fig. 2. Bit sensitivity in the 260 bits, 20ms RPE-LTP GSM

speech frame.

jective quality [11]. The sensitivity of each bit in the 260 bit,

20ms frame is characterized in Fig. 2. This figure provides

an overview of the segmental SNR degradation inflicted by

consistently corrupting one bit of the 260 bits of each frame,

while keeping the others in the frame intact. The repetitive

structure in the figure reflects the periodicity due to the four

5 ms, 40 sample excitation optimization subsegments, while

the left hand side section corresponds to the 36 LAR coeffi-

cients. The high amplitude indicates the more sensitivity to

bit errors and those bits are important for the reconstruction

of the speech signal. From the Fig. 2 block maxima (Vm)

has the highest sensitivity to bit errors, and also the LARs,

gain, and delay have high sensitivity to bit errors.

4. PARTIAL ENCRYPTION TECHNIQUE

From the knowledge of the perceptual significance of each

parameter and bit sensitivity of GSM frame, the 260 bit frame

is divided into two sets of bits: a larger, high protection set,

aimed at offering very strong protection against eavesdrop-

ping; and a smaller set intended to provide low protection

or without protection depending on the application. Larger

set contains LARs, gain, delay and block maxima, since the

bit sensitivity of these bits is high and these parameters have

high perceptual significance. The 260-bit GSM frame struc-

ture and the bits which are subjected to encryption in the 260

bits are shown in Fig. 3. LARs corresponds to the spectral

envelope of the speech signal and it has essential role in in-

telligibility, small changes of these values results in larger

changes in the poles of the STP synthesis filter and the sys-

tem may become unstable. Gain has main role in discrim-

ination of voiced and unvoiced speech and between speech

and silence. Delay corresponds to where the present pro-

cessed subsegment has maximum correlation with the pre-

vious samples, changing this value by encryption results in

significant distortion in the reconstructed speech signal. If

these parameters are made unavailable, by encrypting them,

the reconstruction of the speech signal is difficult. The per-

ceptually less significant bits i.e. grid position (GP ) and

the beta values are included in the smaller set. The larger

set consists of 96 bits out of 260 and the remaining bits

are included in the smaller set. The selected bitstream for

protection is encrypted with PN sequence by XOR opera-

tion to get the encrypted bitstream. The time domain en-

crypted speech signal is obtained by decoding the encrypted

bits along with the other bits that are remain unprotected by


Encrypted Not encrypted

LARs

921 148 204 26036

G D GPvaluesbeta G D GP

valuesbetaG D GP

valuesbetaG D GP

valuesbeta

2 7 2 6 39

Vm Vm Vm Vm

Fig. 3. GSM frame structure and the bits subjected to en-

cryption in the 260 bits.

using RPE−LTP decoder. At the receiver side, the original

GSM coded speech signal is obtained by decoding the de-

crypted larger set of bits along with the smaller set of bits.

While encryption of the speech parameters, all the pos-

sible PN sequences are stored in a table and a key is selected

at random from the PN sequence table using pseudo ran-

dom index generator [14], and the speech parameters are en-

crypted by the XOR operation with the randomly chosen PN

sequence. The advantage of doing this is the residual intelli-

gibility of the encrypted speech signal is less if each parame-

ter is encrypted with different PN sequence instead of using

the same sequence. Also, as the number of sequences are

more the security offered by the encryption system is more.

5. OBJECTIVE QUALITY MEASURES

The performance of the partial encryption schemes was eval-

uated by signal inspection in both time and frequency do-

main and by objective quality measures. In particular we are

employing segSNR and mel-cepstral distance measures to

evaluate the intelligibility present in the encrypted speech

signal [15]. If the distance between the encrypted speech

and the original speech signal is more then that scheme is

good.

5.1. Segmental SNR

Segmental SNR is the signal to noise ratio calculated over

short intervals of speech frames. The segmental SNR is con-

sidered to be a better method compared to traditional SNR

to evaluate the speech signal quality, since segSNR has bet-

ter correspondence to MOS than ordinary SNR. The encryp-

tion scheme which has the less segSNR better encrypts the

speech signal and the residual intelligibility of the encrypted

speech signal is less. If x(n) is the original speech signal

and y(n) is the encrypted speech signal, then segSNR value

is calculated as:

segSNR =σ2

x

σ2e

=1M

M∑

k=1

⎧⎪⎪⎨

⎪⎪⎩

N∑n=1

x2(n, k)

N∑n=1

e2(n, k)

⎫⎪⎪⎬

⎪⎪⎭(5)

where e(n) = x(n)− y(n), and M is the number of speech

segments and N is the length of each segment.

5.2. Mel-Cepstral Distance

Cepstrum-based speech parameters are attractive for use in

objective speech quality measures because of the inherent

separation of vocal tract and excitation speech components.

However, the simple CD measure does not account for the

well known variation of the ears critical bandwidth as a func-

tion of frequency. This effect can be included with coeffi-

cients derived from the mel-cepstrum instead of using cep-

stral coefficients. The mean-square mel-cepstral distance is

computed as:

MCD =1M

∑

<k>

∑

<i>

MCx(i, k)−MCy(i, k) (6)

where MCx(i, k) and MCy(i, k) are the ith mel-cepstral

coefficients for the kth frame of the original speech signal

and encrypted speech signal.

6. RESULTS AND DISCUSSIONS

6.1. Correlation Measures

The above techniques for generation of PN sequences and

the MSAAC and MSACC measures are implemented in MAT-

LAB. PN sequences of desired length are generated and the

MSAAC and MSACC measures are computed for the code

sets. Table 1 shows the correlation measures for PN se-

quences of length 63 bits. From the results, among all PN se-

quences m-sequence and Barker sequences have low MSAAC

values since, these sequences have single peak autocorrela-

tion function. The FoM of these sequences is also high, but

these sequences are not suitable for speech encryption since

there is only one possible sequence for given LFSR length

and given primitive polynomial, so the security provided by

these sequences is less. The Gold sequences have four val-

ued auto-correlation function and three valued crosscorrela-

tion function. The FoM of Gold-like sequences is more than

that of Gold sequences and the number of codes that can be

generated is also similar, these sequences are more prefer-

able than Gold sequences. The correlation values of Barker-

like sequences depends on the upper bound on the peak cor-

relation function. For Barker sequences of length N = 63and the maximum autocorrelation side lobe amplitude of 15,

the obtained MSAAC and MSACC values are shown in Ta-

ble 1. These sequences have less auto-correlation values and

so high FoM value. As mentioned earlier, as the autocorre-

lation values decreases the crosscorrelation values increases,

and the maximum cross-correlation value of these Barker-

like sequences is also high. The Barker-like sequences offers

hight security but the generation of these sequences is com-

plex when compared with the other PN sequences. Small

Kasami sequences have less autocorrelation values and hence

more crosscorrelation values but the number of sequences

that can be generated are less so the security provided by

these sequences is less compared to Barker-like sequences.

The small set of Kasami sequences have less MSAAC value

so the FoM of these sequences is high. The correlation val-

ues of large set of Kasami sequences are similar to that of

Gold-like sequences. The maximum cross-correlation value


Table 1. Aperiodic correlation measures for PN sequences

of length 63 bits

Sequence MSAAC MSACC FoM Max-CC

m-sequence 0.4429 - 2.2577 -

Gold Sequences 0.9750 0.9849 1.0256 0.3333

Gold-like Sequences 0.9227 0.9859 1.0838 0.2857

Barker Sequence (13 bit) 0.0710 - 14.0833 -

Barker-Like Sequences 0.6547 1.0546 1.5274 0.9841

Small Kasami Sequences 0.7604 0.9098 1.3151 0.2222

Large Kasami Sequences 0.9148 0.9979 1.0932 0.9524

Table 2. Aperiodic correlation measures for orthogonal

codes of length 64 bits

Sequence MSAAC MSACC FoM Max-CC

Walsh Codes 10.3906 0.8531 0.0962 0.9844

MWH Codes 5.3281 0.9154 0.1877 0.9531

Orthogonal Gold Codes 0.9739 0.9848 1.0268 0.3438

of small and large set of Kasami sequences is same as that

of Gold sequences. Among all the sequences, FoM of m-

sequences and Barker sequences is less so the frequency

spectrum of these sequences is flat and wide.

Table 2 shows the correlation measures for 64-bit length

orthogonal codes. Orthogonal codes have zero crosscorrela-

tion when there is no time shift between the two sequences

but the correlation values are high when there is a shift be-

tween the sequences. The correlation values of orthogonal

codes are high compared to that of PN sequences. The auto-

correlation values of WH codes are very high and the FoM

value is less so the spectrum of these sequences is not wide

and flat. The MWH codes have less correlation values com-

pared to WH codes making the cross-correlation values of

these codes high. The orthogonal Gold codes have the corre-

lation values similar to that of original Gold codes. Among

all the orthogonal codes, orthogonal Gold codes have less

auto-correlation values so the FoM of these sequences is also

high.

Among all PN sequences that are investigated, the large

Kasami sequences are better suited for speech encryption.

They have less correlation values and high FoM value so

these sequences have wide and flat power spectrum i.e. they

have more noise like properties. Also, the number of pos-

sible sequences are more therefore the security offered by

these sequences is more. Therefore, large Kasami sequences

are used for the partial encryption of GSM coded speech.

6.2. Partial Encryption of GSM Speech

The selected bit stream for protection is encrypted by XOR

operation with large Kasami sequences since they have more

noise like properties and they are better suited for speech en-

cryption.The Fig 4(a) shows the time domain representation

of speech signal /vande mataram/ sampled at 8 kHz. This

speech signal is compressed with RPE−LTP encoder to get

encoded frame of 260 bits for each 20 ms speech frame.

0 0.5 1 1.5 2 2.5

x 104

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

(a) Time domain representation of original speech signal

0 0.5 1 1.5 2 2.5 3

x 104

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

(b) Time domain representation of partially encrypted speech signal

(c) Frequency spectrum for 30 ms voiced segment of original speech

0 50 100 150 200 250 300

−50

−40

−30

−20

−10

0

10

20

(d) Frequency spectrum for 30 ms voiced segment of partially en-

crypted speech

Fig. 4. Partial encryption of the speech signal corresponding

to the utterance /vande mataram/.

Table 3. Objective distortion measures for full encrypted

and partially encrypted speech signals

Full Encryption Partial Encryption

Segmental SNR -10.08 dB -7.93 dB

Mel-Cepstral Distance 47.33 dB 44.7 dB

The encrypted speech signal by using the proposed partial

encryption technique is shown in Fig 4(b). From the fig-

ure, the encrypted speech signal is more distorted, the pe-

riodicity present in the speech signal is removed and the

signal appears as random noise. The Fig 4(c) shows the

512 point FFT computed frequency spectrum of the 30 ms

voiced speech segment taken from the original speech signal

and Fig 4(d) is the frequency spectrum for encrypted speech

segment. From Fig 4(d) , the formant structure present in

the speech signal is destroyed and there is less information

present in the encrypted speech signal about the original

speech signal.

6.3. Objective Quality Measures

The Table 3 gives the objective speech quality measures for

full encrypted speech and partially encrypted speech signal.

While computing these measures, the length of the speech

frame is taken as 20 ms with no overlapping. From the ta-

ble, the security achieved by the partial encryption scheme

is equivalent to that of the full encryption of speech.


7. CONCLUSION

In this paper, we have investigated different PN sequences

that are appropriate for speech encryption by the correlation

measures, which are used to test the randomness of these PN

sequences. The large Kasami sequence set have good corre-

lation values and high FoM value so these sequences have

wide flat spectrum making these sequences better suited for

speech encryption. For a given shift register length the num-

ber of possible large Kasami sequences are also large so the

security offered by these sequences is also large. The gener-

ation of these sequences is also less complex making these

sequences better suited for speech encryption. We have pro-

posed a partial encryption scheme for GSM coded speech.

In this partial encryption instead of encrypting all the pa-

rameters only the perceptually important parameters are en-

crypted making the encryption delay of the system less. By

using this partial encryption scheme 96 bits are encrypted

out of 260 bits, i.e., 37% of the total bitstream is encrypted

making the encryption delay very less. By signal inspection

methods in time and frequency domains and by objective

speech quality measures the content protection achieved by

this partial encryption scheme is equivalent to that of the full

encryption scheme.

8. REFERENCES

[1] H. J. Beker and F. C. Piper, Secure Speech Commu-nications, London: Academic Press, 1985.

[2] W. Stallings, Cryptography and Network Security,

Englewoods Cliffs, New Jersy: Prentice Hall, 2003.

[3] W. Diffe and M. E. Hellman, “New directions in

cryptography,” IEEE Trans. Inform. Theory, vol. 22,

pp. 644-654, Nov. 1976.

[4] A. Servetti and J. C. Carlos, “Perception based par-

tial encryption of compressed speech,” IEEE Trans.Speech, Audio Process., vol. 10, no. 8, pp. 637-643,

Nov. 2002.

[5] R. L. Pickholtz, D. L. Schilling and L. B. Milstein,

“Theory of spread spectrum communications- A tu-

torial,” IEEE Trans. Commun., vol. COM-30, no. 5,

May 1982.

[6] E. H. Dinan and B. Jabbari, “Spreading codes for di-

rect sequence CDMA and wideband CDMA cellular

networks,” IEEE Commun. Magazine, vol. 36, no. 4,

pp. 48-54, Sep. 1998.

[7] D. V. Sarwate and M. B. Pursley, “Correlation prop-

erties of pseudo random and related sequences,”

Proc. of IEEE, vol. 68, no. 5, pp. 593-619, May

1980.

[8] A. Mitra, “On pseudo-random and orthogonal

spreading sequences,” Int. J. Info. Tech., vol. 4, no.

2, pp. 137-144, Sept. 2007.

[9] V. Anil Kumar, A. Mitra and S. R. M. Prasanna, “On

the effectivity of different pseudo-noise and orthog-

onal sequences for speech encryption from correla-

tion properties,” Int. J. Info. Tech., vol. 4, no. 2, pp.

145-152, Sept. 2007.

[10] I. Oppermann and B. S. Vucetic, “Complex spread-

ing sequences with a wide range of correlation prop-

erties,” IEEE Trans. Commun., vol. COM-45, pp.

365-375, March 1997.

[11] L. Hanzo, F. C. A. Somerville and J. P. Woodard,

Voice Compression and Communications, series Ed-

itor: John B. Anderson, Wiley IEEE press series,

2001.

[12] European Telecommunication Standards Institute,

“European digital telecommunications system

(Phase 2+): Full rate speech transcoding (GSM

06.10),” ETSI, 1999.

[13] T. F. Quatieri, Discrete Speech Signal ProcessingPrinciples and Practice, Pearson Edition, 2000.

[14] L. T. Wang and E. J. McCluskey, “Linear feed-

back shift register design using cyclic codes,” IEEETrans. Comput., vol. 37, pp. 1302-1306, Oct. 1988.

[15] S. Quackenbush, T. Barnwell and M. Clements, Ob-jective Measures of Speech Quality, Prentice-Hall,

Eaglewood Cliffs, NJ, 1988.


EXPLORING SUPRASEGMENTAL FEATURES USING LP RESIDUALFOR AUDIO CLIP CLASSIFICATION

ABSTRACT

This paper demonstrates the presence of the audio-specificsuprasegmental information in the Linear Prediction (LP)residual signal, obtained after removing the predictable partof the audio signal. This information, if added to existingaudio classification systems, can result in better performingsystems. The audio-specific suprasegmental information cannot only be perceived by listening to the residual, but canalso be seen in the form of excitation peaks in the residualwaveform. However, the challenge lies in capturing this in-formation from the residual signal. Higher order correlationsamong samples of the residual are not known to be capturedusing standard signal processing and statistical techniques.The Hilbert envelope of the residual is shown to further en-hance the excitation peaks present in the residual signal. Apattern specific to an audio class is also observed in the auto-correlation sequence of the Hilbert envelope. Statistics of thisautocorrelation sequence, when plotted in the variance space,is shown to form clusters specific to audio classes. Thisindicates the presence of the audio-specific suprasegmantalinformation in the residual signal.

1. INTRODUCTION

Large volume of the multimedia data is in use today forvarious applications. Statistics show that the volume ofthe multimedia data in use is about 105 times to that ofthe text data (http://www.sims.berkeley.edu/research/projects/how-much-info/summary.html). Content-based analysis of the multimedia data [1] isimportant for targeting and personalization of applications.However, opaque nature of this data (data stored in form ofbits and sampling frequency) does not convey any significa-tion information about its contents to the user. These factsmake the task more challenging. Audio plays an importantrole in handling the multimedia data as it is easier to processwhen compared to the video data, and also the audio data con-tains perceptually significant information. Audio indexing isthe task of dealing with analysis, storage and retrieval of themultimedia data based on its audio content [2]. Classificationof the audio data into different categories is one important

step in building an audio indexing system.

The information about the audio category is containedin the excitation source (subsegmental), system/physiological(segmental) and behavioral (suprasegmental) characteristicsof the audio data. For humans, the information about theaudio category is perceived by listening to a longer segmentof audio signal. This other level of information contained inthe audio signal is the suprasegmental information, that is thevariation of the signal over long duration (typically 50 ms to200 ms in case of speech). These behavioral characteristicsof the audio data are perceived in the Linear Prediction (LP)residual [3] of audio signal also. Sometimes this differencemay not be noticed in the waveform, but can be perceivedwhile listening to the residual signal. The residual of a signalmay be less affected by channel degradations as compared tothe spectral information [4]. Hence, it is worthwhile to ex-plore these features for the audio clip classification task. Thispaper emphasizes the importance of the suprasegmental in-formation present in the LP residual of the audio signals. Anaudio classification system based on suprasegmental features,if combined [5] with existing systems based on the segmen-tal [6, 7] and subsegmental [8, 9] features, can give a betterperforming audio classification system.

The classes considered for study are advertisement, car-toon, cricket, football and news. The behavioral characteris-tics of the audio is also perceived in the form of the sequenceof the excitation peaks in the LP residual of the signal for thefive audio classes considered for study. The excitation peakscan further be enhanced using the Hilbert envelope [10] of theresidual signal. The gap between the excitation peaks corre-sponds to the pitch period in the case of speech. The pitchperiod varies for different audio signals. The categories con-sidered for the study are combination of various audio compo-nents, like, speech and music [8]. Hence, the patterns in theexcitation peaks in the Hilbert envelope for different audiocategories are also combinations of periodicities of the audiocomponents, which varies for different categories. The pat-tern in these excitation peaks leads to a pattern in the peaks inthe autocorrelation sequences of the Hilbert envelope for thefive audio categories. It further leads to different statisticaldistributions of autocorrelation peaks for different audio cat-

Anvita Bajpai1, B.Yegnanarayana2

1AppliedResearchGroup, SatyamComputerServicesLtd., Bangalore

2InternationalInstituteofInformationTechnology, Hyderabad


[email protected], [email protected]

egories. This emphasizes the presence of the audio-specificsuprasegmental information in the LP residual signal. How-ever, the use of this information for the classification task isstill to be explored.

Section 2 discusses the presence of the suprasegmentalinformation in the LP residual signal. Section 3 discussesthe Hilbert envelope. The methods to extract suprasegmentalinformation from the Hilbert envelope are discussed in Sec-tion 4. Section 5 concludes the paper.

2. SUPRASEGMENTAL FEATURES IN THE LPRESIDUAL SIGNAL

2.1. Computation of the LP Residual from the Audio Sig-nal

The first step is to extract the LP residual from the audio sig-nal using linear prediction (LP) analysis [3]. In the LP analy-sis each sample is predicted as a linear weighted sum of thepast p samples, where p represents the order for prediction. Ifs(n) is the present sample, then it is predicted by the past p

samples as,

s′(n) = −

p∑

k=1

aks(n− k) (1)

The difference between the actual, and the predictable samplevalue is termed as prediction error or residual, given by,

e(n) = s(n)− s′(n) = s(n) +

p∑

k=1

aks(n− k) (2)

The linear prediction coefficients {ak} are determined byminimizing the mean squared error over an analysis frame.

2.2. Suprasegmental Features in the LP Residual Signal

The behavioral characteristics of one audio category differfrom that of the other. In Fig. 1, the LP residual signals forthe five audio categories are shown. In Fig. 1, one may no-tice some differences in the patterns in the residual signals offive audio categories. Sometimes this difference may not benoticed in the waveform, but could be perceived while listen-ing to the residual signal. Hence, it is worthwhile to explorethese features for audio clip classification task. The patternsin the LP residual signal are in the form of a sequence of ex-citation peaks. These excitation peaks can be considered asevent markers. The sequence of these events contain impor-tant perceptual information about the source of excitation andbehavioral characteristics of audio. By listening to the resid-uals of different types of audio clips, one can distinguish be-tween speakers, music, instruments, etc. The excitation peakscan further be enhanced by taking the Hilbert envelope of theresidual signal. The Hilbert envelope computation removesthe phase information present in the residual, thereby leadingto better identification of the excitation peaks.

100 150 200 250 300 350 400 450 500

−1

0

1

Advert

−

isem

ent

100 150 200 250 300 350 400 450 500

−1

0

1

Cart

oon

100 150 200 250 300 350 400 450 500

−1

0

1

Cricket

100 150 200 250 300 350 400 450 500

−1

0

1

Footb

all

100 150 200 250 300 350 400 450 500

−1

0

1

No. of Residual Samples

New

s

Resid

ual A

mplitu

de for

Audio

Clips

Fig. 1. The LP residual for the segments of audio clips be-longing to five audio categories.

3. THE HILBERT ENVELOPE OF THE LPRESIDUAL SIGNAL

3.1. Computation of the Hilbert Envelope from the Resid-ual Signal

The residual signal is used to compute the Hilbert envelope,where the excitation peaks show up prominently. The Hilbertenvelope is defined as,

he(n) =√

e2(n) + h2(n) (3)

where he(n) is the Hilbert envelope, e(n) is the LP residualand h(n) is the Hilbert transform of the residual. The Hilberttransform of a signal is the 900 phase shifted version of theoriginal signal. Therefore, the Hilbert envelope represents themagnitude of the analytic signal,

x(n) = e(n) + ih(n) (4)

where x(n) is the analytic signal, e(n) is the residual andh(n) is the Hilbert Transform of the residual.

The Hilbert envelope computation removes the phase in-formation present in the residual. This leads to emphasis ofthe excitation peaks. The excitation peaks are further empha-sized by using the neighborhood information of each sam-ple in the Hilbert envelope. The modified Hilbert envelope iscomputed as,

hem(n) =h2

e(n)∑k=n+l

k=n−l

he(k)(2l+1)

(5)

where hem(n) is the modified Hilbert envelope, he(n) is theHilbert envelope and l is the number of samples on either sideof the neighborhood of current sample n.

3.2. Presence of Suprasegmental Features in the HilbertEnvelope

Fig. 2 shows the residual, the Hilbert envelope and thecorresponding modified Hilbert envelope for a noisy audio


(speech) segment. It can be noticed that the excitation peaksare clearly visible in the modified Hilbert envelope1. The gapbetween excitation peaks is the pitch period in case of speechsignal, which varies for different audio signals. The cate-gories considered for the study are combination of variousaudio components. Hence the patterns in the excitation peaksin the Hilbert envelope for different audio categories are alsoa combination of events of the audio components. As shown

0 50 100 150 200 250 300 350 400 450 500

−1.3

0

1.3

0 50 100 150 200 250 300 350 400 450 500

−1.3

0

1.3

Am

plitu

de

0 50 100 150 200 250 300 350 400 450 500

0

0.5

1

0 50 100 150 200 250 300 350 400 450 500

0

0.5

1

Samples

Wavefo

rmLP

Resid

ual

Hilbert

Envelo

pe

Modifie

d

Hilbert

Envelo

pe

Fig. 2. The waveform, LP residual, Hilbert envelope, modi-fied Hilbert envelope of the residual signal for a noisy audiosegment.

0 100 200 300 400 500 600 700 800 900 1000

0

0.2

0 100 200 300 400 500 600 700 800 900 1000

0

0.2

0 100 200 300 400 500 600 700 800 900 1000

0

0.2

Hilbert

Envelo

pe A

mplitu

de for

Audio

Clips

0 100 200 300 400 500 600 700 800 900 1000

0

0.2

0 100 200 300 400 500 600 700 800 900 1000

0

0.2

No. of Hilbert Envelope Samples

Advert

−

isem

ent

Cart

oon

Cricket

Footb

all

New

s

Fig. 3. The Hilbert envelope of the residual signal for thesegments of audio clips belonging to five audio categories.

in Fig. 3, the pattern over a longer duration of segment, inexcitation peaks is different for different audio categories,hence it could be utilized for audio clip classification task.

4. SUPRASEGMENTAL FEATURES IN THEHILBERT ENVELOPE

As discussed in previous section, there is audio-specific infor-mation at the suprasegmental level in the LP residual signal,

1For this study modified Hilbert envelope is considered. So in followingtext whenever Hilbert envelope is mentioned, it actually refers to the modifiedHilbert envelope.

which can be perceived by listening to the signal, and it canalso be observed in the Hilbert envelope of the LP residualsignal, as shown in Fig. 3. One method to capture this patternin the Hilbert envelope is by taking the autocorrelation of theHilbert envelope. For a segment of 100 ms of the Hilbert en-velope the autocorrelation sequence is calculated. The reasonfor choosing 100 ms window size for calculation of the au-tocorrelation is that the long term characteristics of the audiosignal are of interest. The window is further shifted by 50 ms,and calculation of the autocorrelation is repeated till wholelength of the audio clip is considered. These autocorrelationsequences (starting from 3rd sample from the center peak to400th sample, normalized with respect to the central peak) areplotted in Fig. 4 for a clip for each of the five audio categoriesconsidered. It can be seen in Fig. 4 that for a news audio there

0 400

0

20

40

−1

0

1

0 400

0

20

40

−1

0

1

0 400

0

20

40

−1

0

1

0 400

0

20

40

−1

0

1

0 50 100 150 200 250 300 350 400

0

20

40

−1

0

1

Autocorrelation Samples

Frame

Sequence

Peak

Strength

Adverti

sement

Cartoon

Cricket

Football

News

Fig. 4. The autocorrelation sequences of the hilbert envelopeof the residual signal of five audio categories.

is a sharp peak in autocorrelation sequence around 60th sam-ple, and the pattern is relatively uniform. While for cartoonaudio the peaks occur at an interval of (around) 30 samples.No clear peak distribution is found in football, advertisement,and noisy regions of cricket audio clips, but peak strengthsare different in autocorrelation sequences for these three cat-egories. In clear commentary regions of cricket audio, thepattern is similar to that of news clip. Hence the distributionof these peaks gives an evidence of suprasegmental character-istics of different audio categories.

The mean, variance and skewness of autocorrelation se-quences along frame sequence axis for each of 3rd to 203th

sample are calculated. The variance of these statistics for aclip is plotted as a point in 3-dimension. These statistics ofautocorrelation sequences of the Hilbert envelope is plottedfor all 1359 test clips2 in Fig. 5. It can be seen that clips

2Data collected across all TV broadcast channels. Number of clips be-longing to each category is: Advertisement - 226, Cartoon - 208, Cricket -318, Football - 300, News - 306


belonging to different categories occupy different but over-lapping regions in this plot. The overlapping area in the plotimplies that the component knowledge in audio is observed toplay a major role in this study. The (parts of) audio clips hav-ing similar audio components have been observed to have thesame kind of pattern in autocorrelation sequence, and occu-pying the overlapping area in 3-dimensional plot. However,inspite of the overlapping region, the clips belonging to dif-ferent audio categories can be observed to occupy differentregions in 3-dimensional plot, as in Fig. 5. It can be seen inFig. 5 that the clips belonging to news category are lying inthe right hand side region of the plot (shown using ‘x’), andclips belonging to football category are lying in the central re-gion of the plot (shown using ‘o’). Clips of cartoon categoryare lying in the left hand side region of the plot (shown using‘�’), and clips of cricket category are lying in central most re-gion of the plot (shown using ‘*’). A significant overlap in theplot can be noticed in case of cricket and football clips, inspiteof a nonoverlapping area occupied by the clips of the two cat-egories. Advertisement clips are lying in central-bottom mostarea in the plot (shown using ‘.’). This emphasizes the pres-ence of audio-specific suprasegmental information in the LPresidual signal.


The information present in audio signal can be categorized atthree levels - subsegmental, segmental and suprasegmental.In this paper the presence of audio-specific suprasegmentalfeatures in the LP residual signal is discussed. The patternin the excitation peaks in the LP residual for different audiocategories was enhanced by taking the Hilbert envelope ofthe residual signal. The statistics of the peak distribution inthe autocorrelation sequences of the Hilbert envelopes werenoticed to be different for the five audio categories. This pa-per has demonstrated the presence of audio-specific supraseg-mental information in the LP residual signal, which could beuseful for the classification task. However, the use of thesuprasegmental information as an additional evidence for au-dio clip classification task is still to be explored.

6. REFERENCES

[1] B. Feiten and S. Gunzel, “Automatic Indexing of aSound Database using Self-organizing Neural Nets.,”Computer Music Journal, vol. 18(3), pp. 53–65, 1994.

[2] N. V. Patel and I. K. Sethi, “Audio Characterization forVideo Indexing,” in Storage and Retrieval for Image andVideo Databases (SPIE), San Jose, USA, Feb. 1996.

[3] J. Makhoul, “Linear Prediction: A Tutorial Review,” inProc. IEEE, vol. 63, no. 4, pp. 561–580, Apr. 1975.

0.5

1

1.5

2

2.5

3

3.5

4

x 10−3

0

0.2

0.4

0.6

0.8

1

x 10−4

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x 10−5

Variance of Mean of

Autocorrelation Peaks

of Hilbert Envelope

Variance of Variance of

Autocorrelation Peaks

of Hilbert Envelope

Legend :

. − Advertisement

<> − Cartoon

* − Cricket

o − Football

x − News

Va

ria

nce

of

Ske

wn

ess o

f A

uto

−

co

rre

latio

n P

ea

ks o

f H

ilb

ert

En

ve

lop

e

Fig. 5. The statistics of the autocorrelation sequences of theHilbert envelope of five audio categories.

[4] B. Yegnanarayana, S. R. M. Prasanna and K. S. Rao,“Speech Enhancement using Excitation Source Infor-mation,” in Proc. IEEE Int. Conf. Acoust., Speech, andSignal Processing, Orlando, FL, USA, May 2002.

[5] B. Yegnanarayana, S. R. M. Prasanna, J. M. Zachariahand C. S. Gupta, “Combining Evidence from Source,Suprasegmental and Spectral Features for a Fixed-TextSpeaker Verification System,” IEEE Trans. Speech andAudio Processing, vol. 13, no. 4, July 2005.

[6] G. Aggarwal, A. Bajpai, A. N. Khan and B. Yegna-narayana, “Exploring Features for Audio Indexing,” inInter-Research Institute Student Seminar, IISc Banga-lore, India, Mar. 2002.

[7] G. Guo and S. Z. Li, “Content-based Audio Classifica-tion and Retrieval by Support Vector Machines,” IEEETrans. on Neural Networks, vol. 14, no. 1, pp. 209–215,Jan. 2003.

[8] Anvita Bajpai and B. Yegnanarayana, “Audio Clip Clas-sification using LP Residual and Neural Networks Mod-els,” in Proc. European Signal and Image ProcessingConference, Vienna, Austria, Sep. 2004.

[9] Anvita Bajpai and B. Yegnanarayana, “Exploring Fea-tures for Audio Clip Classification using LP Residualand Neural Networks Models,” in Proc. Int. Conf. In-telligent Signal and Image Processing, Chennai, India,Jan. 2004.

[10] T. V. Ananthapadmanabha and B. Yegnanarayana,“Epoch Extraction from Linear Prediction Residual forIdentification of Closed Glottis Interval,” IEEE Trans.Acoust., Speech, Signal Processing, vol. ASSP-27, no.4, pp. 309–319, Aug. 1979.


SPECTRAL ESTIMATION USING CLP MODEL AND ITS APPLICATIONS IN SPEECHPROCESSING

Gupteswar Sahu and S. Dandapat


Email:{gupteswar,samaren}@iitg.ernet.in

ABSTRACTIn this work, a new method of linear prediction (LP) spectrumestimation is proposed. It shall be called constrained linearprediction (CLP) spectrum estimation. CLP model is basedon constraining one of the model parameter of the linear pre-diction model. The values of the constrained parameters canbe used for improving the spectral resolution between twoclose peaks in the spectrum. The benefits of the CLP modelare demonstrated by comparing its performance with othermethods. The performance is studied from the view point ofresolution for different data lengths and signal-to-noise ratio.This CLP model is also tested for formant frequency estima-tion of speech signal embedded in noise. Results obtainedusing vowels embedded in +5 dB S/N white gaussian noiseindicate that the proposed model produces formant frequen-cies which are comparable to those estimated by LP method.

Index Terms— linear prediction model, spectral estima-tion, constrained linear prediction model, formant frequency.

1. INTRODUCTION

The traditional, or nonparametric spectrum estimation meth-ods such as periodogram and Blackman-Tukey methods re-lies on the fast Fourier transform (FFT) for the computationof power spectrum [1]. The FFT based methods introducesconsiderable amount of frequency leakage in the computedspectrum because of the presence of side lobes while win-dowing the data. In addition to this problem the resolution ofthe estimated spectrum becomes poorer as the data length de-creases. In real time applications such as speech processingthe signal is assumed to be stationary for short data length. Toovercome the limitations of these FFT methods, parametricspectrum estimation methods are proposed. These methodsdo not assume the data is zero outside the sampled interval.Rather a reasonable model is assumed which can generate thedata. From the available data the parameters of the assumedmodel can be calculated.

Linear prediction assumes that a signal can be modeledas a linear combination of previous samples. For the detec-tion of narrow band harmonics in the presence of noise, linear

prediction has proven to be an effective and computationallyefficient choice for power spectrum estimation. The main ob-jective of formant analysis is to determine the complex natu-ral frequency of the vocal tract as they change during speechproduction [2]. Estimation of formant frequencies is gener-ally a difficult task. The problem is that formant frequenciesare properties of the vocal tract system and need to be inferredfrom the speech signal rather than just measured. The spec-tral shape of the vocal tract excitation strongly influences theobserved spectral envelope. The dominant method of formantfrequency estimation is based on modeling the speech signalas if it were generated by a particular kind of source and filter.Apart from a variety of formant tracking approaches consid-erable attention is paid to methods based on linear prediction(LP) analysis [3][4]. However detection and estimation offormants from a noisy speech is a difficult task, because theaccuracy of root-finding algorithms based on LPC is sensitiveto the noise level [5][6].

In linear prediction, the speech waveform is representedby a set of parameters of an all-pole model, called linear pre-dictive coefficients (LPC), which are closely related to thespeech production transfer function. The LPC analysis es-sentially attempts to find an optimal fit to the envelope ofthe speech spectrum for a given sequence of speech sample.The performance of LPC technique which is equivalent to au-toregressive (AR) modeling of the speech signal, however de-grades significantly in presence of noise [7]. The least squareestimation of the LPC parameters from a noise corrupted se-quence using an all-pole speech model become biased.

The present study proposes a constrained linear predictionmodel (CLP) model. This model is based on constraining oneof the model parameters of a linear prediction model. Thishelps obtain a modified or desired LP spectrum. The valuesof this constrained parameters can be used for improving thespectral resolution between the peaks in the spectrum. Thedata assumed consist of two sinusoidal frequencies of equalamplitudes embedded in white gaussian noise. Also, by usingCLP model the formant frequencies of synthetic vowels areestimated and compared with the formants estimated by LPmethod in presence of noise.


2. THE PROPOSED CONSTRAINED LINEARPREDICTION (CLP) MODEL

The AR model is defined as [8]

x[n] =M∑

k=1

akx[n− k] + Gu[n] (1)

where x[n] is a random signal, M is the model order, ak’s aremodel parameters, u[n] is input to the model and G is gainof the model. The corresponding linear prediction model isgiven as

x̂[n] =M∑

k=1

akx[n− k] (2)

where x̂[n] is the estimated signal. The linear prediction erroris given as

e[n] = x[n]− x̂[n] = x[n]−M∑

k=1

akx[n− k] = Gu[n] (3)

The above relation implies that if the signal obeys the ARmodel exactly, then e[n] = Gu[n] or the input to the ARmodel is the prediction error signal . The proposed M -orderconstrained autoregressive model for a stationary signal isgiven as

x[n] =M∑

k=1k �=p

ackx[n− k] + ac

px[n− p] + Gu[n] (4)

A constant value is assigned to the pth coefficient or parame-ter, ac

p. Constrained AR model is same as the AR model if acp

parameter is not preassigned any value. The correspondinglinear prediction model called the constrained linear predic-tion (CLP) model is given as

x̂[n] =M∑

k=1k �=p

ackx[n− k] + ac

px[n− p] (5)

The error signal for this model is

e[n] = x[n]−x̂[n] = x[n]−M∑

k=1k �=p

ackx[n− k]−ac

px[n−p] (6)

For estimating the model parameters, ack’s, an error index, EI

is defined as

EI =N−1∑

n=0

e2[n] (7)

where N is the number of signal samples. Constraining thepth parameter, ac

p and minimizing this error index, the Yule

Walker equation for the CLP model is

M∑

k=1k �=p

ackRXX [j − k] = RXX [j]− ac

pRXX [j − p] (8)

where j = 1, 2, . . . , M and j �= p. RXX [j] is autocorrelationestimate at a lag of j. There are M − 1 unknowns as oneof the M parameters, ac

p, is assigned a value. These M − 1parameters can be obtained by solving the above M−1 linearequations. The transfer function of the model can be given as

H(z) =1

1−M∑

k=1k �=p

ackz−k − ac

pz−p

(9)

CLP power spectral density is estimated as

PCLP (z) =1

∣∣∣∣1−M∑

k=1

ackz−k

∣∣∣∣2 (10)

3. RESULTS AND DISCUSSIONS

3.1. Power spectrum of two sinusoidal signals embeddedin white gaussian noise.

To compare the performance of different methods , two si-nusoidal signal of different frequencies ( f1 = 0.45π and f2= 0.5π ) with equal amplitudes, in white gaussian noise, areused. The SNR is varied from 40 to 5 dB; and data lengthused are N=8,16 and 32. The model order (M) is kept at 4 forall the methods. For the comparison purpose the spectrum isalso computed using LP and Burg method [1]. The proposedmodel increases the frequency resolution in the power spec-trum. This is achieved by constraining the ac

M parameter ofthe model. The value of constrained fourth parameter, ac

M , is−0.99.

Fig 1, 2 and 3 shows the plots of power spectral density(PSD) for LP, CLP and Burg method, the signal strength P(f)at frequency f is plotted in decibels (dB) relative to the max-imum spectral strength Pmax over all frequencies, computedas 10 log10 [P(f)/Pmax]. The LP method is unable to resolvethe two frequency components for all the cases. whereas theBurg method able to resolve the two frequency componentsonly for high SNR values. The CLP model spectrum can re-solve the two close peaks which are unresolved in the normalLP and Burg method . Even when the SNR is 5 dB and N=8.The CLP method has better resolution capability than othertwo methods. The poles of CLP model are close to unit circlefor ac

4 = −0.99, which signifies that there should be sharppeaks at corresponding frequencies in the power spectrum.This increased sharpness is attributed to the increased resolu-tion between close frequencies.


0 0.2 0.4 0.6 0.8 1

−60

−40

−20

0

Normalized Frequency (Hz)

Nor

mal

ized

Mag

nitu

de (d

B)

LP

CLP

BURG

(a)

0 0.2 0.4 0.6 0.8 1

−60

−40

−20

0


Nor

mal

ized

Mag

nitu

de (d

B)

LP

CLP

BURG

(b)

0 0.2 0.4 0.6 0.8 1

−60

−40

−20

0


Nor

mal

ized

Mag

nitu

de (d

B)

LP

CLP

BURG

(c)

0 0.2 0.4 0.6 0.8 1

−60

−40

−20

0


Nor

mal

ized

Mag

nitu

de (d

B)

LP

CLPBURG

(d)

Fig. 1: Plots of power spectral density for N=8 (a) SNR = 40 dB. (b) SNR =30 dB. (c) SNR = 20 dB. (d) SNR = 5 dB

Figure. 4 shows the pole plots obtained from a 4th-ordernormal LP model and a 4th-order CLP model with ac

4 =−0.99. The poles of CLP model are denoted by ’+’ whilethose of normal LP model are denoted by ’X’. The poles ofCLP model are close to unit circle whereas the poles of nor-mal LP model are relatively inside the unit circle. The polesare shifted towards the unit circle by choosing a lower valuefor ac

M parameter in a CLP model. This results in sharper

0 0.2 0.4 0.6 0.8 1

−60

−40

−20

0


Nor

mal

ized

Mag

nitu

de (d

B)

LP

CLP

BURG

(a)

0 0.2 0.4 0.6 0.8 1

−60

−40

−20

0


Nor

mal

ized

Mag

nitu

de (

dB)

LP

CLP

BURG

(b)

0 0.2 0.4 0.6 0.8 1

−60

−40

−20

0


Nor

mal

ized

Mag

nitu

de (

dB)

LP

CLP

BURG

(c)

0 0.2 0.4 0.6 0.8 1

−60

−40

−20

0


Nor

mal

ized

Fre

quen

cy (d

B)

LP

BURG

CLP

(d)

Fig. 2: Plots of power spectral density for N=16 (a) SNR = 40 dB. (b) SNR= 30 dB. (c) SNR = 20 dB. (d) SNR = 5 dB

peaks at the corresponding frequencies in the magnitude spec-trum. As a result, two closely spaced peaks are expected tobe resolved which otherwise may not be resolved if the polesare located away from the unit circle.


0 0.2 0.4 0.6 0.8 1

−60

−40

−20

0


Nor

mal

ized

Mag

nitu

de (d

B)

LP

CLP

BURG

(a)

0 0.2 0.4 0.6 0.8 1

−60

−40

−20

0


Nor

mal

ized

Mag

nitu

de (

dB)

LP

CLP

BURG

(b)

0 0.2 0.4 0.6 0.8 1

−60

−40

−20

0


Nor

mal

ized

Mag

nitu

de (

dB)

LP

CLP

BURG

(c)

0 0.2 0.4 0.6 0.8 1

−60

−40

−20

0


Nor

mal

ized

Mag

nitu

de (

dB)

LP

BURG

CLP

(d)

Fig. 3: Plots of power spectral density for N=32 (a) SNR = 40 dB. (b) SNR= 30 dB. (c) SNR = 20 dB.(d) SNR = 5 dB.

3.2. Results of formant frequency estimation

The performance of the proposed method was evaluated us-ing real and synthetic vowels. Three synthetic vowels /a/, /o/,and /u/ from voices produced by a male subject (pitch of 120Hz), corrupted by white gaussian noise at +5 dB S/N wereused for evalution. The signal was sampled at 8000Hz. Anwindow length of 200 samples are used for the analysis. The

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

Imaginary

Rea

l

Fig. 4: Pole locations for LP and CLP models for two sinusoids embedded innoise

model order is best selected to be twice the assumed num-ber of formants contained in the data. In this paper a 6thorder linear prediction model, generated by autocorrelationmethod are compared experimentally . Once the predictionpolynomial has been calculated, the formant parameters aredetermined by using peak-picking algorithm on the filter re-sponse curve. Each pair of complex roots are used to calculatethe corresponding formant frequencies. The constrained M th

parameter is -0.99. The vowels were also synthesized usingKlatt synthsizer [13].

Table 1: Standard deviation (Hz) of formant frequency errors for syntheticvowel using LP and CLP model.The vowels are embedded in +5 dB whitenoise

/a/ /o/ /u/LP CLP LP CLP LP CLP

F1 158.6 7.6 81 57.5 37.1 84.3F2 799.2 545.3 921.5 734.8 1033.3 691.4F3 364.1 562.5 565 538.3 1185 1000.6

Fig. 5 shows the spectrum of different speech signals em-bedded in white noise. It can be seen that, the LP methodhas low resolution capability to resolve the three peaks in thespectrum. For the same model order the CLP model can re-solve the peaks nicely. The results are tabulated in Table 1.For comparative purpose, we also estimated the standard de-viation of the formant frequencies of the same vowel using LPmodel. Standard deviations were measured of the differencebetween the true formant frequencies and the estimated for-mant frequencies. The formant was identified as a local max-imum in the amplitude spectrum, i.e., the frequency at whichthe derivative of the all-pole amplitude spectrum changes itssign from positive to negative. Each test consist of 25 trials.Results indicates that the estimation of F1 was more accuratethan the estimation of F2 and F3 frequencies.


0 1000 2000 3000 4000−5

0

5

10

15

20

25

FREQUENCY (Hz)

MA

GN

ITU

DE

(dB

)

LP

CLP

(a)

0 1000 2000 3000 4000−10

0

10

20

30

FREQUENCY (Hz)

MA

GN

ITU

DE

(dB

)

LPCLP

(b)

0 1000 2000 3000 4000−10

0

10

20

30

FREQUENCY (Hz)

MA

GN

ITU

DE

(dB

)

LP

CLP

(c)

Fig. 5: Formant frequency estimation using LP and CLP model for syntheticvowels embedded in 5 dB white noise. (a) /a/. (b) /o/ . (c) /u/.

4. CONCLUSION

A new linear prediction model called the constrained linearprediction model based on constraining one of the model pa-rameter of the linear prediction model has been proposed;based on such model, spectral estimation has been done. Sim-ulation examples show higher resolution capability of the pro-posed method when compared with LP spectrum estimation.The CLP model is also tested for formant frequency estima-tion of vowels embedded in +5 dB white gaussian noise. Itwas observed that the estimation of F1 was more accuratethan the estimation of F2 and F3.

5. REFERENCES

[1] Steven M. Kay, Modern Spectral Estimation, PTR Pren-tice Hall Signal Processing Series, 1988.

[2] B. S. Atal and Suzanne L. Hanauer, “Speech Analysys

and Synthesis by Linear Prediction of the Speech Wave,”j. Acoust. Soc. Amer., vol. 50, pp. 637-655, April 1972.

[3] S. McCandless, “An Algorithm for Automatic FormantExtraction using Linear Pediction Spectra,” IEEE Trans.Acoust., Speech and Signal Processing, vol. ASSP-22,pp. 135-141, 1974.

[4] R. C. Snell and F. Milinazzo, “Formant Location fromLPC Analysis Data,” IEEE Trans. Speech, Audio Process-ing, vol. 1, pp. 129-134, April. 1974.

[5] J. Hernando, C. Nadeu, “Linear prediction of one-sidedautocorrelation sequence for noisy speech recognition,”IEEE Trans. Speech and Audio Processing, vol.5, No.1pp.80-84, 1997.

[6] C. H. Lee, “ On robust linear prediction of speech,” IEEETrans. ASSP, vol.36, pp.642-650, 1988

[7] Akshya K. Swain and Walled Abdulla, “Estimationof LPC Parameters of Speech Signals in Noisy Envi-ronment,” IEEE TENCON Conference 2004. December2004, pp. 139-142.

[8] J. Makhoul, “Linear Prediction: A Tutorial Review,”IEEE Proceedings, vol. 63, No.4, pp. 561-580, April1975.

[9] L. R. Rabiner and R. W. Schafer, “Digital Processingof Speech Signals. Englewood Cliffs, NJ: Prentice-Hall,1978.

[10] Bin Chen and Philipos C. Loizou, “Formant FrequencyEstimation in Noise,” IEEE Int. Conf. Acoustic, Speech,and Signal Processing 2004, vol. 1, May 2004, pp. 581-584,

[11] N. Jain and S. Dandapat, “Constrained autoregressive(CAR) model,” IEEE INDICON Conference 2005, De-cember 2005, pp. 255-257.

[12] Prabhu K. M. M. and Bagan K. Bhoopathy, “Resolu-tion capability of nonlinear spectral-estimation methodsfor short data lengths,” IEE Proceedings, vol. 136, no. 3,pp. 135-142, June 1989.

[13] D. H. Klatt, “Software for a cascade/parallel format syn-thesizer,” J. Acoust. Soc. Amer., vol. 67, pp. 970-995, Mar.1980.


SSppeeeecchh SSyynntthheessiiss

TEXT TO SPEECH SYNTHESIS SYSTEM FOR MOBILE APPLICATIONS

Department of Electrical Engineering, Indian Institute of Science, Bangalore, [email protected],, [email protected]

ABSTRACT

This paper discusses a Text-To-Speech (TTS) synthesissystem embedded in a mobile. The TTS system used is unitselection based concatenative speech synthesizer, where aspeech unit is selected from the database based on itsphonetic and prosodic context. Speech unit considered inthe synthesis is larger than a phone, diphone and syllable.Usually the unit is a word or a phrase. While the quality ofthe synthesized speech has improved significantly by usingcorpus-based TTS technology, there is a practical problemregarding the trade-off between database size and quality ofsynthetic speech, especially in mobile environment. Severalspeech compression schemes currently used in mobilestoday are applied on the database. Speech is synthesizedfrom the input text, using compressed speech in thedatabase, The intelligibility and naturalness of thesynthesized speech are studied. Mobiles contain a speechcodec, one of the modules in the baseband processing. Theidea of this paper is to propose a methodology to use thealready available speech codec in the mobile and read aSMS aloud to the listener, when TTS is embedded in amobile. Experimental results show the clear possibility ofour idea.

Index Terms— corpus based concatenative TTS, RPE-LTP, CELP, ACELP, Automated book reader

1. INTRODUCTION

State of the art corpus-based text-to-speech (TTS) enginesgenerate speech by concatenating speech segments selectedfrom a large database. In the past few decades, TTS systemshave been developed for desktop systems and otherplatforms, where enough hardware resources are available.Efforts are also made to build a TTS engine for embeddedapplications [1]. Corpus based TTS systems require morehardware resources for storing the large amount of recordeddata. Synthesized speech quality is better if more data isavailable during training phase. However, TTS integratedin a mobile has limitations on processing and memory.There is a trade-off between the amount of recorded data to

be used and the quality of synthetic speech in mobileenvironment.Our idea is, given the limited number of resources availablein the mobile environment, TTS should fit in it withoutinterfering with the other modules. TTS in a mobile can betreated as a Value Added Service (VAS) for the user andusually given the least preference in terms of processing.This engine is not used periodically and hence need notcontinuously wait for the input in a mobile. As and when aSMS comes, user can enable TTS to speak out the text inthe message. Care should be taken that the otherfunctionalities in the mobile are not disturbed because ofTTS integration. We have applied some of thecommercially used speech codecs in mobiles on thedatabase and found that TTS output quality is intelligibleand natural. It is worth mentioning about flite[11]developed by CMU for research purpose. is used forresearch purpose. We are currently working on integratingour TTS in a commercial GSM mobile, which needs to betested in real time.

2. OVERVIEW OF OUR TAMIL TTS SYSTEM

We have used TTS for Tamil, a south-Indian language forexperimentation. A user or an Optical CharacterRecognition (OCR) system can give the input text to theTTS engine. TTS involves two modules. One is NaturalLanguage Processing module, NLP [2] and the other isSignal-processing module. NLP module involves severaltasks, which are related to a diverse set of disciplines . TheNLP module takes input in the form of text and outputs asymbolic linguistic representation to the Signal Processingmodule. The latter outputs the synthesized speech based onits phonetic and prosodic context. The basic unit for speechsynthesis is usually a phone, half-phone, diphone, syllable,word, sentence or any such unit. The naturalness ofsynthesized speech depends on the unit selection criteria[3]. These criteria are defined in terms of two costs namely,� Target cost, which estimates the difference between

the specifications of the database unit and the predictedinput unit, based on good phonetic and prosodicmodels.

K. Partha Sarathy, A. G. Ramakrishnan


� Join cost or Concatenation cost, which estimates theacoustic discontinuity between two successive joiningunits

The unit sequence selected from the database is the onewith the lowest overall cost [4]. In our TTS engine, thebasic unit is a word. The selection criteria uses onlyconcatenation cost, since target cost calculation requires aprosody model, which we have not yet developed.

2.1 Database

The database used for our synthesis contains 1027phonetically rich Tamil sentences with 5149 unique words.The coverage of the units is really good. The size of thesewave files is about 295 MB on the hard disk. Thesesentences are recorded at 16 KHz sampling rate from aprofessional Tamil male speaker and the correspondingwave files have been segmented manually usingPRAAT[10], a speech analysis program. Some samplewave files synthesized using the above database areavailable on the Medical Intelligence and LanguageEngineering Lab (MILE) website in IISc at the URLhttp://ragashri.ee.iisc.ernet.in/MILE/index_files/research%20area.html. As we increase the number of phoneticallyrich sentences in the database, the database size becomeslarger and renders the TTS system impractical forembedded applications.

2. SPEECH COMPRESSION SCHEMES

The idea of this work is to use the speech codec alreadyavailable in the mobile for compressing the above database.Speech codecs used in today’s mobiles producecommunication speech quality at low decoding complexity.The scheme used in mobiles is lossy compression, andhence the decompressed speech will not be exactly the sameas the original. However, this speech is clearly perceivedand hence is defined as communications speech quality. Inmobile communications, speech that reaches a mobile iscompressed and the decoder in the mobile decompressesand plays out on the speaker. Compressed speech istransmitted over the air interface to reduce the bandwidthof the signal. Human ear cannot perceive compressedspeech and hence the speech should be decoded forunderstanding. It means that speech codecs are readilyavailable in the mobiles. We want to use these speechcodecs for synthesizing speech from text in a SMS. Today,in India, more people are using GSM mobiles and henceour research is concentrated mostly on GSM codecs. AGSM mobile uses GSM-FR codec [7], which uses RPE-LTPcompression scheme. GSM-FR compresses 20 ms speechframe sampled at 16 KHz (320, 16-bit samples) to 260 bitswith a bit rate of 13 Kbps. FR stands for Full Rate. The

other low bit-rate commercial speech codecs used in GSMmobile are GSM-EFR AMR12.2, AMR10.2, AMR7.95,AMR7.4, AMR6.7, AMR5.9, AMR5.15 and AMR4.75with bit rates of 12.2, 12.2, 10.2, 7.95, 7.4, 6.7, 5.9, 5.15and 4.75 Kbps, respectively. EFR and AMR stand forEnhanced Full Rate [5] and Adaptive Multi Rate [6],respectively. GSM-EFR and GSM-AMR codecs are basedon code-excited linear prediction (CELP) and AlgebraicCELP, respectively. All the above-mentioned codecs areused in our experiments.

3. SPEECH SYNTHESIS

Input to our TTS engine can be direct text from a SMS. Inour synthesis, units are searched based on their left andright contexts in the database. For any unit, many instancesmay be present in the database. If there are ‘n’ units in theinput text and m1, m2..mn instances of each unit in thedatabase, then m1 X m2 X m3 …mn-1 X mn combinations arepossible. Join cost is calculated between instances of unit1with those of unit2, instances of unit2 with those of unit3and so on. One of the best paths is selected using viterbisearch based on the lowest total join cost. The final set ofunits obtained after viterbi search are joined at appropriatepositions, such that there is less mismatch at the points ofconcatenation. Optimum coupling [8] is used for couplingtwo successive units. The mismatch between two frames(last frame of the first unit and first frame of second unit) istaken as the Euclidean distance measure of 13 Mel-scalecepstral coefficients.

3.1 Database compression using speech codecs

We applied the compression schemes mentioned in section2 on the database of 1027 wavefiles. The sizes of thedatabase after different compression schemes are listed inTable 1.

Table 1 Database size for various compression schemes

CompressionScheme Applied

Database sizefor fs = 16KHz

Database sizefor fs = 8 KHz

No compression 295 Mbytes 152 MbytesFR 13 36 Mbytes 21 MbytesEFR 12.2 35 Mbytes 20 MbytesAMR 12.2 35 Mbytes 20 MbytesAMR 10.2 30 Mbytes 18 MbytesAMR 7.95 24 Mbytes 16 MbytesAMR 7.4 24 Mbytes 15 MbytesAMR 6.7 22 Mbytes 14 MbytesAMR 5.9 20 Mbytes 13 MbytesAMR 5.15 18 Mbytes 12 MbytesAMR 4.75 17 Mbytes 12 Mbytes


3.2 Speech synthesis using compressed Database

The procedure for text to speech synthesis remains the sameas discussed in section 1, with the exception that the unitsselected from the database are now compressed units.During the final stage of waveform generation, compressedspeech units are decompressed, coupled and smoothened atconcatenation points. It is then played out on the speaker.Exact compressed speech frame boundaries need to beconsidered during synthesis; even a single byte mismatchresults in complete degradation of synthesized speech. Thisis because all the compression schemes use basic linearprediction principle, where correlation exists betweenframes and filter memories need to be updatedcontinuously. To decode the initial frame of a compressedspeech unit picked up from the database, the decodingalgorithm resets the filter memories. Due to this the errorgets propagated to all frames in the decoded speech unit.Actually, to decode this initial frame of compressed speechunit the decoder algorithm needs filter memories updatedduring the decoding of previous frame. Then, this results inerror-free decoding of compressed speech unit. A blockdiagram of the synthesis using compressed database isshown in Fig. 1.

Figure 1 Block diagram of Text to speech in a mobile

4. PERCEPTION EXPERIMENTS

The performance of any TTS engine is usually evaluatedbased on the intelligibility and naturalness of thesynthesized speech. We use mean opinion score (MOS) toassess the quality of our TTS output. Ten Tamil sentences,which are distinct from the 1027 sentences in the database,were synthesized. Wavefiles sampled at 16 and 8 Khz formthe uncompressed database. The sentences are synthesizedusing uncompressed as well as compressed wavefiledatabases. Each of the 11 compression schemes listed inTable 1 is individually used for synthesis. Totally, 220wavefiles are synthesized. Five native Tamil listeners wereasked to rate these wave files for intelligibility and

naturalness. MOS rating is defined as follows: 5 –Excellent 4 – Good 3 – Fair 2 – Poor 1 – Bad.

5. RESULTS AND DISCUSSION

The schemes used for experimentation are listed in Table 2.The mean of the ratings given by the five listeners are listedin Table 3 for each sentence.

Table 2 List of compression schemes used in our study.

Scheme A 16 Khz No compressionScheme B 8 Khz No compressionScheme C 16 Khz FRScheme D 8Khz FRScheme E 16Khz EFRScheme F 8Khz EFRScheme G 16Khz AMR 12.2Scheme H 8Khz AMR 12.2Scheme I 16Khz AMR 10.2Scheme J 8Khz AMR 10.2Scheme K 16Khz AMR 7.95Scheme L 8Khz AMR 7.95Scheme M 16Khz AMR 7.4Scheme N 8Khz AMR 7.4Scheme O 16Khz AMR 6.7Scheme P 8Khz AMR 6.7Scheme Q 16Khz AMR 5.9Scheme R 8Khz AMR 5.9Scheme S 16Khz AMR 5.15Scheme T 8Khz AMR 5.15Scheme U 16Khz AMR 4.75Scheme V 8Khz AMR 4.75

Table 3. MOS ratings of the synthesized sentences.(mean of 5 listeners.)

No/Scheme

A B C D E F G H

1 3.3 4 4.3 4.7 4.3 4.3 3 3.72 3.3 4 4 4 4.3 4 3.7 3.73 4 4 4 4 4.3 4 4.3 44 3 3 3.3 3.3 3.3 3 3.3 3.35 4.7 4.7 4.7 4.7 4.7 4.7 4.7 4.76 4 4 3.7 4 4.3 4.7 4.7 4.77 3.7 3.7 3.7 3.7 3.7 3.7 3.7 3.78 4 3.7 3.7 3.7 4 4 4 49 4 4 4 4 4.3 4.3 4.3 4.310 3.3 3.3 3 3.3 3.7 3.7 3.7 3.7Average 3.7 3.8 3.8 3.9 4 4 3.9 4

No/Scheme

I J K L M N O P

1 3 3 3 3.7 4 3.3 3.7 3.72 4 3.3 4 4.7 4 4 4 3.73 4 4.3 3.7 4.3 4 4 4 3.74 3.3 3 3.3 4 4 3.7 3.7 3.35 4.7 4.3 4.3 4.7 4.7 4.7 4.7 4.76 4.7 4.7 4.7 5 5 5 5 4.77 3.7 3.7 3.7 3.7 3.7 3.7 3.7 3.78 4.3 4.3 4.3 4.3 4.3 4.7 4.3 4.39 4.3 4.3 4.3 4.3 4.3 4.3 4.3 4.310 3.7 3.7 3.7 3.7 3.7 3.7 3.7 4Average 4 3.9 3.9 4.2 4.2 4.1 4.1 4

NLP

Compressed Database

Decode

Waveformgeneration

Input

Signal Processing

Speech


No/Scheme

Q R S T U V

1 3.3 3.7 4.3 4 4 42 4 4 4.7 4 3.7 3.73 3.7 3.7 3.7 3.7 3.7 3.74 3.7 3.7 4 4 3.7 3.75 4.7 4.7 4.7 4.7 5 4.76 5 5 5 5 5 57 3.7 3.7 3.7 3.7 3.7 3.78 4 4 4 4 4 4.39 4.3 4.3 4.3 4.3 4.3 4.310 4 4 4 4 4 4Average 4 4.1 4.2 4.1 4.1 4.1

The perception experiments evaluated that the TTS engineproduced a high quality synthetic speech, even with highlycompressed database. This, therefore, holds promise for thefuture, where we can read messages in Indian languages onthe mobile. Good scores have been obtained for GSM FRand EFR, which means that optimized code can readily fitinto GSM mobiles. This has a high potential for the market,since the synthesis quality is very good. However, AMRschemes are usually used when the communication channelis taken into consideration. We have used AMR in ourexperiments to understand the effect on synthesized speechquality at very high compression rates. Very highcompression rates lead to very low memory requirement ofthe database. It’s interesting to note that MOS score ofsynthesis using compressed data is very high compared tothe case when uncompressed data is used. Listeners felt thatthe synthesized signal generated using compressed data hasrelatively smoother envelope. However, listeners also feltthat pitch variation of units is locally good and needs somemore modification in the global sense. Local refers to wordlevel and global refers to a complete sentence or a phrase.Pause needs to be effectively modeled for improving thenaturalness of the synthetic speech globally. Thecombination of high quality database and robust unitselection has resulted in good quality of our synthesis.


When a TTS is integrated in a mobile, ideally there are noconstraints on the encoding process. However, the decodingcomplexity should be minimum and all commercial codecsused in mobiles satisfy this criteria. A new codec can bedesigned only for TTS applications in embedded devices[9]. Our intention is not to use additional memory inmobiles for a different codec meant only for TTS. Memorycrunch always exists in embedded environments, especiallyin mobiles. The code size for TTS in our design is in theorder of a few kilobytes. The challenge lies in selectingphonetically rich sentences in the database and fitting theminto the mobile within the memory requirements and

produce intelligible and natural speech. Efforts are in placeto embed our TTS in a mobile and test it in real time. Weare also progressing towards making a handheld embeddeddevice called ‘Automated Book reader for the VisuallyChallenged’ for Indian languages, which reads aloud pagesfrom a printed book.

7. REFERENCES

[1] Nobuo Nukaga, Ryota Kamoshida, Kenji Nagamatsuand Yoshinori Kitahara. “Scalable Implementation of unitselection based text-to-speech system for embeddedsolutions”, Hitachi Ltd. Central Research Laboratory,Japan.[2] A. G. Ramakrishnan, Lakshmish N Kaushik, LaxmiNarayana. M, “Natural Language Processing for TamilTTS”, Proc. 3rd Language and Technology Conference,Poznan, Poland, October 5-7, 2007.[3] A Black and N Campbell, “Optimizing selection ofunits from speech databases for concatenative synthesis”, InProc, Eurospeech, pp. 581-584, 1995.[4] A Hunt and A Black, “Unit selection in a concatenativespeech synthesis system using a large speech database”, InProc. ICASSP, pp. 373-376, 1996.[5] Digital cellular telecommunications system (Phase 2+)(GSM); Enhanced Full Rate (EFR) speech transcoding(GSM 06.60 version 8.0.1 Release 1999).[6] Digital cellular telecommunications system (Phase 2+)(GSM); Adaptive Multi-Rate (AMR); Speech processingfunctions; General description (GSM 06.71 version 7.0.2Release 1998).[7] Digital cellular telecommunications system (Phase 2);Full rate speech; Part2: transcoding (GSM 06.10 version4.3.0 GSM Phase 2).[8] S Isard and A D Coonkie. Progress in Speech Synthesis,chapter Optimum coupling of diphones. Wiley 2002.[9] Chang-Heon Lee, Sung-Kyo Jung and Hong-Goo Kang“Applying a Speaker-Dependent Speech CompressionTechnique to Concatenative TTS synthesizers” IEEE TransAudio, Speech Lang. Proc., VoL. 15, No. 2, Feb 2007.[10] PRAAT : A tool for phonetic analyses and soundmanipulations by Boersma and Weenink, 1992-2001.www.praat.org[11] “Flite: a small, fast speech synthesis engine” Edition1.3, for Flite version 1.3 by Alan W Black and KevinA.Lenzo. Speech Group at Cranegie Mellon University


SINUSOIDAL ANALYSIS AND MUSIC INSPIRED FILTER BANK FOR TRAINING FREE SPEECH SEGMENTATION FOR TTS

Ranjani.H.G1, Ananthakrishnan.G2, A.G.Ramakrishnan3

Dept. of Electrical Engineering, Indian Institute of Science, Bangalore – 560012

ABSTRACT

The major overhead involved in porting a TTS system from one

Indian language to another is in the manual segmentation phase. Manual segmentation of any huge speech corpus is time consuming, tedious and dependent on the person who is segmenting. This correspondence aims at reducing this overhead by automating the segmentation process. A 3-stage, explicit segmentation algorithm is proposed, which uses Quatiere’s sinusoidal model of speech in conjunction with a distance function obtained from Bach scale filter bank, to force align the boundaries of phonemes between 2 stop consonants. Preliminary results for Hindi and Tamil sentences show that the misclassified frames (25ms) per sentence or the Frame Error Rate (FER) is 26.7% and 32.6% respectively.

Index Terms— TTS, Phonetic transcription, Sinusoidal model of speech, Bach scale filter bank, Frame Error Rate

1. INTRODUCTION Many areas of speech research and development take advantage of automatic learning techniques that rely on large segmented and labeled corpora. In speech synthesis, segmentation and annotation of the speech at the phonemic level has become a standard requirement. These corpora are required to train models for recognition as well as to synthesize speech. Text to speech (TTS) systems, in particular, use this synthesis process to create a new concatenation based unit inventory and also for prosody modeling. Hence, the availability of speech data annotated at phonemic level is crucial in the field of speech technology.

TTS systems for Indian languages create standard speech corpora by recording utterances of large number of sentences covering various acoustic and phonetic contexts from a single speaker. The phonetic transcription is available in a TTS corpus as the output of a grapheme to phoneme (G2P) converter. Thus, phonetic labels of speech to be segmented are obtained from the priori knowledge of phonetic transcription. It is only required to align the boundaries of these phonetic transcriptions to their corresponding speech utterances. It is to be noted that this is an automated explicit segmentation technique which has the priori knowledge of the phonetic transcription and hence will not result in “inserted” and “deleted” boundaries.

Automated explicit segmentation using context dependent phone based HMM (CDHMM) gives a good boundary accuracy [8]. Neural network trees with known number of sub word units are used for segmentation [9]. However, the need to develop TTS in multiple languages and the non-availability of large speech

corpora for Indian languages are the major constraints which limits the use of segmentation techniques that are training based. The proposed method is a training free method and gives accuracy comparable with that of methods described in literature.

Segmentation using the Bach scale filter bank has the advantage of being a language independent and training free algorithm [1], [2]. Section 2 summarizes the above algorithm. Section 3 proposes Quatieri’s model for segmenting voiced and unvoiced regions of speech. Section 4 shows the results obtained by combining the above two methods to force align boundaries for segmenting speech at phoneme level. Section 5 concludes the paper and discusses future work.

2. SEGMENTATION USING BACH SCALE FILTER

BANK In this method, speech is treated as a non-stationary signal. A constant Q filter bank motivated by perception of music is formulated. This filter bank has 12 filters in an octave, centers of successive filters separated by a ratio of 2(1/12).

Speech signal (Fs = 16 kHz) is passed through this filter bank and the output of these banks at an instant of time is treated as a feature vector. Here, speech is not passed to the filter banks as short segments (frames) as in the quasi-stationary signal and thus, we get feature vectors for every instant of time. Now, the mean of log of the feature vectors in successive 15ms windows is taken and the Euclidean distance between them is calculated. This is referred to as the Euclidean Distance of Mean of Log of feature vectors or EDML [3]. Viewed as a 2 class problem, the distance between the means is a maximum when the feature vectors in the adjacent windows belong to different phoneme classes. Fig 1 shows the contour of the Euclidean distance and the actual boundaries for a part of a speech utterance. Table 1 gives the performance of the method for English, Hindi and Tamil database.

Matched Phoneme Boundary (MPB) is an automated phoneme boundary which falls within 20ms of a manual boundary. If more than one automated segment boundary falls within +20 ms of a manual boundary or no manual boundary is found within +40 ms of an automated boundary, then such boundaries are considered to be ‘insertions’. Similarly, if no automated boundary is found within +40 ms of a manual boundary, then it is considered as a ‘deletion’.

{ranjani1,ramkiag

3}@ee.iisc.ernet.in, ggananth

[email protected]


Fig. 1. The plot of EDML against time for a portion of Hindi utterance (“kannarphilm”). The vertical lines denote the actual phone boundaries.

TABLE 1. Segmentation performance of EDML using Bach scale filter-bank for various languages.

Language %MPB %Deletions %Insertions English (TIMIT) 82.5 22.3 18.9

Hindi 86.6 3.2 21.4 Tamil 81.9 15.3 23.7

3. QUATIERI’S SINUSOIDAL MODEL FOR SPEECH SEGMENTATION

Quatieri proposed a sinusoidal model for speech analysis/synthesis [4]. As per this model, speech is composed of a collection of sinusoidal components of arbitrary amplitudes, frequencies and phases, and it results in an analysis/synthesis system based on explicit sine wave estimation. The first part involves sinusoidal analyzer, in which amplitude, phase and frequency parameters are extracted from the speech signal using STFT. The sinusoidal model of speech signal is expressed as follows:

��

�K

k

njk

kenAns1

)()(Re)( �

The next stage involves associating the parameters obtained by the analyzer on a frame-to-frame basis so as to define a set of sine waves that continuously evolve in time. To account for the unequal number of sine components from frame-to-frame, birth and death of these frequency tracks can be noted. Voiced regionsgive longer-duration frequency tracks and it can be seen that these sinusoids are harmonic in nature while unvoiced regions give short-duration frequency tracks that fluctuate rapidly. This can be explained by Karhunen-Lo`eve analysis, according to which unvoiced signals can only be sufficiently modeled by a very large number of sinusoids. Peaks in unvoiced regions results in many short-duration, rapidly fluctuating tracks. Applying the sinusoidal model of speech for a speech signal, we find that the long-duration frequency tracks indicate start and end of voiced components of speech. This is in-lieu with Quatieri’s model and is shown in Fig 2.

Sainath and Hazen [5] use this sinusoidal model for segment-based speech recognizer. The word error rate (WER) using this model was shown to degrade gracefully in presence of noise.

Fig. 2. A portion of TIMIT speech (“She had her dark”) waveform and its sinusoidal model

Giving weights to the number of birth and death of sinusoidal components at every frame along with the duration of each track, the voiced and unvoiced regions can be segmented easily.

Results obtained by segmenting speech using sinusoidal model is tabulated in Table 2.

TABLE 2. Comparison of performance of sinusoidal model for different languages

Language Matched v/uvboundaries(40ms)%

Deletedboundaries%

Insertedboundaries %

English(TIMIT) 60.2 39.9 44.7

Hindi 68.8 31.2 50.5

Tamil 65.07 34.92 48.34

4. FORCED ALIGNMENT USING THE ABOVE ALGORTIHMS

A 3 stage segmentation technique is proposed. The first stage detects stop consonants. The next stage segments the speech waveform between 2 stop consonants into voiced or unvoiced regions using sinusoidal model. The final stage further segments these voiced (or unvoiced) regions at the phonemic level using EDML of Bach scale filter bank. A block diagram of the 3 stage algorithm is given in Fig 3.

The major disadvantage of forcing boundary alignments on the entire speech waveform is that the boundary error gets accumulated. Hence, forcing boundaries for all the phonemes between 2 phoneme classes can be thought of, so that the boundary error occurring at the start of the speech waveform does not propagate to the end of the same. The first stage involves detection of a phoneme class armed with the phonetic transcription, but constrained to be a training free algorithm.

Fricatives, vowels, nasals, diphthongs, nasal vowels and glides need some stored form of features to be able to detect them. Also, different phonemes in each class require different features. However, stop consonants can be efficiently detected without storing any features as described.


Fig. 3. Block diagram of the 3 stage algorithm for speech segmentation

Predominantly, the first frame (10ms) of any speech sentence is silence. Speech signal is first high pass filtered with cutoff frequency of 400Hz (the voice bar of voiced stops extends till roughly 400 Hz). Now, to detect stop consonants, MFCCs of all the frames in the sentence are taken. Then the Euclidean distance between the first frame and every frame is taken. If this distance drops below a threshold value for a minimum of 3 frames (the minimum duration of a stop consonant can be taken to be roughly 30ms), then the algorithm decides that the region below the threshold approximately contains a stop consonant or a silence region or a combination of both. The frame within this region having the minimum distance from the first silence frame is a sure stop consonant (or silence or both) frame. Experiments give stop consonant detection accuracy of 87% with 20% insertions for 100 sentences from Hindi database.

Fig. 4. Portion of a Hindi speech utterance – “or UnkEtatkal”. The utterance has 2 silence regions (at the start and end of “or”) and 4 stop consonants (of which 2 are contiguous i.e., ‘tk’). The vertical lines denote the start of the frame classified as sure stop consonant by the stop consonant detection algorithm.

The number of silence regions involving actual silence regions (between words) and the closure regions of the stop consonants can be known from the phonetic transcription. Using this, the number of silence regions to be detected can be forced. In this case, the equal error rate i.e, the number of insertions is equal to number of deletions, is 11.3% for 100 files in Hindi. For 50 sentences of TIMIT database, the equal error rate is 15%. This is of order, comparable to the stop detection accuracy in literature [6], [7].

The rest of the discussion assumes error free stop consonant detection and attempts segmentation of individual phonemes between 2 stop consonants.

Speech signal

Phonetic transcription

Stop consonantdetection

Algorithm using sinusoidal model forV/UV region boundaries between 2 stop consonants

Algorithm using Bach-scale filter bank for V-V and UV-UV boundaries

Automated Boundaries

Fig. 5. Waveform of Hindi speech utterance /D/,/m/,/En/,/p/ and its corresponding boundaries.

The phonemes are classified as either voiced or unvoiced phonemes. Unvoiced phonemes are stop consonants and fricatives while all the other phonemes are considered voiced. So, there are regions of unvoiced and voiced speech. Between 2 stop consonants, the number of voiced to unvoiced transitions and vice versa can be counted and the required number of transitions can be detected using Quatieri’s speech model as explained in section 3. The output of this stage is the boundaries of voiced and unvoiced regions between 2 stop consonant frames. Fig 6. shows the output of stage 2 for the waveform in Fig 5.

Fig. 6. Sinusoidal analysis of the speech waveform. The vertical lines show the automated boundaries for transitions from unvoiced to voiced transition and vice versa.

The final stage segments further the voiced (unvoiced) regions into the required number of phonemes (for example, the voiced region may have a vowel followed by a glide and then a nasal) . So, the number (N) of required boundaries required is known and the best N such boundaries are chosen from the EDML function from the Bach scale filter bank.

Fig 7. shows the EDML contour for the voiced portion (which contains a nasal followed by a nasal vowel). It can be seen that the distance function peaks at the nasal to nasal vowel transition.

All the above boundaries are combined to form the automated boundaries, shown in Fig 8.

5. RESULTS

The algorithm has been tested on 50 sentences each from TIMIT, Hindi and Tamil databases. The performance of the proposed algorithm i.e., the number of boundaries which lie within a tolerance region of 25ms for 50 sentences of TIMIT database is compared with other methods proposed in literature in Table 4.

Another performance measure for segmenting by force aligning boundaries is the Frame Error Rate (FER). It is the ratio


of the number of misclassified frames to the total number of frames in a sentence. Table 5 tabulates the results for the proposed method, assuming error free detection of stop consonants.

Fig. 7. The voiced portion of speech waveform containing /m/ (nasal) followed by /En/ (nasal vowel) with its corresponding EDML function. The vertical line denotes the automated boundary.

Fig. 8. Speech waveform with the manual boundaries and automated boundaries.

TABLE 4. Comparison of segmentation performances on TIMIT database: Algorithms in the literature as against the proposed one.

Segmentation algorithm Matched phoneme boundary NTN [9] 66.6%HMM [8] 65.7%

Dynamic Programming [8] 70.9%Proposed method 62.4%

TABLE 5. Frame Error Rate between 2 actual stop consonants of the proposed method for different languages

Language FEREnglish (TIMIT) 40.7%

Hindi 26.75%Tamil 32.6%

6. CONCLUSIONS

The proposed method is language independent and is also training free. Its accuracy is at least comparable, if not better, to those of the methods that use training algorithms. A final round of manual intervention is required. However, this manual intervention is now less tedious and less time consuming.

It can be seen that the accuracy of sinusoidal model is very low. The sinusoidal analysis can also be viewed as a narrow band spectrogram. Hence, the window size being large, results in a poor time resolution accounting for the low time accuracy of the boundaries. Also, the sinusoidal model, though language independent, needs to be fine tuned for different languages. In

other words, the threshold for starting and ending of a sinusoidal track, maximum magnitude increase before birth of a new track and minimum length for a track to exist needs to be changed for effective results. The low accuracy of the method for TIMIT database is justified by the fact that the database comprises different speakers and the model needs to be fine tuned.

Future work can focus on improvement of the stop consonant detection algorithm for higher accuracy and lower insertion. Also, depending on the phoneme transition to be detected, the effect of a variable size window for the Euclidean distance using Bach scale filter bank can be tested. The intuitive concept that detection of some more phoneme classes can improve the accuracy of automated segmentation needs to be verified.

7. REFERENCES

[1] G. Ananthakrishnan, H. G. Ranjani, and A. G. Ramakrishnan, “Language Independent Automated Segmentation of Speech using Bach scale filter-banks”, Proc. ICISIP –Dec,2006, pp -115 – 120

[2] G. Ananthakrishnan, H. G. Ranjani, and A. G. Ramakrishnan, “Comparative Study of Filter-Bank Mean-Energy Distance for Automated Segmentation of Speech Signals”, Proc ICSCN –Feb,2007, pp 06- 10

[3] Ananthakrishnan G, “Music and Speech Analysis Using the ‘Bach’ Scale Filter-bank”, M.Sc (Engg) thesis, Indian Institute of Science, Apr -2007

[4] R. J. McAulay and T. F. Quatieri, “Speech Analysis/Synthesis Based on a Sinusoidal Representation,” Proc ASSP Aug, 1986, vol 34, issue 4, pp 744 – 754

[5] T.N. Sainath and T.J. Hazen “A sinusoidal model approach to acoustic landmark detection and Segmentation for robust segment-based speech recognition,” Proc ICASSP 2006, pp

[6] F. Malbos, M. Baudry and S. Montresor ,” Detection of stop consonants with the wavelet transform”, Proc. IEEE-SP International Symposium on Time-Frequency and Time-Scale Analysis, Oct. 1994, pp 612 – 615

[7] P. Niyogia and M. M. Sondhi, “Detecting stop consonants in continuous speech”, Proc. JASA Feb, 2002, pp 1063- 1075

[8] Abhinav Sethy, Shrikanth Narayanan, “Refined Speech Segmentation For Concatenative Speech Synthesis”, Proc ICSLP-2002, pp 149-152

[9] Sharma, M. Mammone, R., “Automatic speech segmentation using neural tree networks”, Neural networks for signal processing, Proc IEEE workshop, Sep 1995, pp 282-290

[10] L. Rabiner and B. H. Juang, “Fundamentals of Speech Recognition”, Pearson education Press, 1993 edition (AT&T).


NOVEL SCHEME TO HANDLE THE MISSING PHONETIC CONTEXTS IN SPEECHSYNTHESIS, BASED ON HUMAN PERCEPTION AND SPEECH

Laxmi Narayana M and A G Ramakrishnan,

Department of Electrical Engineering, Indian Institute of Science, Bangalore.

ABSTRACT

We report our efforts in handling situations in Text toSpeech Synthesis, where a particular phonemic or syllabiccontext is not available in the corpus. The idea is to replace�� he inability oflisteners to distinguish them when placed in a particularcontext. Such phones were found linguistically in two southIndian languages - Tamil and Telugu, by performinglistening tests and acoustically, through a phoneclassification experiment with Mel Frequency CepstralCoefficients as features. Maximum likelihood classifier isused to find the most misrecognized phones. Both framelevel and phone level classifications are performed to findout such phones. The classification experiments areperformed on Tamil corpus of 1027 sentences. The naturalvariability in human speech is studied by analyzingutterances of same speech at different times. We observethat, not only the characteristics of phones change, whensame sentence is spoken at different times, but also thephones get replaced by other phones, for the same speaker.

1. INTRODUCTION

Text to Speech (TTS) synthesis is an automated encodingprocess which converts a sequence of symbols (text)conveying linguistic information, into an acoustic waveform(speech). A concatenative speech synthesis system uses theactual human speech as the source material for synthesizingspeech. One of the characteristics based on which a TTSsystem is evaluated is its ability to produce an intelligiblespeech. The intelligibility of the synthetic speech depends onthe selection of relevant syllables for concatenation, whichmatch the target context. Even though the speech corpuscovers all the phones in the language under consideration, itmay not have all the phonetic contexts. Using individualmono-phones for concatenation results in discontinuities ofpitch, energy and lack of coarticulation, leading to unnaturalspeech. Speech synthesis based on syllables seems to be agood possibility to enhance the quality of synthesized speechcompared to mono-phone or diphone-based synthesizers.This consideration is based both on the fact that morecoarticulation aspects are included in syllable segments

compared to diphone units and on the fact that the mainprosodic parameters (pitch, duration, amplitude) are closelyconnected to syllables [3]. So, not only the presence of aphone in the database is important, but the syllable in whichthe phone is present and the context in which the phone orsyllable is present are also important. Mono-phones areconsidered for concatenation only in the worst case.

2. GOAL OF THE WORK

The goal is to identify the phones whose perception is moreor less similar i.e, a phone, which when replaced by anotherphone in that particular context, should not make muchdifference in perception; the listener should�� distinguish. The knowledge of these phones can be used insynthesis. Section 3 further presents our motivations forconducting this kind of experiment. Section 4 describes thephone perception experiments carried out over telephone inlanguages Tamil, Telugu and English and the correspondingresults. Section 5 describes the frame and phone levelclassification experiments performed on the Tamil database.Mel Frequency Cepstral coefficients (MFCC) are used asfeatures with Maximum Likelihood (ML) classifier forclassification. Results are discussed in Section 6. Section 7presents the conclusion.

3. MOTIVATION

There are 12 vowels and 18 consonants in Tamil. There arefive other phones introduced for representing Sanskrit. Thelanguage has certain well defined rules which introduceseven other phones depending on the presence of consonantswith respect to the vowels or the other consonants. Hencethere are 42 phones in the language. If we consider phoneticcontexts, any one of the 42 phones could occur between anytwo phones. So there are 423 contexts for each phone. If wetake the combination of a vowel and a consonant as asyllable (for example), then we get around 216 syllableseach of which can occur between any two syllables. So for asyllable, there are 2163 possible contexts for its occurrence.All of them may not be valid, but the issue is, practically, forany corpus, it is not possible to cover all such phoneticcontexts. So, while synthesis, if a �syllable in a particularphonetic context� is not available in the inventory, another

[email protected] , [email protected]


syllable by whose substitution, the listener may not noticeany difference in perception, can be used for concatenation.

In continuous speech, a listener may not pay attention toeach and every phone the speaker speaks. While speaking ontelephone, sometimes, the person on the other side, whonaturally never listens to each and every phone, may notexactly recognize all the words we speak. Sometimes, hisprior knowledge of the words and the context makes himunderstand our speech or we might have to repeat somewords or syllables, even though the phone conversationtakes place in a less noisy environment. Further, when aspeaker utters the same sentence at different times,sometimes, some phones may be missed or replaced by otherphones; but still the listener can make out the sentence. The�� can also be used to handle the missing phonetic contexts.The present paper reports the perception experiments andthe phone classification experiments conducted to find outsuch phones. The results are used in our Tamil TTS System.The Tamil database under consideration contains 1027sentences from a single male speaker, sampled at 16 kHz,which are segmented and labeled manually using Prattsoftware. Though the database is phonetically rich, but, asmentioned before, it may not contain all the contexts and thisis the motivation for this experiment.

4. PERCEPTION EXPERIMENTS

Listening experiments are conducted over the telephone tocapture��phones in Tamil language. Oneperson calls the other person and pronounces a list of 152phones/ syllables (combination of consonant and vowel) inTamil and the person on the other side writes the phoneswhatever she/he listens to for the first time. Repetition ofphones by the speaker is not allowed. Individual phones arechosen to find the exact confusion between phones; if wordsare chosen, the listener who has a prior knowledge of theword writes the word correctly even though he may have notlistened properly or the word is not pronounced properly.This does not serve the purpose.

The experiment is conducted with 10 pairs of nativeTamil people. On an average, 30% of the phones arewrongly identified as other phones. Another set ofexperiments are conducted over 2 pairs on themisrecognized phones. Not much improvement inrecognition accuracy is identified. A consistency is found inthe misidentification over the speaker-listener pairs. Most ofthe nasals are wrongly identified as other nasals. Many ofthe long vowels (deergha phones, �� A�� call�� short vowels (hrasva phones,�� !��!��two kinds of /r/ phones in Tamil - /r/ and /R/. They aremisrecognized for each another. The three types of /l/ - /l/,/L/ and /zh/ are confused among themselves. Many times,the vowels like /i/, /u/ are identified as combination of a

consonant and vowel - /yi/, /wu/. However, the�� vowels, nasals, fricatives, glides is relatively less compared��mis�� !�� similar phones which can be replaced by each another. Theentire set of phones and the consistently and frequentlymisrecognized phones are listed in Figure 1 and Table 1respectively. Table 2 gives one example Tamil word each,for the uncommon phones in Table 1.

ÜÝ Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ôõ ö ÷ ø ù ú û ü ý A B C D ® E F G H I J K LM N O P Q R S TUV W X Y Z ¯ a b c d e f gh i j k l m n o p qr ° ± ² ³ ´ μ ¶ ¸ ¹ º »¼ ½ ¾ ¿ À Á Â Ã Ä Å Æ ÇÈ É Ë Ì Í Î Ï Ö ×ØÙ ÚÛ £ ª « ¬ ÿ ú§ û§ ü§ ý§ ú¨ ü��

Figure 1: List of Tamil Phones used

ng - n A " a i - yi R - rny "n I " i u " wu ng - nyN " n U - u L - l S - sL " zh

Table 1: Most confused phones in Tamil. The pairs shown in boldare common misrecognitions between Telugu and Tamil. /ng/, /ny/,

/N/, /n/ are the respective nasals of /k/, /ch/, /T/, /t/ groups./ng/ -�� /ny/ - �� /n/ -��!/N/ -�"# /r/-��$ /R/ -�%&

/L/ -��' /zh/ - Üö°

Table 2: Example words of the phonesd " dh R - r ng � nys � S Ksha " cha T- p

Table 3: Phones identified consistently and frequently for oneanother in Telugu over 8 Speaker-listener pairs.

Similar experiment was conducted on phones of Teluguover 8 speaker-listener pairs with 52 phones/syllables. 5pairs over telephone and 3 pairs sitting some distance apart.In the later case, one person sits at one place and loudlyutters the phone/syllable and the other one listens and writes.This did not show any difference in the recognitionaccuracy. It was found that the recognition depends on theclarity of pronunciation of the speaker and the attentivenessof the listener. The most frequently and consistentlymisidentified phones in Telugu are shown in Table 3. Themisidentification rate in Telugu is less compared to Tamil.On an average, 10 phones out of 52 were misrecognized.That comes to a 20% misidentification whereas it is around30% in Tamil. This is because the number of phones inTelugu is more compared to that of Tamil; Telugu peopleare habituated to pronounce and listen to more number ofphones, which is not the case with Tamil people.Nevertheless, some similarity is found between themisidentified phones in the two languages. The commonmisidentifications are shown in bold in Tables 1 and 3.


5. NATURAL VARIATIONS IN SPEECH

Data Collection: Ten sentences from the Tamil corpus areselected and 8 native Tamil people are asked to speak the 10sentences, at 10 different times over a period of 3 days. Thetime gap between two recordings of one speaker is at least 3hours. The sampling rate is 16 kHz, which are manuallysegmented and labeled using Pratt software.Variation in phones: When a speaker speaks the samesentence/phrase/word several times, we also observevariations in the phones uttered. Sometimes, some phonesare missed or some phones are replaced by other phones.The words of one sentence by one speaker spoken at 10different times are listed in Figure 2, with the changedphones shown in bold. According to the rules of the Tamillanguage, the two words shown in Figure 2 should be spokenas n i ml a d i y A g a w u m and p A d u g A plu D a n u m. But as it is seen from the figure, sometimes/g/ is spoken as /h/ and sometimes as /k/.

So, this observation says that, replacing the unavailablephones in some contexts with the corresponding confusedphones, not only avoids discontinuities, but also inducesnaturalness in the synthetic speech.

nimladiyAgawum # pAdugApluDanumnimladiyAhawum # pAdugApluDanumnimladiyAkawum # pAdugApluDanumnimladiyAkawum # pAdukApluDanumnimladiyAgawum # pAdugApluDanumnimladiyAgawum # pAdugApluDanumnimladiyAgawum # pAdugApluDanumnimladiyAgawum # pAdugApluDanumnimladiyAgawum # pAdukApluDanumnimladiyAgawum # pAdugApluDanum

Figure 2: Variation in pronunciation of phones when same wordsare spoken 10 times.

6. PHONE CLASSIFICATION EXPERIMENT

After identifying the phones, which are recognized wronglyfor other phones, we took the next step of classifying thephones using Maximum Likelihood Classifier. The Tamilphones from our Tamil Corpus are classified.

6.1. Feature Extraction � Training & Testing

The traditional filter-bank approach [4] is followed forextracting Mel Frequency Cepstral coefficients (MFCCs)from the speech signal. The process is very briefly presentedhere. Each 20 ms frame is represented by a 12-dimensionalacoustic vector. The training data is converted to frame leveldata and .feat files which store the MFCC vectors of all theframes of the corresponding phone are created. Mean andcovariance are obtained for all the .feat files.

Two types of classifications are performed: frame leveland phone level. In the former case, a single 20 ms frame (a12 dimensional acoustic vector) is classified to one of the 48(Tamil) phone classes using the ML classifier [1]. In the

later case, mean (vector) of all the MFCC vectors belongingto one phone and classified using ML classifier. The ideabehind doing this is to represent a phone with a singleacoustic vector. In the case of frame level classification, asingle frame does not represent a phone. Of course, this isthe method mentioned in the literature to test the efficiencyof a classifier, but the focus of the present experiment is notto design a robust classifier, but to find the confusability ofphones so that the phones can be used interchangeably insome contexts. So phone level classification is also done.


Phones are classified using full covariance matrix anddiagonal covariance matrix. The classification accuracyobtained in the former case is better compared to that of thelater. The experiments are carried out for different sizes oftraining and test data and the phones that are misclassifiedare noted down. The results of the phone level classificationare presented in Tables 4 and 5.

Table 4: Phone level Classification results on Tamil corpus withFull Covariance matrix; Number of sentences used for testing: 100Variable: Average number of Feature Vectors per class

Table 5: Phone level Classification results on Tamil corpus withFull Covariance matrix; Number of sentences used for Training:700, Variable: Number of sentences used for Testing

7.1. Classification Accuracy

There are 6 broad categories of phones in Tamil " Vowels(a, A, i, I, u, U, e, E, ae, o, O), Semivowels & Glides (y, r,R, l, ll, wl, Ll, L, zh, w, yl), Stops (k,T, t, p, b, kl, Tl, tl, pl,g, D, d, TR), Affricates (cl, j), Fricatives ( S, s, h), Nasals(m, n, ng, ny, N, nl, ml, Nl). BCCA (Broad ClassClassification Accuracy) is the accuracy of correctlyclassifying a phone to its major category. For example, if avowel is identified as vowel, a nasal as a nasal and so on, theclassification is considered to be accurate. The overallaccuracy in the fifth column of Table 4 is the accuracy ofclassifying a phone to its true class. Both of them are foundto increase with the training data size. When the trainingdata size is kept constant and the test data size is varied, aslight decline in the accuracies with the increase of test datasize is observed.

S.No

Trainingdata size

Avg. No. ofFV per class BCCA Accuracy

1 100 6295 61% 47%2 200 6938 65% 49%3 400 8648 72% 53%4 700 10603 74% 53%

S.No Test data size BCCA Accuracy1 50 73.5% 52%2 200 72.6% 51.2%3 400 71.73% 50.5%


7.2. Confusion Matrix

A confusion matrix of the Tamil data for the significantmismatches is shown in Table 6. The classification accuracyof the phone /a/ is relatively high compared to that of theother phones, in both Tamil and English. Consistently, forall the �� #$&� �� '��72% of the /I/ phones are classified as /i/. This is not soprominent with the other vowels. So, if a deergha syllable([consonant A/I] or [A/I consonant]) is not available in thecorpus in a particular context, it can be replaced with thehrasva syllable ([consonant a/i] or [a/i consonant]). This is amajor finding. The confusion between /u/ and /U/ pairs isfrequent in the listening tests but not so significant in theclassification test. The following results are in the casewhere the training data size is 700 and test data size is 400.40.9% of /ae/s are classified to /i/ while only 29.8% of /ae/sare correctly classified to /ae/ class. 9% of /ae/s areclassified to /yl/ (genitive of /y/). 38% of /yl/s are classifiedto /i/. 11.4% of /yl/s are classified to /ae/. There is moremisclassification among the three phone classes - /i/, /yl/ and/ae/. 44.44% of /ll/s are classified to /Ll/.

Table 6: Confusion matrix of most confused Tamil phones

7.3. Application to TTS

The knowledge of the phones usually misidentified isused in Speech synthesis. Blind listening tests are conductedwith 4 native Tamil people. The listeners are asked to listento a set of 11 synthesized sentences which are generated byour Tamil TTS system. The same 11 sentences are alsosynthesized with some phones replaced by the correspondingconfused phones found, in some words. Many words had asingle phone replacement and some of them also had 2 to 3phone replacements. The original phones and the phoneswith which they are replaced are shown in Table 7. Thelisteners are asked to write the synthetic sentences of boththe sets separately. The results are checked to find thevalidity of the phone replacement. 75% of the words forwhich phone replacement is done are recognized as theregular words by all the listeners. They could get the originalword even though some of the phones are replaced by otherphones in those words. 3 listeners did not notice a change in50% of the remaining 25% (phone replaced) words. The��

out are shown in bold in Table 7. Some special replacementswhich are more language specific are shown in Table 8. Thephonetic transcription of the words is shown. The IPA codesof the phonemes can be found in [2].

ae � e y n � N l � zhm - n I � i u � wutl - Tl e - E i � y iL - l R - r p - wb - p

Table 7: Phone replacements done during synthesis.

Original word Word after phonemereplacement

a g n i a k N iE w u g a n ae e u g a N ae

u l a g a ng g a L ae y u m w u l a g a m g a L ae y u mm u kl i y a m u k y a

A y w u kl U D a m A y u kl U D a mTable 8: Tamil words, before and after phone replacement.

8. CONCLUSION AND FUTURE SCOPE

A novel way of systematic replacement of missing phoneshas been proposed for speech synthesis. The most confusedTamil phones which can be replaced by one another inspecific contexts at the time of synthesis, if they are notavailable in the corpus, are found. The confused phones inTamil are identified by conducting listening tests overtelephone and also by the phone classification experimentusing ML classifier. The confused phones in Telugu arealso found by perception tests. The common confusedphones over the two Indian languages are identified. The

natural replacement of phones by other phones in humanspeech is also observed. This gives a hope that the proposedphone replacement strategy also makes the synthetic speechclose to natural speech. The collected data can be analyzedfor the variability in characteristics of phones and theknowledge can be incorporated in TTS to induce naturalnessin the synthetic speech, in future. The knowledge of theconfused phones is incorporated in Tamil Text to speechsynthesis and experiments show that the proposed phonereplacement strategy is fairly successful.

9. REFERENCES

[1] Duda, Hart, Stork, Pattern Classification, Second Edition,John Wiley & Sons, 2001.[2] http://en.wikipedia.org/wiki/Tamil_script.[3] Kopecek, I., Pala, K. "Prosody Modelling for Syllable-Based Speech Synthesis", Proceedings of the IASTEDConference on AI and Soft Computing, 1998, pp. 134-137.[4] *�� /�� *�� ;�<�� >�� /��?�� Q�� \ �� ^_��mel-��`�� {�Proc. IEEE Int. Conf. on Acoustics, Speech, and SignalProcessing, vol. 1, Salt Lake City, UT, Jun. 2001, pp. 73"76.

True Classa A i I ae l ll yl

a 3164 166 180 3 65 6 33 74A 1112 1461 0 0 0 4 2 0i 228 0 1962 110 419 2 6 407I 1 0 9 7 1 0 0 1

ae 112 0 220 11 305 0 0 122l 0 0 0 0 0 0 0 0ll 0 0 0 0 0 0 12 0yl 61 0 130 7 92 0 0 378

Ass

igne

dC

lass

Total 5788 1633 2909 148 1023 83 369 1069


DETECTION OF POSITIONS OF PACKET LOSS IN RECONSTRUCTED VOIP SPEECH

A. Mandal 1K, . R. Prasanna Kumar1, G. Athithan1

ABSTRACTDetection of the positions of packet loss serves a useful pur-pose in the recognition of PCM speech reconstructed fromVoIP packets. The information about the loss positions can beused to skip unreliable observation vectors around these posi-tions so as to achieve better recognition of PCM speech. TheVoIP receivers apply packet loss concealment (PLC) schemesto fill up the lost segments. These schemes do not deal withthe problem of discontinuities adequately. In our study, wecompare the performance of the distance measures based ontwo standard speech features to detect the positions of lossin the presence of commonly used PLC algorithms. Perfor-mance of each of them is analyzed in terms of the probabil-ity of correct detection and false alarms. An approach basedon auditory representation is suggested to overcome the lim-itations of the measures based on standard features. Resultsindicate that the approach based on auditory representationgives a significantly low false alarm rate.

Index Terms— VoIP, PLC, loss point detection, auditoryrepresentation

1. INTRODUCTION

A well known problem in Voice Over Internet Protocol (VoIP)communication is that of packet losses which occur due tonetwork related impairments such as excessive delay, conges-tion or errors during transmission which causes packets to bedropped midway. These losses can occur either in isolationor in burst, spanning a length of several consecutive packets.Consequently, these lost packets produce missing segments inthe reconstructed pulse code modulated (PCM) speech whichnot only degrade the perceptual quality of the received speech,but also affect the performance of automatic speech recogni-tion (ASR). The amount of degradation depend both on theoverall packet loss and burst losses, and it increases with theincrease in length of consecutive losses. This is because, thecodecs used in VoIP communication exploit redundancies inthe speech signal during compression by mostly predictingthe current speech samples from the previous ones. Hence theloss of packets causes the propagation of error across severalframes of decoded speech. Various packet loss concealment

(PLC) schemes are used to fill the missing segments. Theseconcealment techniques cannot perform a perfect recovery ofthe missing segments but can only predict the values of thecurrent segment samples from the past and future correctlyreceived segments. Even if the perceptual quality improves,the content of the filled data in the missing segments maydiffer significantly from the data that was originally present.This happens in a situation when the loss spans several con-secutive packets, making it extremely difficult to predict thevalues of the missing segment close to original values. But incases when the number of consecutively lost packets is small,most of the PLC techniques are able to mask the perceptualdiscontinuities and fill the lost segments with data close totheir original values. This kind of properly masked lossy seg-ments do not degrade the ASR performance significantly andcan be treated as non-lossy segment, unlike those that resultin prominent perceptual discontinuities even after applicationof PLC.

Though the PLC techniques provide the needed percep-tual continuity, for improving the performance of the speechrecognizer based on Hidden Markov Model (HMM), the knowl-edge of the positions of loss can be used to skip the observa-tion vectors from the places of unreliable speech data.

A receiver which implements the PLC scheme has ac-cess to the real time protocol timestamps and packet sequencenumbers and uses this information for detecting the positionsof loss while performing concealment. A speech recognizer,which takes only PCM speech as input, has no access to suchinformation and it has to derive it from the reconstructed speech.This task is difficult in the absence of the knowledge of thePLC scheme that is implemented in the receiver.

The rest of the paper is organized as follows. In Sec-tion 2, the commonly used packet concealment techniquesfor PCM speech are discussed. Section 3 presents the twoapproaches for detecting the positions of missing packets inPCM speech reconstructed using different PLC techniques. InSection 4, we propose an approach based on auditory repre-sentation of sounds to detect the subtle discontinuities arisingout of packet loss, that remain even after concealment. Weconclude the paper in Section 5 along with directions for fu-ture work.

1InformationSecurityDivision,

CenterforArtificial Intelligence & Robotics, Bangalore, India 2Department of CSE, Indian Institute of Technology Madras, Chennai, India


C. Chandra Sekhar2

[email protected]

2. PLC TECHNIQUES IN VOIP SPEECH

The PLC techniques are generally classified as sender-basedand receiver-based schemes. A survey of these techniques canbe found in [1]. The sender driven techniques require activeparticipation from sender and are based on retransmission,forward error correction (FEC) or interleaving. Because ourinterest is restricted to speech recognition on decoded PCMspeech from VoIP packets without any change to the exist-ing VoIP communication framework, the sender driven tech-niques are not suitable. On the other hand, receiver drivenPLC techniques which try to estimate and compensate theloss in the signal without any assistance of the sender are ofrelevance to us. The receiver based schemes are again classi-fied into insertion, interpolation and regenerative schemes inthe order of increasing improvement in terms of speech qual-ity. The simplest insertion based approach is to insert silenceduring the frame loss period. Another popular concealmenttechnique in this category repeats the previous correctly re-ceived packet. This technique performs better than silencesubstitution. Interpolation based techniques attempt to usethe data from neighbouring packets and create a substitutefor a packet that was lost. The ITU-T has standardized inG.711 (Appendix A) [2] a high quality, low complexity inter-polation based concealment scheme which depends on pitchwaveform substitution. The regenerative techniques exploitsome pre-fetched knowledge of algorithms used in the respec-tive codecs in order to derive the codec parameters and usethem eventually to synthesize speech in the missing portions.An example of the regenerative technique is the concealmentstandard presented in ANSI TI-521-2000 (Appendix B) [3].The technique extracts the residual signal of the previouslyreceived speech by linear prediction analysis, uses periodicreplication to generate an approximation of the excitation sig-nal of the missing speech and eventually generates the syn-thesized speech using this excitation. Our study and analy-sis are based on these four concealment algorithms, namely,silence insertion (S), previous packet repetition (P ), G.711Appendix A (A) and ANSI TI-521-2000 Appendix B (B).

A set of 100 utterances from TIMIT database was usedfor the study. The packet duration was 10 milliseconds andframes of one packet size were removed at fixed and ran-dom positions from the utterances to create lossy speech. Thiswould test the performance of the methods to detect the po-sitions of loss of packets of minimum possible length. Theloss rate was kept at 5 percent. Subsequently the loss inter-vals were filled by the concealment algorithms as discussedabove and the concealed speech data was generated.

3. APPROACHES TO DETECT THE POSITIONS OFPACKET LOSS

Our approach to detection of positions of lost packets is basedon the distance measures between the feature vectors of two

adjacent frames of speech. The frame size was kept at halfthe packet size with a frame rate of 125 frames/sec. Thespeech frames constructed from successive packets which donot have packet loss in between are expected to exhibit highcorrelation. Consequently, the distance d between the fea-ture vectors of such frames is expected to be low. The lossyregions typically have high value of distance d compared tothat of non-lossy regions.

In this study the following standard distance metrics wereconsidered:

• Euclidean distance on Mel scale filter bank outputs [4]but skipping the DCT computation step

• Euclidean distance on power spectra computed fromFFT

• Itakura-Saito distortion [5] computed on power spec-trum

The normalized Euclidean distance measure d between twofeature vectors x and y is defined as below:

d2 =‖x− y‖2

‖x‖2‖y‖2(1)

The Itakura-Saito distortion measure dIS(S, S′) between twospectra S(ω) and S′(ω) is defined as follows

dIS(S, S′) =

π∫

−π

S(ω)S′(ω)

dω

2π− log(

σ2∞

σ′2∞

)− 1 (2)

where σ2∞

and σ′2∞

are the gains or one stop prediction er-rors of S(ω) and S′(ω) respectively. For an utterance, thewaveform of the speech signal without packet loss is plottedin Fig. 1 (a). The distance measure d for lossless speech isplotted in Fig. 1(b) and d for the speech reconstructed usingdifferent PLC schemes is plotted in Figs. 1(c)-(f). The pointsat which the value of d is greater than a preset threshold arehypothesized as positions of packet loss. A lossy point is saidto be detected correctly if the hypothesized loss position lieswithin ±2 frames of the of the corresponding actual loss po-sition. From the plots, it can be observed that the distancemeasure d is approximately the same for the lossless and thelossy regions for the last three PLC schemes. This gives riseto a large number of false alarms when the threshold is appliedto distinguish the lossy and lossless regions. The performancehas been evaluated with respect to the probability of correctdetection Pd and the probability of false alarm Pf defined asfollows:

Pd =Number of loss points correctly detectedNumber of loss points actually present

(3)

Pf =Number of false hypotheses of loss pointsTotal number of hypotheses of loss points

(4)


The receiver operating characteristic (ROC) curves for thedifferent representations and for different PLC schemes areplotted in Fig. 2. For each case, the best performance speci-fied as highest Pd and the lowest Pf is determined. The bestperformance in detecting the packet loss positions for differ-ent representations and for different PLC schemes is given inTable 1.

Table 1. Performance (η) in detection of packet loss positions fordifferent representations and for different PLC schemes.

Representation η PLCS P A B

Mel scale filter Pd 0.995 0.977 0.968 0.96bank outputs Pf 0.0 0.608 0.621 0.6858

Power spectrum Pd 0.972 0.2 0.094 0.11(Euclidean) Pf 0.0 0.65 0.858 0.86

Power spectrum Pd 0.946 0.128 0.21 0.1013(Itakura-Saito) Pf 0.033 0.728 0.608 0.775

It is observed that all the three representations give a veryhigh value of Pd and a very low value of Pf for the silenceinsertion based PLC scheme. It indicates that the packet losspositions can be detected with a high accuracy and with nofalse alarms for this PLC scheme. For the other three PLCschemes, the Mel scale filter bank outputs based representa-tion gives a very high value of Pd and a high value of Pf

simultaneously. This indicates that though the Mel scale fil-ter bank output based representation is effective in detectionof the packet loss positions, the false alarm rate is quite highwhich significantly degrades the overall performance. TheEuclidean distance metric on power spectrum based represen-tation gives a low value of Pd and a very high value of Pf

indicating a poor performance. The Itakura-Saito distortionmeasure is marginally better than Euclidean distance in termsof false alarms though it too gives a high false alarm rate.The power spectrum of speech has undesirable harmonic finestructures at multiples of the fundamental frequencies leadingto spurious hypotheses. This aspect is reduced in Mel rep-resentation where the neighbouring components are groupedto form frequency bands in a non-uniform way resulting inlesser false alarms. Most of the PLC techniques use spectralsmoothing to mask spectral discontinuities and try to achieveperceptual smoothing which is not always achieved perfectly.Mel scale representation being a perceptual measure is ableto detect the discontinuities which power spectral based dis-tance measure may not detect resulting in higher accuracy forthe former.

Perceptual experiments have shown that the human earcan detect discontinuities resulting from improperly concealedlossy segments in the reconstructed speech with a very lowfalse alarm rate. The discontinuities that remain prominenteven after the application of PLC causes significant degrada-tion of ASR performance and need to be detected accurately.In the next section, we propose a method based on auditory

representation for detection of packet loss positions. Simi-lar approach has been used in [6] for solving auditory sceneanalysis problems.

4. AUDITORY REPRESENTATION BASEDAPPROACH

The simulation of sound processing in human ear takes placein two stages. In the first stage of peripheral auditory pro-cessing, the activity of the cochlea is simulated through acochleogram. This is implemented with a bank of 128 gam-matone filters [7]. The gain of the filters is chosen to reflectthe transfer function of the outer and middle ears. The outputof each filter is processed by a model of hair cell transduc-tion [8]. The mid-level auditory processing is simulated bycomputing the correlogram [9] at 8 milliseconds intervals.A correlogram is formed by computing a running autocor-relation of the hair cell output. At a given time step t, thenormalized autocorrelation A(i , t , τ) for ith filter with a timelag τ is given by

A(i , t , τ) =K−1∑

k=0

h(i , t − k)h(i , t − k − τ)w(k), (5)

where h is the output of the hair cell model, w is a rectan-gular window of width K time steps. It has been demon-strated in earlier studies [10] that the correlogram channelswhich lie close to a formant share a similar pattern of peri-odicity. Therefore, the adjacent channels lying close to a for-mant exhibit a high degree of correlation whereas the channelbelonging to different formants show a low degree of crosschannel correlation. This kind of segregation of formants isvery prominent below 1KHz. The measure of the similaritybetween adjacent channels within a correlogram is given bythe cross-channel correlation metric defined as

C (i , t) =1L

L−1∑

τ=0

A(i , t , τ)A(i + 1 , t , τ), (6)

where A is the normalized autocorrelation function and L isthe maximum autocorrelation lag. The cross-channel correla-tion for the first 25 channels are used which represents thedesired range of frequency. Also, the segregated structureis preserved across the next few correlograms because a for-mant extends for a period of time and its frequency changessmoothly with time. This property is known as temporal con-tinuity. Both the properties can be observed distinctly in agray scale image plot of cross-channel correlation for the loss-less speech shown in Fig. 3(a).

From the analysis of the gray scale image representationof cross-channel correlations, it is observed that for speechwithout any packet loss (perceptually clean speech), the re-gions of high correlation represented by darker shades areseparated from regions of low correlation represented by lighter


shades. This happens for the entire frequency range (almostall the filters) below 1KHz. Also, the change in the gray levelintensities across time is smooth which can be attributed tothe temporal continuity property. A deviation from this be-havior suggests an event of perceptual discontinuity causeddue to loss of frames or imperfect concealment. Vertical linesof constant intensity through out the entire frequency rangethat are distinct from their surrounding gray levels are indica-tors of potential discontinuities. By examining the gray scaleimages, we could detect positions of perceptual discontinu-ities caused due to loss of frames. The gray scale image ofplot of cross-channel correlation of the reconstructed speechusing the previous packet repetition based PLC as shown inFig. 3(b). The manually detected positions of packet loss aremarked with dots. Based on manual detection from the im-ages, the values of Pd and Pf for the auditory representationbased approach for different PLC schemes are given in Ta-ble 2. It is seen that the major advantage is that the false

time −−−>

channels

−−

−>

50 100 150 200 250 300 350

5

10

15

20

25

(a)

(b)

Fig. 3. (a) Gray scale image of cross channel correlation for cleanspeech and (b) Reconstructed speech obtained using previous packetrepetition based PLC.

Table 2. Performance in manual detection of the positions of packetloss using the auditory representation based measures for differentPLC schemes.

η PLCP A B

Pd 0.2 0.2357 0.057Pf 0.0667 0.083 0.02

alarms rate is very low. The low detection rate may be due tothe better concealment by the PLC schemes which could re-store the speech signal to a reasonably good perceptual qual-ity.

5. SUMMARY AND CONCLUSION

In this study, we have compared the performance of Euclideandistance measures on Mel filter bank output based represen-

tation and power spectrum based representation and Itakura-Saito distortion for the detection of the positions of packetloss in reconstructed VoIP speech. All of them perform wellfor the silence insertion based PLC scheme. For the otherPLC schemes, though the Mel scale filter bank output basedrepresentation gives a very high detection rate, it also has ahigh false alarm rate. The other two distance measures per-form poorly. An auditory representation based approach isproposed for manual detection of the positions of packet losscorresponding to the perceptual discontinuities that remaineven after the application of PLC. These can be detected man-ually from the gray scale images of cross-channel correlationplot. The proposed method gives a very low false alarm rateand the detection rate is comparable to that of human ear andcould accurately detect the discontinuities that remained afterconcealment. Methods for automating the auditory represen-tation based approach have to be developed. A combinationof representations may be explored to develop a system fordetection of packet loss positions with a high detection rateand a low false alarm rate. Effect of the detection of packetloss position on automatic recognition of VoIP speech has tobe studied.

6. REFERENCES

[1] C. Perkins, O. Hodson, and V. Hardman, “A packet loss recov-ery technique for streaming audio,” IEEE Network Mag., pp.40–48, 1998.

[2] International Telecommunication Union, Appendix A :a highquality low complexity algorithm for packet loss concealmentwith G.711, ITU-T Recommendation, November, 2000.

[3] ANSI, Packet loss concealment algorithm for use with ITU-TRecommendation G.711, ANSI Recommendation TI.521-2000(Annex B), July, 2000.

[4] L.R. Rabiner and B-H. Juang, Fundamentals of Speech Recog-nition, PTR Prentice Hall, Englewood Cliffs, New Jersey,1993.

[5] R. Gray, A. Buzo, A. Gray Jr., and Y. Matsuyama, “Distortionmeasures for speech processing,” IEEE Trans. Acoust., Speechand Signal Processing, vol. 28, no. 4, pp. 367–376, Aug. 1980.

[6] D.L. Wang and G.J. Brown, “Separation of speech from inter-fering sounds based on oscillatory correlation,” IEEE Trans.Neural Networks, vol. 10, no. 3, pp. 684–697, 1999.

[7] J. Holdsworth R.D Patterson, I. Nimmo-Smith and P. Rice,APU Report 2341: An Efficient Auditory Filterbank based onGammatone function, Cambridge, Applied Psychology Unit,1988.

[8] R. Meddis and L.O. Mard, “A unitary model of speech percep-tion,” J. Acoust. Soc. Am., vol. 102, pp. 1811–1820, 1997.

[9] D.P.W Ellis and D. Rosenthal, Midlevel Representations ofComputational Auditory Scene Analysis, Lawrence Erlbaum,Mawah, New Jersey, 1998.

[10] S.A. Shamma, “Speech processing in the auditory system: Therepresentation of speech sounds in the responses of the audi-tory nerve,” J. Acoust. Soc. Am., vol. 78, pp. 1613–1621, 1985.


Euclidean distance Euclidean distance Itakura-Saito distanceon Mel scale filter on Power spectrum on power spectrumbank output based based representation based representation

representation

(a)

0 0.5 1 1.5 2 2.5−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

time−−−>

Am

plitu

de

0 0.5 1 1.5 2 2.5−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

time−−−>

Am

plitu

de

0 0.5 1 1.5 2 2.5−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

time−−−>

Am

plitu

de

(b)

0 100 200 300 400 500 600 7000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

frame index−−−>

d−

−−

>

0 100 200 300 400 500 600 7000

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5


d−

−−

>0 100 200 300 400 500 600 700

0

1

2

3

4

5

6

frame index

d−

−−

>

(c)

0 100 200 300 400 500 600 7000

0.5

1

1.5


d−

−−

>

0 100 200 300 400 500 600 7000

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5


d−

−−

>

0 100 200 300 400 500 600 7000

5

10

15

20

25

30

35

frame index

d−

−−

>

(d)

0 100 200 300 400 500 600 7000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5


d−

−−

>

0 100 200 300 400 500 600 7000

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5


d−

−−

>

0 100 200 300 400 500 600 7000

1

2

3

4

5

6

frame index

d−

−−

>

(e)

0 100 200 300 400 500 600 7000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5


d−

−−

>

0 100 200 300 400 500 600 7000

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5


d−

−−

>

0 100 200 300 400 500 600 7000

1

2

3

4

5

6

frame index

d−

−−

>

(f)

0 100 200 300 400 500 600 7000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5


d−

−−

>

0 100 200 300 400 500 600 7000

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5


d−

−−

>

0 100 200 300 400 500 600 7000

1

2

3

4

5

6

frame index

d−

−−

>

Fig. 1. (a) Waveform of a speech signal for an utterance without packet loss. (b) Distance measure for lossless speech. Distancemeasure for speech reconstructed using (c) Silence insertion based PLC. (d) Previous packet repetition based PLC. (e) G.711Appendix A based PLC. (f) ANSI TI-521-2000 Appendix B based PLC. Left column corresponds to Euclidean distance onMel scale filter bank output based representation. Middle column corresponds to Euclidean distance on power spectrum basedrepresentation. Right column corresponds to Itakura-Saito distortion measure on power spectrum based representation.


Euclidean distance Euclidean distance Itakura-Saito distortionon Mel scale filter on power spectrum based on power spectrumbank output based representation based representation

representation

(a)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.70.94

0.95

0.96

0.97

0.98

0.99

1

Pf−−−>

Pd−

−−

>

0.0260.0280.030.0320.0340.0360.0380.040.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Pf−−−>

Pd−

−−

>

0 0.020.040.060.080.10.120.140.16

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Pf−−−>

Pd−

−−

>

(b)

0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.750.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Pf−−−>

Pd−

−−

>

0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Pf−−−>

Pd−

−−

>

0.65 0.7 0.75 0.8 0.85 0.90.05

0.1

0.15

0.2

0.25

0.3

Pf−−−>

Pd−

−−

>

(c)

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.90

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Pf−−−>

Pd−

−−

>

0.65 0.7 0.75 0.8 0.85 0.9 0.95 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Pf−−−>

Pd−

−−

>

0.66 0.68 0.7 0.72 0.74 0.76 0.780.1

0.12

0.14

0.16

0.18

0.2

0.22

0.24

0.26

0.28

0.3

Pf−−−>

Pd−

−−

>

(d)

0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.80.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Pf−−−>

Pd−

−−

>

0.7 0.75 0.8 0.85 0.9 0.95 10

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

Pf−−−>

Pd−

−−

>

0.640.660.680.70.720.740.760.780.80.1

0.12

0.14

0.16

0.18

0.2

0.22

0.24

0.26

0.28

0.3

Pf−−−>

Pd−

−−

>

Fig. 2. ROC curve of positions of packet loss detection in speech reconstructed using (a) silence insertion based PLC. (b)Previous packet repetition based PLC. (c) G.711 Appendix A based PLC. (d) ANSI TI-521-2000 Appendix B based PLC


DEFINING SYLLABLES AND THEIR STRESS LABELS IN TAMIL TTS CORPUS

Laxmi Narayana M and A G Ramakrishnan

Department of Electrical Engineering, Indian Institute of Science, Bangalore 560 012.

ABSTRACT

We report our work on stress labeling of syllables in thespeech corpus of Tamil Text to Speech Synthesis. Syllable isthe minimum possible speech segment which can be spokenindependent of the adjacent phones. Keeping this in mind, anew syllabification strategy, which preserves thecoarticulation effects of the phones present in the identifiedsyllables, is proposed for Tamil language. The syllables arestress labeled based on the combination of their Pitch,Energy and Duration (PED). The byproduct of the stresslabeling of the corpus is the prosodic knowledge that thefirst syllable in Tamil is stressed (in most cases, except whenthere is some emphasis of a particular syllable in a word)and the second syllable is stressed if the first syllable is/has ashort vowel. Based on this, a rudimentary prosody model isdeveloped for Tamil TTS.

Index Terms� Syllable, Stress labeling, Duration,Coarticulation, Pitch, Energy, prosody model, speechsynthesis.

1. INTRODUCTION

Text to Speech (TTS) synthesis is an automated encodingprocess which converts a sequence of symbols (text)conveying linguistic information, into an acoustic waveform(speech). The existing TTS systems for European languageslike English, German and French have achieved a good levelof intelligibility and naturalness. Though a number ofresearch prototypes of Indian language TTS systems havebeen developed, none of these are of quality that can becompared to commercial grade TTS systems in languagesmentioned above. Further, much work has not been carriedout on prosody models or increasing naturalness in syntheticspeech for Indian languages.

Speech synthesis based on syllables seems to be a goodpossibility to enhance the quality of synthesized speechcompared to mono-phone or diphone-based synthesizers.This consideration is based both on the fact that morecoarticulation aspects are included in syllable segmentscompared to diphone units and on the fact that the mainprosodic parameters (pitch, duration, amplitude) are closelyconnected to syllables [3]. Section 2 describes the goal of

the paper. Section 3 gives the definition of syllable andstress and the reason for associating the parameter stress(only) with syllable (only). Section 4 discusses the proposedstructure of the Tamil syllable. Section 5 presents differentcues of stress as mentioned in the literature, the attributesused to quantify stress in the current work and discusses thenovelty of the work. Section 6 describes the proposedmethod for stress labeling of syllables and the correspondingresults. Section 7 describes the prosodic knowledge acquiredfrom stress labeling and the developed preliminary prosodymodel for Tamil TTS.

2. MOTIVATION FOR THE WORK

Predicting the characteristics of speech to be synthesized isknown as prosody modeling. Prosody model needs to predictthe characteristics like pitch, amplitude and duration of aspeech segment to be concatenated. Although changing thecharacteristics of the available speech segments in thecorpus is possible and in practice also, this is considered tobe secondary to the availability of speech segments withcharacteristics matching the target context. One may nothave a unit exactly matching the target context. However, ifthere is an interface which can clearly distinguish thecharacteristics of speech units in the corpus, it reduces to agreat extent the complexity of unit selection process andexhaustive calculation of join cost for different combinationsof units. This also reduces the burden of modifying thecharacteristics of the units during concatenation. So, it isuseful to organize the database in such a way that thestressed and the unstressed syllables are distinguished. Thisis the motivation for our work. Deciding whether the syllableto be synthesized is stressed or not, is dictated by theprosody model (section 7). To our knowledge, there is noprior work done on this aspect of any Indian language so far.

3. WHY ONLY SYLLABLE? WHY ONLY STRESS?

As far as the production of spe�� the minimum possible speech segment, which can be spokenin isolation; i.e., without the help of its adjacent phones. Asyllable contains only one vowel. The format of a syllablecan be V, VC, CV, CVC (Ex: /a/, /ap/, /ra/, /ram/,respectively) etc. Section 4.1 elaborates on this. The

[email protected] , [email protected]


`�� |�}� ��|�~�is good if the synthesizer can speak with differentexpressions: sadness, happiness, anger. But theseexpressions need not be considered while reading the text.Further, emotional expressions can be the next level ofsophistication, once a basic quality of synthetic speech isachieved. But some syllables need to be emphasized(stressed) while reading. The immediate question, which��}� �� |�}� ��erunits? Stress is the relative emphasis that may be given tocertain syllables in a word. T��^��{��associated with a mono-phone, since a speaker cannot stressa phone without stressing the adjacent phones within asyllable. So, we hypothesize that syllable is the minimumpossible unit with which the stress parameters can beassociated.

4. SYLLABLE STRUCTURE

The problem of defining a syllable has always plagued thelinguists. While it is true that native speakers can in mostcases consistently say how many syllables a given utterancehas, the format of a syllable cannot be globalized. Differentlanguages have different formats of syllable in theirgrammar. The classical syllable structure of Tamil is definedfor poetry and deals more with the written aspect of thelanguage. As far as stress labeling of a spoken syllable isconcerned, this definition of syllable is observed to be not sorelevant.

4.1 Proposed Syllable structure for Tamil

The idea of syllabification lies in exactly defining thesyllable segments (V, CV, VC, CVC, CCV or CVCC as thecase may be) that are used for speech production and can bepicked up from the database, so that good coarticulation ofadjacent segments is preserved in the produced speech.These segments need not always be syllables in thetraditional sense, but boundaries between them can well bedetermined from phonetic point of view, enabling automaticgeneration of the syllable database.

The following example shows a Tamil sentence (O), itsphonetic transcription (P) and syllabic transcription (S).

O: (�)!*��$�+� ," .+� (�)!�/0�10�2 �/��3+� 4+��4+�5�6�7�0�8��0��9

P: n A y a n a kk A r a n # m e ll a # n A y an a tt ae # u d a TT i l # w ae tt u # p I #p I # e n R u # s a tt a m # p A r tt A n

S: nA ya nak kA ran mell lla nA ya natl tlae udaT Til waet tu pI pI en Ru sat tam pArt tAn

The syllables identified in the above example are of theformat CV, CVC, V, VC and CVCC. Identifying V, CV,VC, and to some extent, CVC are trivial. But in cases wheregenitives are present, they are included in both the adjacent

syllables. In the above example, in the first wor��'� ��kk� '� �� !� �� A��!��& /k/ and /k/ & /A/. Similar is the case with other genitives.

Since a syllable contains only one vowel, the number ofsyllables identified in a word is equal to the number ofvowels in it. In general, when a word starts with a vowel(say VCVCVC, Ex: u d a TT i l), the first vowel is identifiedas a syllable if the next to next phone is a vowel and CVsyllables are identified unless there is a genitive in thesequence of phones, in which case the genitive manifests inboth the adjacent syllables. In a word of the form VCCV,VC and CV syllables are identified (Ex: e n R u). If a wordis of the form VCCVC, VC and CVC syllables are identified

5. ATTRIBUTES TO QUANTIFY NATURAL STRESSIN NORMATIVE SPEECH

Stress and its manifestation in the acoustic signal have beenthe subject of many studies. Researchers have attempted todetermine the reliable indicators of stress by analyzingvariables such as fundamental frequency (F0), amplitude,concentration of spectral energy, duration and others. Higherintensity, greater duration and higher fundamental frequencyare believed to be the primary acoustic cues for stressedsyllables, although how the three factors work together tomake a syllable more prominent than the surrounding ones isstill not very clear [2]. Stressed syllables are usuallyindicated by high sonorant energy, long syllable or vowelduration, and high and rising F0 [3]. The cues of stressmentioned above are found based on the studies carried outfor stress detection of syllables in English and Dutchlanguages for the applications of speaker recognition orspeech recognition. Further, the speech corpus analyzed, inmost of the cases, was a biased one. For example, emotionalspeech was recorded (with happiness, anger, sadness) andused for stress analysis. It is not that only emotional speechhas stress; normal speech too has some stress content in it,although it may not be very prominent. Otherwise, the pitchcontour of speech signal spoken in neutral speaking stylewould have been flat. Much interest has not been shown instress analysis of normal speech corpus as applied to TTS,that too in Indian languages. The present work deals withdetecting such natural stress in normative speech.

For the present work, the attributes chosen to quantifystress are Pitch, Energy and Duration. Although anothercue - amplitude was thought of, since energy is a function ofamplitude, this attribute was not included. Hereafter, thecombination of the above three parameters of a syllable willbe called as PED. Syllables are stress labeled based on their;��!��^'�� ;��{��has P Hz of pitch, the amount of energy it has is E relativeunits and it exists for D seconds.


cluster:1

cluster:2

cluster:3

6. STRESS LABELING METHOD PROPOSED

Figure 1 shows the outline of the stress determinationsystem. Features that illuminate stress information (PEDs)are obtained for the syllables as follows. The duration (D) ofeach syllable is obtained from the boundaries established bythe speech segmentation process. The pitch value for every�� ;�� is the maximum value among thepitch-!�� >*/�energy. From the segmented and labeled database, thenumber of occurrences, duration and the left and rightphonetic contexts of each phone are known. Hence, phoneclusters are formed (based on their duration and context). Inaddition, syllable clusters are also formed, which are furtherclustered such that the PEDs of syllables in each sub-clusterfall within a specified range. The two major clusters are:1. with stress (relatively higher PED) 2. without stress(relatively lower PED).

6.1 Stress Determination

First, the phonetic transcription of the sentences is convertedinto syllabic transcription according to the syllabificationstrategy proposed. For each syllable, the number ofoccurrences in the database is found such that no syllable is�� '�, which is�� '�� another syllable. While searching for the instances of� �� '�� '�� '�� taken care of by searching a syllable for its number ofoccurrences, in accordance with the syllabic transcription ofthe text corpus. For all the occurrences of a particularsyllable, the PED statistics are computed and each feature(P, E and D) is normalized with respect to the variance ofthe feature values. For the determination of stress, eachsyllable is associated with a three dimensional vector, thecomponents being its pitch, energy and duration. Thesevectors are plotted in three-dimensional space and clusteredusing k-means clustering algorithm. Figure 2 shows the�� '�� -means clustering,after variance normalization of features.

6.2 Specially recorded Stressed Corpus

The Tamil corpus being used for the TTS system is in anormal speaking style and there is not much emphasis given,unlike in an emotional speech. Nevertheless, some stress is

naturally present in some syllables. To check the validity ofthe stress determination process, a new speech corpus isrecorded specially from the same speaker, which can also beused for TTS along with the old corpus. The sentences in thenew corpus are selected from a drama script and have lot ofsyllables that can be stressed. The drama is recorded in twodifferent styles: 1. with full emotion, 2. in a neutral speakingstyle. Stress labeling is carried out for the syllables in bothrecordings. Figure 3 shows the syllable clusters formed by k-means clustering after variance normalization of features, for�� '��corpus recorded with emotion.

6.3 Performance Evaluation � Perception experiments

To check the validity of the results of k-means clustering,listening tests are conducted. Four native Tamil people areasked to listen to the different instants of a syllable and ratethem. The listeners are asked to rate the syllables with 3levels (3: most stressed syllables, 1: non-stressed syllablesand 2: medium ones). The results are compared with theresults of k-means clustering. The clustering performed onthe PED values normalized with respect to the varianceshowed better correlation than the case of normalizing PEDswith respect to the maximum value.

0.4

0.6

0.8

1

0.2

0.4

0.6

0.8

1

0.4

0.5

0.6

0.7

0.8

0.9

1

Dura

tion

Energy

Pitch

cluster:1

cluster:2

cluster:3

Figure 2: Result of k-means clustering on�� '��"PED, Clusters: 1 - stressed, 2 - moderately stressed, 3 - unstressed.

0.4

0.6

0.8

1

0

0.5

1

0.2

0.4

0.6

0.8

1

Duratio

n

PitchEnergy

Figure 3: Result of k-means clustering on�� '��; Features " PED, Cluster 1:

unstressed, Cluster 2: moderately stressed, Cluster 3: stressed.

SyllableIdentification

FeatureExtraction

Phonetic transcription Speech signal

ClusteringPED

Stress labeledsyllables

Syllabictranscription

Figure 1: Outline of the Stress labeling system


6.4 New labeling

It is found that many of the syllables with relatively highervalues of PEDs are clustered to one class by the k-meansclustering. So, a new type of clustering is performed toidentify the stressed syllables. A reference vector isconstructed from the maximum values of each feature acrossthe syllables. The Euclidian distance from each syllablevector to the reference vector is computed. The syllableswith distance below a threshold t1 are designated as stressed(stress rating " 3), and the syllables with a distance aboveanother threshold t2 are marked as unstressed (stress rating "1), with the remaining syllables marked as normal syllables(stress is not too high or absent; stress rating " 2).

7. PROSODY MODELING

After labeling the syllables with the stress assignmentstrategy proposed and comparing with the stress rating givenby perception experiments, the location of a stressed syllablein the text corpus is found. Figures 4 and 5 show the syllabictranscription of some of the sentences in which the syllables��'�� '�� two words). The stressed syllables, for which the rating is 3,are shown in bold black letters and the unstressed ones withthe rating of 1, in bold grey color.201 > ya nak kA ran # mel la #220 > muk ku Ru Ni # wi nA ya har # kaR pa ha# wi nA ya har # mAm ba zha # wi nA ya har #mu da li ya # wi nA ya har # kO yil haL # nIngga lA ha # til lae # na ha ra meng gum # nartta na # wi nA ya har # kO wil # na ra mu ha #wi nA ya gar #236 > siT Tuk ku ru wi haL # nA rA #273 > ti ruk ku RaL # nA Da hat tin #276 > # nA ra dan # kan dar wa ku lat taeccErn da # ku mA ra nA ha #284 > me ylo li haL # toN Dae # nA win <

�� '�� words. nA: Stressed (rating � 3) nA: Unstressed (rating�1)

It is observed that the stressed syllables found by thelabeling method described above and confirmed by thelistening tests occurred as the first syllable in most of thecases. In the examples shown, all the stressed syllables��'�� '�� unstressed syllables occurred in the middle of the word.Some of the re��'��have a stress rating of 2 and some are part of another� �� ~�� '�� `�� occurrence as the first syllable of a word is less. As observedfrom Figure 5, the frequency of the second syllable of theword being stressed increases, when the first syllable is/has ashort vowel (hrasva).

/�� ^��first syllable in Tamil is stressed and if the first syllable

is/has a short vowel (hrasva), the second syllable is��{��'�� <� �� be found always, because the speaker was specifically askedto constrain his speech to normative style, with as minimalexplicit stress as possible. We believe that the strategy ofselecting syllables as the basic units, selecting themaccording to the prosody model mentioned above and thenmodifying their characteristics if required might give bettersynthesis with computational savings in unit selection. Thishypothesis needs to be verified experimentally.202> sol la ga rA di # a ha ra mu da li #219> wI Dum # aL ip pa wa rA hi ya #220> kO wil # i rA sa # wi nA ya har #221> i rA sa sa bae # na Da rA sap pe ru mAn #234> pu zhak kat tiR ku wa rA da daR kA na236> paL Lik kuc cen Ra # nA rA # 258> sA laegrA mam # a ra wa mang ga lam269> i wae # gi rA mi ya mA na wae #279> in da # rA gam # ka ru nI la #280> i da nae # ae da rA bAd #285> kan na Da # nAT Til # rA ma #289> bi rA ma Nar ka Luk ku <293> pEr # bi rA ma Nar haL <

�� $�� '�� words. rA: Stressed (rating�3) rA: Unstressed (rating�1)

8. CONCLUSION

A new method for dynamically choosing the syllablestructure, is proposed for Tamil language. The proposedsyllabification strategy preserves the coarticulation effects ofall the phones present in the identified syllables. A methodof Stress labeling of syllables based on their PEDs isproposed. Stress labeling of syllables in the Tamil corpus isperformed. Based on the location of stressed syllables in aword, a rudimentary prosody model has been developed for�� /�� ^�� Tamil is stressed in general and the second syllable isstressed if the first syllable is/has a short vowel. Thisconfirms the earlier studies of some linguists. The fact hasalso been verified acoustically.

9. REFERENCES

[1] Min Lai, Yining Chen, Min Chu, Yong Zhao, Fangyu Hu��^Ahierarchical approach to automatic stress detection in English��{��Proc. ICASSP 2006, pp 753-756.�#�� _�� }�� }�� *�� ^'�� {�IEEE Trans. on Speech and Audio Processing,2(4), pp. 469-481, 1994.[3] Chao Wang and St�� /�� ^�� /�� *��for Improved Speech Recognition of Spontaneous Telephone/��;~��>��{��EUROSPEECH 2001, Sept 2"7, 2001, Aarlborg, Denmark.[4] Kopecek, I., Pala, K. "Prosody Modelling for Syllable- BasedSpeech Synthesis", Proc. IASTED Conf. AI and Soft Computing,1998, pp. 134-137.[5] Cairns, Douglas A.; Hansen, John H. L�� ^\�� {��JASA, Vol96 (6), pp.3392-3400, Dec 1994.


GRAPHEME TO PHONEME CONVERSION FOR TAMIL SPEECH SYNTHESIS

A. G. Ramakrishnan and Laxmi Narayana. M

Department of Electrical Engineering, Indian Institute of Science, Bangalore 560 012.

ABSTRACT

The input to a TTS system may contain acronyms, abbreviations and non-standard words, which must first be converted to the corresponding Tamil graphemic form. A text normalization module is developed for accomplishing this task. We have developed a quality grapheme to phoneme (G2P) conversion module, which converts the normalized orthographic text input into its phonetic form. The phonetic transcription helps in identifying and analysing the basic units such as mono-phones, diphones and syllables and also for segmenting the speech corpus. A character to phoneme mapping interface is developed to map the Tamil graphemic text to the corresponding phonetic representation in Roman script. A Rule base is created which contains the inter and intra word rules for changing the default character phone mapping wherever necessary. A proper noun lexicon as well as foreign word lexicon is also incorporated for dealing cases where G2P fails. The G2P module designed is to be used for MILE Tamil TTS synthesizer in both Windows platform and Festival (Linux) environment.

Index Terms— G2P, Tamil TTS, normalization, lexicon

1. MOTIVATION

Over 65 million people worldwide speak Tamil, the official language of the south Indian state of Tamil Nadu, and also of Singapore, Sri Lanka and Mauritius. In addition to the above countries, it is spoken in Bahrain, Malaysia, Qatar, Thailand, United Arab Emirates and United Kingdom. Tamil is a syllabic language which contains 12 vowels and 18 consonants. There are 5 other phones introduced for representing some consonants of Sanskrit. The language has well defined rules, which introduce seven other phones based upon the relative positions of consonants with respect to vowels or other consonants. Hence there are 42 phones in the language.

Text to Speech (TTS) synthesis is an automated encoding process, which converts a sequence of symbols (text) conveying linguistic information, into an acoustic waveform (speech). The two major components of a TTS synthesizer are - Natural Language Processing Module (NLP), which is capable of producing a phonetic transcription of the given text and Digital Signal Processing module (DSP), which transforms this phonetic transcription into speech [2]. One of the characteristics, based on which a TTS system is evaluated, is its intelligibility i.e., its ability to accurately synthesize the input text before naturalness and expression. The G2P

module is responsible for the determination of the phonetic transcription of the incoming text. This involves normalizing the input text and mapping the graphemic representation to a corresponding phonetic representation. Since the orthographic representation and pronunciation do not match in some cases, the default mapping needs to be changed wherever necessary. Section 2 describes the need for Text Normalization and how the non-standard words like acronyms and abbreviations in the input text are dealt with in NLP. Section 3 describes the process of converting a pure word sequence into its phonetic equivalent, the inter and intra word rules which are made use of, for G2P conversion, and the creation of Foreign word lexicon. The results are presented in section 4.

2. NORMALIZATION OF NON-STANDARD WORDS

The text input to the TTS system may not be pure Tamil text. It may contain some non-standard words like acronyms, abbreviations, proper names derived from other languages or clutters, phone numbers, decimal numbers, fractions, ordinary numbers, sequence of numbers, money, dates, measures, titles, times and symbols. The Natural Language Processing module should be able to handle such non-standard words. Standard words are those, whose pronunciation can be obtained from the G2P rules. A G2P converter maps a word to a sequence of phones. All the non-standard words must be expanded into the corresponding Tamil graphemic form, before sending to the G2P module for phonetic expansion. This module should also take a decision of how a non-standard word is being pronounced. For example, a phone number must be read with each digit treated as a single number and read in isolation.

The corresponding Tamil graphemic representations of possible non-standard words, English words and Tamil short forms are written in ‘iLEAP’ format. iLEAP is a software, where one could type in many Indian languages. The ISCII (Indian Script Code for Information Interchange) file is exported as an ASCII (text) file and this file is used by the Text normalization module. The format of the text normalization file is shown in Figure 2.

The input text file is searched for abbreviations or acronyms (can be in Tamil or English). They are replaced by the corresponding expansion (graphemic form) in Tamil in the output (ISCII) file. This is illustrated in Figure 1.

Ex: aug is replaced as Ýèv† (normalization) august is also replaced as Ýèv†

(Tamil transcription)

[email protected] [email protected],


Further, there are a number of words, used regularly inTamil, which are originally Sanskrit or English words.The G2P fails to give the accurate phonetic transcriptionin case of such words. Hence, we have created a lexiconof foreign words.

Expansion of the numbers is a ‘special’ case in normalization, because a decision needs to be takenwhether the encountered number is an ordinary number, a phone number, date, time or currency. If it is currency, it decides whether it is rupee or dollar or pound or yen. Theinput number is considered as a string. An ordinary number is expanded according to the length of the string.A module is written to expand a 3 digit number. Thecontrol chooses different paths according to the length ofthe string. If the number is 4 or 5 digits long, then it mustbe in thousands or ten thousands. In this case, this moduleis called twice, first to convert the number of thousands to words (1 or 2 digits) and next to convert the remaining 3 digit number to words. For example, if the number is12345, in the first call, the number 12 is processed and in the next call, 345 is processed. If it is a 6 or 7 digitnumber, it must be in lakhs or 10 lakhs and if the numberhas 8 or 9 digits, then it must be in crores or 10 crores;then this module is called thrice or four times, respectivelyand so on. If the number of digits is less than or equal to 3,then the module is called only once.

If the number string is not an ordinary number, aparameter (a number corresponding to the decision taken)is set according to the type of the number string. If thenumber string is a decimal number (Ex: 23.8756) thenumber before the dot (.) is treated as one number and thedigits after the dot are spoken in isolation. If the numberstring is a date, the delimiters can be '/' or '-' (Ex: 25-10-1999 or 25/10/1999). All the three values (date, month,and year) are extracted from the input string andprocessed separately. Similarly the different types ofnumber strings namely, currency, range of numbers,arithmetic, phone numbers and time are identified by thedelimiters present and expanded accordingly.Example:

... a.d 1999 Ý‹ Ý‡´

aug ñ£î‹ .....

… A.H. ÝJóˆ¶ ªî£œ÷£Jóˆ¶ ªî£‡ÈŸÁ å¡ð¶ Ý‹ Ý‡´ Ýèv† ñ£î‹ ......

Figure 1: Text Normaliz

ph.no ªî£¬ô«ðC â‡

Fi ur rmat tion le

In a erts theormalized orthographic text input into the underlying

ling

into thecorr

leceives the sequence of Tamil words, from the Text

norm

rules are applied to theinpu

ation using look up table

jan üùõK rs Ïð£Œèœ I.e I™L e†ì˜

g e 2: Fo of Normaliza fi

3. DESING OF THE G2P MODULE

TTS system, the G2P module convn

uistic and phonetic representation. G2P conversiontherefore is the fundamental step in a TTS system [1]. Thetext normalization module inputs a word sequence to theG2P module. The Grapheme to Phoneme conversion ofthe word sequence can be done using the Letter to SoundRules. The rules are based on a pronunciation dictionary,in which a mapping of the spelling of a word into a sequence of phones can be found. For example, consider an English word – “speech”. The pronunciation dictionaryconverts this word to the phone sequence – S P IY CH1.Traditional orthography in some languages, particularlyFrench and English, often does not coincide withpronunciation. However, in other languages such as Spanish and Italian, there is a consistent relationshipbetween orthography and pronunciation (Link).

If there is no pronunciation dictionary, a simple set of rules to convert the graphemic form of a word

esponding phonemic form is used. Using such rules is more relevant for Indian languages. In many cases, thereis a direct correspondence between what is written and what is spoken. For example, consider a Tamil word -“asiriyar”, the corresponding phonetic transcription is -/A/ /s/ /i/ /r/ /i/ /y/ /a/ /r/ which is very similar to the word.

3.1. Mapping from Tamil Graphemes to Phonemes

The grapheme to phoneme (G2P) conversion modure

alization module, which then are converted to aphonetic transcription represented in Roman script. This is obtained by using a character to phone mapping, whichgives the corresponding phonemic representation of Tamilgraphemes, in Roman script. The Roman character (orsometimes, combination of letters in case of diphthongslike /ae/ or /au/ or genitives like /kk/ and /tt/) whichrepresents a Tamil phonemic unit may not ‘sound’ exactlylike the Tamil phoneme and is used only as anunambiguous representation. The .wav (speech) files inthe inventory are labeled according to this Romanrepresentation. The DSP module in the TTS engine picksup the corresponding speech units (which are labeledaccording to the mapping), given by the phonemicrepresentation, for concatenation.

Words are converted one by one. Two kinds of rules– inter-word rules and intra-word

t text during conversion. If the last character in a word is halanth, and the last but one character is /k/ or /ch/ or /th/ or /p/ and the next word starts with /k/ or /ch/or /th/ or /p/, respectively, the two words are concatenated.This inter word rule is very relevant here because, while speaking such words, a speaker does not pronounce thephoneme /k/ or /ch/ or /th/ or /p/ two times. Such junctionsof words will combine into a single word in which thosetwo phonemes at the junction would manifest as agenitive, for example /p/ /p/ becomes /pp/. The double


letters are labeled with a suffix ‘l’ to the basic phoneme.This is illustrated in Figure 3(a) and 3(b). It was foundthat the duration of a double letter occurring in the middleof a word and that which was formed at the junction oftwo words is comparable.

Tamil uses a syllabic script that is largely phonetic in nature. Thus, in most cases, there is a one-to-one mappingbetw

oneme Example: Ü V VOW a ¤ V VOW a

Charact echaracter.

C-Con wel and H-Halanth

write a rule

Phonem

hileClass main s the C type characters (consonants)

ase for Tamil G2P

t of rules that modify theefault mapping of the characters based on the context in

whi

IèŠŠ ðö¬ñò£ù¶ Iè Iè„„ Cø‰î Iè„„Cø‰î

MN ™

een graphemes and the corresponding phones. Thearchitecture of the Natural language Processing module is shown in Figure 4. Language specific information is fedinto the system in the form of mapping and rules. Thedefault character to phone mapping is defined in themapping file. The format of the mapping is shown belowand explained subsequently.

Format: Character Type Class Ph

er: The orthographic representation of th

Type: Three types of characters are identified,sonant, V-Vo

Class: The class to which the character belongs. Theseclass labels can be effectively used torepresenting a broad set of characters. e: The default phonetic representation of the

character.Type gives a broad classification of the characters w

ly classifieinto different clusters like KA, CA, TA, tA, PA and YA.Some examples of the default mapping are shown inFigure 5.

3.2. Rule B

The Rule Base contains a sed

ch a particular phone occurs. Specific contexts arematched using rules. The system triggers the rule that bestfits the current context. The rule format is given below.

ŠŠðö¬ñò£ù¶

ñ†ì‚‚ è÷Š¹ ñ†ì‚‚è÷Š¹ ˆ F¬óJ™ MNˆ̂F¬óJ

(a)

(b)

Figure 3(a) Examples of Inter-word rules in Tamil (b) /p/ /p/ becoming /pp/ (pl)

Format: *1 *2 …. *m { +1 +2 … +n } Example: VOW KA VOW { K:1:X R:2:g K:3:X }

*i Class label of the ith character as defined in thecharacter phone mapping. Together, these *isrepresent the context that is being matched.

+j jth action specification node. Each such node hasthe form: Action_Type:Pos:Phoneme_Str

Action_Type This field specifies the type of thisaction performed at this node. Possiblevalues are K (Keep), R (Replace), I (Insert) and A (Append).

Pos The index of the character beingcovered by the context of the rule (1 8 Pos 8 m)

Phoneme_Str represents phoneme string output by thisaction node.

The example rule given above says that if the grapheme/k/ (/k/ belongs to class KA) appears between two vowels(VOW KA VOW), keep the first character (vowel) as it is (K:1:X), replace the second character(/k/) with /g/(R:2:g) and keep the third character (vowel) as itis(K:3:X).

Grapheme to Phoneme Converter

Character PhoneMapping

G2P RuleBase

NormalizedTamil text

TranscribedTextText NormalizerInput Tamil

text

Look Up Table

Foreign wordLexicon

Figure 4: Natural Language Processing (NLP) mod

Ü a è k å o Ý A î t æ O

ule

ì T â e

Figure 5: Mapping examples


If the same consonant occurs twice consecutively, the genitive is represented by replacing the second grapheme by ‘l’. Ex: TA HAL TA { A:1:l } (T T -> Tl) The reason for using this kind of representation for double letters is that the speech files in the inventory are labeled accordingly and if ‘T T’ is kept as it is, then the DSP module selects two ‘T’ segments instead of selecting a single ‘Tl’ segment. Also, linguistically, TT is a single phone (genitive), not two.

Graphemes exist in Tamil script only for the phonemes /k/, /ch/, /T/, /th/, /p/. But these five graphmes would manifest as /g/, /j/, /D/, /dh/, /b/, respectively, if they occur between two vowels or prefixed by nasals.

Ex: NAS1 HAL KA { K:1:X K:2:X R:3:g } VOW PA VOW { K:1:X R:2:b K:3:X }

However, there is an exception for /ch/. If it occurs between two vowels, it becomes /s/. Also, /ch/ becomes /s/ when it comes in the beginning of a word. There are some more rules, which are not listed here. Figure 6 gives some sample input Tamil sentences, the normalized Text and the phonetised text.

3.3. Lexicon for handling Foreign Words

The process of corpora phonetization or the development of phonetic lexicons for the western languages is traditionally done by linguists. These lexicons are subject to constant refinement and modification. But the phonetic nature of Indian scripts reduces the effort to building mere mapping tables and rules for the phonetic representation. These rules and the mapping tables together comprise the phonetizers or the Grapheme to Phoneme converters [1].

The G2P rule base cannot be generalized to handle all the words in the input text, especially for the proper nouns derived from other languages like Sanskrit and Urdu. For example, the word ‘Buddha’ is written in Tamil as ‘¹ˆî£’(putla); the grapheme /p/ in the beginning of the word should be pronounced as /b/ and the /tl/ should be pronounced as /dl/. But there are no such rules in the rule base, since it is basically a Sanskrit word used as it is in Tamil. If a rule is introduced for this purpose, that will affect Tamil words. For example, the Tamil word ‘¹ˆîè‹’ (putlakam) is pronounced as ‘putlagam’ only. The initial /p/ doesn’t change to /b/ or the genitive /tl/ doesn’t change to /dl/.

To cater to such exception words and proper nouns, a lexicon has been created. The lexicon dictates the phonetic composition, or the pronunciation of each entry in the list. The G2P first looks in the lexicon file for each input word. If the word is present in the lexicon file, its phonetic transcription is taken from the lexicon itself. Otherwise, the G2P applies mapping and the rules to produce the phonetic transcription.

The phonetic transcription generated by the G2P converter can be used for segmentation of the speech corpus. The phonetic transcription can be aligned with the speech waveform and the phone boundaries can be adjusted manually or by automatic speech segmentation algorithms.

4. CONCLUSION

The Tamil graphme to phoneme conversion module has been developed effectively. Efficient rules have been designed in, which cover most of the contexts in which the default mapping needs to be changed. A foreign word lexicon handles cases where the general G2P rules do not give the exact phonetic transcription. This has been used in the Tamil TTS developed around Festival, as well as an independent, stand-alone TTS. The developed C code for G2P module is designed to be used for Tamil TTS in both Windows and Linux platforms. FK«õEJ¡ Hø‰î «îF september 1, 1928. ðô pages ªè£‡ì 21 ï£õ™èœ, 41 CÁè¬îèœ ÜìƒAò 3 ªî£°Š¹èœ, è¡ùì Þô‚Aòˆ¶‚° Üõ¼¬ìò ðƒèOŠ¹èœ.

(a)FK«õEJ¡ Hø‰î «îF ªúŠªì‹ð˜ å¡Á / ÝJóˆ¶ ªî£œ÷£Jóˆ¶ Þ¼ðˆ¶ â†´ ðô ð‚èƒèœ ªè£‡ì Þ¼ðˆ¶ å¡Á ï£õ™èœ, ï£Ÿðˆ¶ å¡Á CÁè¬îèœ ÜìƒAò Í¡Á ªî£°Š¹èœ, è¡ùì Þô‚Aòˆ¶‚° Üõ¼¬ìò ðƒèOŠ¹èœ.

(b)> t i r i w E N i y i n # p i R a n d a # t E d i # s e p T e m b a r # o n R u # # A y i r a tl u # t o Ll A y i r a tl u # i r u b a tl u # e Tl u # # p a l a # p a kl a ng g a L # k o N D a # i r u b a tl u # o n R u # n A w a l g a L $ # n A R p a tl u # o n R u # s i R u g a d ae g a L # a D a ng g i y a # m U n R u # t o g u pl u g a L $ # k a nl a D a # i l a kl i y a tl u kl u # a w a r u D ae y a # p a ng g a L i pl u g a L <

(c)Figure 6: (a) Input Tamil text in ISCII format (b) Tamil text after normalization (c) Transcribed Output of G2P Converter.

5. REFERENCES

[1] A. Gopalakrishna, Rahul C, Sachin J, Rohit K, Satinder Singh, R.N.V Sitaram and S.P. Kishore, “Development of Indian Language Speech Databases for Large Vocabulary Speech Recognition Systems,” Proc International Conf Speech and Computer (SPECOM), Patras, Greece, Oct. 2005. Link (http://en.wikipedia.org/wiki/Phonetic_transcription)

[2] Thierry Dutoit, “High-quality Text-To-Speech synthesis: an overview,” Journal of Electrical and Electronics Engineering,Australia: Special Issue on Speech Recognition and Synthesis, vol. 17 no 1, pp. 25-37, 1997.


SSppeeeecchh && SSppeeaakkeerr RReeccooggnniittiioonn

��

��

��'��~��~��_�� ¡�'¢~~_�£��¤��~��~��~��

~~��}��~��¢¥� ¦§��

��

�~�� `�� !�� <�� *�� !�� *�¢�� ¡�� £� ��!��¡�� £� �� <�� !�� ¢��¢�¢�� *��`�� _��_��¡*�__£��¤�� *�� !�� !�� `�� !��<��

�� ©� /�� *��

�� !�� !��

� �� <�� <�� !�� <�� ¡�� ª�£�� /�� /�� <�� !�� ¢�� <��« ��<��«�� !� �� <�� «�� <�� « �� ¡�� !�"�� #�� <��£��

$�� *�� <�� ~� �� <��

�%�& � �� /�� *�� ¡�£�$��¡�£�*��!��¦ª��¦¬��!� �� '��(��(')��)*'��(+��)%%��)��'��

��!��*�__��!� ��'/>��¨�

£��~�� *�__�� `��

�£�� `��¡��`�� £�� ¡�� £� �� !� �� !��!� ��

¦£�� `�� *�__�� ¢��`�� !� ��*�__®��!�� ¢<�� !��

�� ,�� ~�� ¢�� !�� ¡ £� � ��!�� ¡ £� � ��

[email protected], [email protected]


��/__��

/__�

/��

��¡�£�

;��*��

��¡��£� �_��

Q��}��

;�¢��

�� ¡� � £� �� ¡� � £�� !�� !�� <�� !�� ¥��

�¡ £ ¡ £� �� ¡��Q�� `�� £��

� � � �¯� �¡ � £ ¡ £� ¡ £� � � ��

��

�� ®�� ¡ £�

�� ¡ £� �

� ��

� � ��

� ��

�� ¢�� ¢��!�� ¡_}�£��!�� `�� ¡ £�� !�� *�� ¢��!�� *�� ¡�� *�� £�� `�� `�� `�� ¡�� *��£��!��!��¡��`�� £��¡��`�� £�� `�� !�� ¥��!��!�� };�� !�� ¥��

� ¡ £ ¡ £ ¡ � £

� ��

��

��

��

�� ¡ £ ¡ £ ¡ � £�

� ��

��

��

��

��

'�� ¡ � £� ��

��

��

¡ £ ¡ £� ¡ � £� � ��

�

��

�� ¡ £ ¡ £� ¡ � £ ��

��

��'�� `� �� ¬�� !�� `�� *�� ¦�� ¦��°��/__��*�__�� Q�� ¢�� !�� <�� ¡�� !� `�� ¢��£� �� ¢�� _��

¡ §�¬£¡ £ �� ¡ £ �� !�"" ! � #�

��

��

� ��

��±��};�� # ±��/__��

¡ £!�"" ±� ��/__�� ¡ £! � ±��<��

��¡ � £

¡ £ �

�

�� ! �

#��

��

�# ±��!��

��$%�&��'��()$*)+&��,�&�%��)�%�%+��~��`��!��/__� �� ~��

-��

� �� `��

�� ¡ £� � � �� - -� � �~��

§� � �� - -�� /��

� ��

��

� �� §- ��

��

� � � � ��

��

� � ��

�¡_��!��£��¡�£�

��

¡��£�

�`�¡�£� �� `�� `�� *��~��¡�£��

� � � � � � � � � � ��

� � ��

. .�

� � � ��

��

��¡¦£�

�� !�� ¥��

��

��¡ £� �

¡ £� �

��¢��!��


��

/ � ��

� � � ��¡£�

��'� �� !�� §� � ��

0 ��~��

��!�� §� �� ~��

� � � � � ��

��

�� !��

�

��

� � � ��

®� §

��

!"

# $��¡¬£�

��¡¬£��¡¦£��

��

� ��

� � ��

� ��

�.

� � !��

��

� � ��

'��

� � � � � ��

� �. !��

�

��!��!�� / � � % �¡�� / � ��<�� §� £�� §�� / � �� !� � � �

�� !�� !�� ¡£��

� � � � � � � ��

� � �

��/

� � !� � ��

�

��

¡ª£�

�

��~��!��!��!�� ! � ��`�� !�� !� � �� ¡¬£� �� ¡ª£��

� � � � ¦��

��

��

� � ��

� ��

��

��

�

� ��

� ��

� � � � � � ¦ ¦ � �� 1 �

. 1� 1. 1�

��

� �� & � ��

��>��«��

�� §� �¬�§�¦¥¬�§�¦¥¬�§� �¬�

� �'

� ¡¥�£��

� � � � � � � �

� �

� ��

¡ £ §�¬�§�¬�

/ 1 1/ � �

� �

� � !��

'

� ��

� �

� �

��¡¥�£�

��!�� /__� �� `��!��`��¡¥£�� ¡ £� � � �� ¢� �� !�� ~>� �� !�� ¡��!£� �� *�� !�� `�� *�� *�__��~�� `�� `�� !� �� *�__� �� !�� <�� ¬�� !�� ¡ £�� ¡ £�� ¢��¢��

2��%²,%*&3%�+)��*%�(�+�� ª��¡��<��¢��¢�� £� �� ~��!��

��2��*�¢��¡'��!��*��¦��£��

��`�� *�� `��!��

��!��`��


�� ¡�� !�� £�� !�� ¦�� !�� ¬§³� �� Q�� ¢�� !�� '!�� ¡�!�� ¦�� ¬�� ¥�� §�� ¬�£� �� <�� !� ��¡�>£��¡ª§��´§�� §��£��¡�/£��*�__��/�__�¡¢/£�¡��`��!�£��

�

�

�

�

�

�

�

�

�/��!��¨� £�'!��

��

�£��!��<�� ¦�� `�� <��¤�� *��

¦£�;��/�__�¡¢/£��*�__�� !��¤��¦��³��¦ �§ª�³��¦��<��!� ��

£�� *�__� �� ¤��

¬£�� !�� ¦� ��`��!�� ~>� �� *�� /�� ¡��~��£��¢�� !�� *�� ¡��¢�� £�� ¢��!��

�� !� �� <�� !��¡�� <�� ¢�� <��¢�� £� �� *�__�� `�� ¢�� `�� *��!��<��¢�� !�� *�� `�� *�__��

��"##$%µ�$&'�()&(*"�+)&��

�~�� *�__��!� ��!��¤�� /�__� �� *�__��¤��¢�� ~>� �� !�� `�� !�� !�� ,��!��$(�&)�*-'�-#-&.��'¢~~_��/��~~��0��!�� ;��1�� ~~~��Q ��!��

%-�-%-&(-�� }�� *�� _�� 0�� '�� _�� _�� ^/��

�� {�� ! ��2��!�� §��§¬¢� ��*� ��§§��

�� >��>��_��1��* ��*��3��}��^}�!�� {�� 4�� !!�5�� ¬¦¢ ¥°�� ´´��

�¦�� /�� !�� ;�� *�� ^_�� {� �� !�� 2��!��'//;¢�°��'�� ´°§��

�� <�� '�� _�� 1��1�� ^/�� {� �� 2�� ,� ��"��,�� !�� 2��" ��2674��!�� ¥¢�§�� ´´¬��

�¬�� `� �� /�� ^*�� ¢�� !�� {� �� 2�� !��°��¥�� §§ ��

�ª�� /��0��^3�� ~��{��#��!��!�� ´ª��°ª �� ¬¦¢ �¬¥��´�� ´ª��

�¥�� /�� *�� 8 � �� !�� ,� �� 2��9� �� '��;�� ´´´��

�°�� >�� /�� ;�� Q�� Q�� ^}�!�� {��#�� 2��:��!��° ¢°�� ´´°��

.��;��'!��/��>�� '��<��

��>��/� ¦§/� ª§/� ´§/�

*�__� ¬§�§§� ª´�§� ¥ª�¬¦�/�__¡¢/£� 5<�=;� 5<�5�� 5<��<�

.��'!��/��>�� '��¦�<��

��>��/� ¦§/� ª§/� ´§/�

*�__� ¦´�ª°� §�¬°� ¥�¦´�/�__¡¢/£� =5�>�� <��?�� <��5=�


SPEAKER RECOGNITION IN A MULTISPEAKER ENVIRONMENT

R. Kumara Swamy 1, B. Yegnanarayana2

1Dept. of Electronics and Communication Engg., Siddaganga Institute of Technology Tumkur, India2International Institute of Information Technology, Gachibowli, Hyderabad, India

ABSTRACT

In this paper, the problem of speaker recognition in a multispeakerenvironment is studied. A method for preprocessing multispeakerspeech signals is suggested prior to extracting speaker-specific fea-tures for recognition. Speech produced by an individual speaker isenhanced using the characteristics of the excitation source of speechproduction mechanism. The enhanced speech signal is used in a neu-ral network based speaker recognition system. The effectiveness ofthe proposed preprocessing method is demonstrated using the rel-ative improvement in the recognition performance on speech datacollected in a noisy reverberant multispeaker environment.Index terms: Multispeaker speech, time-delay, instants of signifi-cant excitation, speech enhancement, speaker recognition

1. INTRODUCTION

Voice is a key biometric and speaker recognition is a critical taskin forensic, mobile services and banking applications, especially fordistant authentication. Most of the speaker recognition studies ad-dress the issue of mismatched channel, speaker variability duringtraining and testing phases by providing channel compensation or byusing robust features [1–5]. Speaker recognition in a multichannelenvironment has been proposed in [6]. Speech data recorded throughmobile phone, PDA, telephone and microphone simultaneously wasused for recognition studies. In most of the above studies the empha-sis was on addressing the issues of channel variation, intra-speakerand inter-speaker variability. In real applications various human andenvironmental factors contribute to recognition errors, namely, pres-ence of competing speakers, channel mismatch, emotional state ofspeaker and noise effects. The performance of speaker recognitionsystems degrades substantially for a mixture of speech signals in amultispeaker environment.One possible solution to this problem of degradation in speech is touse front-end preprocessing prior to feature extraction for recogni-tion. In this paper, a preprocessing method is proposed to enhancethe speech produced by the desired speaker from a multispeakerspeech signal in a noisy reverberant environment. The effectivenessof using source as well as system features for speaker recognitionwas shown in [7]. In this study improvement due to preprocessingin the relative performance of speaker recognition in a multispeakerenvironment is shown using both the source and system features. Aneural network based speaker recognition system [8] is used to testthe effectiveness of the preprocessing method.The paper is organized as follows: Section 2 discusses the setup forcollecting data for this study. The proposed preprocessing method isdescribed in Section 3. Section 4 describes the speaker recognitionsystems using source and system features. Experimental results aregiven in Section 5. Summary of the paper is given in Sec. 6.

0.6 m

0.805 m

0.5 m

speaker 1

Microphone−1 Microphone − 2

speaker2

speaker 3

speaker 4

0.88 m

1.07 m

0.97 m

0.52 m

0.82 m

0.77 m

Fig. 1. Experimental setup used for collecting multispeaker speechdata.

2. EXPERIMENTAL SETUP FOR DATA COLLECTION

The speech data for this study was collected using the TIMITdatabase. Ten sentences spoken by 35 speakers were chosen for thisstudy. The training data for each speaker consists of 8 sentences, andthe remaining 2 sentences are used for recognition. The duration ofthe training data is around 25 sec and the testing data is around 5sec. The training data was played through a loudspeaker, and thespeech data was collected by a pair of microphones in a laboratoryenvironment having an average reverberation of about 500 ms. Thespeech data was sampled at 16000 Hz and stored as 16 bit samples.A set of 60 speakers different from the training set were chosen fromTIMIT database as interfering speakers. The test data was collectedusing a pair of microphones by playing the desired speaker speechand the competing speaker’s speech through different channels. Thesetup used for collecting the data is shown in Fig. 1. A total of 700test utterances were recorded for testing against each speaker model.

3. PROPOSED PREPROCESSING METHOD

The mixed speech signals collected by a pair of microphones con-sist of the direct component of speech from different speakers, thereflected component of speech and the background noise. The per-formance of the recognition system will be poor if we extract thefeatures directly from the mixed speech signal. Instead, we can en-hance the speech of the desired speaker by preprocessing the col-lected speech signal so that the extracted features provide a better

[email protected],[email protected]


Time−Delay

Computation

Time−Delay

Compensation

Weight Function

Computation

and

Enhancement

Mic−1

Signal

Mic−2

Signal

To

Feature

Extraction

Fig. 2. Block diagram of the proposed preprocessing method.

representation of speaker-specific information. The spatial separa-tion of the speakers relative to microphones can be used to extractspeaker-specific information from the collected speech signal. Thereis a time-delay in the arrival of the speech signals produced by aspeaker at a pair of microphones. This delay is different for differentspeakers. It is assumed that speakers are stationary and are not posi-tioned along the perpendicular bisector of the line joining the micro-phones. Estimation of time-delays along with the characteristics ofthe excitation source of the speech production mechanism are usedto enhance the speech due to desired speaker relative to the compet-ing speakers. The block diagram of the proposed method is shownin Fig. 2. The delays are computed using the cross-correlation ofthe Hilbert Envelopes (HE) derived from the Linear Prediction (LP)residuals of the two microphone signals [9]. Fig. 3 shows the plotof the number of frames (histogram) as a function of the delay.

−2 −1.5 −1 −0.5 0 0.5 1 1.5 20

25

50

75

100

Delay in ms

% o

f fra

me

s

(a)

−2 −1.5 −1 −0.5 0 0.5 1 1.5 20

25

50

75

100

Delay in ms

(b)

−2 −1.5 −1 −0.5 0 0.5 1 1.5 20

25

50

75

100

Delay in ms

% of fram

es

(c)

−2 −1.5 −1 −0.5 0 0.5 1 1.5 20

25

50

75

100

Delay in ms

(d)

Fig. 3. Delay plot showing the number of speakers.

The number of prominent delays indicate the number of speakers,as is clearly evident from Fig. 3. The knowledge of the time-delays

0 20 40 60 80 100 1200

0.5

1

0 20 40 60 80 100 1200

0.5

1

0 20 40 60 80 100 1200

0.5

1

Time in ms

(a)

(b)

(c)

Fig. 4. HE of (a) mic-1 speech signal, (b) mic-2 speech signal aftertime-alignment, and (c) sample minima of (a) and (b) emphasizinginstants of speaker-1.

associated with the individual speakers relative to the microphonescan be used to emphasize the instants of significant excitation ofa speaker relative to others [9]. For a three speaker case, this isachieved using

g1(n) = min(h1(n), h2(n− τ1)), (1a)

g2(n) = min(h1(n), h2(n− τ2)) (1b)and

g3(n) = min(h1(n), h2(n− τ3)) (1c)where h1(n), h2(n) are the HEs derived from the two microphonesignals, τ1, τ2, and τ3 are the time-delays corresponding to speaker-1, speaker-2 and speaker-3, respectively. The sample minima com-puted in Eqn. (1a) of the HEs helps in reducing the effect of spuriouspeaks, while preserving the genuine instants of speaker-1 as shownin Fig. 4 (c) . The excitation component of the speech produced bythe desired speaker can be further reinforced by deriving a weightfunction using the knowledge of the delays. The difference e(n) be-tween the HEs g1(n) and g2(n) emphasize the instants of excitationof the desired speaker-1 relative to the instants of speaker-2.

e(n) = g1(n)− g2(n), (2)

Similarly, the difference between the HEs g1(n) and g3(n) empha-size the instants of excitation of the desired speaker-1 relative to theinstants of speaker-3. The difference of the minima shows the in-stants of the desired speaker as positive and the instants of compet-ing speaker as negative instants (Fig. 5(a)). The nonlinear weightfunction is computed as

w(n) =1

1 + e−α(e(n)−β)), (3)

where α and β determine the slope and threshold of the weightfunction. This is used to reinforce the instants of the desiredspeaker while suppressing the instants corresponding to the com-peting speakers. The plots of the smoothed difference signal and theweight function are shown in Fig. 5 (b) and (c), respectively. Theweight function can be used to enhance the speech produced by thetarget speaker by weighing the mixed speech signal with the weightfunction. This signal carries more desired speaker-specific informa-tion as compared to the collected speech signal, as illustrated in thestudies given in the subsequent sections.


0 20 40 60 80 100 120−0.5

0

0.5

1

0 20 40 60 80 100 120−1

0

1

0 20 40 60 80 100 1200

0.5

1

Time in ms

(a)

(b)

(c)

Fig. 5. (a) Difference of minima emphasizing the desired speaker,(b) smoothed difference signal, and (c) weight function emphasizingexcitation source component of desired speaker .

4. SPEAKER RECOGNITION STUDIES

The preprocessed speech signal is used for capturing speaker-specific features for recognition. Speech data for training is pro-cessed in a manner as explained in the previous section, and thenit is used for extracting both source and system features. Thisensures that the network is trained and tested with the prepro-cessed data. The neural network based speaker recognition sys-tem described in [8] is used for the recognition studies. Both thesource and system features are derived using linear prediction anal-ysis [10]. A 14th order linear prediction analysis is performedfor every frame (20 ms) on preemphasized (differenced) speech.19 linearly weighted cepstral coefficients are derived from the 14LPCs to represent the short-time spectral envelope [5, 11]. The 19-dimensional weighted cepstral coefficient feature vector is used torepresent the vocal tract system characteristics. Linear predictionresidual is used to derive the source characteristics. The hypothesisis that the source characteristics of the speaker may be present inthe higher (> 2) order correlations among the samples, which aredifficult to extract. The speaker specific information in the excita-tion of voiced speech, as well as the system features are capturedby autoassociative neural network (AANN) models. A five layerAANN model with nonlinear hidden layers and a linear output layeris used. The structure of the AANN model used for source features is40L48N12N48N40L, where L (linear) and N (nonlinear)represent the type of activation function used by the nodesin that layer. A tanh(.) function is used for the nonlin-ear function. The structure used with the system features is19L38N6N38N19L. The training error for each epoch typicallyshows a decreasing trend as shown in Fig.6, signifying the learningability of the network. The AANN model now characterizes the sub-segmental excitation features present in the LP residual signal or thedistribution of the weighted LPCC vectors. One model is built foreach speaker. The error curves for AANN trained with source andsystem features are shown in Figs. 6 and 7, respectively.

The trained AANN model is used to find the similarity or dis-similarity of characteristics of any given test signal with respect tothat used while training. Both source and system features extractedfrom the preprocessed test data are used for testing a given speaker

0 50 100 150

0.6

0.65

0.7

0.75

0.8

0.85

# of epochs

Error

(a)(b)

(c)

(a) −−−−− −− Collected data

(b) _________ Processed data

(c)_._._._._._. Timit data

Fig. 6. AANN training error curves for source features extractedfrom different data.

utterance. The mean square error obtained for each frame is usedas a measure of dissimilarity. Larger the error, farther are the framecharacteristics from those used for training. This frame error is con-verted into a normalized (0 to 1) similarity measure, known as theconfidence score , given by c(n) = exp(−e

2(n)). The averageframe confidence Cavg = 1

Nt

PNt

n=1 c(n) , gives the similarity be-tween training sequence and test sequence, where Nt is the numberof frames in the test signal.

5. RESULTS OF SPEAKER RECOGNITION

The recognition studies were carried out on the recorded one speakerspeech data with and without preprocessing. The results are given inTable. 1. This forms the baseline system for our study. It can be ob-served that the recognition performance of the system with the datacollected in a real environment is poor compared to the performancewith the clean TIMIT data. The degradation may be due to the effectof background noise and reverberation. There is an improvement inthe recognition performance due to preprocessing. The test data of700(20 × 35) utterances is tested against the models of all the 35speakers in the set. The average confidence value for each of the35 models for the given test data is computed (35 × 700). Thereare 35 × 680 imposter utterances and 700 genuine speaker utter-ances. The confidence value is used to rank the speaker models. Ide-ally a genuine speaker utterance should have the highest confidencescore, and thus have rank one. The top five ranks were consideredin evaluating the recognition performance. With preprocessing, thespeech produced by the speaker is enhanced and hence the recogni-tion performance. This trend is evident in Table. 2. As the numberof competing speakers increases, the difficulty in extracting speaker-specific information from the mixed signal increases. This decreasesthe overall performance of the system, as illustrated from the results.The relative improvement in the performance of speaker recognitionwith preprocessing is clearly seen from the results.

6. SUMMARY

In this study we have studied the significance of preprocessing inthe context of speaker recognition in a multispeaker environment.


0 5 10 15 20 25 30 35 40 45 50

0.3

0.4

0.5

0.6

0.7

0.8

# of epochs

Error

(c)(b)

(a)

(a) − − − − − − Collected data

(b) ________ Processed data

(c) _._._._._._ Timit data

Fig. 7. AANN training error curves for system features extractedfrom different data.

Table 1. Performance of the baseline system.

Speech Source Features System FeaturesTimit Data 94.3 % 100 %

Recorded Data 83.0 % 77.0 %Processed Data 88.0 % 89.8 %

The enhancement of speech produced by the target speaker prior tofeature extraction is shown to improve the recognition performance.The speaker recognition studies using both source and system fea-tures on the same database substantiates this point. The recognitionperformance can be further improved by tuning the parameters of therecognition system. In the present study, the emphasis was on evalu-ating the enhancement technique in speaker recognition applicationin a multispeaker environment. We have not made any attempt to op-timize the parameters of the model used for feature extraction, andalso at the decision making stage. The average confidence scoresof all the blocks were used for ranking the test utterance. A betterapproach would be to determine suitable weighting for each block,depending on the nature of the signal in that block. Further, one canhave multiple models and multiple test segments, providing multipleevidences to take a decision about the speaker. This may improvethe performance for a given amount of data.

7. REFERENCES

[1] B. S. Atal, “Automatic recognition of speakers from theirvoices,” Proc. IEEE, vol. 64, pp. 460–475, Apr. 1976.

[2] D. A. Reynolds, T. F .Quateri, and R. B. Dunn, “Speaker recog-nition using adapted Gaussian mixture models,” Digital SignalProcessing, vol. 10, pp. 19–41, 2000.

[3] S. R. M. Prasanna, C. S. Gupta and B. Yegnanarayana, “Ex-traction of speaker-specific excitation information from linearprediction residual of speech,” Speech Communication, vol.48, no. 10.

[4] J. .P. .Campbell, “Speaker recognition:A tutorial,” Proc. IEEE,vol. 85, pp. 1436–1462, 1997.

Table 2. Results of speaker recognition.

Source Features System FeaturesCase Collected Processed Collected Processed

Data Data Data Data2 spk 31.2 % 59.8 % 49.3 % 64.0 %3 spk 29.5 % 52.1 % 33.3 % 56.2 %4 spk 28.4 % 56.9 % 31.1 % 50.6 %

[5] S. Furui, An Overview of Speaker Recognition Technologyin Automatic Speech and Speaker Recognition, Kluwer Aca-demic, Boston, U. S. A., 1996.

[6] Lifeng Sang, Zhaohui Wu, and Yingchun Yang, “Speakerrecognition system in multi-channel environment,” IEEETrans. System, Man, Cybernetics, vol. 4, no. 10.

[7] B. Yegnanarayana, K. Sharat Reddy, S. P. Kishore, “Source andsystem features for speaker recognition using AANN models,”in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing,May. 2001, vol. 1, pp. 409–412.

[8] B. Yegnanarayana, S. R. M. Prasanna, Jinu. M. Z, and C. S. Gupta,“IIT Madras speaker verification system,” in Proc. NISTSpeaker Recognition Workshop, Vienna, U.S.A, May. 2002.

[9] R. Kumara Swamy, K. Sri Rama Murty, and B. Yegna-narayana, “Determining the number of speakers from mul-tispeaker speech signals using excitation source information,”IEEE Signal Processing Letters, vol. 14, no. 7, pp. 481–484,July 2007.

[10] J. Makhoul, “Linear prediction: A tutorial review,” Proc.IEEE, vol. 63, pp. 561–580, Apr. 1975.

[11] L. R. Rabiner and B. H. Juang, Fundamentals of Speech Recog-nition, Prentice-Hall, Eaglewood Cliffs, New Jersey, 1993.


SPEAKER RECOGNITION IN LIMITED DATA CONDITION

H. S. Jayanna and S. R. Mahadeva Prasanna


Email:{h.jayanna,prasanna}@iitg.ernet.in

ABSTRACT

In this paper we mention the significance of speaker recogni-tion in limited data conditions, we then highlight the issuesinvolved in the development of such a speaker recognitionsystems. The possible directions for the developments of sucha speaker recognition system are mentioned. Finally, the use-fulness of variable segmental analysis in the development ofspeaker recognition system under limited data conditions isdemonstrated.Index Terms: Speaker recognition, limited data, FSA, VSA

1. INTRODUCTION

Speaker recognition refers to the task of recognizing speak-ers from their speech [1]. State-of-the-art speaker recogni-tion systems are based on the assumption that enough speechdata is available both during training as well as testing. How-ever, there are situations in real life that will have only lim-ited amount of speech data for speaker recognition. Thesesituations may be termed as non-cooperative scenario. Forinstance, speech data collected from remote or border areaswhere the speaker (may be a terrorist) speaks only a few sec-onds and the task here is to identify the person. Also, incase of forensic investigation, suppose we have only few sec-onds of speech data and task is to validate the identity ofthe speaker. Further, there are recent applications like spo-ken document retrieval, where it is desirable to be able todetect and track the presence of a group of speakers [2]. Insuch applications it is desirable to model as well as detectspeakers using limited amount of speech data [2]. In case ofmulti-speaker speech, the speaker recognition will be easy ifwe are able to model and detect speakers using limited data.Thus in all these applications we benefit if we have speakerrecognition technology which can give reliable performanceunder limited data conditions. In the present work enoughdata is used to symbolize the case of having speech data offew minutes (≥ one minute). Alternatively, limited data sym-bolizes the case of having speech data of few seconds (≤ 10seconds) [2].

Speaker recognition in limited data conditions refers tothe task of recognizing speakers where both the training as

well as test speech data present will be only for few sec-onds. The different stages involved in the development of aspeaker recognition system may be broadly grouped into fourstages, they are, analysis, feature extraction, training and test-ing. The analysis stage involves the proper choice of framesize and shift for feature extraction. The frame size and shiftfor speech analysis can be as low as 3-5 msec to extract speaker-specific excitation source features [3], medium size like 10-30msec to extract speaker-specific vocal tact information [4] andas high as 100-300 msec to extract speaker-specific supraseg-mental information [5]. The feature extraction stage involvesthe proper choices of signal processing method to extract rel-evant speaker-specific features from the speech signal. Thismay include non-linear models for extracting excitation sourceinformation [3], cepstral analysis for extracting vocal tract in-formation [4, 6] and pitch and duration extraction methodsfor extracting suprasegmental information [7]. The trainingstage involves developing reference model for each speakerfrom the set of feature vectors extracted from the trainingspeech data. This stage may use Vector Quantization (VQ),Gaussian Mixture Model (GMM), Neural Network (NN) andHidden Morkov Model (HMM) [6, 8–11]. Finally, the testingstage involves comparing the features extracted from the testspeech data with each of the reference models to recognizethe speaker of the test data.

In most of the state-of-the-art speaker recognition systemsthe analysis stage uses frame size and shift in the range of 10-30 msec, Mel Frequency Cepstral Coefficients (MFCC) areextracted as feature vectors, GMM is used for finding the ref-erence model for each speaker. Finally a simple Euclideandistance measure is employed in the testing stage. This se-lection works well when enough data is available for train-ing and testing. Alternatively, if the data available is onlyfew seconds, then 10-30 msec choice provides few features,GMM may not model speaker well due to sparse distribu-tion of features and as a result finally the performance ofthe speaker recognition system will be poor. Thus to de-velop a speaker recognition system under limited data con-ditions which has practically significant performance, theremay be two possibilities. The first one is to develop newmethods in each of the four stages of the speaker recogni-tion system. The second one is to improve the current method


in each stage so that the overall performance of the systemimproves. The first one is even though interesting, coming upwith new method may be time consuming. Alternatively, thesecond approach is preferable because all the existing meth-ods at various stages are developed over many years and in-tricacies of the same have been understood better. For in-stance, in the recent work on speaker recognition in limiteddata conditions, the speakers are modelled using the conceptof Universal Background Model (UBM) for obtaining im-proved models [2]. Also, as will be demonstrated in this work,a variant of conventional segmental analysis [12–14] termedas variable segmental analysis [15–17] is used in the analysisstage, which is found to improve the performance of speakerrecognition in limited data conditions significantly.

The rest of the paper is organized as follows: The issuesand possible directions for the development of speaker recog-nition in limited data conditions will be discussed in Section2. The development of a speaker recognition system usingvariable segmental analysis is described in Section 3. Thespeaker recognition experiments conducted and their resultsare discussed in Section 4. Summary of the present work andscope for future work are given in Section 5.

2. SPEAKER RECOGNITION IN LIMITED DATACONDITIONS

The main difficulty in the development of a speaker recogni-tion system under limited data conditions is the availabilityof very less amount of speech data for training and testing.This leads to the generation of very few feature vectors in thefeature extraction stage. For training, these few feature vec-tors will be spread sparsely in the feature space. Hence mostof the models, which depend upon the distribution capturing(clustering) of feature vectors in the feature space may notget trained well. Therefore, the training stage produces poorspeaker model. Also, during testing, the few vectors may beinsufficient to do comparison with the speaker models to geta conclusive evidence about the speaker.

For the extraction of feature vectors, speech is mostly pro-cessed by segmental analysis, where the frame size and shiftwill be chosen in the range of 10-30 msec. This is mainlybased on the fact of quasi-stationarity of vocal tract infor-mation. However, this process is leading to very few vec-tors under limited data conditions. One possibility is to re-duce the frame size and shift. However, too less a frame sizemay lead to poor spectral resolution, and too less a frameshift may not have any significant change in the speaker in-formation. Hence, if we are interested in extracting spec-tral features representing speaker specific information, thenwe may have to design a new spectral analysis technique,which eliminates the above mentioned limitations and gen-erates relatively more number of feature vectors. In the fea-ture extraction stage mostly cepstral features (i.e., MFCC)are extracted as feature vectors due to their consistent good

performance, when enough data is available. Alternatively,under limited data conditions since they form sparse distri-bution in the feature space, it may be better to look into otherfeature extraction methods, preferably, capturing complemen-tary speaker-specific information. For instance, we may ex-tract feature vectors representing speaker-specific excitationinformation [3]. We may then have relatively more numberof feature vectors for training as well as testing.

In the training stage, mostly models based on the distribu-tion capturing of the feature vectors in the feature space areused. These include GMM, NN and VQ. Since the numberof feature vectors are very few, the trained models may notbest represent the speaker. Hence, it may be better to use newmodels which does not depend on distribution capturing prin-ciple. Alternatively, multiple models which are based on dif-ferent principles for distribution capturing may be developedfor the same limited amount of training data.

In the testing stage since the testing speech data is limited,the number of feature vectors for comparing that data with thereference speakers models will be less. Accordingly, testingmethod may be developed to get consistently good decision.For instance, a weighted spectral distance may be developedbased on the frame energy, which will try to emphasize evi-dences from high energy frames and use it as an evidence ofmultiple frames.

3. SPEAKER RECOGNITION IN LIMITED DATACONDITIONS USING VARIABLE SEGMENTAL

ANALYSIS

In this work we demonstrate that by improving the analysisstage in the speaker recognition system, it is indeed possi-ble to generate relatively large number of feature vectors fortraining and testing under limited data conditions. This leadsto better modelling during training, better comparison duringtesting and hence significantly improved performance. Thespeaker recognition systems based on spectral informationmostly representing vocal tract, speech is analyzed in framesof 10-30 msec. For instance, the frame size may be 20 msecand frame shift may be 10 msec. This is more commonlytermed as segmental analysis. Once the frame size and frameshift are chosen, they are kept constant throughout the study.Hence in the present work we call this as Fixed SegmentalAnalysis (FSA). Since, it generates only few feature vectorsin limited data conditions, what we are interested is to mod-ify this segmental analysis, such that more number of featurevectors are generated. For this, the frame size and shift arevaried in the range 5-30 msec and for each value of frame sizeand shift, feature vector is extracted from the limited amountof data. It should be noted that as mentioned earlier, sincewe are extracting spectral information corresponding to vo-cal tract, the frame size and frame shift cannot me made toosmall, which leads to poor spectral resolution. Hence, min-imum frame size and shift is chosen as 5 msec. Similarly,


−1

0

1

−1

0

1

−2

−1

0

c1c2

c3

−1

0

1

−1

0

1

−2

−1

0

c1c2

c3

−1

0

1

−1

0

1

−2

−1

0

c1c2

c3

−1

0

1

−1

0

1

−2

−1

0

c1c2

c3

−1

0

1

−1

0

1

−2

−1

0

c1c2

c3

−1

0

1

−1

0

1

−2

−1

0

c1c2

c3

(a)(b) (c)

(d)(e) (f)

Fig. 1: Features of three speakers for 50 msec speech data: (a), (b) and (c) are features extracted for 20 msec frame size and 10 msec frame shift and (d),(e) and(f) are corresponding VSA based features.

frame size and shift cannot be made too large due to the vi-olation of quasi-stationary property. Hence maximum framesize and shift is chosen as 30 msec. Since we are varyingthe frame size and shift during the analysis of same speechdata, we are calling it as Variable Segmental Analysis (VSA).The proposed VSA is similar to Variable Frame Size (VFS)and Variable Frame Rate (VFR) analysis employed for speechrecognition and language identification tasks [15–17]. Themain distinction is that in the present work we do not com-pute the spectral change, which is used as control parameterfor VFS and VFR analysis of speech [12–14]. This is mainlybecause, in speaker recognition we are mainly interested inhigh energy voiced regions, which may not have much spec-tral change.

To illustrate how the proposed VSA analysis techniqueprovides relatively large number of feature vectors comparedto conventional FSA analysis, 50 msec speech segment ofthree speakers is taken separately. It is processed by FSA witha frame size of 20 msec and shift of 10 msec, cepstral featuresare extracted for each frame of each speaker and plotted inFig. 1(a), (b) and (c). Alternatively, the same speech segmentof each speaker is processed by the proposed VSA analysis,cepstral features are extracted for each frame and plotted inFig. 1(d), (e) and (f). As it can be observed pictorially, the

feature space in Fig. 1(d), (e) and (f) is densely populatedcompared to Fig. 1(a), (b) and (c). This figure infers thateven though speech data is less, proposed VSA produces rel-atively more number of feature vectors. Further, presence ofthe features at different places may represent speaker-specificinformation and this may be quite different compared to thefeatures obtained by mere interpolation. Therefore, the train-ing stage using VSA may produce relatively better speakermodels and also the testing stage may yield better compari-son results.

4. SPEAKER RECOGNITION STUDIES

YOHO speaker recognition database is used for the presentwork [18]. The database contains speech of 138 speakers col-lected over telephone channels spread over a duration of about3 months. Each speaker has about 96 speech data files each ofabout 3 sec duration for training, and 40 speech data file eachof about 3 sec for testing. Among this, we have taken firstfour files in every speakers to create the database for speakerrecognition study under limited data conditions. The framesize and shift are chosen in the range from 5 msec to 30 msecwith a successive shift of 5 msec. In the training case, firstspeech file of about 3 sec duration from the training set is


2 4 6 8 10 12

20

30

40

50

60

70

Amount of testing data in sec

pe

rfo

rmn

ce

(%

)

Codebook size 16

FSA

VSA

2 4 6 8 10 12

20

30

40

50

60

70


pe

rfo

rmn

ce

(%

)

codebook size 32

FSA

VSA

2 4 6 8 10 12

20

30

40

50

60

70


pe

rfo

rmn

ce

(%

)

Codebook size 64

FSA

VSA

2 4 6 8 10 12

20

30

40

50

60

70


pe

rfo

rmn

ce

(%

)

Codebook size 128

FSA

VSA

Fig. 2: Performance of the speaker recognition system based on FSA and proposed VSA analysis techniques for different sizes of training and test data for 138speakers.

taken. MFCC feature vectors of dimension 13 are extractedfrom the different frames both for FSA and VSA techniques.The models are trained using different codebook sizes from16 to 128. For the testing purpose the first speech file of about3 sec duration from the testing set is taken, feature vectors areextracted by FSA and VSA techniques. The feature vectorsare compared with the reference speaker models built duringtraining. The same procedure is repeated for 2 and 4 speechdata files. The performance of the speaker recognition sys-tem for different codebook sizes as well as different amountof speech data is shown in Fig. 2. As it can be observed, theproposed VSA analysis produces significantly better perfor-mance compared to FSA. It provides an average improvementof about 7.5%. It is interesting to note that the improvementin the performance is significantly better (about 18.5%) whenthe amount of data is less. That is, when only one speech fileis used.


This work presented the task of speaker recognition in limiteddata conditions. This began with the mentioning of signifi-cance of a such a task. The issues involved in the develop-ment of such a system and possible directions for developing

speaker recognition system suitable for limited data condi-tions were mentioned. Finally it was demonstrated that evenby improving the analysis method employed, we can improvethe performance of speaker recognition system in limited dataconditions.

As mentioned earlier similar improvement can be achievedat other stages in the development of a speaker recognitionsystem. This may lead to further improvement in the perfor-mance.

6. REFERENCES

[1] Bishnu S Atal, “Automatic recognition of speakers fromtheir voices,” proceedings of the IEEE, vol. 64(4), pp.460–475, Apr. 1976.

[2] Pongtep Angkititrakul and John H. L. Hansen, “Dis-criminative In-Set/Out-of-Set Speaker Recognition,”IEEE Transactions on Audio, Speech, and languageProcessing, vol. 15 (2), pp. 498–508, 2007.

[3] S. R. Mahadeva Prasanna, Cheedella S. Gupta, andB. Yegnanarayana, “Extraction of speaker-specific ex-citation information from linear prediction residual of


speech ,” Speech communication, vol. 48, pp. 1243–1261, 2006.

[4] John Deller and John Hansen and John Proakis, DiscreteTime Processing of Speech Signals, 1st ed. IEEE Press,1993.

[5] G. Doddington, “Speaker Recognition based on Idiolec-tal Differences between Speakers,” presented at Eu-rospeech, pp. 2521–2524, Aalborg, Denmark,2001.

[6] Lawrence Rabiner and Biing-Hwang Juang, Fundamen-tals of Speech Recognition. Prentice Hall, 1993.

[7] L. Mary, K. Sreenivasa Rao, Suryakanth V. Gangashettyand B. Yegnanarayana, “Neural Network Models forCapturing Duration and Intonation Knowledge for Lan-guage and Speaker Identification,” in Proc. Int. Conf. onCognitive and Neural Systems, Boston, May 2004.

[8] D. O. Shaughnessy, Speech Communications: Humanand Machine, 2nd ed. IEEE Press, 1999.

[9] B. Yegnanarayana and S. P. Kishore, “Aann: an alterna-tive to GMM for pattern recognition,” Neural networks,vol. 15, pp. 459–469, 2002.

[10] D. A. Reynolds, “Speaker identification and verificationusing Gaussian mixture speaker models,” Speech Com-munication, vol. 17, pp. 91–108, Aug. 1995.

[11] Kelvin R.Farrell, Richard J. Mammone, and Khaled T.Assaleh, “Speaker recognition using neural networksand conventional classifiers,” IEEE Transactions onAcoustics Speech and Audio Processing, vol. 2(1), pp.194–205, Jan. 1994.

[12] S. Peeling and K. Ponting, “Variable frame rate analy-sis in the ARM continuous speech recognition system,”Speech Communication, vol. 10, pp. 169–179, 1996.

[13] K. Ponting and S. Peeling, “The use of variable framerate analysis in speech recognition,” Computer Speechand language, vol. 5, pp. 169–179, 1991.

[14] P.L. Cerf and D.V. Compernolle, “A new variable framerate analysis method for speech recognition,” IEEE Sig-nal Processing Letters, vol. 1, pp. 185–187, Dec. 1994.

[15] T. Nagarajan, “Implicit systems for spoken languageidentification,” Ph.D. dissertation, Indian Insititute ofTechnology Madras, Chennai,India, 2004.

[16] G. L. Sarada, N. Hemalatha, T. Nagarajan and H. A.Murthy, “Automatic transcription of continuous speechusing unsupervised and incremental training,” Int. Con.on Spoken Language Processing (ICSLP), vol. 18(2),Korea 2004.

[17] G. L. Sarada, T. Nagarajan and H. A. Murthy, “Multipleframe size and multiple frame rate feature extraction forspeech recognition,” Int. Conf. on Signal Processing andCommunication (SPCOM), IISc Bangalore, 2004.

[18] J. P. Campbell, Jr., “Testing with the YOHO CD-ROMvoice verification corpus,” in Proc. IEEE Int. Conf.Acoust. Speech and Signal processing (ICASSP), pp.341–344, 1995.


EMOTION RECOGNITION USING MULTILEVEL PROSODIC INFORMATION

ABSTRACT

In this paper we propose a method for recognizing the emo-tions using prosody models developed from speech at differ-ent levels. The basic idea is that the combined characteristicsof prosody from different levels will provide an effective dis-crimination among the basic emotions. The emotions consid-ered in this study are anger, happy, compassion and neutral.The prosodic features used for developing the models are du-ration patterns, intonation patterns and energy. The featuresare extracted from speech at different levels such as utter-ance, word and syllable levels. SUSE (speech under simu-lated emotions) database is used for developing the prosodymodels and performing the emotion recognition task. The re-sults indicate that the recognition of emotions is better by us-ing the combined prosodic knowledge derived from multiplelevels compared to the models derived from any one level.

Index Terms— Duration, Emotion, Emotion recognition,Energy, Intonation, Prosody, Utterance level, Word level andSyllable level.

1. INTRODUCTION

Human beings use emotions extensively for expressing theirintentions through speech. It is observed that the same mes-sage (text) will be conveyed in different ways by using appro-priate emotions. At the receiving end, the intended listenerwill interpret the message according to the emotions presentin the speech. In general, it is the fact that the speech pro-duced by human beings is embedded with emotions. There-fore in developing speech systems (i.e., speech recognition,speaker recognition, speech synthesis and language identifi-cation), one should appropriately exploit the knowledge ofemotions. But, most of the existing systems are not using theknowledge of emotions while performing the tasks. This isdue to the difficulty in modeling and characterization of emo-tions present in speech [1, 2].

It is known that speech signal carries all the informationsuch as message, speaker characteristics, language character-istics and the emotions that contribute to the intentions of thespeaker. The challenge is how to characterize or model the

specific information (i.e., specific information to character-ize sound units, speaker identity, language identity and emo-tions), so that it can be used to perform the desired task. Inthis paper our focus will be on recognition of basic emo-tions present in the speech. From the state-of-art, it is ob-served that the characteristics of the emotions can be seen atthe source level (characteristics of excitation signal and shapeof the glottal pulse), system level (shape of the vocal tractand nature of moments of different articulators) and at theprosodic level [3–10]. Among the features from different lev-els for characterizing the emotions, prosodic features have at-tributed to a major role by the existing literature as well asperceptual point of view also [5, 10]. Therefore in this paperwe narrowed down our focus to recognition of emotions us-ing only prosodic features. The prosodic features consideredin this study are (1) the duration patterns (2) the average pitch(3) variation of pitch with respect to mean (Standard Devia-tion (SD)), and (4) the average energy.

In our previous work, we have analyzed the above saidprosodic features extracted from the utterance level, word leveland syllable level [10]. It is observed that the prosodic fea-tures at the utterance level are highly overlapping and dis-crimination capability of emotions using them is limited [10].Whereas the prosodic features at the word level and syllablelevel have observed to be better discrimination capabilitiesamong the emotions. Therefore in this paper, we propose amethod for recognition of emotions using prosodic featuresextracted from utterance, word and syllable levels. The pro-posed method is the fusion based approach, which combinesthe unique characteristics of prosodic features at different lev-els (utterance, word and syllable levels). In this paper, wehave used a simple first order statistic (i.e., mean of the distri-bution) as the feature for representing the individual prosodiccharacteristics. The main goal of this study is to demon-strate the discrimination characteristics of the basic emotionspresent in speech using the prosodic features derived from theutterance, word and syllable levels. By exploiting the uniqueprosodic characteristics present at multiple levels will be use-ful for developing the robust emotion recognition system.

The rest of the paper is organized as follows: In Section 2,we will briefly discuss about the analysis of prosodic features

1SchoolofInformationTechnology,

IndianInstituteofTechnologyKharagpur, Kharagpur-721302, WestBengal 2DepartmentofECE,

IndianInstituteofTechnologyGuwahati, Guwahati-781039, Assam, India

K.Sreenivasa Rao 1, S. R. M. Prasanna2 and T.Vidya Sagar3


derived from the utterance, word and syllable levels. Classi-fication strategy of emotions using the statistical analysis isdiscussed in Section 3. Summary of the contents of the pa-per, and the possible future extensions to the present work aregiven in Section 4.

2. ANALYSIS OF PROSODIC FEATURES

In this paper each emotion is analyzed and characterized byusing prosodic features at gross (utterance) and finer (wordand syllable) levels. The prosodic features used in the analysisare duration, pitch (mean and standard deviation) and energy.The database used in the study is prepared by the research stu-dents at IIT Guwahati, and has been used for stressed speechclassification task [11]. The database consists of speech ut-terances corresponding to three basic emotions anger, com-passion and happy. In addition to these, speech utterances arealso recorded in neutral style (without emotion). The utter-ance considered for the analysis is India match odipoyindafrom Telugu (an Indian language). Thirty native male Teluguspeakers have participated for the recording of speech sig-nals. For each emotion, the same speech utterance has beenrecorded five times from each speaker. This has resulted ina total of 150 speech data files for each emotion, and in total600 speech data files for all the four emotions, including neu-tral. The recorded speech signals are sampled at 8000 Hz andstored as 16 bit numbers. In this speech data, 80% of the filesare used for training (building the models) and the remaining20% of the files are used for testing.

In this study we use the first order statistics (Mean of thedistribution) for the analysis of basic emotions. The averagevalues of the prosodic parameters are determined at differentlevels. At the first step, the mean values of the prosodic pa-rameters at the utterance level are determined. The way todetermine the prosodic parameters is described in brief: Theduration of each of the speech files is determined in seconds.The mean of the durations is determined for each emotion cat-egory. The pitch values of each utterance are obtained fromthe autocorrelation of the Hilbert envelope of the LP resid-ual [12]. As each utterance has sequence of pitch values ac-cording to its intonation pattern, analysis is carried out usingmean and SD of pitch values. The energy of speech signal isdetermined, and the normalized energy for each speech utter-ance is calculated by dividing the energy with duration (num-ber of samples) of speech utterance. Here the mean valuesare derived from 120 utterances of each emotion. The meanvalues of the prosodic features at the utterance level are givenin the Table 1.

In Table 1, the first column indicates the basic emotionsconsidered for the analysis. Columns 2-5 contains the aver-age values of the prosodic features. From the utterance levelfeatures, it is observed that prosodic features are overlappedfor some emotions and distinct for other emotions. From theduration patterns, it is observed that the anger emotion has

Table 1: Mean values of the prosodic parameters at the utterance level

Emotion MeanDuration Pitch SD of pitch Energy

(sec) (Hz) (Hz)Anger 1.04 223 50.57 0.032Happy 1.25 181 42.75 0.015

Compassion 1.32 184 38.91 0.012Neutral 1.24 160 34.12 0.013

minimum mean duration, whereas other emotions have meanduration close to each other. Similarly, using the pitch fea-ture, it is observed that happy and compassion have very closemean pitch values. From the SD of pitch, it is observed thatall the basic emotions have distinct mean values. All the emo-tions except the angry will fall under one group based on theenergy.

For analyzing the prosodic features at the word level, theabove mentioned features (duration, pitch and energy) are ex-tracted from each word in the utterance. The utterance usedin this analysis has three words (1) word1 (W1): India, (2)word2 (W2): match, (3) word3 (W3): odipoyinda. The utter-ances are segmented into words manually, and the prosodicparameters are derived in a way similar to that of utterances.W1, W2 and W3 are the words correspond to initial, medialand final positions in the utterance. The entries in the Table2 indicate the values of prosodic features for the words withrespect to different emotions.

Table 2: Mean values of the prosodic parameters at the word level

Units Emotion MeanDuration Pitch SD of pitch Energy

(sec) (Hz) (Hz)W1 Anger 0.223 227 35.60 0.043

Happy 0.291 191 31.27 0.022Compassion 0.298 189 29.39 0.018

Neutral 0.272 167 19.3 0.022W2 Anger 0.226 229 30.55 0.024


Neutral 0.276 171 31.5 0.011W3 Anger 0.446 231 38.29 0.040


Neutral 0.564 153 22.39 0.014

The first column in Table 2 indicates the basic units usedin the analysis. Here, the words are used as basic units. Theother columns in Table 2 contains the similar information thatis present in Table 1. The prosodic values in Table 2 indicatethat the nature of prosodic features at word level for differentemotions will follow the trend similar to the features at utter-ance level. But, in the detailed analysis it is observed that the


variation in the prosodic features depends on the position ofthe word in the utterance. For instance change in the averageduration for compassion will be more for the final word (W3)and least for middle word (W2). Where as from the analy-sis at the utterance level this kind of information is missing.The overlapping pattern of mean pitch at the utterance levelfor happy and compassion can be resolved by the word levelanalysis by observing the mean pitch values for the final word(W3). Similarly some of the ambiguities at the utterance levelcan be resolved by using word level information.

The syllable is a natural and convenient unit for speechin Indian languages. In Indian scripts, characters generallycorrespond to syllables. A character in an Indian languagescript is typically in one of the following forms: V, CV, CCV,CCVC, and CVCC, where C is a consonant and V is a vowel.Among different forms of syllables the most common formobserved in Indian languages is CV. In this study the utterancehas nine syllables: in, di, ya, match, O, di, pO, in, dA (S1to S9). Since the middle word has only one syllable (S4),prosodic features of that syllable are not shown in the Table3 (because these features are same as word2 (W2) features inTable 2). The mean values of the prosodic features for thesyllables S1 to S9 are given in Table 3.

The gross analysis of prosodic features for different emo-tions at syllable level may follow the word level and utterancelevel analysis to some extent. But the careful observation in-dicates some unique characteristics of syllables, which dif-fer from prosodic characteristics at word level and utterancelevel. It is also observed that the prosodic features at the syl-lable level depends on the syllable position (initial, medialand final) within the word. The prosodies also depends on thenature of the syllable, the number of syllables that constitutethe word (length of the word), the position of the word in thephrase and the linguistic context associated to the syllable (thenature of the preceding and the following syllables). For ex-ample, for happy, the initial syllable in the first word and finalsyllable in the last word have significantly larger duration. Inthe analysis of other prosodic features at syllable level, similarbehavior of uniqueness of characteristics is observed. Someof the ambiguities at the word level and utterance level canbe resolved using the information at the syllable level. Forexample, the energy at the utterance and word levels for theemotions happy, compassion and neutral is very close, withthat the discrimination using this feature is poor. In this con-text, by observing the syllable features (S2, S3, S7 and S9),one can discriminate the emotions effectively.

3. CLASSIFICATION STRATEGY USING THESTATISTICAL INFORMATION

In the previous section, we derived the statistical informationof the prosodic parameters for each emotion. In this section,we will discuss about the methods to exploit the derived sta-tistical information for recognizing the emotions.

Table 3: Mean values of the prosodic parameters at the syllable level

Units Emotion MeanDuration Pitch SD of pitch Energy

(sec) (Hz) (Hz)S1 Anger 0.076 181 13.01 0.018


Neutral 0.099 154 17.12 0.010S2 Anger 0.057 233 02.44 0.030


Neutral 0.064 178 06.28 0.022S3 Anger 0.092 226 11.49 0.071


Neutral 0.108 173 5.32 0.035S5 Anger 0.083 206 13.86 0.055


Neutral 0.102 154 11.09 0.024S6 Anger 0.068 246 12.91 0.022


Neutral 0.082 157 14.5 0.008S7 Anger 0.075 237 19.70 0.052


Neutral 0.092 169 38.4 0.015S8 Anger 0.108 227 22.13 0.038


Neutral 0.144 148 12.37 0.016S9 Anger 0.106 186 18.92 0.040


Neutral 0.137 135 9.76 0.009

In this work we used a simple Euclidian distance measurefor performing the Emotion Recognition (ER) task. The meanvector for each emotion is obtained from the analysis in theprevious section. The mean vector consists of the average val-ues of the prosodic features derived from the whole utterance.These mean vectors corresponds to each emotion representsthe reference models for the basic emotions. To evaluate theperformance of these utterance level features, ER task is per-formed using the speech files from the test data set. A four di-mensional feature vector is derived from each test utterance,and determine the Euclidian distance with each of the meanvectors (reference models), which represents the set of emo-tions. The given test speech utterance is classified into oneof the basic emotion category based on the minimum distancecriterion. The classification performance of the test data us-ing utterance level features is given in the Table 4. The tableshows the confusion matrix. The diagonal entries corresponds


to the classification performance of each emotion. Since theprosodic features at the utterance level are overlapped acrossthe emotions, the classification performance using these fea-tures alone is expected to be low.

Table 4: Classification performance using utterance level prosodic features.

Emotion Anger Happy Compassion NeutralAnger 62 29 2 7Happy 21 48 19 12

Compassion 5 11 54 30Neutral 10 18 15 57

The prosodic information at the word level is observed tobe more discriminative compared to utterance level informa-tion (shown in Table 2). Hence the performance of the Emo-tion Recognition System (ERS) may be improved by using theprosodic features at the word level. For this classifier, emotionmodels are prepared using the prosodic features at the wordlevel. Since the utterance in the database has three words,each emotion is represented using three word models. Here,the word models represent the four dimensional mean featurevectors similar to the utterance model, where each dimensionrepresents the mean value of the particular prosodic feature atthe word level. The ER performance of this classifier is eval-uated with the test utterances. The prosodic parameters foreach word are derived, and the 4-dimensional feature vectoris formed to represent the word. The distances between thederived feature vector and the reference models of each emo-tion are estimated. The distances corresponds to each emotionare summed up for all the words, and the minimum distanceindicates the emotion category of the particular test utterance.The performance of the ER system using word level prosodicinformation is shown in Table 5. The overall performance ofthe ER system using word level features is seems to be im-proved slightly compared to utterance level features.

Table 5: Classification performance using word level prosodic features.



In the similar way ER system can be build using syllablelevel prosodic features. The performance of the system withsyllable level features is given in Table 6. The overall classifi-cation performance is observed to be better compared to wordand utterance level systems. This is clearly evident from thestatistical analysis performed in Section 2.

From the above results it is observed that the recogni-tion system using utterance level information performs betterrecognition for anger emotion. ERS using word level and syl-

Table 6: Classification performance using syllable level prosodic features.



lable level performs better for happy and compassion respec-tively. This indicates that the individual classifiers are goodat recognition for a specific emotion, compared to the over-all recognition performance. This suggests an idea of usingmultiple classifiers for the recognition task. The same pointis also observed in Section 2. In the statistical analysis of theprosodic features, it is observed that at each level there existssome unique and discriminative features for specific emotionsas well as confusing features which will leads to ambiguity.One can achieve the better performance by cleverly combin-ing the features at different levels. Hence we further devel-oped the ERS using the prosodic information derived fromdifferent levels. In this paper we proposed four ER systemswhich uses the prosodic information derived from utterance,word and syllable levels. The four models are: (1) Modelusing utterance and word level features, (2) Model using ut-terance and syllable level features, (3) Model using syllableand word level features and (4) Model using utterance, wordand syllable level features.

In these models the methodology for the classification isas follows: (1) The distance between the derived feature vec-tors and reference models is obtained for each emotion usingthe prosodic information from the particular level. (2) Sumthese distances with respect to each emotion, and recognizethe emotion category of the test utterance as the model whichhas the least distance. The recognition performance of thesemodels which exploits the multilevel prosodic information isobserved to be better compared to the models with prosodicinformation at a particular level. Among the four models, themodel derived using the prosodic information from the utter-ance, word and syllable levels was shown the best recogni-tion performance (Table 10). The recognition performance ofthese models using multilevel prosodic information is shownin the Tables 7, 8, 9 and 10.

Table 7: Classification performance using utterance and word level prosodicfeatures.




Table 8: Classification performance using utterance and syllable levelprosodic features.



Table 9: Classification performance using word and syllable level prosodicfeatures.




In this paper we proposed emotion recognition system us-ing prosodic features derived from the utterance, word andsyllable levels. Four basic emotions anger, happy, compas-sion and neutral were considered for the ER task. Duration,average pitch, deviation in pitch and energy were used asprosodic features in the ER task. Statistical analysis of theprosodic parameters at different levels showed that there ex-ists some unique and discriminative information specific tothe basic emotions. At the same time the analysis also indi-cated the overlapping of features across the emotions. In thispaper simple Euclidian distance measure is used for the ERtask. Various ER systems were developed using the prosodicfeatures extracted from utterance, word and syllable levels.The performance of the ERS with syllable level features wasshown to be better compared to utterance and word level fea-tures. The performance of the ERS with features drawn fromspecific level was shown to recognize a particular emotionsignificantly (Tables 4, 5 and 6). For capturing the uniquecharacteristics of different emotions, ER systems were devel-oped by combining the prosodic information derived from ut-terance, word and syllable levels. The performance of theERS with combined features was shown to be better com-pared to the ERS with individual features. Among the mod-els developed using combined features, the model derivedby combining the prosodic features from all levels was per-formed better (Table 10).

In this paper the combined models are derived from thelinear combination of the evidences from the individual mod-els. We have not explored the significance of each modelwhile combining the evidences. In the future work, the op-timal weights can be derived for the individual models toachieve the best performance. From the statistical analysis, itis observed that there are some unique and discriminative fea-tures specific to the emotion at different levels. The features

Table 10: Classification performance using utterance, word and syllablelevel prosodic features.



specific to the emotion are related in a complex manner atdifferent levels. Nonlinear models are known to capture theserelations in a better way compared to the simple linear mod-els used in this paper. Therefore the ER task can be exploredusing nonlinear models such as neural networks and supportvector machines. In this paper we analyzed the ER task us-ing single utterance uttered by multiple male speakers. Thisgives the analysis corresponds to text dependent and speakerindependent cases. One can extend this study for differentcases such as (1) Text independent, (2) Speaker dependent,(3) Gender dependency and (4) Language independent. Simi-larly we can explore other features such as vocal tract systemand excitation source information for the ER task.

5. REFERENCES

[1] L. ten Bosch, “Emotions, speech and the ASR frame-work,” Speech Communication, vol. 40, pp. 213–225,2003.

[2] A. Iida, A study on corpus-based speech synthesis withemotion. PhD thesis, Graduate school of media andGovernance, Keio University, Japan, Sept. 2002.

[3] L. Yang, “The expression and recognition of emotionsthrough prosody,” in Proc. Int. Conf. Spoken LanguageProcessing, pp. 74–77, 2000.

[4] T. L. Nwe, S. W. Foo, and L. C. D. Silva, “Speech emo-tion recognition using hidden Markov models,” SpeechCommunication, vol. 41, pp. 603–623, Nov. 2003.

[5] A. Iida, N. Campbell, F. Higuchi, and M. Yasumura,“A corpus-based speech synthesis system with emo-tion,” Speech Communication, vol. 40, pp. 161–187,Apr. 2003.

[6] I. R. Murray, J. L. Arnott, and E. A. Rohwer, “Emo-tional stress in synthetic speech: Progress and future di-rections,” Speech Communication, vol. 20, pp. 85–91,Nov. 1996.

[7] R. Cowie and R. R. Cornelius, “Describing the emo-tional states that are expressed in speech,” Speech Com-munication, vol. 40, pp. 5–32, Apr. 2003.


[8] C. M. Lee and S. Narayanan, “Toward detecting emo-tions in spoken dialogs,” IEEEAUP, vol. 13, pp. 293–303, March 2005.

[9] T. V. Sagar, K. S. Rao, S. R. M. Prasanna, and S. Danda-pat, “Characterization and incorporation of emotions inspeech,” in Proc. IEEE INDICON, (Delhi, India.), Sept.2006.

[10] T. V. Sagar, “Characterization and synthesis of emo-tions in speech using prosodic features,” Master’s thesis,Dept. of Electrical and Communications Engineering,Indian Institute of Technology Guwahati, Guwahati, In-dia., May 2007.

[11] S. Ramamohan and S. Dandapat, “Sinusoidal model-based analysis and classification of stressed speech,”IEEE Trans. Speech and Audio Processing, vol. 14,pp. 737–746, May 2006.

[12] S. R. M. Prasanna and B. Yegnanarayana, “Extrac-tion of pitch in adverse conditions,” in Proc. IEEE Int.Conf. Acoust., Speech, Signal Processing, (Montreal,Canada), May 2004.


WAVELET BASED SPEAKER RECOGNITION UNDER STRESSED CONDITION

J. Rout and S.Dandapat


Email:{janmejaya,samaren}@iitg.ernet.in

ABSTRACT

In this work, we propose a text-independent speaker recogni-tion under stressed condition using wavelet transform. Linaerfrequency cepstral coefficients estimated from the speech sig-nal are used for speaker recognition. Three level wavelet de-composition is used for this purpose. Experimental resultswith this technique show comparable performance with con-ventional techniques which use MFCC features for speakerrecognition. Four styles of stressed speech data has beenselected from stressed speech database(SUSAS) for speakerrecognition.

Index Terms— Linear frequency cepstral coefficients, vec-tor quantization, speaker recognition, multi-band analysis, stressedspeech, wavelet transform, cepstral derivatives.

1. INTRODUCTION

Stressed speech [4] is defined as the speech produced un-der any condition that causes the speaker to vary speech pro-duction from neutral condition. In this study, the follow-ing four perceptually induced stressed conditions from theSUSAS database are considered: Anger, Lombard, Neutraland Question.

Speaker recognition can be performed as speaker identi-fication or speaker verification. Speaker identification systemidentifies a speaker from reference speakers set. Validity of aspeaker can be verified by Speaker verification system [1].The speaker recognition system contains feature extractionand classification. In literature review, mostly the feature ex-tractions are based on spectral information. Cepstral coeffi-cients are derived from the short time spectrum of speech sig-nal. Since speech production is usually modeled as a convo-lution of the impulse response of the vocal tract filter with anexcitation source, the cepstrum effectively deconvolves thesetwo parts, resulting in a low time component correspondingto the vocal tract system and a high time component corre-sponding to the excitation source [2].

Linear prediction cepstral coefficients and Mel-frequencycepstral coefficients are well known features for speaker recog-nition [3]. Although cepstral coefficients provide a good set

of feature vectors it is difficult to extract dynamic informa-tion from speech signals. Also linear predictive (LP) basedcepstrum is sensitive to noise. Therefore a new method is re-quired to extract specific features from speech signals and thefeatures should not be sensitive to noise. So the wavelet trans-form has recently become alternative for analysis of speechsignals.

0 0.2 0.4 0.6 0.8 1−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Time (sec)

Norm

alized A

mplitu

de

Time (sec)

Fre

qu

en

cy (

Hz)

0 0.2 0.4 0.6 0.8 1

0

500

1000

1500

2000

2500

3000

3500

4000

0 0.2 0.4 0.6 0.8 1−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Time (sec)

Norm

alized A

mplitu

de

Time (sec)

Frequency (

Hz)

0 0.2 0.4 0.6 0.8 1

0

500

1000

1500

2000

2500

3000

3500

4000

0 0.2 0.4 0.6 0.8 1−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Time (sec)

Norm

alized A

mplitu

de

Time (sec)

Fre

qu

en

cy (

Hz)

0 0.2 0.4 0.6 0.8 1

0

500

1000

1500

2000

2500

3000

3500

4000

0 0.2 0.4 0.6 0.8 1−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Time (sec)

No

rm

alize

d A

mp

litu

de

Time (sec)

Frequency (

Hz)

0 0.2 0.4 0.6 0.8 1

0

500

1000

1500

2000

2500

3000

3500

4000

Fig. 1: Speech signal and Spectrogram of the word ”break” of a male speakerin different emotions (a) Neutral (b) Lombard (c) Question (d) Anger


Four stressed speech signals with their spectrograms areshown in Fig.1. It is observed that spectral content of the foursignals are different. This variation is due to stress effect andit will reduce the speaker recognition performance. Thereforewe have used wavelet transform to improve the speaker recog-nition under stressed condition. Though SUSAS databasecontains 12 different stress conditions, we have used Anger,Lombard, Question and Neutral for evaluation. The Lom-bard effect is the tendency to increase one’s vocal intensity, ifnoise exists in the background. There are 35 words uttered by9 speakers with three different accents. Each word is utteredtwice.

This paper is organized as follows. Section II describesthe methodology. Multi-band analysis of stressed speech hasbeen specified in section III. Results and discussions are ex-posed in section IV. Finally, the conclusion has been drawn insection V.

2. METHODOLOGY

The block diagram of the proposed speaker recognition sys-tem is shown in fig.2. The system implements two mainstages i. e. wavelet transform based features extraction andvector quantization based classification stage. In the first stepthe speech signal is partitioned into N frames having eachframe of 20 ms with overlapping 10 ms. Silence detectionis performed to discard silence parts from the speech. Thepurpose of pre-emphasizing is to flatten the speech signal andto increase the relative energy of its high frequency spectrum.The next step is to window each individual frame so as tominimize the signal discontinuities at the beginning and endof each frame.

Fig. 2: The block diagram of the proposed speaker recognition system

The wavelet transform is defined as the inner product ofa signal with a collections of wavelet functions. The waveletfunctions are scaled and translated. Discrete wavelet trans-form (DWT) analyzes signals at different frequency bandswith different resolutions. It decomposes a signal into approx-imation and detail information. The original signal passesthrough two complementary filters namely low pass and high

pass filters giving two signals of approximation and detailcoefficients. The approximation coefficients are high scale,low frequency components and the detail coefficients are lowscale, high frequency components [8]. In this paper we usedDaubechies wavelet of class 6 and decomposition was doneupto level 3. Most of the present day systems use the char-acteristics of vocal tract system for speaker recognition. Theinformation is extracted using short time spectrum analysis ofsegments of 20-30 ms of speech signal. Cepstral analysis isvery useful for speaker recognition as it separates the excita-tion signal from the vocal tract. Cepstral coefficients are de-rived from the short time spectrum of the speech signal. Thecepstrum is the inverse Fourier transform of log magnitudespectrum of speech signal. From [7]

s(n) = g(n)⊕ v(n) (1)

where s(n), g(n), v(n) are the speech signal, the excitationsignal and the vocal tract impulse response respectively.

S(f) = G(f)V (f) (2)

log(S(f)) = log(G(f)) + log(V (f)) (3)

c(n) =1N

N−1∑

k=0

log |S(k)|ej2πkn/N (4)

In conventional speaker recognition system feature vec-tors are extracted over the whole frequency band of the in-put signal. Cepstral analysis is applied over the same fre-quency band. In this work the input speech signal is ana-lyzed using the mother wavelet “db6” with three levels ofdecomposition. The coefficients of three details levels i. e.first(cD1), second(cD2), third(cD3) and third approximationslevel (cA3) are used for feature extraction.In this way acous-tic features are extracted using cepstral analysis for each sub-band individually. After extracting the cepstral coefficientsfrom each wavelet band they are combined into one singlevector. The cepstral analysis given in equation (4) is ap-plied to each wavelet band to get twelve linear frequency cep-stral coefficients. The linear frequency cepstral coefficientsfrom four wavelet bands are combined to give a feature vec-tor of 48 coefficients. Vector quantization classifier is used forboth conventional and wavelet bands based speaker recogni-tion under stressed condition.

In this paper we have also used another additional fea-ture to improve the speaker recognition under stressed condi-tion. The cepstrum represents the local spectral properties of astressed speech. It doesn’t characterize the temporal or tran-sitional information. For text related application improvedperformance has been found by introducing cepstral deriva-tives into the feature space because cepstral derivatives cap-ture the transitional information in the speech. First deriva-tive of the cepstrum is also known as delta cepstrum. Cepstralcoefficients and delta cepstrum together has been used to im-prove the speaker recognition performance. First each speech


frame is decomposed into four frequency sub-bands using themother wavelet “db6”. A number of linear frequency cep-stral coefficients (LFCCs) are calculated from each sub-band.Then these are combined into one feature vector with theirfirst order derivatives.

3. MULTI-BAND ANALYSIS FOR STRESSEDSPEECH

The basic idea of multi-band analysis is to recognize the speakerunder different stressed condition by using multiple frequencybands and extracting the acoustic features individually fromthe speech. In multi-band analysis sufficient information canbe extracted. Speech signals are very complex because ofthe composition of various frequency components [5]. Theimportant problem in speaker recognition under stressed con-dition is the extraction of information content of the stressedspeech. There are two available sources of information in-volved with time and frequency domain in speech signal. Intime domain the sharp variation in amplitude of speech signalare the most meaningful features. In frequency domain dif-ferent speakers may have different responses in all frequencyregions. Therefore the method that only consider the fixed fre-quency channels may lose some useful information in the fea-ture extraction process [5]. The use of discrete cosine trans-form in the MFCC based feature extraction have such type ofdrawback mentioned above. Therefore the wavelet transformwhich has a good time and frequency resolution instead ofdiscrete cosine transform is an ideal tool in this respect. Fig.

Fig. 3: A schematic diagram of wavelet transform based multi-band tech-nique

3 shows the wavelet transform based multi-band techniquewith features combination. By using the appropriate motherwavelet the speech frame is decomposed into 3 levels giv-ing four coefficients. Cepstral features extracted from eachband are combined to form feature vector which is given tothe classifier for speaker recognition. Fig. 4 shows three leveldecomposition coefficients using the mother wavelet “db6” of

a speech signal in anger stressed condition. The speech sig-nal having one approximation (cA3) and three details (cD3,cD2 and cD1) are shown. It captures the important features indifferent frequency bands. The original signal is sampled at 8kHz. The coefficient cD1 captures the features in between 2-4kHz called first octave. Similarly the second and third octavesare cD2 and cD3 that captures 1-2 kHz and 500 Hz-1 kHz re-spectively. The different frequency band coefficients of thesame utterance “break” is also given in lombard emotion infig 5.

0 0.05 0.1 0.15 0.2 0.25−0.2

0

0.2

cA

3

0 0.05 0.1 0.15 0.2 0.25−0.5

0

0.5

cD

3

0 0.05 0.1 0.15 0.2 0.25−1

0

1

cD

2

0 0.05 0.1 0.15 0.2 0.25−2

0

2

cD

1

Fig. 4: Different frequency band representation of a speech utterance “break”in anger emotion


In this paper two different speech features mel frequency cep-stral coefficients (MFCCs) and linear frequency cepstral co-efficients (LFCCs) are used for evaluation of speaker recog-nition under stressed condition. A 64-size vector quantizationclassifier is used.

Table 1: Speaker recognition rate (%) using MFCC

Training TestingAnger Lombard Neutral Question Average

Anger 80.88 40.0 40.44 36.88 49.55Lombard 36.0 89.33 54.66 47.11 56.77Neutral 29.33 49.77 90.22 66.66 58.99Question 29.77 50.66 75.55 88.88 61.21


0 0.05 0.1 0.15 0.2 0.25−0.2

0

0.2

cA

3

0 0.05 0.1 0.15 0.2 0.25−0.5

0

0.5

cD

3

0 0.05 0.1 0.15 0.2 0.25−1

0

1

cD

2

0 0.05 0.1 0.15 0.2 0.25−1

0

1

cD

1

Fig. 5: Different frequency band representation of a speech utterance “break”in lombard emotion

Table 2: Speaker recognition rate (%) using LFCC

Training TestingAnger Lombard Neutral Question Average

Anger 75.11 35.11 43.55 37.77 47.88Lombard 26.66 90.66 61.33 47.55 56.55Neutral 31.55 58.22 92.0 68.0 62.44Question 27.11 48.0 73.33 88.88 59.33

In table I and II four rows indicate code books for whichthe training data is used from a particular emotion category.Four columns indicate the test patterns from four differentemotions. Diagonal values show the results when the testingdata and the training data are taken from the same emotion.The last column of the table represents the average recog-nition rate which is computed for a particular stress trainedmodel tested with all stresses considered. Diagonally we getthe highest recognition rate and we observe that when a sys-tem is trained and tested in different conditions the recogni-tion rate degrades dramatically. In table I highest SR rate(90.22%) is obtained when neutral data is used for testingwith a neutral code book using MFCC features. The SR ratehas a lowest value of 29.77% when question data is used fortesting with an anger code book. It also represents ques-tion trained model having highest average SR rate of 61.21%.Anger trained model gives the least performance of 49.55%.Considering each stressed trained model the recognition rateof LFCC is better than MFCC except anger stress is shown

Fig. 6: Recognition rate vs number of LFCCs per sub-band

in table II. LFCC features show maximum recognition ratein neutral stress and least performance in case of anger as inMFCC. For finding the optimal number of LFCCs that shouldbe extracted from each sub-band we have measured the recog-nition rate taking various numbers of LFCCs (3 to 10). Therecognition performance vs number of linear frequency cep-stral coefficients per sub-band graph is shown in fig. 6. Itis observed that the maximum recognition rate (91%) wasobtained when 9 LFCCs are extracted from each sub-bandin case of Lombard stress. Similarly for anger and questionstress it was 72% and 85% respectively when 8 LFCCs areextracted. In case of neutral the maximum rate (93%) wasobtained when 10 LFCCs are extracted.


5. CONCLUSION

In this paper a new technique wavelet transform for speakerrecognition is applied to different stressed condition. Featurerecombination technique is used. Linear frequency cepstralcoefficients of wavelet decomposed speech signals are ex-tracted.The performance is comparable with the conventionalsystem. The drawback is the large dimensionality of the fea-ture vector.Therefore a robust algorithm should be applied toreduce the dimensionality of the feature vector. Also compen-sation technique for reduction of stress should be applied toimprove the speaker recognition under stressed condition.

6. REFERENCES

[1] D. O Shaughnessy, “Speaker recognition,” IEEE ASSPMag., vol. 3, no. 4, pp. 417, Oct. 1986.

[2] D. O’ Shaughnessy, Speech Communication: Human andMachine. Reading, MA: Addison-Wesley, 1987.

[3] J.P. Campbell, “Speaker Recognition: A Tutorial,” Proc.IEEE., Vol. 85, no. 9, Sep. 1997.

[4] B. D. Womack and J. H. L. Hansen,“N-channel hiddenmarkov models for combined stressed speech classifica-tion and recognition,”IEEE Trans. Speech and Audio Pro-cessing, vol. 7, no. 6, Nov. 1999.

[5] C. T. Hsieh, E. Lai and Y. C. Wang, “Robust speechfeatures based on wavelet transform with application tospeaker identification,” IEE Proc. Vis. Image Signal Pro-cess, vol. 149, no. 2, April 2002.

[6] W. Alkhaldi, W. Fakhr and N. Hamdy, “Automaticspeech/speaker recognition in noisy environments us-ing wavelet transform,”Circuits and Systems, 2002.MWSCAS-2002. The 2002 45th Midwest Symposiumon,Vol. 1, Aug. 2002.

[7] W. Alkhaldi, W. Fakhr and N. Hamdy, “Multi-band basedrecognition of spoken Arabic numerals using wavelettransform,” Nineteenth National Radio Science Confer-ence, Alexandria, March 2002.

[8] S. C. Woo, C. P. Lim and R. Osman, “Development ofspeaker recognition system using wavelets and artificialneural networks,” Proc. International Symposium on In-telligent Multimedia, Video and Speech Processing, May2001, Hong Kong.

[9] S.Ramamohan and S.Dandapat, “Sinusoidal Model-Based Analysis and Classification of Stressed Speech,”IEEE Trans. Speech Audio Process, 2006.

[10] G.S.Raja and S.Dandapat, “Sinusoidal model basedspeaker identification,” Proc. NCC-2004 conference, IIsc,Bangalore, pp.523-527, Jan-Feb. 2004.

[11] K. P. Markov and S. Nakagawa, “Text-independentspeaker recognition using non-linear frame likelihoodtransformation,” Speech Communication, Vol.39, pp.301-310, 1998.

[12] D. A. Cairns and J. H. L. Hansen, “Nonlinear analy-sis and detection of speech under stressed conditions,” J.Acous. Soc. Amer., vol. 96, no. 6, pp. 3392-3400, 1994.

[13] K. H. Yuo, T. H. Hwang, and H. C. Wang, “Combinationof autocorrelation-based features and projection measuretechnique for speaker identification,” IEEE Trans. SpeechAudio Process., Vol. 13, No. 4, pp. 565 - 574, Jul. 2005.

[14] B. Yegnanarayana, S. R. M. Prasanna, J. M. Zachariah,and C. S. Gupta, “Combining evidence from source,suprasegmental and spectral features for a fixed-textspeaker verification system,” IEEE Trans. Speech AudioProcess., Vol. 13, No. 4, Jul. 2005.

[15] K. S. R. Murty and B. Yegnanarayana, “Combiningevidence from residual phase and MFCC features forspeaker recognition,” IEEE Signal Processing Letters,Vol. 13, No. 1, pp. 52 - 55, Jan. 2006.

[16] G.S.Raja and S.Dandapat, “Sinusoidal model basedspeaker identification using VQ and DHMM,” Proc.IEEE, INDICON-2004 conference, IIT, Kharagpur, pp.338-343, Dec. 2004.


VVLLSSII

A 2.4 GHz CMOS DIFFERENTIAL LNA WITH MULTILEVEL PYRAMIDICALLY WOUNDSYMMETRIC INDUCTOR IN 0.18 μm TECHNOLOGY

Genemala Haobijam, Deepak Balemarthy and Roy Paily*

VLSI and Digital System Design Laboratory,Department of Electronics and Communication Engineering, Indian Institute of Technology Guwahati,

Assam, India - 781039.

ABSTRACT

In this paper an integrated 2.4 GHz differential Low NoiseAmplifier (LNA), designed in a standard CMOS 0.18 μm TSMCprocess is presented. The LNA employs a new multilevelpyramidally wound symmetric (MPS) inductor to minimizethe chip area. The symmetric inductor is realized by windingthe metal trace of the spiral coil down and up in a pyramidalmanner exploiting the multilevel VLSI interconnects technol-ogy. Being multilevel, it achieves higher inductance to arearatio. The MPS inductor is symmetric and therefore elimi-nates the need of a pair of planar inductor for the differentialconfiguration. The performance of the differential LNA withMPS inductors is satisfactory, achieving a gain of 11.1 dBwith a noise figure of 3.9 dB. The MPS inductors of the dif-ferential LNA are characterized using a full wave Electromag-netic simulator and the LNA is designed and simulated usingEldo of Mentor Graphics.

Index Terms— multilevel inductors, symmetric inductor,differential LNA, EM simulation, gain

1. INTRODUCTION

Complementary Metal Oxide Semiconductor (CMOS) tech-nology is a competing technology for radio transceiver im-plementation of various wireless communication systems dueto its low cost, higher level of integrability with the technol-ogy scaling etc.[1]. In a typical radio receiver, the low noiseamplifier (LNA) is one of the key components and its mainfunction is to provide enough gain to overcome the noise ofsubsequent stages. In the process, the LNA should add min-imum possible noise and accommodate large signals with-out distortion. Also it must have good impedance match-ing to its source and to its subsequent stages which includemixer etc. The design of LNA always requires a good un-derstanding of the tradeoffs between noise figure (NF), gain,linearity, impedance matching and power dissipation. Amongthe various LNA topologies available, the inductively source

degenerated LNA is the most promising topology [2]. TheLNA configuration can be single-ended or differential. Gen-erally, the differential configuration is mostly used in analogintegrated-circuits since they are less sensitive to noise andinterference. The differential structure also offers a stable ref-erence point. For any type of circuits, the measured values arealways taken with respect to a reference point. In the case ofa non-differential structure, this reference would be on-chipground. However, the on-chip ground may not be very re-liable due to the presence of parasitic resistance and capaci-tance, leading to unpredictable results. On the other hand, in adifferential structure, the measured results of one half-circuitare always taken with respect to the other half-circuit. Thisminimizes the chance of getting unexpected results. The dif-ferential output also eliminates the need of single ended to dif-ferential conversion circuitry. Another important advantage isthat the differential configuration enables us to bias the am-plifier and to couple amplifier stage together without the needfor bypass and coupling capacitors. This is another reasonwhy differential circuits are ideally suited for IC fabricationwhere large capacitors are impossible to fabricate economi-cally. Besides, the demand for small size, low cost and highperformance circuits has driven the integration of passive de-vices along with the active devices. With integration, the par-asitic reactance decreases because of their shorter lead lengthswithout the need for individual packaging and with more sim-ple internal structure. Moreover, while fabricating differentialconfigurations, the required chip area will be large due to theincreased number of passives required. The monolithic in-ductor is a critical passive component and its integration con-tinues to be a deciding factor of performance and processingcost. Integrated inductors occupy a large area of the chip. So,new inductor topology which occupy smaller area have to beexploited to minimize the area and cost of integration.

In this paper, we present a differential LNA with a newmultilevel pyramidically wound symmetric inductor [3] in whichthe metal trace spirals up and/or down. This results in a com-pact layout with higher inductance to area ratio thereby reduc-ing the area of the inductor to a large extent. Being symmetric

*email: [email protected]


Port 1 Port 2

Common node ( Port 3 )

Dout Din

W

S Spacing

i1

i1 i2

i2

Fig. 1. Layout of a pair of asymmetric planar inductor fordifferential circuit implementation.

it also eliminates the need of a pair of inductors to be used indifferential LNA. The LNA is designed to function at 2.4 GHzfor Bluetooth application. The paper is organized as follows.In Section 2, the inductor design is discussed in detail. In sec-tion 3 the design of differential LNA is explained. The resultsare presented in section 4 and finally conclusions are drawnin Section 5.

2. INTEGRATED INDUCTOR DESIGN

Spiral inductor integrated on Silicon was first reported in 1990by Nguyen and Meyer [4]. The inductors were square spiralsof values 1.3 and 9.3 nH with a peak quality factor (Q) of 8at 4.1 GHz and 3 at 0.9 GHz respectively. They also provedits performance in an LC voltage controlled oscillator and RFbandpass filter circuits [5], [6]. Integrated spiral inductorssuffer not only from ohmic losses in the metal due to its fi-nite resistance but also from losses in the substrate due to lowresistivity of silicon resulting in low Q. On the other hand, itoccupies a large area of the die as compared to active devices.A trade off exists between the cost and performance and there-fore the structure can be designed and optimized with a goodunderstanding of the performance trends with respect to thelayout and process parameters [7]. There has been an enor-mous progress in the research on the performance trends, de-sign and optimization, modeling, quality factor enhancementtechniques etc. to combat the design complexity of spiral in-ductors. The first and basic step in the design of integrated in-ductor is the selection of a particular topology and lay out in achosen process technology. Since inductance value is decidedmore by its layout parameters [8] it is important to understandthe performance and limitations of different structures to ar-rive at an efficient and area optimized design. Various asym-metric or symmetric spiral inductor topologies, both planarand multilevel exist today. In integrated circuits, the differ-ential topology is preferred because of its less sensitivity tonoise and interference. Hence, for such circuits a symmetric

Fig. 2. Multilevel pyramidically wound symmetric inductor.

structure is required. For differential circuit implementation,a pair of planar spiral inductors can be used with their in-ner loops connected together in series [9] as shown in Fig.1. Since the currents always flow in opposite directions inthese two inductors, there must be enough spacing betweenthem to minimize electromagnetic coupling. As a result, theoverall area occupied by the inductors is very large. To elim-inate the use of two inductors and reduce the chip area con-sumption, the centre tapped spiral inductor was presented in1995 by Kuhn et al [10] for balanced circuits. Later, in 2002Danesh and Long [11] presented a symmetrical inductor withenhanced Q for differential circuits. The symmetric inductoris realized by joining groups of coupled microstrip from oneside of an axis of symmetry to the other using a number ofcross-over and cross-under connections. The symmetrical in-ductor under differential excitation results in a higher Q andself resonating frequency (fres). A 7.8 nH inductor with outerdiameter of 250 μm, metal W of 8 μm and spacing of 2.8 μmrespectively has a peak Q of 9.3 at 2.5 GHz under differentialexcitation while Q is only 6.6 at 1.6 GHz under single endedcondition. It also occupies less area than its equivalent asym-metrical pair of inductors. This type of winding of the metaltrace was first applied to monolithic transformers by Rabjohnin 1991 [12]. A ‘group cross’ symmetric inductor structure[13] manufactured on a printed circuit board (PCB), in whichthe metal traces crosses each other in groups, was also shown


(a) (b)

Fig. 3. (a)Inductively source degenerated LNA topology. (b) Small signal equivalent of the inductively degenerated LNA.

to have less effective parasitic capacitances between two in-put ports and higher (fres) and Q. However, the area of allthese inductors is still large. Other different forms of sym-metric windings were also studied [14]-[16].

Multilevel symmetric inductors can be designed exploit-ing the available multilevel interconnects of a standard CMOStechnology. To minimize the area, the inductors of the LNAare implemented with the new multilevel pyramidically woundsymmetric (MPS) inductor structure [3] in which the traces ofthe metal spirals up and down in a pyramidal manner avoid-ing the overlap to reduce the inter turn parasitic capacitances.The structure is shown in Fig. 2. This form of metal windingresults in a compact symmetric layout, wherein one inductorcoil is folded inside the other. The MPS inductor is realizedby connecting two inductors in series as indicated by Inductor1 and Inductor 2 in the figure. For a four layer MPS induc-tor, Inductor 1 has its outermost (first) turn on topmost metal,M4 and the second turn on M3 and follows similarly down tothe bottom metal level, M1 having the innermost (fourth) turn.The second pyramidal spiral inductor starts winding from bot-tom metal level, M1 with outermost (first) turn and the sec-ond turn on M2 and repeats till it ends with the innermostturn on M4. The two inductors are connected at the bottommetal level, M1. So for each inductor, the turns of the metaltrack runs in different metal layers. There will be couplingbetween these two inductors and the total inductance is givenby the sum of its self and mutual inductances. If W is themetal width and Off is the offset between the edges of twoconsecutive turns in different layer, then the inner diameterdecreases or increases by 2(W+Off ) as the inductor windsdown and up respectively. The design parameters comprisesof W, Off , Dout, Din and number of metal layers.

3. LNA DESIGN

In this section the design of the differential LNA with a newmultilevel pyramidally wound symmetric inductor is discussed.The single-ended inductively source degenerated LNA is shownin Fig. 3(a). An LNA must have good matching to the in-put source and to the load whose impedances are generally50 Ω. The source-degenerating inductance Ls interacts withthe gate-source capacitance cgs of the main transistor M1 toproduce a real term in the input impedance. Inductance Lg

provides the additional degree of freedom required in generalto allow for a desired resonant frequency of the input loop [1].Fig. 3(b) shows the small signal equivalent of the inductivelydegenerated LNA. The input impedance Zin(ω) of the LNAfrom the small signal model is

Zin(ω) = ωtLs + j

[(Ls + Lg

)ω −

1

ωcgs

](1)

where ωt = gm/cgs is the transit frequency of the MOSdevice. Evidently, the first term in (1) is real impedance withthe advantage that it does not have the thermal noise of a re-sistor. One immediate observation from (1) is that impedancebecomes real only at a single frequency, more precisely, theresonance frequency of inductors Ls, Lg and gate-source ca-pacitance cgs. The input matching of 50 Ω is obtained byselecting the source inductance Ls such that ωtLs = 50 Ω.A cascode transistor M2 is used to reduce the interaction ofthe tuned output with the tuned input and to reduce the effectof M1’s gate to drain capacitance cgd. The inductor Ld res-onates with the total capacitance at the output node and tunesthe output impedance to 50 Ω. The minimum noise figure ofthe single-ended LNA Fmin is given by [1]

Fmin = 1 + 1.62

(ωo

ωt

)(2)


Fig. 4. Inductively source degenerated differential LNA.

where ωt = gm/cgs is the transit frequency and ωo = 2πfo

and fo = 2.4 GHz is the resonant frequency. As stated ear-lier, differential configuration offers several advantages oversingle-ended circuits. In order to make the LNA circuit ofFig. 3(a) differential, another half-circuit was built, whereeach transistor and circuit component has a complimentarytransistor or component. The source degenerated differen-tial LNA is shown in Fig. 4. The positive input voltage isapplied at the gate of one of the half-circuit CS amplifiers,while the negative input voltage is applied at the gate of theother half-circuit. In such a case, it is now the difference be-tween the two input signals that is being amplified. The over-all output of the LNA is taken between the drains of eachhalf-circuit as shown in Fig. 4. The LNA is designed in astandard CMOS 0.18 μm TSMC process using fixed poweroptimization process discussed in [1]. The bias current Ibias

is taken as 15 mW. The width of the main transistor is se-lected to obtain the minimum noise figure. The width of thecascode transistor M2 is taken same as that of M1. Designedvalues are M1 = M2 = M3 = M4 = 300 μm, Ls = 1 nH,Lg = 5.2 nH, Ld = 2 nH, Rs = 50 Ω. The expected noisefigure of the single ended LNA is 1.7 dB. As the differentialconfiguration comprises of two single ended structures, thenoise figure of the differential LNA will be around 3.5 dB.


In our LNA design with MPS inductor, instead of the pairof Ls of 1 nH each shown in Fig. 4, a differential MPS in-ductor of 2 nH is used as the source degeneration inductor.The Lg and the Ld inductors are also implemented with MPSinductors. The structures were designed in 0.18 μm CMOStechnology using a full wave electromagnetic simulator Intel-lisuite of Intellisense Software [17]. The 2 nH MPS inductor

2 4 6 8 10 12 14 16 18

0

10

20

30

40

Frequency (GHz)

Indu

ctan

ce (n

H)

0

2

4

6

8

10

12

14

16

18

Qua

lity

fact

or

Fig. 5. Inductance and quality factor plot for 2 nH MPS in-ductor.

1 2 3 4 5 6 70

2

4

6

8

10

12

14

16

18

Frequency (GHz)

Indu

ctan

ce (n

H)

0

2

4

6

8

Qua

lity

fact

or

Fig. 6. Inductance and quality factor plot for 5.2 nH MPSinductor.

is designed using the upper three metal layers. The inductorhas an outer diameter of 100 μm, inner diameter of 64 μm,metal width of 8 μm and offset of 2 μm. The inductance andthe quality factor under differential excitation are shown inFig 5. The 5.2 nH MPS inductor is designed using the upperfour metal layers and it has an outer diameter of 126 μm, innerdiameter of 34 μm, metal width of 10 μm and offset of 2 μm.The inductance and the quality factor under single ended ex-citation are shown in Fig 6. The quality factor under singleended and differential excitation are measured as in [11].

The performance of the LNA with the new MPS inductoris evaluated using the Eldo of Mentor Graphics. The extractedlumped parameters at 2.4 GHz are incorporated in the simu-lation of the LNA circuit. The gain of the amplifier, the inputand the output reflection coefficients, given by the S parame-


(a) (b)

(c) (d)

Fig. 7. S parameter curves (a) |S11|, (b) |S22| (c) |S21| and (d) |S12|.

ters are plotted in Fig. 7. From Fig. 7(a) we can see thatthe LNA shows an |S11| of -20 dB at 2.4 GHz. The |S11| isless than -10 dB showing that the LNA exhibits a good inputmatching.

Fig. 7(b) shows the |S22| curve. The LNA results an |S22|of -16.8 dB at 2.4 GHz. The |S22| is less than -10 dB showingthat the LNA exhibits a good output matching. The outputmatching is slightly poorer than the input matching. This isdue to the higher number of parasitics at the output node asthe differential output is taken across the two inductors whichhave parasitics associated with them.

Fig. 7(c) shows the |S21| curve. The LNA shows an |S21|of 11.1 dB at 2.4 GHz. The LNA exhibits a good gain (|S21|)at the desired frequency.The 3 dB points observed from |S21|curve are 2.33 GHz and 2.48 GHz and hence the LNA meetsthe bandwidth requirement for Bluetooth application.

Fig. 7(d) shows the |S12| curve. The LNA exhibits avery good reverse isolation (|S12|) of -44.1 dB at 2.4 GHz.In summary, the designed LNA exhibits quite satisfactory S-parameters at 2.4 GHz.

4.1. Noise Figure

Noise figure is the measure of the signal to noise ratio degra-dation as the signal traverses through the LNA. Fig. 8 showsthe noise figure curve. The noise figure of the LNA at 2.4 GHz

Fig. 8. Noise Figure curve.

is 3.9 dB. The noise figure ranges from 3.85 dB to 4.0 dB overthe Bluetooth bandwidth. Therefore, the noise performanceof the LNA is satisfactory.

4.2. Linearity

The 1 dB-compression point and IIP3 values of the designedLNA are -3.7dBm and 3.98dBm respectively. Hence the de-signed LNA exhibits good linearity. Table 1 gives the sum-


Table 1. Performance summary of the differential LNA

Technology (μm) 0.18Frequency (GHz) 2.4|S11| (dB) -20|S12| (dB) -44.1|S21| (dB) 11.1|S22| (dB) -16.8

Noise Figure (dB) 3.9IIP3 (dBm) 3.98

Pin−1dB (dBm) -3.7Total Power Dissipation (mW) 35

5. CONCLUSION

In this paper the design of an an integrated differential LNAusing the multilevel pyramidically wound inductor in a 0.18μm CMOS technology was demonstrated. Being multilevel,the MPS structure achieves higher inductance to area ratio andoccupies smaller area. The structure also eliminated the needof a pair of planar inductors for differential configuration. Theperformance of the differential LNA was satisfactory. Thedifferential LNA exhibited good input and output matchingwith a gain of 11.1 dB and reverse isolation of -44.1 dB at2.4 GHz. The area of the LNA chip will be reduced signifi-cantly with the MPS inductors.

6. ACKNOWLEDGMENT

This work was carried out using Intellisuite of IntelliSenseSoftware Corp. procured under the NPSM project at IndianInstitute of Technology Guwahati.

7. REFERENCES

[1] D. K. Shaeffer and T. H. Lee, “A 1.5V, 1.5 GHz CMOSLow Noise Amplifier,” IEEE J. Solid-State Circuits, May1997, vol. 32, pp. 745-758.

[2] T. H. Lee, The Design of CMOS Radio-Frequency Inte-grated Circuits, 2nd edition, Cambridge University Press,2004.

[3] Genemala Haobijam, Roy Paily, “Multilevel pyramidi-cally wound symmetric spiral inductor”, in Proc. 11th

IEEE VLSI Design and Test Symposium, Aug. 8-11, 2007,Kolkata, India.

[4] N. M. Nguyen and R. G. Meyer, “Si IC-compatible induc-tors and LC passive filters”, IEEE J. Solid-State Circuits,vol. 25, pp. 1028-1031, Apr. 1990.

[5] N. M. Nguyen and R. G. Meyer, “A Si bipolar monolithicRF bandpass amplifier”, IEEE J. Solid State Circuits, vol.27, no. 1, pp. 123-127, Jan. 1992.

[6] N. M. Nguyen and R. G. Meyer, “A 1.8-GHz monolithicLC voltage-controlled oscillator”, IEEE J. Solid StateCircuits, vol. 27, no. 3, pp. 444-450, Mar. 1992.

[7] Genemala Haobijam, Roy Paily, “Efficient optimizationof integrated spiral inductor with bounding of layout de-sign parameters”, Analog Integrated Circuits and SignalProcessing, Vol 51, no.3, pp 131-140, June 2007.

[8] Y. K. Koutsoyannopoulos and Y. Papananos, “Systematicanalysis and modeling of integrated inductors and trans-formers in RFIC design”, IEEE Trans. Circuits Syst. II,vol. 47, pp. 699-713, Aug. 2000.

[9] J. Craninckx and M. Steyaert, “A 1.8-GHz low-phase-noise CMOS VCO using optimized hollow spiral induc-tors”, IEEE J. Solid-State Circuits, vol. 32, no. 5, pp. 736-744, May 1997.

[10] W. B. Kuhn, A. Elshabini-Riad, and F. W. Stephenson,“Centre-tapped spiral inductors for monolithic bandpassfilters”, Electron. Lett., vol. 31, pp. 625-626, April 1995.

[11] M. Danesh and J. R. Long, “Differential driven symmet-ric microstrip inductors”, IEEE Trans. Microwave TheoryTech., vol. 50, pp. 332-341, Jan. 2002.

[12] G. G. Rabjohn, Monolithic microwave transformers,M.Eng thesis, Deptt. Electron., Carleton Univ., Ottawa,ON, Canada, Apr. 1991.

[13] Yu-Yang Wang and Zheng-Fan Li, “Group-Cross Sym-metrical Inductor (GCSI): A new inductor structure withhigher self-resonance frequency and Q factor”, IEEETrans. Magnetics, vol. 42, pp. 1681-1686, June 2006.

[14] Wei-Zen Chen, Wen-Hui Chen, “Symmetric 3D passivecomponents for RF ICs application”, in Digest of PapersIEEE Radio Frequency Integrated Circuits (RFIC) Sym-posium, pp. 599-602, June 2003.

[15] S. Kodali, D. J. Allstot, “A symmetric miniature 3D in-ductor”, in Proc. International Symposium Circuits andSystems, ISCAS, vol.1, pp. I-89-I-92, May 2003.

[16] H. Y. D. Yang, “Design considerations, of differen-tial inductors in CMOS technology for RFIC”, in Di-gest of Papers IEEE Radio Frequency Integrated Circuits(RFIC) Symposium, pp. 449-452, June 2004.

[17] M. Farina, T. Rozzi, “A 3-D integral equation-basedapproach to the analysis of real-life MMICs-applicationto microelectromechanical systems”, IEEE Trans. Mi-crowave Theory Tech., vol. 49, pp. 2235-2240, Dec. 2001.


A DESIGN OVERVIEW OF LOW ERROR RATE 800MHz 6-BIT CMOS FLASH ADCs

Niket Agrawal, Roy Paily

ABSTRACT

The monolithic flash ADC is a key component of digital ap-

plication and is the fastest ADC architecture for a given IC

process. This paper provides a brief design overview of fun-

damental building blocks of a 6-bit 800MHz flash ADC. The

resistor averaging and NAND gate bubble error correction

techniques are incorporated for performance enhancement.

The ADC reaches 1GHz sampling frequency with an effec-

tive number of bits (ENOB) of 5.5. The resistor averaging

technique shows an improvement in the performance of ADC

by reducing the offset by a factor of two. The ADC works

up to Nyquist frequency and shows an ENOB of 5.33 at this

frequency. The peak DNL (Differential Non Linearity Error)

is less than 0.4LSB.

1. INTRODUCTION

In today’s era, a lot of information has to be stored and fre-

quently accessed within short period. The application can be a

speech data files for playback or disk drive read channel. The

data is always processed and stored in digital format for reli-

ability and noise immunity [1]. This requires analog to digi-

tal conversion block in between sensor and signal processing

circuitry. The demand on sampling rate of analog to digital

converter (ADC) varies with application. For speech signals

the demand is more on resolution than on sampling rate. For

image applications, the demand is on sampling rate as a lot of

raw data has to be frequently accessed and processed. Gen-

erally, a resolution of 8-bit is sufficient for image application.

Now a day’s flash converters are very popular in high sam-

pling rate applications [2]. They have high data conversion

speed but that comes at the expense of large power dissipa-

tion [3]. Silicon area and silicon power consideration gen-

erally limits the resolution to 6-8 bits. There are many tech-

niques available in literature to reduce the power consumption

of flash ADC [4]. Other architectures like sigma delta, suc-

cessive approximation (SAR) and dual slope promises high

resolution but with very less data rate in comparison to flash

converters. For speech application, sigma delta converters

Fig. 1. Block diagram of Flash ADC

are the preferred choice as they are robust and provide high

resolution with very less error rate. This paper provides a

brief design overview of 800MHz 6-bit flash ADC suitable for

digital image application. Section-2 covers the basic build-

ing block description of flash ADC, section-3 covers circuit

details, section-4 contains digital error correction, section-

5 contains the result and finally the paper is concluded in

section-6.

2. BASIC BUILDING BLOCK OF FLASH ADC

The block diagram of a typical flash architecture is shown in

Fig.1. It has mainly six blocks; reference ladder, T/H (Track

and hold), preamplifier, first and second stage comparator, re-

sistor averaging termination network, digital error correction

circuitry and gray encoded ROM. For high speed ADC, T/H

is used to widen the input bandwidth by holding the input

for half clock duration. Without T/H it would be difficult for

preamplifier to sense the rapidly varying input signal.

. A resistive divider with 2N resistors provides the reference

voltage. For better linearity and accuracy, poly resistors are

Dept.of ECE, Indian Institute of Technology Guwahati, Assam, India-781039

Email:[email protected], [email protected]

VLSI and Digital System Design Laboratory,


used. The input signal is compared with each of these refer-

ences at the same time and the preamplifier amplifies the dif-

ference. The preamplifier is used to reduce the input referred

offset by its gain. Nevertheless, for high input bandwidth the

gain of preamplifier is limited by gain bandwidth product. For

an N-bit converter the circuit employs 2N − 1 preamplifiers.

The output of the preamplifier is fed to two stage compara-

tors, which generate the correct logic levels depending on the

difference amplifier input. Two-stage comparator architecture

has advantage of faster regeneration. The output of first stage

comparator may not provide rail-to-rail swing. However, this

is fed to second stage comparator, which generates correct

logic swing to drive the digital circuitry. Ideally, the outputs

from the comparators are string of zeros followed by strings

of ones, which is called thermometer code. For very fast in-

put signals, small timing differences between the comparators

and offsets in comparators can cause a situation where a one

is found above zero. This situation is normally called a bubble

in thermometer code [1]. Depending on the number of suc-

cessive zeros, the bubbles are characterized as first, second

and higher order. There exists a second type of error in flash

ADC, which is caused due to very small difference between

input and reference voltage. The small difference between

input and reference is amplified by preamplifier and fed to

comparators. If this amplified difference is very small, the

comparator is unable to resolve this difference resulting in a

Vmeta output which is neither ‘1’ nor ‘0’. This situation is re-

ferred as metastability, which causes glitches in output code.

A faster regeneration and high gain reduces the probability

of metastability. The technique used to address metastability

and bubble error is given in section-4 of the paper.

3. CIRCUIT DETAILS OF ANALOG DESIGN BLOCK

The circuit details of each functional block, introduced in

Fig.1, are presented in this section.

3.1. Track and Hold (T/H)

For high speed ADC T/H circuit is used to widen the input

bandwidth by holding the input signal for half clock cycle

[2]. The circuit is shown in Fig.2. For clock high the input is

sampled and stored on capacitor C1. For clock low, the charge

on capacitor faces high impedance of PMOS (M3) source fol-

lower which act as a buffer. The output of source follower is

fed to 64-preamplifiers. The source follower is sized appro-

priately to drive the preamplifiers. Large size input transis-

tor was used for sensing the low level input signal, however

this introduces considerable charge injection [5]. Therefore, a

dummy NMOS switch (M1) is used to reduce this effect. The

common mode voltage is 0.5V while the input range is 0.1V

to 0.8V, which corresponds to output swing between 1.2V and

1.95V. An overlapped range of 1.3V to 1.8V is chosen as fi-

nal range of input to preamplifier, which corresponds to LSB

Fig. 2. Track and hold circuit

of around 16mV. We used differential topology in our design

to reduce the nonlinearity error and common mode noise [2].

This directly corresponds to 1.4V peak-to-peak input range

for the ADC. Smaller capacitance reduces the RC time but

worsens the output by introducing thermal noise (kT/C). The

4pF capacitor (C1) limits the second harmonic distortion to

-110dB and third harmonic distortion to -65dB, which is suffi-

cient for this input range and better than similar work reported

[6].

3.2. Preamplifier

Preamplifier amplifies the difference between input signal and

the reference voltage so that it can be fully resolved by the

later stage (comparator). The circuit diagram of preamplifier

used in our design is shown in Fig.3. As a first stage, high

bandwidth as well as high gain is required from preamplifier

and these conflicting requirements should be traded off ap-

propriately [2]. The unity gain frequency is given as

ft = gm

2πCgs

To recover from large overdrive voltage reset switches are

invariably used in preamplifier. We have designed preampli-

fier for a gain of 3 while providing bandwidth up to 1 GHz.

Transistors are sized using following gain equation

A = 2gm(in)gm(load) = 3

gm =√

2βn(W/L)ID . where . βn = μnCox

For selected input range from sample and hold the tran-

sistor should be in saturation, which is given by equation

1.3 ≥ Vtn + Vdsat(in) + Vdsat(bias)

1.8 ≤ Vdd − |Vtp| − |Vdsat(load)|+ Vtn

Vdsat bias is reduced by sizing bias transistor with larger width.

To get the desired Vdsat in both input and load transistors,

gate length larger than the typical minimum was used. The

final output range is given by

Vdsat(in)+Vdsat(bias) ≤ Vout ≤ Vdd−|Vtp|−|Vdsat(load)|The final output range was 1.2V to 2V. With the reset switch

on, the gain is given by


Fig. 3. Preamplifier Circuit

A = 2gm(in)

gm(load)+ 2Rreset

The targeted gain was 0.2 and the reset switch was sized to

satisfy the above obtained Rreset.

3.3. Comparators

Two cascaded comparators are used to decrease the latching

time as well as to increase the gain. First stage comparator

is shown in Fig.4 which consists of preamplifier with diode-

connected load cascaded with a regenerative latch. The first

comparator is designed to have a low gain of 1.5 but with very

high bandwidth. The function of this comparator is to further

magnify the difference of preamplifier by latching. Keeping

the overall gain low in preamplifier stage has an advantage

of high input bandwidth but at the same time offset is not

reduced. Therefore, resistor averaging termination technique

is used in our design to reduce the offset [7]. The gain of first

stage comparator is given by

A = 2gm(in)

gm(load)+ 2Rreset

= 1.5

This was set to 1.5 for high bandwidth. For 1.2V to 2V range

the first stage comparator should be in saturation and the con-

ditions are given by

1.2 ≥ Vtn + Vdsat(in) + Vdsat(bias)

2 ≤ Vdd − |Vtp| − |Vdsat(load)|+ Vtn

The output range of first stage comparator is then

Vdsat(in)+Vdsat(bias) ≤ Vout ≤ Vdd−|Vtp|−|Vdsat(load)|This output range corresponds to 1V to 2.2V, which is fed

to second stage comparator. The reset and regeneration time

constant are given by

τreset = Cload

gm(load)+ 2Rreset

. And. τregen = Cload

gm(latch)−gm(load)

Fig. 4. First stage comparator

Fig. 5. Second stage comparator

The transistors are sized and optimized to satisfy all the above

requirements. The two time constants should be reduced to

get faster latching and faster reset. Increasing the size of

load transistor increases gm(load) reducing the time constant

but at the same time Cload increases. Through simulations,

proper sizes were evaluated. The second stage comparator

which is shown in Fig.5 is basically two inverters connected

back to back. The switching threshold of inverters is adjusted

according to input range. The width is increased till the CLoad

is dominant. The input transistors are designed such that they

should remain in saturation for 1V to 2.2V range.

3.4. Offset cancelation by resistor averaging

In practical scenario, device can have mismatch and can lead

to DNL (Differential Non Linearity) and INL (Integral Non

Linearity) errors. The input referred offset standard deviation

is given by

σinp =√

σ2preamp + σ2

comp

A2preA2

comp

where Apre is gain of preamplifier and Acomp is gain of com-

parator. σinp, σpreamp, σcomp is the input referred offset

standard deviation, offset standard deviation of preamplifier


and offset standard deviation of comparator respectively. For

0.35μm technology, the standard deviation due to Vth for

PMOS and NMOS is around 8mV/√

WL and 10mV/√

WLrespectively. This gives an offset of around 12mV that is very

close to LSB and hence unacceptable. The resistor-averaging

scheme works on the principle of finding a stable bias point

for all the comparators [7]. For example, if the output volt-

age of the preamplifier changes then the current will flow in

the resistor average termination and the bias point will shift in

such a way to restore the linearity. When the output of pream-

plifier is connected with a resistor to an adjacent preamplifier

and so on, the error due to mismatch is averaged out and leads

to improvement in DNL and INL specification as shown in

the result section of this paper. The first stage average resis-

tor termination value depends on preamplifier load, which is

around 3.5kΩ in this case. Therefore putting 100kΩ resistor

at load will not affect the preamplifier gain. The second stage

average resistor termination value depends on first stage com-

parator load, which is again around 4kΩ, and hence adding

100kΩ resistor will not affect its operation. The resistor aver-

aging technique suffers from large nonlinear response at the

extreme ends. To restore the linearity at the edges, we have

used a Ravg−Rload resistor value at the first and last average

termination network, which requires a dummy preamplifier

and comparator at the end for proper termination. With resis-

tor averaging termination the offset is reduced to 4mV.

4. DIGITAL ERROR CORRECTION

Comparator in flash converter typically generates thermome-

ter code [2]. If a particular comparator input is high with re-

spect to reference level then the corresponding output is high.

The zero to one transition point rise and fall for different in-

put range and this point is decoded and used to address ROM.

As we are processing high frequency input signal, small tim-

ing difference between the adjacent comparators and unfavor-

able reference variations can cause error in output codes. Due

to comparator delay and change in reference voltage a ‘1’ is

found above ‘0’ and this situation is referred as bubble in ther-

mometer code. There exists bubble of first to higher orders as

shown in Fig.6 where bold line indicates the actual reference

voltage while dashed line shows shifted reference. The bits in

circle shows correct bit while bits in square shows erroneous

bit due to shifted reference and delay in adjacent compara-

tor stages. It can be seen in the figure that the probability of

second order bubble is very less in comparison to first order

bubble and requires large timing mismatch. There are many

techniques available in literature to address first order bubble

problem. We have used three input NAND gate technique in

which a two zeros and one will be qualified as one and will

address the respective ROM line. The second stage compara-

tors outputs are fed into NAND gates as shown in Fig.7. The

NAND gate plus inverter output will be ‘1’ in only one case

when a 001 code is detected. This technique totally removes

Fig. 6. Voltage levels vs time instants depicting the cause of

first and second order bubble

Fig. 7. 3-Input NAND gate bubble error correction

the first order bubble problem but cannot address second and

higher order bubble. In our design, we used Gray encoded

ROM that suppresses the error due to second and higher or-

der bubble. In case of two ROM line selections, the output

code will be AND function of two gray codes. These codes

have only one-bit change in between them and hence the error

will be small.

. Metastability is another source of error in output codes

of flash ADC. This is a situation in which the comparator is

unable to resolve the difference in reference and input and

output lies between logic ‘0’ and logic ‘1’. Due to this er-

ror, no ROM line is addressed and this contributes a severe

error of no ROM line selection. This error can be addressed

by keeping in mind that in case of metastability the output

of comparison is Vmeta. Now the inverter threshold of three

input NAND gate are designed in such a way that for Vmeta

input it will generate a logic ‘1’ which will address the ROM

line and thus error of no ROM line selection can be reduced.

5. RESULT & DISCUSSION

The circuit was simulated in Design Architect of Mentor Tools

with TSMC 0.35μm technology and 3.3V power supply. The

static parameters DNL and INL were found by using his-

togram test technique. The output data was analyzed using


Fig. 8. DNL/INL without resistor averaging

Fig. 9. DNL/INL with resistor averaging

MATLAB software. The circuit was simulated with and with-

out resistor averaging technique. The INL and DNL of ADC

without resistor averaging is shown in Fig.8.

. The other case (with resistor averaging) is shown in Fig.9.The

peak INL and DNL for the case of ADC without averag-

ing network is 0.9LSB and 0.58LSB respectively. With re-

sistor averaging the INL and DNL are 0.4LSB and 0.3LSB

respectively. The DNL without resistor averaging is greater

than 0.5LSB, which is reduced to 0.3LSB by resistor aver-

aging, which proves the effectiveness of resistor averaging

technique. The SNDR and SFDR of ADC as a function of

input frequency is shown in Fig.10. An input sine wave of

amplitude greater than the full-scale range was used to simu-

late the ADC to successfully hit all the codes. The frequency

was carefully chosen to get coherence. A total 1024 sam-

Fig. 10. SNDR and SFDR of ADC vs input frequency

ples were used for FFT analysis. The plot shows that for

input frequency of 62.5MHz SNDR is around 36dB, which

shows an ENOB of 5.7. The SFDR at this frequency is around

45dB. With increase in frequency of input signal the SNDR

and SFDR decreases. For frequency near Nyquist frequency,

the ADC resolution decreases rapidly. This is due to increase

in bit error in ADC as well as additional noise component.

Fig.11 shows a plot of ADC SNDR at fixed low input fre-

quency and varying sampling rate. The graph shows that for

frequency greater than 1GHz, the ENOB is still greater than

5.4 bits. The ADC performance summary is shown in Table-l.

Table-ISummary of ADC performance

Specification Performance of ADCTechnology TSMC 0.35μm

Power supply 3.3V

Sampling Frequency 800MHz

Power Consumption 200mW

DNL/INL 0.3/0.4LSB

[email protected] 5.88bits

[email protected] 44.4dB

[email protected] 33.65dB

6. CONCLUSION

Design of an 800 MHz 6-bit flash converter was discussed

in this paper. The design summary of preamplifier, first and

second stage comparators, resistor averaging termination net-

work, digital error correction circuitry and gray encoded ROM

was provided. 3-Input NAND gate and gray encoded ROM


Fig. 11. SNDR and SFDR of ADC vs sampling frequency

suppressed bubble errors in ADC and improved its perfor-

mance. The ADC showed an ENOB of 5.7 bits at input fre-

quency of 62.5MHz and therefore verified the functionality

of ADC. The peak INL and DNL were 0.4LSB and 0.3LSB

with resistor averaging technique. The ADC worked well

above 1GHz sampling frequency while maintaining an ENOB

greater than 5.4.

7. REFERENCES

[1] Christopher W. Mangelsdorf, “A 400-MHz Input Flash

Converter with Error Correctio”, IEEE Journal of Solid-state Circuits, vol. 25, no. 1, pp. 184-191, February 1990.

[2] Choi, M. and Abidi, A.A., “A 6-b 1.3-Gsample/s A/D

Converter in 0.35-um CMOS”, IEEE Journal of Solid-State Circuits, vol.36, pp. 1847-1858, December 2001.

[3] Scholtens, P. C. S and Vertregt M., “A 6-b 1.6-Gsample/s

flash ADC in 0.18um CMOS using averaging termina-

tion”, IEEE Journal of Solid-state Circuits, vol. 37, no.7,

pp. 1499- 1505, December 2002.

[4] Clemenz L. Portmann and Teresa H. Y. Meng, “Power-

efficient metastability error reduction in CMOS flash A/D

converters”, IEEE Journal of Solid-state Circuits, vol. 31,

no.8, pp. 1132- 1140, August. 1996.

[5] Koen Uyttenhove and Michiel Steyaert S. J., “A 1.8-V 6-

Bit 1.3-GHz Flash ADC in 0.25um CMOS”, IEEE Jour-nal of Solid-state Circuits, vol. 38, no. 7, pp. 1115-1122,

July 2003.

[6] Sunderarajan S. Mohan, Maria del Mar Hershenson,

Stephen P. Boyd, and Thomas H. Lee, “Bandwidth Ex-

tension in CMOS with Optimized On-Chip Inductors”,

IEEE Journal of Solid-state Circuits, vol. 35, no. 3, pp.

346-355, March 2000.

[7] Padoan. S, Boni. A, Moraridi. C and Venturi. F, “A novel

coding schemes for the ROM of parallel ADCs, featuring

reduced conversion noise in the case of single bubbles in

the thermometer code”, in IEEE ICECS, 1998, pp. 271-

274.

[8] Phillip E. Allen, Douglas R. Holberg, “CMOS Analog

Circuit Design”, Oxford University Press, New York,

2004.


ACQUISITION, STORAGE AND ANALYSIS OF REAL SIGNALS USING FPGA

P. Chopra, A. Kapoor, and R. Paily*

Department of ECE, Indian Institute of Technology Guwahati, India *email: [email protected]

ABSTRACT In this paper, we present a preliminary design for acquisition, storage and subsequent analysis of a real speech signal. A simplest approach of converting this signal to digital form using an Analog to Digital Converter (ADC) and storing them on the memory of a Field Programmable Gate Array (FPGA) board is suggested. The frequency and amplitude components of the stored signal can be analyzed anytime using appropriate Fourier transform implementations on the FPGA. This approach combines both the hardware and software aspects to implement a high performance system at low-cost. The main advantage of this approach is that the design is basic and essential to all speech processing methodologies and can be easily extended to other real time signals as well.

Index Terms — Speech Signal Processing, ADC, FPGA Implementation, Signal Conditioning

1. INTRODUCTION Today, speech processing research is interdisciplinary, drawing upon work in fields as diverse as biology, computer science, electrical engineering, linguistics, mathematics, physics, and psychology. Within these disciplines, pertinent work is being done in the areas of acoustics, artificial intelligence, computer algorithms, linear algebra, pattern recognition, phonetics, probability theory, signal processing, and syntactic theory. With advances in Very Large Scale Integration (VLSI) technology, high performance compilers, it has become possible to incorporate these algorithms in hardware. There are many Application Specific Integration Circuits (ASIC) solutions which offer small-sized, high performance systems. However, these suffer from low flexibility and longer design cycle times [1]. In this context, the hardware implementation on Field Programmable Gate Arrays (FPGAs) is an advantage.

The reason for using FPGA are in fact many. They have the flexibility to model software processors and design them according to the requirements of the programmer. They offer large logic capacity, exceeding several million

equivalent logic gates, and include dedicated memory resources for storing large data. Moreover, they include special hardware circuitry that is often needed in digital systems, such as DSP blocks (with multiply and accumulate functionality) and Phase Locked Loops (PLL) that support complex clocking schemes. These various features are extremely helpful in analyzing the efficacy of the whole system.

In this paper, we present a preliminary design for acquisition, storage and subsequent analysis of a real speech signal. A simplest approach of converting this signal to digital form using an ADC and storing them on the memory of an FPGA board is suggested. The paper has been organized as follows. First, Section 2 presents an overview of the project with a block level description. Section 3 describes the hardware design in detail. Section 4 discusses the results obtained at various stages during the sample run of the system. Conclusions and future scope are discussed in Section 5.

2. BASIC BUILDING BLOCKS OF THE SYSTEM

The present scope of the project is to use an FPGA to analyze the characteristics of a real time signal. The basic flow of the project can be summarized through three blocks of operation as shown in Fig. 1. The data from the real world is transformed to a real time electrical signal using a transducer. The obtained real time signal is conditioned by suitable input interface circuit. This electrical signal can be converted to the digital domain by sampling using an ADC and the samples can be stored into an FPGA. FPGA boards have many types of memory embedded on them that can be utilized to store this digital data. Using a Hardware Description Language (HDL), this data can be processed to extract the desired parameters with suitable algorithms.

Fig. 1. Block Diagram level flow of the project

Signal Acquisition

Module Transducer +

Interface Circuit

Analog to Digital

Conversion Module

Storage and Processing

Module


Fig. 2 Circuit for preprocessing stage

This idea was implemented to process a speech signal and analyze the characteristics of the same. The speech signal is fed to the interface circuit through a microphone. The microphone acts as a transducer converting speech into electrical variations. After some initial preprocessing steps, the signal obtained was fed to an Analog to Digital Converter (ADC) for sampling at a desired rate. The optimum rate of sampling the voice data is an important factor. Though Nyquist criterion provides the sufficient condition for reconstruction of signal, it is not optimum for speech processing. So, sampling at a rate much higher than Nyquist rate to observe the subtle nuances of incoming voice signal is required. The digital samples obtained were stored in the on-chip memory of the FPGA. The stored samples were processed to obtain desired parameters.

3. HARDWARE DESIGN

The overall hardware design consists of the following three modules (i) Signal Acquisition and Pre-processing of signal using the interface circuit, (ii) Sampling the real-time signal to obtain digital data and (iii) Storing the samples in FPGA for analyzing desired parameters.

Signal Acquisition and Preprocessing Stage

The interface circuit is designed to condition the real time signal obtained from the transducer [2] and the circuitry for the pre processing is shown in Fig. 2.

Microphone: The speech signal is fed to the interface circuit through a microphone. The microphone essentially acts as a transducer, converting Speech into electrical variations constituting the real time speech signal.

Low Pass Filter: The speech signal is fed to an active Low Pass Filter stage to remove high frequency interferences.

We know that the speech signal essentially spans the frequency range 0.3 KHz to 3.3 KHz approximately. So an LPF with cutoff frequency of fc = 4 KHz is used to obtain a smooth signal without affecting the spectral contents of the signal itself. It is followed by a buffer for proper interface with the ADC [4].

Amplifier: The signal obtained at this stage is very weak. So an amplification stage is necessary such that the input to the next stage is discernible. The amplifier is realized using an op-amp circuit in its non-inverting configuration. As our ADC has an analog input range of -0.5V to +0.5V, so the gain of the amplifier has been adjusted accordingly.

After these initial preprocessing steps, the signal obtained is a smooth and amplified version of the original speech. This is then input to sampling stage.

Sampling ADC Stage

The input data has to be sampled to obtain the digital samples that can then be stored in the FPGA. The speech signal has frequency components ranging from 0.3 KHz up to about 3.3 KHz. The Nyquist criterion requires the signal to be sampled at rates greater than twice the maximum frequency present (i.e. about 8 KHz) for faithful reconstruction of the speech signal from the samples. However, sampling at this rate would mean sampling twice in one period of the maximum frequency component. This could possibly lead to loss of spectral resolution that can be obtained from the signal. So sampling at higher rate is preferred to obtain higher spectral resolution.

The Sampling Rate was fixed at 6 MHz by supplying an external encoded clock. The Analog-to-Digital Converter (ADC) used for our purpose is the Analog Devices AD9057. It is an 8-bit monolithic ADC optimized for low cost, low


pTfpbp

ethod

S

AbCim

bIFTncAccmTims

Nsb

power, small siTTL and CMOfrom 5 MHz topower dissipatibe operated fpermissible ana

The ADC cedge into an 8he FPGA for

obtained by wrdigital samples

Storage and P

After obtainingbe a mechanisCyclone Boarmplement the r

Fig 3.

The block dblocks that hanput Interface

FPGA to start This duration number of samconnected to onAlong with thicompute the FFcould be done made availableTransform usmplemented on

suitable for Stra

Noise needs to speech signal. back to analog

ize and ease ofS compatible a

o 80 MHz. It is ion of 200 mWfrom +3V oralog input rang

converts signa-bit sample vastorage and

iting a suitableonly for the de

rocessing

g the samples sm of storagerd (EP1C6Q2required logic

Modules in the

diagram showave been imple

is a user operastoring the datis pre-decide

mples and thn-chip memoryis, the processFT of the storeusing codes sue by Altera.

sing Radix-2 n the cyclone Fatix or Cyclone

be removed fo[3] The idea idomain and di

f use. The encoand can have fra low power A

W. Its 8-bit dir +5V power

ge is 1V p-p.

al value at eacalue, which is processing. T

e code in FPGAesired duration

from the ADCe and process240C8) had in this project.

e FPGA implem

wn in Fig. 3 shemented on thable interface wta for a finite ed according e sampling fry where data issing module cd data and thenuch as FFT Me

A 16-point DFFT algori

FPGA. A 32-poe II boards [6],

for effective cois to convert thisplay them on

oded clock is borequency rangiADC with typigital outputs cr supplies. T

ch positive clothen supplied he handshake

A that accepts tn.

C, there needs sing. The Alte

been used .

mentation

hows the variohe FPGA boawhich directs tduration of tim

to the desirrequency. It being stored [

could be used n analyze it. Thegacore FunctiDiscrete Fourithm could oint DFT is mo, [8].

omputation of the digital signn an oscilloscop

oth ing cal canThe

ocktois

the

to erato

ousard. the

me.red

is[7].

to hision rierbe

ore

thenals pe.

The sigformat,

This sestages dspeech continudisplayin Fig. to the clearlywavefoit mustpassing6.

F

Compahaving approxi9.5 ms,6, it couspeech the ADaccordiHence, to the 8

gnals are to be, to produce the

4. ACTUAL

ection includesduring the samsignal is show

uously repeatinyed on a Digita

4. The peak alow pass filtevisible in th

orm in Fig. 5 dt be removed

g through the p

Fig 4. Noisy S

Fig 5. Magnifie

aring it with thlesser distortio

imate distance, which correspuld be seen thasignal varies f

DC is from -0ingly adjusted the signal obt

8-bit ADC for t

e directly fed e analog signal

L WORKING

s some wavefomple run of thewn below whicng the word

al Storage Osciamplitude of ther is 350mV. he starting poduring the timed using filterspre-processing

peech Signal di

ed view of the N

he first plot, ion than the ori

e between two ponds to a pitchat after the filtefrom -1V to +10.5V to +0.5V

that it does notained from thethe sampling st

to a DAC in l.

G AND RESUL

orms obtained e entire systemch has been ge

“Hello”. Its lloscope (DSOhe speech befoA noisy inter

ortion of the the speaker iss. The wavefo

g stage are sho

isplayed on a D

Noisy Speech Sig

it is much smoiginal speech speaks is obse

h of 105.2 Hz.er stage, the am1V. As the inpu

V, so the gain ot violate the Ae amplifier stagtage.

sequential

LTS

at various m. A human enerated by

waveform O) is shown ore feeding rference is magnified

s silent and forms after own in Fig.

SO

gnal

oother and signal. The erved to be From Fig.

mplitude of ut range of should be

ADC range. ge was fed


Fig 6. S

Fig 7. Magni

TABLE I CO

Speech Signal a

ified view of the

OMPILATION RE

Components U

Timing Analy

Power Consu

fter pre-proces

e pre-processed

EPORT SUMMA

Used

yzer Summary

mption Summa

sing stage

d Speech Signal

ARIZING THE C

Total LoTotal PinTotal Me

Worst CWorst CWorst C

ary Total ThCore DyCore StaI/O Ther

A RinstanticontentCompilFPGAcontentcan storcorrespto analyovercomRAM a

In thto acqugeneralhardwaprocessFrom thsuch a when thcollecteconsumfact thstorageis compflexibilreconfiparameperformextendespeech module

OMPONENTS U

ogic Elementsns Usedemory Bits Use

ase T su (Setupase T co (Procesase T h (Hold T

hermal Power Dynamic Thermaatic Thermal Pormal Power Dis

RAM of lengtiated. The Verits was synthesilation SummarBoard is show

ts utilized and re samples for

ponding to 9.5 yze such a sigme this limitatiand by reducing

his work, we huire and obtal. The idea are for a spesing, samplinghe graphs, it csignal is neceshe speech is fed from some

mption on the Fhat power cone and retrieval. paratively lowelity that the gurable with t

eterized as welmed on these ed to study andisorder by d

e.

USED, TIMING A

ed

p Time) ssing Time)

Time)

Dissipation al Power Dissipower Dissipatiossipation

th 8192 wordilog code to wized and implery of the hardwwn below in T

the sampling the duration oms approxima

gnal. We are pion by using bog the sampling

5. CONCLU

have presentedain the charac

has been aceech signal an and storage han be conclude

ssary, as noise fed directly to e distant sourcFPGA Board wnsumption incThe total cost

er and could beFPGA pro

the number of ll with the typestored data. T

n ECG or EEGesigning an ap

ANALYSIS AND

53 20 65,53

7.53511.855.542

pationon

1605146960.0075.69

ds with 8-bit wwrite and read iemented on theware implemenable1. With thfrequency of 6

of 1.36 ms. Buttely, it would b

presently invesoards with larg

g frequency.

SION

d a simple and cteristics of a ctually implemnd the acquishave been accoed that pre-prointerference isthe microphon

ce. The dynamwas quite high,creases duringt involved for e modeled easi

ovides. The samples, num

e of analyses tThis work couG signal or to ppropriate inpu

POWER DISSIP

36

5 ns 50 ns 2 ns

.12 mW

.43 mW 0 mW 9 mW

width was its memory e board [5]. nted on the he memory 6 MHz, we t with pitch be difficult stigating to ger on-chip

faster way signal in mented in sition, pre omplished. ocessing of s high even ne and not mic power due to the g memory the system

ily with the design is

mber of bits that can be

uld also be detect any

ut interface

PATION


6. REFERENCES

[1] C. Gonzalez-Concejero, V. Rodellar, A. Alvarez-Marquina, E. Martinez de Icaya and P.Gomez-Vilda, “Designing an Independent Speaker Isolated Speech Recognition System on an FPGA”, IEEE, 2006.

[2] Qifeng Zhu, Abeer Alwan, “The Effect of Additive Noise on Speech Amplitude Spectra: A Quantitative Analysis” IEEE Signal Processing Letters, Vol. 9, No. 9, September 2002.

[3] C. Y. Espy, “Effects of Noise in Signal Reconstruction From its Fourier Transform Phase”, S.M. Thesis, Dept. Elec. Engg. Comput. Sci., Massachusetts Inst. Technol., Cambridge, May 1981.

[4] Ramakant A and Gayakwad, “Op-Amps and Linear Integrated Circuits” Third Edition, Prentice Hall College Div, 1992.

[5] S. Palnitkar, “Verilog HDL”, Second Edition, Prentice Hall, PTR, 2003.

[6] Z. Szadkowski, “16-point Discrete Fourier Transform based on the Radix-2 FFT algorithm implemented into cyclone FPGA as the UHECR trigger for horizontal air showers in the Pierre Auger Observatory”, Nuclear Instruments and Methods in Physics Research, Vol 560, Issue 2, pp. 309-316, 2006.

[7] On-Chip Memory Implementations Using Cyclone Memory Blocks, Jan 2007, online literature available on Altera Cyclone Device family in the following link: www.altera.com/literature/hb/cyc/cyc_c51007.pdf.

[8] Fast Fourier Transform (FFT) with 32K-Point Transform Length, Jan 2007, online literature available on Altera Cyclone Device family in the following link: www.altera.com/support/examples/verilog/ver-fft-32k.html.


Author Index

A. G. Ramakrishnan, 28, 47, 74, 78, 82, 92, 96

A. Kapoor, 140

A. Mitra, 57

Ananthakrishnan G, 47, 78

Anupam Mandal, 86

Ashok Rao, 2, 6, 32

B. Yegnanarayana, 63, 106

C. Chandra Sekhar, 86

D. Ebenezer, 9

Deepak Balemarthy, 128

Deepak. J. Jayaswal, 36

G. Athithan, 86

Genemala Haobijam, 128

Gupteswar Sahu, 67

H. S. Jayanna, 110

Hemant A. Patil, 102

Hemantha Kumar G, 2, 6, 32

J. Kumar, 28

J. Rout, 121

K. Partha Sarathy, 74

K. R. Prasanna Kumar, 86

K. Sreenivasa Rao, 115

L. N. Sarma, 18

Laxmi Narayana M, 82, 92, 96

M. A. Ansari, 13

M. S. Manikandan, 18

Manjunath Aradhya V. N, 32

Mukesh A. Zaveri, 36

Niket Agrawal, 134

Noushath. S, 2

P. Chopra, 140

P. Krishnamoorthy, 51

Preeti Rao, 42

R. Kumara Swamy, 106

R. Paily, 140

R. S. Anand, 13

Raghavendra. R, 6

Ranjani H. G, 47, 78

Roy Paily, 128, 134

S. Dandapat, 18, 67, 121

S. Manikandan, 9

S. R. M. Prasanna, 51, 57, 110, 115

S. R. Nirmala, 18

T. K. Basu, 102

T. Kasar, 28

T. Vidya Sagar, 115

V. Anil Kumar, 57

Veena Karjigi, 42


Anvita Bajpai, 63

workshop on image and signal processing - indian … proc-2.pdfdirector, aimscs, hyderabad and prof....

Documents