cell phone recognition

10
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 7, NO. 2, APRIL 2012 625 Recognition of Brand and Models of Cell-Phones From Recorded Speech Signals Cemal Hanilçi, Figen Ertaş, Tuncay Ertaş, and Ömer Eskidere Abstract—Speech signals convey various pieces of information such as the identity of its speaker, the language spoken, and the lin- guistic information about the text being spoken, etc. In this paper, we extract information about the cell phones from their speech records by using mel-frequency cepstrum coefcients and identify their brands and models. Closed-set identication rates of 92.56% and 96.42% have been obtained on a set of 14 different cell phones in the experiments using vector quantization and support vector machine classiers, respectively. Index Terms—Cell phone recognition, mel-frequency cepstrum coefcients (MFCCs), support vector machines (SVMs), vector quantization (VQ). I. INTRODUCTION T HE speech science has been one of the most challenging areas of research over decades. Speech recognition [1], speaker recognition [2], language recognition [3], speaker di- arization [4], emotion recognition [5], and gender recognition [6] are the most popular applications in the speech technology. All these methods use the information that speech signal car- ries. Speaker recognition systems use the features that represent speaker’s identity while in speech recognition the information that parameterizes the text being spoken is used. Both types of features are extracted from speech signals. Since speech is a nat- ural signal, voice can be characterized as a biometric. In this work, we address a new problem of recognizing cell phones from their recorded speech. To be more precise, using speech signals, we try to identify the brand and model of a cell phone, by which they are recorded. The term brand of a cell phone de- notes the manufacturer e.g. Nokia, Samsung, etc., and the term model of a cell phone is used to denote the type of product from the same manufacturer e.g. Nokia 3600, Samsung E250, etc. Our motivation of studying on cell phone recorded speech is the fact that they have been inevitably becoming an integral part of human life. From a forensic point of view, the wide range of its usage will denitely signify that there will be lots of evidence in speech signals recorded through cell phones, and the identica- tion/verication of their brand and model may be a signicantly Manuscript received June 06, 2011; revised November 20, 2011; accepted November 25, 2011. Date of publication December 07, 2011; date of current version March 08, 2012. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Jiwu Huang. C. Hanilçi, F. Ertaş and T. Ertaş are with the Department of Electronic En- gineering, Uludağ University, 16059 Bursa, Turkey (e-mail: chanilci@uludag. edu.tr; [email protected]; [email protected]). Ö. Eskidere is with the Department of Mechatronics, Uludağ University, 16059 Bursa, Turkey (e-mail: [email protected]). Digital Object Identier 10.1109/TIFS.2011.2178403 valuable step in the investigation of evidence. Therefore, iden- tication/verication of cell phones from their recorded speech may well be addressed within forensic science. In the literature, there exist works that try to identify source cameras (digital cameras or cell phones) using their recorded images, reporting great performance of achievement [7]–[14]. For instance, Celiktutan et al. identies the cell phones from images using support vector machines [14], Dirik et al. iden- ties the digital single lens cameras and compact cameras [7] using the sensor dust characteristics. However, to the best of our knowledge, there is no previous study aiming to identify or verify cell phones using their recorded speech, and hence their brands and models, which may also be used as strong evidence by the court or law enforcement ofcers. Because of the tolerances in the production of electronic components, every realization of an electronic circuit (par- ticularly in the case of including a microphone) cannot have exactly the same transfer function. The transfer function ex- hibits dissimilarities from one realization to another, which even increase when the circuits performing the same task come from different manufacturers. This implies that cell phones may be identied by using their recorded speech, as each phone will introduce a convolutional distortion on the input speech, hence leaving its own tell-tale footprints (impact of transfer function) of the recording circuitry. The recorded speech can therefore be considered as a signal of whom original frequency spectrum is multiplied by a transfer function specic to each phone and may be explored by signal processing techniques to identify the phone by which the speech is recorded. In the literature of speaker recognition, mel-frequency cepstrum coefcient (MFCC) has been commonly employed as a feature to char- acterize speakers. In this paper, we use MFCC to characterize recording cell phones and hence identify their brand and model from a given recorded voice sample. However, we do not consider the recording of calls through RF connection, where the problem becomes more complicated as the characteristics of both transmitting and receiving ends plus some degree of time-varying channel effects (despite the channel equalization at the receiving end) are all involved. We consider only the case of employing cell phones like an ordinary tape recorder. The rest of the paper is organized as follows. The system def- inition and fundamental steps of the cell phone recognition al- gorithm are described in Section II. In Section III, the motiva- tion for identifying cell phones and MFCC extraction procedure are given in detail. The classication methods are briey de- scribed in Section IV. The experimental results are provided in Section V and nally the future work and conclusions are dis- cussed in Section VI. 1556-6013/$26.00 © 2011 IEEE

Upload: reetu-gupta

Post on 18-Apr-2015

40 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Cell Phone Recognition

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 7, NO. 2, APRIL 2012 625

Recognition of Brand and Models of Cell-PhonesFrom Recorded Speech SignalsCemal Hanilçi, Figen Ertaş, Tuncay Ertaş, and Ömer Eskidere

Abstract—Speech signals convey various pieces of informationsuch as the identity of its speaker, the language spoken, and the lin-guistic information about the text being spoken, etc. In this paper,we extract information about the cell phones from their speechrecords by using mel-frequency cepstrum coefficients and identifytheir brands and models. Closed-set identification rates of 92.56%and 96.42% have been obtained on a set of 14 different cell phonesin the experiments using vector quantization and support vectormachine classifiers, respectively.

Index Terms—Cell phone recognition, mel-frequency cepstrumcoefficients (MFCCs), support vector machines (SVMs), vectorquantization (VQ).

I. INTRODUCTION

T HE speech science has been one of the most challengingareas of research over decades. Speech recognition [1],

speaker recognition [2], language recognition [3], speaker di-arization [4], emotion recognition [5], and gender recognition[6] are the most popular applications in the speech technology.All these methods use the information that speech signal car-ries. Speaker recognition systems use the features that representspeaker’s identity while in speech recognition the informationthat parameterizes the text being spoken is used. Both types offeatures are extracted from speech signals. Since speech is a nat-ural signal, voice can be characterized as a biometric. In thiswork, we address a new problem of recognizing cell phonesfrom their recorded speech. To be more precise, using speechsignals, we try to identify the brand and model of a cell phone,by which they are recorded. The term brand of a cell phone de-notes the manufacturer e.g. Nokia, Samsung, etc., and the termmodel of a cell phone is used to denote the type of product fromthe same manufacturer e.g. Nokia 3600, Samsung E250, etc.Our motivation of studying on cell phone recorded speech is thefact that they have been inevitably becoming an integral part ofhuman life. From a forensic point of view, the wide range of itsusage will definitely signify that there will be lots of evidence inspeech signals recorded through cell phones, and the identifica-tion/verification of their brand and model may be a significantly

Manuscript received June 06, 2011; revised November 20, 2011; acceptedNovember 25, 2011. Date of publication December 07, 2011; date of currentversion March 08, 2012. The associate editor coordinating the review of thismanuscript and approving it for publication was Prof. Jiwu Huang.C. Hanilçi, F. Ertaş and T. Ertaş are with the Department of Electronic En-

gineering, Uludağ University, 16059 Bursa, Turkey (e-mail: [email protected]; [email protected]; [email protected]).Ö. Eskidere is with the Department of Mechatronics, Uludağ University,

16059 Bursa, Turkey (e-mail: [email protected]).Digital Object Identifier 10.1109/TIFS.2011.2178403

valuable step in the investigation of evidence. Therefore, iden-tification/verification of cell phones from their recorded speechmay well be addressed within forensic science.In the literature, there exist works that try to identify source

cameras (digital cameras or cell phones) using their recordedimages, reporting great performance of achievement [7]–[14].For instance, Celiktutan et al. identifies the cell phones fromimages using support vector machines [14], Dirik et al. iden-tifies the digital single lens cameras and compact cameras [7]using the sensor dust characteristics. However, to the best ofour knowledge, there is no previous study aiming to identify orverify cell phones using their recorded speech, and hence theirbrands and models, which may also be used as strong evidenceby the court or law enforcement officers.Because of the tolerances in the production of electronic

components, every realization of an electronic circuit (par-ticularly in the case of including a microphone) cannot haveexactly the same transfer function. The transfer function ex-hibits dissimilarities from one realization to another, whicheven increase when the circuits performing the same task comefrom different manufacturers. This implies that cell phones maybe identified by using their recorded speech, as each phone willintroduce a convolutional distortion on the input speech, henceleaving its own tell-tale footprints (impact of transfer function)of the recording circuitry. The recorded speech can thereforebe considered as a signal of whom original frequency spectrumis multiplied by a transfer function specific to each phone andmay be explored by signal processing techniques to identifythe phone by which the speech is recorded. In the literatureof speaker recognition, mel-frequency cepstrum coefficient(MFCC) has been commonly employed as a feature to char-acterize speakers. In this paper, we use MFCC to characterizerecording cell phones and hence identify their brand and modelfrom a given recorded voice sample. However, we do notconsider the recording of calls through RF connection, wherethe problem becomes more complicated as the characteristicsof both transmitting and receiving ends plus some degree oftime-varying channel effects (despite the channel equalizationat the receiving end) are all involved. We consider only the caseof employing cell phones like an ordinary tape recorder.The rest of the paper is organized as follows. The system def-

inition and fundamental steps of the cell phone recognition al-gorithm are described in Section II. In Section III, the motiva-tion for identifying cell phones and MFCC extraction procedureare given in detail. The classification methods are briefly de-scribed in Section IV. The experimental results are provided inSection V and finally the future work and conclusions are dis-cussed in Section VI.

1556-6013/$26.00 © 2011 IEEE

Page 2: Cell Phone Recognition

626 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 7, NO. 2, APRIL 2012

Fig. 1. Generic cell phone recognition system using MFCC feature extraction.

II. SYSTEM DEFINITION

The cell phone recognition can be used as a generic termwhich refers to two different tasks: cell phone identificationand cell phone verification. In the verification task, an identityclaim is given to the system as an input and the system acceptsor rejects the given identity claim. However, identification en-compasses two types of application known as closed-set andopen-set identification, which are both of academic interest asconsiderably different tasks. In the closed-set identification, theaim is to match the unknown input with one of the recordedvoice samples from a set of a priori known cell phones,from which the input is assumed to come. Open-set identifica-tion system aims to detect first whether or not the input comesfrom the cell phone set known to the system a priori and thenmatches the input with one of the phones in the set if it decidesit does or rejects the input otherwise. In this paper, we consideronly the closed-set identification problem, in which we iden-tify the brand and model of an unknown cell phone using therecorded voice samples from a set of cell phones.A cell phone recognition system consists of three important

steps: feature extraction, modeling algorithm and the matchingalgorithm. Feature extraction is the process of extracting theuseful information which characterizes the cell phone from a setof given voice samples. The modeling algorithm generates thecell phone model using these features for each phone. Matchingalgorithm performs the calculation of similarity measure andmakes comparisons among the models. The recognition proce-dure is shown in Fig. 1. As seen from the figure, both trainingand recognition stages include the feature extraction step. In thetraining stage, a model is created for each cell phone and thesemodels are stored so as to use in the recognition stage. In therecognition stage of the identification system considered in thisstudy (closed-set), feature vectors for an unknown cell phoneare computed from the voice sample and compared with eachmodel stored in the database and the model which produces themaximum similarity is assigned as the identity of unknown cellphone.

III. FEATURE EXTRACTION

The most common features used in speech applications arebased on the spectrum of speech signal, in which the desiredinformation is rather embedded. For our purpose, the desiredinformation is only the device-specific portion of the whole in-formation contained in the recorded speech, and the aim of fea-ture extraction is to capture device-specific discriminatory in-formation using a suitable representation that effectively reflectdifferences.

A. Motivation

Let us consider the recording section of the cell phone asa linear time-invariant filter with impulse response , andtherefore the recorded speech signal as the output of thisfilter in response to the original speech signal given by

. Since speech is not a stationary signal, itis divided into overlapped frames within which the signal is as-sumed to be stationary. The th short-time segment (frame) ofthe recorded input speech can be written as

(1)

where is a window function of length . Therefore, theimpact of cell phone on the recorded speech can be consid-ered as a convolutional distortion that may help identify therecording cell phone. However, the identity of the cell phoneis embedded in the recorded signal in the form of convolutionthrough the impulse response of its recording stage, which needsto be converted into a better form suited to identification. Forthis purpose, taking the short-time Fourier transform for a betterperception of the device effect, the windowed speech segmentin (1) is given by

(2)

It is seen from (2) that the device leaves its tell-tale footprintson the recorded speech by modifying the spectrum of theinput speech signal. For instance, Fig. 2 shows the spectrumenvelopes of the same utterance recorded by cell phones ofdifferent brands. It can be clearly seen from the figure that eachcell phone introduces its own differences on the spectrum. Oneway of identifying the recording phone is to find a means todig out its footprints from the recorded speech and directly useit in a suitable form to accomplish identification. However,an indirect way is to consider as concatenated with thevocal tract transfer function that produced the input speech(input to the cell phone’s recorder stage) and perceive therecorded speech as a new original speech stemmingfrom the cell phone’s recorder. Then, with the equivalenttransfer function of the vocal tract and cell phone’s recorder,one may consider the cell phone as the original source of therecorded speech. Considering and

, where is the excitation func-tion, is the vocal tract transfer function for the speechin the th frame, and is the equivalent transfer functionthat characterize the cell phone, (2) may now be represented as

(3)

Page 3: Cell Phone Recognition

HANILÇI et al.: RECOGNITION OF BRAND AND MODELS OF CELL-PHONES FROM RECORDED SPEECH SIGNALS 627

Fig. 2. Spectrum envelopes for different cell phones computed from same speech sample.

By this approach, device (cell phone) identification is tanta-mount to speaker identification, and the procedures of speakeridentification can be directly applied to the task of cell phoneidentification. There are various speech feature extractiontechniques based on the spectrum analysis such as cepstrumwith mel-scale filter-bank and linear predictive cepstrum coef-ficients (LPCCs), the former of which is the most commonlyused speech parametrization [15]. Hence, in this paper weinvestigate the performance of MFCC with vector quantization(VQ) and support vector machines (SVM) classifiers in identi-fying the brand and model of cell phones from their recordedspeech samples.

B. Mel-Frequency Cepstrum Coefficients

The speech spectrum has a lot of details, but they are notof interest in common as they are. Rather, the envelope of thespectrum multiplied by a filter bank is preferred [15]. Althoughmel spacing of filters in the filter bank offer little gain over theBark-scale (other perceptually motivated filter-bank spacing) oruniform spacing in automatic speech recognition especially inmatched training and testing conditions [19], as in our case, itprovides considerable benefit in speaker identifications. For in-stance, with the identification experiment on the whole set ofcell phones given in Table I (14 phones in total) and using VQ

TABLE IBRANDS AND MODELS OF CELL-PHONES USED IN

EXPERIMENTS AND THEIR CLASS NAMES

classifier with codebook size , identification perfor-mance of 81.79% is achieved with mel-frequency filter bank,while Bark-frequency filter bank yields 55% identification rate.Therefore, cepstrum with mel-scale filter bank is more suitablecompared to cepstrum with bark-scale filter bank as front-endprocessing.After filter-bank smoothing with transfer function and

subsequent logarithm operation on the output energies, (3) canbe written as

(4)

Page 4: Cell Phone Recognition

628 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 7, NO. 2, APRIL 2012

and by further representing as theweighted equivalent transfer function (4) may be put into amore convenient form as

(5)

where we see nonlinearly transformed terms in additive form.The success of the MFCC in characterizing the device is prob-ably this nonlinear transformation with additive property. Now,transforming back to time domain, we obtain the features ex-tracted from the th frame of the recorded speech as the sum ofdevice and source (excitation source) specific features given by

(6)

in which , , and are the cepstrums of recordedspeech and the weighted equivalent impulse response of cellphone recorder characterizing the cell phone, respectively. In(6), indexes the number of filters in the filter bank. Note thatthe convolutionally embedded device-specific information inthe recorded speech is now converted to additive form witha representation suited to identification. Being in a suitableform, may then be used for further processing to iden-tify the recording phone. The significance of the nonlinear(logarithmic) transformation of features in additive form onidentification rates can be clearly seen by inspection of experi-mental results given in Table VI in Section V.The computation steps of the MFCC is shown in Fig. 1. The

most often used frame lengths in speech signal processing are20 or 30 ms with 10 or 15 ms overlap. In the experiments, wehave used 30-ms-long frames with 15 ms overlap. Each frame iswindowed by a window function. The Hamming and Hanningwindows are the two most commonly used window functions.In the experiments, we have used Hamming window function.From the windowed frame, fast Fourier transform (FFT) mag-nitude spectrum is computed and the spectrum is filtered usinga bank of triangular bandpass filters. Filter banks are named bythe shape of filters and localization of their frequencies (subfre-quencies and central frequency). We have used , tri-angular filters equally spaced in the Mel-scale. Mel-scale is themost commonly used frequency scale which is linear up to 1000Hz and logarithmic above 1000 Hz and the central frequenciesof the filters in Mel-scale are computed by [16] and [17]

(7)

where the logarithms are taken to the base 10. The logarithm offilter bank outputs are taken and multiplied by 20 to obtain thespectral envelope in decibels. Finally, the MFCCs are obtainedby taking the discrete cosine transform (DCT) of the spectralenvelope. Now, denoting as hereafter for convenience,cepstrum coefficients are obtained as

(8)

where is the th MFCC coefficient, is the number of tri-angular filters in the filter bank, is the log energy output of

the th filter coefficients and is the number of MFCC coef-ficients that we want to calculate. The 0th MFCC coefficient,, is the average log energy of speech frame and is commonly

discarded. For more details about the MFCC extraction, readersare referred to [2] and [15].In addition to MFCC features, the derivative of them (also

known as delta features) are also important to characterizethe embedded information. Mostly, if there is some variabilityin the condition of recordings such as different microphones,transmission channels (landline or wireless) and sessionsbetween recordings, delta features improve the recognitionaccuracy. The first-order derivative of a set of MFCC featurevectors, , is computed by

(9)

where each feature vector , , is of dimension ,is the number of frames to be analyzed, and is the number

of neighborhood frame used in delta cepstrum coefficients. Typ-ically, is selected as 1, 2 or 3. The second-order derivatives(also known as double delta features) of MFCCs can be com-puted in the same way by using the deltas. In this paper, wehave used MFCCs along with their first-orderderivatives as the feature set, for each frame in the experiments.A way of visualizing the differences introduced by cell

phones on features is to use histograms. For this purpose, weillustrate the histograms of only a single feature, the sixthMFCC feature , for a set of cell phones of different brandswith a fixed model in Fig. 3, as an example to show the behaviorof the same feature on each cell phone with a different brand.In the figure, the same speech sample from a particular speakeris used to obtain features to see the impact of cell phone brandon the features. It can be seen from the figure that each cellphone makes its own difference in features. Different brandsinduce different behaviors on the same feature, hence resultingin different histograms. Fig. 3 has been obtained by averagingover 100 trials of different utterances on each unit.In Fig. 3, only one coefficient was used to illustrate the dis-

criminatory effect of MFCC features on cell phones with dif-ferent brands. However, MFCC features are not only effectivein characterizing cell phones with different brands, but also withthe same model of a fixed brand as well. To illustrate this, let uspick up two phones of exactly the same brand and model andexamine the squared Euclidean distance between each compo-nent of their MFCC feature vectors. For this purpose, we choosetwo of Nokia 3600, two of Samsung E250 and two of SonyW880 cell phones. For each pair of these cell phones, averagesquared Euclidean distances, , , are cal-culated over 100 different speech samples, and the distances onMFCC features marked with associated standard deviation areillustrated in Fig. 4. Here, and represent the resulting thcepstrum coefficient for a phone pair in response to the sameinput. As seen from the figure, each cell phone pair exhibitsdistances to some extent between corresponding coefficients al-though they do not differ in brand and model, therefore justi-fying the effectiveness of MFCC features in characterizing cellphones in all cases. As extremely useful information, these dif-ferences in the behavior of features therefore enable classifiers

Page 5: Cell Phone Recognition

HANILÇI et al.: RECOGNITION OF BRAND AND MODELS OF CELL-PHONES FROM RECORDED SPEECH SIGNALS 629

Fig. 3. Histograms of feature for each cell phone model.

Fig. 4. Average squared Euclidean distances of each MFCC features on the same brand and model of cell phone pairs.

to discriminate between phones even in the case that they areof the same brand and model (see the results of the experimentswith Nokia subset in Tables II and VII).

IV. CLASSIFICATION ALGORITHMS

In this section we describe classification algorithms in brief.

A. Vector Quantization-Based ClassificationVQ [18] is one of the simplest classification algorithms.

It was originally developed as a compression algorithm but

later was used as a classification algorithm [20], [21]. It isone of the easiest algorithms to implement in real-time appli-cations. For a given input data set (training feature vectors)

consisting of number of vectors eachof which has dimension , VQ algorithm aims to partitioninto separate clusters and each cluster is

represented by a code vector, , which is the average vector(centroid) of the cluster and the whole set of code vectors

is known as codebook and representsa cell phone’s model. VQ aims to find a codebook for a

Page 6: Cell Phone Recognition

630 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 7, NO. 2, APRIL 2012

TABLE IIRECOGNITION RATES (IN %) FOR VQ-BASED SYSTEM ON NOKIA SUBSET

TABLE IIICONFUSION MATRIX FOR NOKIA SUBSET USING TIMIT DATA

TABLE IVRECOGNITION ACCURACIES (IN %) OF VQ-BASED CELL-PHONE RECOGNITION

SYSTEM FOR DIFFERENT CODEBOOK SIZES ON WHOLE SET

given training feature vectors by minimizing the objec-tive function. The most widely used objective function is themean-squared error MSE which is defined as:

MSE (10)

where is the squared Euclidean distance between vec-tors and in -dimensional vector space

(11)

In the recognition stage of a cell phone recognition system,the MSE between the feature vectors ofunknown cell phone and codebook of each trained cell phone, is calculated, and the cell phone whose

model (codebook) generates minimum MSE is determined asthe identity of the unknown cell phone, namely

MSE (12)

Selection of the codebook generation algorithm is one of themost important parts which needs to be considered in the VQalgorithm. There are various codebook generation algorithmssuch as LBG [18], -means [22], pair-wise nearest neighbor(PNN) [23] and self-organizing maps (SOM) [24] in the litera-ture. LBG and -means algorithms are the most popular code-book generation algorithms. Since the -means algorithm ishighly dependent on selection of initial codebook, we gener-ated codebooks starting with LBG algorithm followed by 20-means iterations for coarse and fine-tuning respectively as

in [25].

B. Support Vector Machines-Based ClassificationSVM is a powerful discriminative classifier and became the

state-of-the-art classification method in many pattern recogni-tion applications, such as speaker verification [26]. Kocal et al.has used the SVM to speech steganalysis [27] and Kinnunen etal. has used the SVMs to detect speech silence parts of a voicesample [28]. SVM is a binary classifier which models the deci-sion boundary between two classes as a separating hyper-plane.In the training stage, SVM finds a separating hyper-plane whichmaximizes the margin of separation between two classes byan optimization process. An interesting area of application ofSVMs is speech processing. A challenging problem of applica-tion of SVMs in speech processing is to deal with a huge amountof data. Since features are extracted in every 30 ms frames with50% frame shift, the training/test data for a set of speech sam-ples is a sequence of vectors rather than a single vector. In [29],the size of training data was first reduced via clustering methodto avoid this problem. However, in recent years the use of se-quence kernels for application of SVMs in speech processinghas gained more attention. Generalized linear discriminant se-quence (GLDS) kernel is a well-known and powerful applica-tion of SVMs in speaker and language recognition [30]–[32].The GLDS method creates a single characteristic vector usingthe sequence of feature vectors extracted from a speech sample.The feature vectors are mapped into kernel feature space by apolynomial expansion [33]. For example, a second-order poly-nomial expansion of a 2-D vector is given by

. During the training of SVMof our cell phone recognition system, each cell phone is repre-sented by a single vector which is the average expanded featurevectors:

(13)

Since SVMs are two-class classifiers we have used theLib-SVM package [34], in which a multi-class SVM classifieris available. The Lib-SVM uses the one-against-one approachfor multi-class classification by constructingclassifiers where is the number of classes. An advantage ofthis package is that it predicts class probability information

Page 7: Cell Phone Recognition

HANILÇI et al.: RECOGNITION OF BRAND AND MODELS OF CELL-PHONES FROM RECORDED SPEECH SIGNALS 631

TABLE VCONFUSION MATRIX OF VQ-BASED CELL-PHONE RECOGNITION SYSTEM ON TIMIT DATABASE FOR

besides the class labels. The details of implementation ofmulti-class SVMs can be found in [35]. For a given test vector, the algorithm produces an -dimensional confidence vector,

, where each component of the vector,, indicates the probability that the

vector belongs to th class. The decision rule is defined as.

V. EXPERIMENTS

In the experiments, we have used models of cellphones. Compared to the work in [8], [10], [11], and [14], where4, 16, 9, and 6 number of camera models were used, respec-tively, using number of models in our experimentscan therefore be considered adequate, at least for the purposeof presenting preliminary results in this emerging field. It is ex-pected that the topic is to be explored by researchers from var-ious aspects. The collection of brands include Nokia, Samsung,Sony Ericsson, LG, Motorola and HP. Five of Nokia, three ofSamsung, one of LG, three of Sony Ericsson, one of Motorola,and one of HP models have been used in the experiments. Thebrands and models of cell phones are listed in Table I.We have used two different databases to investigate the per-

formance of our cell phone recognition system. First, we haveused TIMIT [36] database for voice samples recorded using cellphones. TIMIT database is a popular speech/speaker recogni-tion database which has been used in many speech applicationssuch as speech recognition, speaker identification and speechsteganalysis [27], [37], [38]. TIMIT database consists of 630speakers from different dialects of American English (192 fe-males and 432 males) and each speaker reads ten sentences eachof which is approximately 3 s long. Two of the ten sentencesare read by every speaker and remaining eight sentences aredifferent. We have selected 24 speakers from the test portionof database, and 240 sentences of these 24 speakers are playedand recorded by each cell phone in a silent environment. Withthis, we have 240 sentences for each cell phone hence totalling

3360 voice samples. For each cell phone, we have used 120 sen-tences for training and the remaining 120 sentences for testing(a total of 1680 individual tests). Apart from TIMIT, we havebuilt a database by recording the speech using each of the 14cell phones, to which we refer as LIVERECORDS in the sequel.For each cell phone, a speech data is recorded in the same room,which is about 10min long spoken by the same speaker. The halfof the recording (5 min long) is used to train each phone, andthe remaining 5 min portion of the recording is segmented into3-s-long chunks for testing (100 test sentences for each phone,total of 1400 tests).Experiments have been conducted for two different setups,

in one of which a small group of cell phones with the samebrand (five Nokia cell phones) has been used to evaluate the per-formance of recognition systems on the phones from the samemanufacturer. This small group of cell phones is hereafter re-ferred to as the Nokia subset. In the other, the whole set of cellphones (mixture of 14 phones in total) with various brand andmodels has been used. This large group of cell phones is referredto as the whole set in the following. Apparently, the recogni-tion task on the former group is more challenging. The totalnumber of test samples employed in the Nokia subset is 600(120 tests for each phone) and 500 (100 tests for each phone)for TIMIT and LIVE RECORDS databases, respectively. Therecorded speech signals are in the adaptive multi-rate (AMR)compression format for all phones with 8 kHz sampling fre-quency and 12.2 kb/s bit rate.

A. Experiments Using VQ

For VQ-based classification system we have used seven dif-ferent codebook sizes, , toanalyze the impact of codebook size on recognition accuracy.Experiments have been conducted on the Nokia subset and thewhole set using the TIMIT and LIVE RECORDS databases.Table II summarizes the recognition rates for different code-book sizes for each database on the Nokia subset. The confusion

Page 8: Cell Phone Recognition

632 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 7, NO. 2, APRIL 2012

TABLE VIPERFORMANCE COMPARISON OF MFCCS AND DCT OF MFBE

TABLE VIIRECOGNITION RATES (IN %) OF SVM-BASED CELL-PHONE

RECOGNITION SYSTEM ON NOKIA SUBSET

table of this small group on the TIMIT database for codebooksize is given in Table III. It is seen from the tablethat systemmakes the wrong decisions mostly on the same units(separate cell phones with the same brand and model, N2 andN3, which are both Nokia 3600).The performance of the VQ-based system for different

codebook sizes on the whole set (total of 14 phones) is givenin Table IV, and the confusion matrix for is given inTable V. As seen from Table IV, recognition performance ex-hibits consistent improvement with increasing codebook size.However, from experiments with the Nokia subset (Table II),performance improvement slows down for codebook sizesgreater than , while significant increase in recognitionrate is observed for codebook sizes up to .An experiment has also been conducted to illustrate the sig-

nificance of logarithmic transformation in frequency domain onthe identification performance as emphasized in Section III-B.The experiment has been performed on the whole set with theTIMIT and LIVE RECORDS data without taking the logarithmof mel-frequency filter bank output energies. Identification re-sults are presented in Table VI labeled as the DCT of MFBE(mel-filter bank energies) with the results given in Table IV forcomparison. It is seen from the table that omitting log opera-tion in feature extraction results in a dramatic loss on identifi-cation performance. This result itself illustrates the significanceof nonlinear transformation with additive property in capturingrelevant information and hence the suitability of cepstrum to cellphone (or equivalently speaker) identification.

B. Experiments Using SVM

With the SVM-based classification, andvalues are used as the degree of polynomial expansion for

TABLE VIIICONFUSION MATRIX FOR NOKIA SUBSET ON TIMIT DATA USING SVM

TABLE IXRECOGNITION RATES (IN %) OF SVM-BASED

CELL-PHONE RECOGNITION SYSTEM

GLDS kernel described in Section III, and recognition experi-ments have been conducted on the Nokia subset and the wholeset using both databases as in the case of VQ-based system.The recognition rates of the SVM-based system with differentvalues on the two databases for Nokia subset are given

in Table VII and the confusion table of TIMIT database foris given in Table VIII. It can be seen from Tables VII

and VIII, SVM achieves great performance on recognizing cellphones from the same manufacturer. It can also be seen thatSVM outperforms VQ (see Tables II and III) both in terms ofrecognition rate in general and distinguishing between phoneswith the same brand and model (N2–N3 pair). The systemperformance on the whole set is given in Table IX for bothdatabases and the confusion table is given in Table X. As seenfrom both tables, the performance of the SVM is superior tothat of VQ on both databases in terms of recognition rates.It should be noted that SVM achieves great performance onalso discriminating cell phones of the same brand and model(N2-N3, SO2-S03, and SA1-SA2 pairs) compared to VQ-basedsystem, as those in the Nokia subset.

C. Discussion

Experimental results show that SVM classifier achieveshigher identification rates than VQ based system. This is prob-ably due to the fact that SVM better captures the differencesamong the classes [39]. It is generally seen that the results ob-tained with the LIVE RECORDS is higher for the experimentson the whole set with VQ classification. The reason for this isthat the LIVE RECORDS database consists of only one speakerand the contents of the speech used for each device is the same.But, these are different in the TIMIT database, which explainsthe case. Mathematically speaking, in the TIMIT case, both

and changes from device to device. However, inthe LIVE RECORDS case, the vocal tract transfer function isfixed due to one speaker and the only change is in the devicetransfer function , thereby yielding higher identification

Page 9: Cell Phone Recognition

HANILÇI et al.: RECOGNITION OF BRAND AND MODELS OF CELL-PHONES FROM RECORDED SPEECH SIGNALS 633

TABLE XCONFUSION MATRIX OF SVM-BASED CELL-PHONE RECOGNITION SYSTEM ON TIMIT DATABASE FOR

rates. It is seen that the performance gap closes when code-book size increases with the VQ classifier. Experiments onthe Nokia subset show that the identification rate on LIVERECORDS database gets worse for the codebook size greaterthan . This is due to the fact that the changes on thecharacter of the equivalent transfer functions of cell phones isless pronounced, since changes less from phone to phone(same manufacturer). However, with the SVM classification,performance on TIMIT database is higher regardless of theexperiment being on the subset or the whole set. This is dueto the fact that the little variations on the equivalent transferfunction from device to device in theLIVE RECORDS case cannot be as discriminatory as in thecase of TIMIT database, where changesmore from device to device, since SVM classifiers map thefeatures into a higher dimension compared to the VQ classi-fiers. Therefore, SVM classifiers perform better on the TIMITdatabase, with the whole or Nokia subset.

VI. CONCLUSION AND FUTURE WORK

In this paper, we have addressed a new problem of recog-nizing the brand and model of a cell phone from its recordedspeech. The following list is evident from the results.1) Speech signal conveys information about the sourcedevice.

2) MFCC features capture the characteristics of the sourcedevice and can be used as forensic features to recognizecell phones from their recorded speech.

3) Both VQ and SVM classifiers achieve remarkable recog-nition performance.

4) SVM outperforms the VQ-based classification system interms of recognition performance. However, the VQ algo-rithm requires less computation and is easier to implement.

5) SVM is superior to VQ on recognizing the cell phones ofthe same brand and model.

As further work, we plan to extend this preliminary study inthe following ways: 1) cell phone recognition with large numberof phones including larger number of subsets of the same brandand same model of cell phones to investigate the ability of rec-ognizing the same units; 2) extending the cell phone identifi-cation to source device identification using the mixture of cellphones and voice recorders; 3) cell phone identification withnew feature sets; 4) analyzing the performance of cell phoneverification system.

ACKNOWLEDGMENT

The authors would like to thank İ. Avcıbaş for his valuablecomments.

REFERENCES[1] Y. Gong, “Speech recognition in noisy environments: A survey,”

Speech Commun., vol. 16, no. 3, pp. 261–291, 1995.[2] T. Kinnunen and H. Li, “An overview of text-independent speaker

recognition: From features to supervectors,” Speech Commun., vol. 52,no. 1, pp. 12–40, 2010.

[3] W. Campbell, F. Richardson, and D. Reynolds, “Language recognitionwith word lattices and support vector machines,” in Proc. IEEE Int.Conf. Acoustics, Speech Signal Processing, 2007, pp. 989–992.

[4] S. Tranter and D. Reynolds, “An overview of automatic speaker di-arization systems,” IEEE Trans. Audio, Speech Language Process.,vol. 14, no. 5, pp. 1557–1565, May 2006.

[5] B. Shuller, G. Rigoll, and M. Lang, “Hidden Markov model basedspeech emontion recognition systems,” in Proc. IEEE Int. Conf. Acous-tics, Speech Signal Processing, 2003.

[6] H. Ting, Y. Yingchun, and W. Zhaohui, “Combining MFCC and pitchto enhance the performance of gender recognition,” in Proc. 8th Int.Conf. Signal Processing, 2006.

[7] E. Dirik, H. T. Sencar, and N. Memon, “Source camera identificationbased on sensor dust characteristics,” in Proc. IEEE Workshop SignalProcessing Applications Public Security Forensics, 2007.

[8] A. E. Dirik, H. T. Sencar, and N. Memon, “Digital single lens reflexcamera identification from traces of sensor dust,” IEEE Trans. Inform.Forensics Security, vol. 3, no. 3, pp. 539–552, Jun. 2008.

[9] Y. Fang, A. E. Dirik, X. Sun, and N. Memon, “Source class identifi-cation for DSLR and compact cameras,” in Proc. IEEE Int. WorkshopMultimedia Signal Processing, Oct. 2009.

[10] J. Lukas, J. Fridrich, andM.Goljan, “Digital camera identification fromsensor pattern noise,” IEEE Trans. Inform. Forensics Security, vol. 1,no. 2, pp. 205–214, Apr. 2006.

Page 10: Cell Phone Recognition

634 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 7, NO. 2, APRIL 2012

[11] C.-T. Li, “Source camera identification using enhanced sensor pat-tern noise,” IEEE Trans. Inform. Forensics Security, vol. 5, no. 2, pp.280–287, Apr. 2010.

[12] Y. Long and Y. Huang, “Image based source camera identificationusing demosaicking,” in Proc. IEEE Int. Workshop Multimedia SignalProcessing, October 2006.

[13] M. J. Tsai, C. L. Lai, and J. Liu, “Camera/mobile phone source identifi-cation for digital forensics,” in Proc. IEEE Int. Conf. Acoustics, SpeechSignal Processing, 2007.

[14] O. Celiktutan, B. Sankur, and I. Avcibas, “Blind identification of sourcecell phone model,” IEEE Trans. Inform. Forensics Security, vol. 3, no.3, pp. 553–566, Jun. 2008.

[15] F. Bimbot, J. F. Bonastre, C. Fredouille, G. Gravier, I. M. Chagnol-leau, S. Meignier, T. Merlin, J. O. Garcia, D. P. Delacretaz, and D.A. Reynolds, “A tutorial on text-independent speaker verification,”EURASIP J. Appl. Signal Processing, vol. 4, pp. 430–451, 2004.

[16] T. Kinnunen, V. Hautamaki, and P. Franti, “Fusion of spectral fea-ture sets for accurate speaker identification,” in Proc. Int. Conf. SpeechComputer, 2004.

[17] J. R. Deller, J. H. L. Hansen, and J. G. Proakis, Discrete-Time Pro-cessing of Speech Signals. Piscataway, NJ: IEEE Press, 1999.

[18] Y. Linde, A. Buzo, and R. Gray, “An algorithm for vector quantizerdesign,” IEEE Trans. Commun., vol. 28, no. 1, pp. 84–94, Jan. 1980.

[19] B. J. Shannon and K. K. Paliwal, “A comparative study of filter bankspacing for speech recognition,” in Proc. Microelectronic Eng. Res.Conf., 2003.

[20] C. Hanilci and F. Ertas, “Principal component based classification fortext-independent speaker identification,” in Proc. IEEE Int. Conf. SoftComputing, Computing With Words and Perceptions in System Anal-ysis, Decision and Control, Sep. 2009.

[21] F. K. Soong, A. E. Rosenberg, L. R. Rabiner, and B. H. Juang, “Avector quantization approach to speaker recognition,” in Proc. IEEEInt. Conf. Acoustics, Speech and Signal Processing, 1985.

[22] T. Kanungo, D. Mount, N. Netenyahu, C. Piatko, R. Silverman, andA. Wu, “An efficient K-means clustering algorithm: Analysis and im-plementation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, pp.881–892, 2002.

[23] H. W. Equitz, “A new vector quantization clustering algorithm,” IEEETrans. Acoust., Speech, Signal Process., vol. 37, no. 10, pp. 1568–1575,1989.

[24] N. M. Nasrabadi and Y. Feng, “Vector quantization of images basedupon the Cohonen self-organizing feature maps,” in Proc. IEEE Int.Conf. Neural Networks, 1988, pp. 101–108.

[25] C. Hanilçi and F. Ertaş, “Comparison of the impact of someMinkowskimetrics on VQ/GMM based speaker recognition,” Computers Elec-trical Eng., vol. 37, no. 1, pp. 41–56, 2011.

[26] W. M. Campbell, J. P. Campbell, T. P. Gleason, D. A. Reynolds, andW. Shen, “Speaker verification using support vector machines andhigh-level features,” IEEE Trans. Acoust., Speech, Signal Process.,vol. 15, no. 7, pp. 2085–2094, 2007.

[27] O. H. Kocal, E. Yuruklu, and I. Avcibas, “Chaotic type features forspeech steganalysis,” IEEE Trans. Inform. Forensics Security, vol. 3,no. 4, pp. 651–665, Aug. 2008.

[28] T. Kinnunen, E. Charnenko, M. Tuononen, P. Franti, and H. Li, “Voiceactivity detection using MFCC features and support vector machine,”in Proc. Speech Computer, 2007, pp. 556–561.

[29] Z. Lei, Y. Yang, and Z. Wu, “Mixture of support vector machines fortext-independent speaker recognition,” in Interspeech, 2005.

[30] W. M. Campbell, J. Campbell, D. A. Reynolds, and P. Torres-Car-rasquillo, “Support vector machines for speaker and language recogni-tion,” Computer Speech Language, vol. 20, no. 2, pp. 210–229, 2006.

[31] W. M. Campbell, E. Singer, P. A. Torres-Carrasquillo, and D. A.Reynolds, “Language recognition with support vector machines,” inProc. Odyssey: Speaker and Language Recognition Workshop, 2004,pp. 41–44.

[32] W. M. Campbell, “Generalized linear discriminant sequence kernelsfor speaker recognition,” in Proc. Int. Conf. Acoustics, Speech SignalProc., 2002, pp. 161–164.

[33] W.M. Campbell, K. T. Assaleh, and C. C. Broun, “Speaker recognitionwith polynomial classifiers,” IEEE Trans. Speech Audio Process., vol.10, no. 4, pp. 205–212, Apr. 2002.

[34] C. C. Chang and C. J. Lin, LIBSVM: A Library for Support VectorMachines 2001, Software [Online]. Available: http://www.csie.ntu.edu.tw/~cjlin/libsvm/

[35] T. F. Wu, C. J. Lin, and R. C. Weng, “Probability estimates for multi-class classification by pairwise coupling,” J. Machine Learning Res.,vol. 5, pp. 975–1005, 2004.

[36] J. Garofolo, “Getting started with the darpa TIMIT cd-rom: An acousticphonetic continuous speech database,” National Inst. Standards andTechnology (NIST) 1988.

[37] O. Dehzangi, B. Ma, E. S. Cheng, and H. Li, “Discriminative outputcoding features for speech recognition,” in Int. Symp. Chinese SpokenLanguage Processing, 2008.

[38] D. A. Reynolds, “Speaker identification and verification usingGaussian mixture speaker models,” Speech Commun., vol. 17, no.1–2, pp. 91–108, 1995.

[39] R. P. Ramachandran, K. R. Farrel, R. Ramachandran, and R. J. Mam-mone, “Speaker recognition-general classifier approach and data fusionmethods,” Pattern Recognition, vol. 35, pp. 2801–2821, 2002.

Cemal Hanilçi received the B.Sc. and M.Sc. degreesfrom the Department of Electronic Engineering,Uludağ University, Bursa, Turkey, in 2005 and 2007,respectively. Currently, he is working toward thePh.D. degree in the same department.His research interests include speaker recognition,

speech signal processing and voice biometrics.

Figen Ertaş received the B.Sc. and M.Sc. degreesfrom Uludağ University, Turkey, in 1985 and 1988,respectively, and the Ph.D. degree from Sussex Uni-versity, U.K., in 1997, all in electronic engineering.Currently, she is an Assistant Professor in the

Department of Electronic Engineering, Uludağ Uni-versity, and her research interests include speakerrecognition and forensic applications of speechsignal processing.

Tuncay Ertaş received the B.Sc. and M.Sc. degreesfrom Dokuz Eylül University, Turkey, in 1986 and1988, respectively, and the Ph.D. degree, in 1994,from Sussex University, U.K., all in electronicengineering.He is currently with the Department of Electronic

Engineering, Uludağ University, Turkey. His re-search interests include wireless communicationsand signal processing.

Ömer Eskidere received the B.Sc., M.Sc., andPh.D. degrees from Uludağ University, Turkey, in1997, 2000 and 2007, respectively, all from theDepartment of Electronic Engineering.He is currently with the Department of Mecha-

tronics, Uludağ University, and his research interestsinclude speaker recognition and forensic applicationsof speech signal processing.