1 robust audio-fingerprinting with spectral entropy...

1

Robust Audio-Fingerprinting With Spectral

Entropy SignaturesAntonio Camarena-Ibarrola *,Student Member, IEEE,and Edgar ChavezMember, IEEE,

{camarena,elchavez}@umich.mx

Universidad Michoacana de San Nicolas de Hidalgo

Av. Francisco J. Mujica S/N Ciudad Universitaria CP 58000

Morelia, Michoacan, Mexico

Abstract

In this paper we propose a highly robust audio-fingerprint (AFP). We call this AFP theSpectral

Entropy Signature(SES). To extract the SES of a song, Shannon’s entropy is determined from the

spectral coefficients of each one of the first 24 critical bands according to the Bark scale; entropy values

are then binary coded to obtain a very compact AFP of only 0.13kbps.

To put the SES in context we compared it with aspectral flatness signature(SFS) and aTime domain

Entropy Signature(TES). The SES, TES and SFS were determined for every song in an assorted genre

collection of4 000 elements. Four hundred songs were severely degraded and searched for using excerpts

of five seconds. The SES showed higher robustness than both the TES and the SFS for the degradations

of white noise addition, equalization, lossy compression,re-recording in a noisy environment, low pass

filtering, time-shifting and cropping.

EDICS: AUD-CONT

Index Terms

Audio-Fingerprint, Entropy, Music Information Retrieval.

I. INTRODUCTION

Audio-Fingerprints (AFPs) are essential characteristicsof digital audio streams used to score the

perceptual similarity between audio signals. Ideally, an AFP should be aninvariant of the signal, an

intrinsic characteristic found in the signal even if it has suffered severe degradations, as far as a human

August 14, 2007 DRAFT

2

being is still able to correctly identify the audio stream. The potential applications of a robust AFP cover

a wide spectrum, some of them are listed below:

1) Broadcasts Monitoring. The assessment of sponsorship effectiveness may be done bycomputers

equipped with multi channel FM/TV cards [1].

2) Duplicate detection. Detecting duplicates is very important for maintaining the integrity of any

multimedia database.

3) Automatic labelling. Modern MP3 (MPEG1,2 layer 3) players provide the user with tools for

organizing songs, they rely in the contents of meta-data labels (e.g. Album’s title), when these

labels are empty they can be automatically filled using fingerprinting techniques.[2].

4) Querying by example. A song may be identified using a small excerpt of audio captured by a

mobile phone as explained in [3].

5) Filtering in p2p networks. When music is transmitted in a peer to peer network, the audio-fingerprint

is determined from the packets and searched for in a list of copyrighted songs to prevent illegal

copies [4].

A. Characteristics of an AFP

To accomplish the tasks described above, and AFP should havethe following properties:

1) Robustness. Audio signals may be subject to a variety of signal degradations such as noise contam-

ination, lossy compression, loudspeaker to microphone transmission (LsMic), low-pass filtering

simulating narrow band telephone line transmission, equalization, cropping, time shifting and

loudness variation. The AFP of a song should not be too different from the AFP of a degraded

version of the same song.

2) Compactness. Some applications need to store the AFP of every song from a possible big collection,

other applications need to transmit the AFP over the internet, these facts make compactness a very

desirable characteristic of an AFP.

3) Granularity. SomeMusic Information Retrievalapplications requiere the ability to identify a song

using only a small excerpt, for example, inquerying by hummingwe do not want the user having

to hum the entire song he is searching for. Granularity is also known asrobustness to cropping.

4) Time complexity. The AFP should be determined with as little computer effortas posible. The AFPs

of the whole collection of songs have to be determined in a reasonable time. Realtime systems

have to extract the AFP of a song on line, furthermore, inBroadcasts Monitoringit is desirable to

be able to compute the AFP of several audio channels simultaneously.


3

5) Scalability. Is defined as the ability of an audio fingerprinting system tooperate with large databases,

this feature is conditioned by a low time complexity, a compact AFP size and a good indexing

technique.

B. Audio-fingerprint modelling

The first thing an audio-fingerprinting system has to do is to extract features from the signal. The

module in charge of extracting relevant perceptual features of the audio signal is known as thefront end,

once this module delivers the features the signal, the AFP system models the songs in a way that best

serve the purpose of the application for which it has been designed. Some AFP models are listed below:

• Sequences of Feature Vectors. This kind of AFPs are also known astrajectories or traces. The

features extracted at equally spaced periods of time are simply stored in a list of vectors or in a

table, one row per frame. An example of this kind of AFP is the binary vector sequence described

in [3].

• StatisticsInstead of storing every feature vector, only statistical data over the set of feature vectors

are stored. The audio-fingerprint designed for MPEG-7 [5] computes the means, variances, minimum

and maximum values every 32 frames. The minimum and maximum values are used for delimiting

the search and the means and variances are used for the actualsearch using some measure like the

Mahalanobis’ distance.

• CodebooksThe sequence of feature vectors extracted from a song is replaced by a small number

of representative code vectors stored in acodebook, which from then on represents the song. This

model disregards the temporal evolution of the audio signal.

• Strings. Trajectories can be converted into long strings of integers using vector quantization. This

model allows the treatment of the songs as texts that can be compared using flexible string matching

techniques [6].

• Single vectors. These are the smallest AFPs, they are usually built with average features extracted

from the whole song, for example, an AFP can be a vector containing the beats per minute, the

average zero crossing rate and the average spectrum [2].

• Hidden Markov Models (HMM). These finite state machines model non stationary stochastic proceses

(e.g. songs). For each song of the collection a HMM is built. The features extracted from the test’s

song are considered to be a sequence of acoustic events and then used as the input for the candidate’s

HMM. The candidate’s HMM in turn reports the probability that the test song matches the candidate

song, this probability is used as a proximity measure for choosing the right song [7].


4

C. Feature extraction

Audio-fingerprinting systems extract features from the signal normally on a frame by frame basis. Most

systems extract the signal features in the frequency domainusing a variety of linear transforms such as

the Discrete Cosine Transform, the Discrete Fourier Transform, the Modulation Frequency Transform [8]

and some Discrete Wavelet Transforms like Haar’s and Walsh-Hadamard’s [9].

Early work on audio-fingerprinting inherited the benefits from decades of research in speech processing.

Looking for more relevant features of music a variety of perceptual variables have been used such as

Loudness(PL) [10], the Joint Acoustic and Modulation Frequency(JAMF) [8], the Spectral Flatness

Measure(SFM) [11], the Spectral Crest Factor(SCF) [11], Spectral Subband Centroids(SSC) [12],

tonality [13], the sign of Energy’s second derivative [3] andchroma values[14] among others [15]. In

[12] it is shown how theNormalized SSCare more robust than MFCC and tonality for lossy compression

and equalization. In [8] it is reported that the Normalized JAMF has superior robustness than aspectral

estimatefor compression and equalization. In [11] it is reported that SFM has superior robustness than

PL and SCF as well . SFM was adopted by MPEG-7 for audio fingerprinting purposes [16]. We now

present a brief definition of the SFM due to the importance of this feature and because we included the

use of the SFM in our experiments as a reference.

The SFM is a feature related to the tonality aspect of the audio signal. The SFM is defined as the ratio

of the geometric mean and the arithmetic mean of the power spectrum coefficients. The SFM for bandb

with bandwidthnb can be computed using formula (1). The SFM reports values between zero and one,

values near one means that the spectrum is flat and the audio isnoisy, while values near zero show that

the audio signal is more tone-like.

SFMb =

[

∏nb

i=1c(i)]

1

nb

1

nb

∑nb

i=1c(i)

(1)

Wherec is the vector where the power spectrum coefficients are stored.

To put our work in context we implemented an audio fingerprintusing the SFM as the relevant

perceptual feature, we will refer to it as the Spectral Flatness Signature (SFS). The SFM is computed

for each frame and band using the same resolution both in frequency and time as the Spectral Entropy

Signature (SES), this was done so that any possible improvement or declination of the robustness could

not be attributed to anything else but to the perceptual capabilities of the features put into comparison.

Some systems extract signal features directly in time domain as in [17] where the sign of the time

derivative of the signal was found to be robust to lossy compression and low-pass filtering. Another


5

example is the signature presented in [18] which is thoroughly described next.

D. The Entropy of a Signal as the Relevant Perceptual Feature

Searching for features in audio signals that would still be present if those audio signals are severely

degraded, we decided to explore the use of entropy for audio fingerprinting purposes. We began by using

the time domain entropy as explained in [18]. For completeness we include below a brief discussion of

the entropy and some interesting properties.

The entropy of a signal is a measure of the amount of information the signal carries [19]. Shannon’s

entropy is computed using (2) and its continuos version called “differential entropy” is defined as in (3)

[19].

H(x) = −

n∑

i=1

piln(pi) (2)

Wherepi is the probability for any sample of the signal to adopt valuei beingn the number of possible

values the samples may adopt, for example, if the sample sizeis of 8 bits, thenn = 28 = 256

H(X) = −

∫

+∞

−∞

p(x)ln[p(x)]dx (3)

The entropy of a signal is a measure of how unpredictable it is, if the signal is a constantk, then

its probability distribution function (PDF) is a unitary impulse located atk, that is pi = δ(k), and its

entropy or unpredictability is zero as shown in (4), observethat 0log(0) needs to be considered zero for

this to be true. On the opposite case, if the signal has a uniform distribution then the entropy would be

maximum, that is, ifpi = 1/n for n possible values then its entropy would beln(n) as in (5)

Hmin = −∑

i

δ(k)ln[δ(k)] = −ln(1) = 0 (4)

Hmax = −

n∑

i=1

1

nln(

1

n) = −ln(

1

n) = ln(n) (5)

Entropy has been used in speech signals in noisy environments as a segmentation tool [20]. Also,

entropy has been used in choosing the desirable frame rate inthe analysis of speech signals [21].

By processing an audio-signal in frames of two seconds overlapped 50% and computing Shannon’s

entropy every frame, a sequence of entropy values is obtained, we will refer to this sequence as the

entropy curve. The entropy curvesof several degradations of the songDiosa del cobreare shown in


6

figure 1. Please note how similar theentropy curveslook between the original and the lossy compressed

(i.e mp3@32kbps) version, the low-pass filtered (i.e. 1KHz cutoff) version and the scaled (i.e. 50 percent

louder) version. The profile of these fourentropy curvesis almost identical so can safely use the sign

of the derivative to build a binary string that we call in thispaper theTime-domain Entropy Signature

(TES). As reported in [18] TES is not only extremely compact and easy to compute but turned out to

be very robust for the specific degradations of low-pass filtering, scaling and lossy compression. On the

other hand, theentropy curveis severely deformed when the song is degraded by equalization, noise

mixing and re-recording (i.e Loudspeakers to microphone transmission in a noisy environment). The fact

that TES is not robust under equalization when compared to Haitsma’s AFP [3] is acknowledged in

[18], robustness under noise mixing and re-recording was not assessed in [18] but further experiments

conducted know and reported in this paper report weakness ofTES for noise mixing and re-recording

under noisy environment. To cope with these deformations, we combined the use of entropy with the

robust AFP design described in [3] only using spectral entropy instead of energy’s second derivative. An

extremely robust AFP was obtained which will be described next.

II. OUR CONTRIBUTION. THE SPECTRAL ENTROPY SIGNATURE (SES)

The human ear perceives better the lower frequencies than the higher ones. The Bark scale defines 25

critical bands, each one of them corresponds to a section of the cochlea of about 1.3 mm [10]. Equation

(6) can be used to convert Hertz to Barks.

z = 13tan−1

(

0.76f

1000

)

+ 3.5tan−1

(

f

7500

)2

(6)

Wheref is the frequency in Hertz andz is the frequency in Barks

If the entropy of the spectral coefficients that correspond to a specific critical band is computed for

every frame of an audio signal we obtain a sequence that we call spectral entropy curve. Remember from

the preceding section how equalization deformed theentropy curvemaking TES practically unsuitable

for this kind of degradation. We found that this was not the case when thespectral entropy curves

were obtained for each critical band. To show this effect, weincluded figure 2, where we can see the

spectral entropy curvesfor critical bands 4,8,12,16 and 20 of the songDiosa del cobre. The curves

at the left in figure 2 correspond to the original song while the curves at the right correspond to the

equalized version. Amazingly, thespectral entropy curvesseem almost unaffected by equalization. Not


7

0 50 100 150 200 2503

3.5

4

Ent

ropy

ORIGINAL

0 50 100 150 200 250

3.6

3.8

4

4.2

Ent

ropy

EQUALIZED

0 50 100 150 200 250

3

3.5

4

Ent

ropy

LOW PASS FILTERED

0 50 100 150 200 2502.5

3

3.5

Ent

ropy

NOISY

0 50 100 150 200 250

3

3.5

4

Ent

ropy

LOSSY COMPRESSED

0 50 100 150 200 2502.5

3

3.5

4

Time (seconds)

Ent

ropy

RE−RECORDED

0 50 100 150 200 250

3.43.63.8

44.2

Time (seconds)

Ent

ropy

LOUDER

Fig. 1. Entropy signal of several degraded versions of the song Diosa del cobre

all 24 critical bands are shown to avoid overcrowding, the other bands behave just alike. This early

experiment was quite encouraging for the design of an audio-fingerprint based on spectral entropy, the

Spectral Entropy Signature(SES).

A. Entropygram Determination

The firsts steps for the determination of the SES of a song are related to the determination of its

entropygram(defined below), these steps are:

1) Stereo audio signals are first converted to monoaural by averaging both channels.


8

0 100 200 30018

20

22

24

Ent

ropy

4th Band

Original

0 100 200 30018

20

22

24

Ent

ropy

4th Band

Equalized

0 100 200 300

18

20

22

24

Ent

ropy

8th Band

0 100 200 300

18

20

22

24

Ent

ropy

8th Band

0 100 200 30015

20

Ent

ropy

12th Band

0 100 200 30015

20

Ent

ropy

12th Band

0 100 200 300

16

18

20

Ent

ropy

16th Band

0 100 200 300

16

18

20

Ent

ropy

16th Band

0 100 200 300

16

18

20

Time (seconds)

Ent

ropy

20th Band

0 100 200 300

16

18

20

Time (seconds)

Ent

ropy

20th Band

Fig. 2. Spectral entropy curves for critical bands 4,8,12,16 and 20 (from the top down) according to the Bark scale. Not all

the 24 bands are shown so the figure is not overcrowded: Left, original. Right, Equalized version

2) The signal is processed in frames of 370 ms, this frame sizeensures an adequate time support for

entropy computation. The frame sizes normally used in audio-fingerprinting ranges from 10 ms to

500 ms according to [15]. The frame size used in [3] is precisely 370 ms.

3) Our frames are overlapped fifty percent, therefore, 5.4 frames per second will be the frame rate for

the SES extraction, a low frame rate like this will result in acompact audio-fingerprint.

4) To each frame the Hann window is applied and then its DFT is determined.

5) Shannon’s entropy is computed for the first 24 critical bands according to the Bark scale, discarding


9

only the 25th critical band (frequencies between 15.5 KHz and 20 KHz). For any given bandb, the

elements of the DFT corresponding tob are used to build two histograms, one for the real parts

and another one for the imaginary parts of these elements. The histograms are used to estimate

the probability distribution functions. Shannon’s entropy for the real and imaginary parts of the

DFT are computed separately, call themhbr and hbi respectively. The entropyhb for bandb is

determined as the sum ofhbr andhbi.

For each frame of the audio signal a vector with 24 values of spectral entropy is obtained. The sequence

of vectors corresponding to a short excerpt of audio of a few seconds make a matrix of 24 rows and

a number of columns that depends on the duration of the excerpt. Such a matrix can be shown as an

image where the horizontal axis represents time, the vertical axis represents frequency and the gray

levels represents the amount of information (i.e, entropy)for every band and frame. We call these images

Entropygrams. Some entropygrams are shown in figure 7 we will refer to them later.

B. The codification step

We used figure 2 to show how thespectral entropy curveof any specific critical band is practically not

deformed when a song is equalized, the profile of the curve remains almost unchanged, there is however

a vertical shift, therefore, just as we did for the TES, we keep for each frame only an indication of wether

the spectral entropy is increasing or not for each band. Equation (7) states how the bit corresponding

to bandb and framen of the AFP is determined using the entropy values of framesn and n − 1 for

bandb. Only 3 bytes (i.e 24 bits) are needed for each frame of audio signal, that was another reason for

dropping the 25th critical band. A diagram of the process of determining the SES including the steps for

the entropygram determination and codification step is depicted in figure 3.

F (n, b) =

1 if [hb(n) − hb(n − 1)] > 0

0 Otherwise(7)

Since the SES of a song is a binary matrix, it can be shown as a black and white image as the one

shown in figure 4 where a piece of the AFP fromDiosa del cobreis shown. In the same figure a 5

second excerpt is magnified.

III. E XPERIMENTS

The experiments on robustness were carried out taking into consideration the following deformations:

1) Cropping. The songs will be identified using excerpts of only 5 seconds.


10

Framing

BandDivision

EntropyComputation

Codification

F(n,0)

F(n,23)

F(n,1)

FFT

h0r(n)

0

H

+

->0

H

real

imag Z-1+

h0i(n)

h0(n-1)

h (n)

+-

+

+

h0(n) h0

(n-1)-

H

+

->0

H

real

imag Z-1+

h1i(n)

h1(n-1)

h (n)

+-

+

+

h1(n) h1

(n-1)-

1

h1r(n)

H

+

->0

H

real

imag Z-1+

h23i(n)

h23(n-1)

h (n)

+-

+

+

h23(n) h23

(n-1)-

23

h23r(n)

Fig. 3. Information content analysis and coding for SES extraction

Fig. 4. A fragment of the AFP of the songDiosa del cobre. A 5 sec excerpt is magnified

2) Desynchronization. Since the excerpts used to look for a song may begin at any instant to reproduce

a real scenario, no frame from the excerpt will correspond toexactly the same period of time than

any frame from the original song. This deformation is also known astime shifting.

3) Lossy compression to MP3 using a bit rate of only 32 kbps, this is frequently represented as

mp3@32kbps. This particular degradation introduces a timeshift also.

4) Equalization according to the diagrams shown on figure 5, these are common equalization styles

from [22].

5) Mixing with white noise, this kind of noise contaminates all bands in the same way. Once mixed

with white noise the songs have a Signal to Noise Ratio (SNR) between 3 dB and 5 dB. The SNR

is computed using equation (8) wherePsignal is the power of the original signal andPnoise is the

power of the noise added to the signal.


11

(a) 1965 (b) Classic V (c) Louder (d) Pop (e) Soft Bass

Fig. 5. Equalization styles used. Eighteen bars spread fromthe lowest band (i.e leftmost) at 55 Hz to the highest band at 20

KHz. A bar above the horizontal axis indicates the amplification of its corresponding band. A bar below the horizontal axis

indicates the attenuation of its corresponding band

SNR = 20log10(Psignal

Pnoise

) (8)

Noise produced by big fans fall in the range that is referred as white noise. The noise we used to

contaminate songs can be accessed under the name of turbofan-hifi.wav at:

http://www.asti-usa.com/skinny/sampler.html. Please note that colored noise is

not as severe since it will affect only some bands of the signal.

6) Low pass filtering with a cutoff frequency of 1KHz. A Butterworth filter of second order with -20

dB/decade was used, meaning that the signal’s amplitud has declined to a tenth for a frequency

that is ten times the cutoff frequency.

7) Loudspeaker-Microphone transmission (Ls-Mic), this degradation consisted on playing the music

with the pair of loudspeakers of a multimedia system and recapturing it with an omnidirectional

microphone with a sensibility of -54±3 dB and a frequency response of 50Hz to 16KHz in a noisy

environment.

8) Scaling. The signal was amplified 50 percent without clipping prevention, in fact approximately 30

percent of the signal’s peaks were clipped during this degradation.

A. Experiment 1. Using whole songs

Since TES is a signature designed for whole songs, it was onlyfair to include it exclusively in

experiments where whole songs were compared. The first experiment carried out is described as follows:

1) The audio fingerprints (i.e. SES, TES and SFS) of4 000 songs from all kind of genres (rock, pop,

classical, etc.) were extracted.

2) Four hundred songs (i.e. ten percent) were subject to the six signal degradations numbered (3) to

(8) at the beginning of this section.


12

3) The signatures of the2 400 audio files obtained in the previous step were also extracted.

4) The audio fingerprints of the degraded songs were searchedamong the audio-fingerprints of the

collection of4 000 using the nearest neighbor criterion.

As a distance measure for the SES of two songs the Hamming distance was used. The Hamming

distance was also used when comparing the TES of two songs. Finally, the Euclidian distance was used

when comparing the SFS of two songs.

Table I shows the precision rate for the TES, SES and SFS that resulted from this experiment. The

Precision rate is defined as the fraction of the correctly identified songs (i.e true positives) over the number

of queries performed (i.e true positives plus false positives) [23]. In this first experiment SES showed high

robustness to every considered degradation, which was veryencouraging. SFS showed high robustness to

Equalization, lossy compression an Scaling. Finally, TES showed high robustness to Low-Pass filtering,

Lossy Compression and Scaling, its robustness to Re-recording is also acceptable.

TABLE I

PRECISION RATE FOR DIFFERENT SIGNAL DEGRADATIONS USINGTES ,SESAND SFSWITHOUT CROPPING(WHOLE

SONGS)

Degradation TES SES SFS

Equalization 53.7 % 100 % 100 %

Noise contamination (SNR≈4dB) 63.2 % 100 % 55.3 %

Re-recording (LsMic) 92.1 % 100 % 80 %

Low-Pass filtering (1KHz) 100 % 100 % 72.1%

Lossy Compression (32kbps) 100 % 100 % 100 %

Scaling (50 percent louder) 100 % 100 % 100 %

B. Experiment 2. Using small excerpts

Experiment 1 was done using whole songs, however, for some applications it is important to recognize

a song using only a small excerpt of only a few seconds. To verify the robustness to the degradations

considered in experiment 1 combined with cropping and desynchronization at the same time, the following

experiment was carried out:

1) The signatures (i.e SES and SFS) of4 000 songs from all kind of genres (rock, pop, tropical,

classical, etc.) were extracted and stored.


13

2) 400 of these songs were degraded in six different ways: Lossy compression, Equalization, Mixing

with noise, Low pass filtering, Scaling and finally Loudspeaker-Microphone transmission in a noisy

environment.

3) From each of the 2400 audio files (including originals) obtained in the previous step a excerpt of

5 seconds was extracted, therefore all those degradations were combined with cropping and at the

same time with desynchronization.

4) The short signatures of the 2400 excerpts that resulted from the previous step were determined.

In figure 7 the entropygrams of a excerpt of the songDiosa del cobrecorresponding to the seven

degraded versions considered are shown, their corresponding signatures are shown in figure 8.

5) All the short signatures determined in the previous step were searched inside every whole song’s

signature from the collection of4 000 determined in the first step using the nearest neighbor

criterion. For example, the nearest SES signature to those shown in figure 8 was found inside the

piece of the whole song’s SES signature that is magnified in figure 4 and shown again in figure 6.

Fig. 6. The piece of SES magnified in figure 4 in the same size as the used in figures 8 and 9

The Hamming distance is used to establish how different de SES of two excerpts are from each

other. The Hamming distance between two binary matrixes canbe conceived as a measure of fullness

of the matrix that results from computing the absolute difference between them. In figure 9 we show

the differences found between the degraded versions of a excerpt of the songDiosa del cobreand the

nearest neighbor that was indeed found inside that song, precisely the one shown in figure 6.

.

Not even the excerpts extracted from the original songs werefound without errors, to understand this

consider that the probability for the first frame of the randomly selected excerpts to be aligned with

any frame of the song is very small, so the experiment is reproducing a real scenario, this effect is

known as desynchronization or time-shift. In figure 10, the distance between a excerpt of a song and

the most similar (i.e closest Hamming distance) segment of audio inside the same song is plotted as a

function of the time-shift. To generate this curve a song at 44100 samples per second was used, therefore


14

(a) Original (b) Equalized (c) Low-pass filtered

(d) Noisy (e) Lossy compressed (f) Ls-Mic

(g) Louder

Fig. 7. Entropygrams of the same excerpt of five seconds from several degraded versions of the songDiosa del cobre

a frame of 0.37 sec is made of 16384 samples, the first excerpt was extracted beginning at a position

that was a multiple of the frame size (i.e zero time-shift), this excerpt was of course found inside the

song without errors (i.e zero distance) and corresponds to the first point of the curve. The second excerpt

was extracted beginning 100 samples (i.e 2.2ms) after the first excerpt, the most similar piece of audio

inside the song was found with a normalized Hamming distanceof 0.013, it corresponds to the second

point of the curve. The third excerpt was extracted beginning 100 samples after the second excerpt and

so on. Since the frames are overlapped fifty percent a distance of zero is found again at a time-shift of

185 ms (i.e half the frame size). It is clear from figure 10 thatincreasing the overlap will result in a

more robust audio-fingerprint, if for example, the overlap is increased to 90% the normalized Hamming

distance between a random excerpt extracted from a song and the most similar segment of audio inside

the same song could not be greater than 0.1 instead of 0.2 which is the maximum distance observed in

figure 10. Of course, a price in processing time and disk spacewould have to be paid if an increase in


15



(g) Louder

Fig. 8. Signatures of excerpts of five seconds of the degradedversions of the songDiosa del cobre.

robustness is desired.

The SES of a song is a binary matrix with a number of rows that depends on the duration of the

song and a fixed number of columns (e.g.1501 × 24 for a song of 4 minutes and 39 seconds). On the

other hand, the SES of a excerpt is a binary matrix with a fixed size (i.e. 24 × 24 in our experiment)

as long as the duration of the excerpts does not change. Usingthe Hamming distance, we compared the

short binary matrix of the excerpt with every posible submatrix of the same size from every signature

of the collection in order to search the song to which the five seconds excerpt belongs to. For example,

the nearest SES to those shown in figure 8 was found inside the piece of the whole song’s SES that

is magnified in figure 4. The brute force search procedure tookapproximately 50 seconds to answer a

query in a 2.8 GHz Pentium 4 PC with 512 Mbytes of RAM. The searching time was reduced to about

20 seconds using the following strategy: Instead of finishing the computation of the Hamming distance

between any submatrix and the SES of the query excerpt just tofind out that they are too different, skip


16



(g) Louder

Fig. 9. Absolute differences, the Hamming distance as an indicator of fullness.

to the next submatrix as soon as the normalized Hamming distance between the first columns of the

submatrix and the first columns of the SES of the query excerptis higher than 0.3.

In table II the rate of correctly identified songs using SES and SFS are shown for the signals

degradations considered. The SES shows higher robustness to Noise addition, Re-recording and Low-pass

filtering.

C. Experiment 3.

Since Experiments 1 and 2 were not able to find any weakness on the SES, we designed a third

experiment where degraded songs are not only compared with originals but with other degraded versions

as well, for example, the equalized version of a song can be compared with its noisy version. In the

problem of Querying by example the degraded versions are always compared with originals, however

this is not the case for other applications, for example, in radio broadcast monitoring the audio signal

that is going to be used as the reference for a specific commercial spot is normally captured in the same


17

0 50 100 150 2000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Time−shift (ms)

Nor

mal

ized

Ham

min

g di

stan

ce

Fig. 10. Distance between a excerpt of a song and the most similar segment of audio inside the same song as a function of

the time-shift

TABLE II

PRECISION RATE FOR DIFFERENT SIGNAL DEGRADATIONS. EXPERIMENT 2 (USING EXCERPTS OF5 SECONDS)

Degradation SES SFS

Cropping and time-shift 100 % 100 %

Equalization, cropping and time-shift 100 % 100 %

Noise contamination, cropping and time-shift100 % 63 %

Re-recording in noisy environment,

cropping and time shift 100 % 75 %

Low-Pass filtering, cropping and time-shift 100 % 82 %

Lossy Compression,cropping and time-shift 100 % 100 %

Scaling, cropping and time shift 100 % 100 %

way as the audio signal to be monitored. As another example, consider a p2p application that is always

looking in the network for audio files with better quality than the ones in the local host, this application

would be comparing all kinds of degraded versions includingthose obtained from old tapes.

We used thirty eight songs for this experiment, each one on six degraded versions making a total of


18

228 audio files. Each audio file were compared with every otherone to fill up theconfusion matrixof

228 rows and 228 columns. The5 1984 locations of the confusion matrix correspond to the same number

of distances between audio-signatures that had to be computed. The confusion matrix that results from

this comparisons using SES is shown as an image in figure 11. Every pixel of figure 11 has a gray level

according to the distance it represents (darker means closer). The first row of pixels represents the set

of distances between the first audio file and the rest of them, the second row of pixels represents the

distances between the second audio file and all of the others and so on. The 228 pixels along the diagonal

ar all black because the distance between any audio file and itself is always zero. The symmetry of the

Hamming distance can also be appreciated in the confusion matrix. The names of the audio files have

a prefix according to the song’s name they belong to and a suffixthat denotes the kind of degradation

the song suffered. The audio files are maintained in alphabetical order according to its name, for this

reason, the ideal confusion matrix would be all white with 38black squares along the main diagonal,

each of which would be of size6 × 6 (i.e six degradations). The confusion matrix that results from this

comparisons using SFS is shown in figure 12. The confusion matrix that results from this comparisons

using TES is shown in figure 13. The expected6 × 6 black squares along the diagonal are not as well

defined in 12 or in 13 as they are in figure 11. Figure 11 even resemble us the ideal confusion matrix

described above, this fact reveals SES as a more robust audiofingerprint than SFS or TES.

20 40 60 80 100 120 140 160 180 200 220

20

40

60

80

100

120

140

160

180

200

220

Fig. 11. Confusion Matrix resulting from experiment 3 usingSES


19

20 40 60 80 100 120 140 160 180 200 220

20

40

60

80

100

120

140

160

180

200

220

Fig. 12. Confusion Matrix resulting from experiment 3 usingSFS

20 40 60 80 100 120 140 160 180 200 220

20

40

60

80

100

120

140

160

180

200

220

Fig. 13. Confusion Matrix resulting from experiment 3 usingTES

D. Sensitivity Analysis and Optimal Threshold Selection for Experiment 3

When two degraded versions of a same song have a distance below some thresholdth, we say that we

are in the presence of atrue positive, if those two degraded versions of a song have a distance above th,

we say we are dealing with afalse negative. On the other hand, when comparing two different songs, if

the distance between them falls belowth then we call that afalse positiveand if the distance is grater


20

than th, then is atrue negative. Table III summarizes these definitions.

TABLE III

DEFINITIONS FOR THE SENSITIVITY ANALYSIS

dist < th dist > th

Same songs True Positive (TP ) False Negative (FN )

Different songs False Positive (FP ) True Negative (TN )

The True Prediction Rate(TPR) is the fraction of songs the system correctly identifies (i.e. true

positives) from all the songs the system should have identified. The TPR is also known assensitivityor

recall and it is estimated with (9). TPR equals1 − FFR where FFR is the well knownFalse Rejection

Rate.

The False Prediction Rate(FPR) is a measure of how often the system mistakes a song for another

and it is defined as in (10). The FPR is also known asFalse Alarm Rateand equals1 − specificity

TPR =TP

TP + FN(9)

FPR =FP

FP + TN(10)

The ROC Space is the plane where the vertical axis is the TPR and the horizontal axis is the FPR, a

single point in this plane is the performance of the system for a given threshold. By varying the threshold

to all it posible values a ROC curve is generated. In figure 14 the ROC curves for the system that use

SES, TES and SFS are shown, there we can clearly see that the area under the ROC curve for SES is

greater than the area under the ROC curve of SFS or TES.

From the analysis for generating the ROC curves the optimal threshold for each system was also found,

this is the threshold for the point of the curve that is closest to the upper-left corner. Using the optimal

threshold, the precision rates for all posible combinations of degradation (e.g Low pass filtered against

equalized) when SES was used are shown on table IV. Table V show the precision rates when SFS was

used. Table VI show the precision rates when TES was used.

Segments from several degraded versions of one of the songs of the test set are available at

http://lc.fie.umich.mx/∼camarena/Audiofiles.html


21

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

TES

SES

SFS

False Prediction rate

Tru

e P

redi

ctio

n ra

te

Fig. 14. ROC curves for Experiment 3

TABLE IV

PRECISION RATE OBTAINED USINGSESAND ITS OPTIMAL THRESHOLD (EXPERIMENT 3)

LowPass EQ Loud Noisy LsMic

Original 100 % 100 % 100 % 100 % 100 %

LowPass 100 % 100 % 97 % 100 %

EQ 100 % 97 % 100 %

Loud 100 % 100 %

Noisy 95 %

IV. CONCLUSIONS

1) Regarding robustness. The spectral entropy signature proposed in this paper has proved to be highly

robust to heavy degradations of the audio signals. The SES turned out to be a more robust AFP

than the TES specifically for equalization, noise contamination and loudspeaker to microphone

transmission in a noisy environment (LsMic). The SES turnedout to be more robust than the SFS

specifically for Noise contaminated songs, LsMic and low-pass filtered songs. All the five seconds

excerpts were correctly identified using SES no matter what kind of degradation the audio signal

was subject to, these results do not contradict those reported in [11] since the level of degradation to

which the songs were subject in our experiments was a higher one, for example, in [11], the songs


22

TABLE V

PRECISION RATE OBTAINED USINGSFSAND ITS OPTIMAL THRESHOLD (EXPERIMENT 3)


Original 97 % 100 % 100 % 77 % 95 %

LowPass 100 % 100 % 71 % 90 %

EQ 100 % 76 % 87 %

Loud 77 % 95 %

Noisy 74 %

TABLE VI

PRECISION RATE OBTAINED USINGTESAND ITS OPTIMAL THRESHOLD (EXPERIMENT 3)


Original 87 % 52 % 100 % 74 % 85 %

LowPass 52 % 87 % 61 % 74 %

EQ 42 % 26 % 42 %

Loud 74 % 55 %

Noisy 37 %

were contaminated with noise only with a “reasonable SNR of 20-25 dB simulating background

noise”. Instead, we mixed noise up to an SNR of 4-5 dB. Please note that a SNR of 1 dB means

that noise and music have the same intensity level which would make difficult even for the human

auditory system to identify the songs.

It is very interesting how the SES of the low pass filtered song’s did not change significatively even

when only 8 out of the 24 considered critical bands fall belowthe cutoff frequency of 1 KHz. To

understand this effect, remember that a low-pass filter attenuates the content of the signal above the

cutoff frequency gradually as the frequency increases (theButterworth filter that we used attenuates

the signal at a rate of with -20 dB/decade), also the entropy value depends on the distribution of

the spectrum for each considered band disregarding its amplitude. You can see how the absolute

differences shown in figure 9(c) are not relevant below the first 20 bands, only the last 4 bands (at

the top) seem affected.

2) Regarding Compactness. Twenty four bits every 185 ms is a very compact fingerprint of only 0.13


23

kbit/s. A a reference, Haitsma-Kalker’s AFP requieres 2.6 kbit/s [3]. Of course, the resolution can

be tuned (i.e size of the sliding window and the overlap percentage) depending on the application.

3) Regarding Time complexity. The time it takes to determineSES of a song is approximately 7

percent of the duration of the song using a personal computerPentium 4 of 2.8GHz and 512 MB

of RAM. This parameter is important for real time applications. Again, the higher the resolution

adopted, the higher the time needed to determine the SES of a song will be.

4) Regarding Granularity. In this work, we show the results of the experiments where excerpts of five

seconds were used to identify a song. We however, experimented with excerpts of 10, 15 and 20

seconds as well. An elementary observation from such experiments is that the shorter the excerpt

was, the higher the required resolution (i.e greater overlap percentage) to identify a song had to

be.

5) Regarding Scalability. The first row of table IV is the accuracy rate of searching original songs

among 228 audio files. In experiment 2, songs were searched inside a collection of4 000 of them,

no decrease of the precision rate of SES is observed with the increase of the the database size.

Metric indexes, as surveyed in [24], could be used to speed upsearches.

REFERENCES

[1] S. Shin, O. Kim, J. Kim, and J. Choil, “A robust audio watermarking algorithm using pitch scaling,” in14th International

Conference on Digital Signal Processing, vol. 2, 2002, pp. 701 – 704.

[2] (2002) Musicbrainz trm musicbrainz-1.1.0.tar.gz. [Online]. Available: ftp://ftp.musicbrainz.org/pub/musicbrainz/

[3] J. Haitsma and T. Kalker, “A highly robust audio fingerprinting system,” inInternational Symposium on Music Information

Retrieval ISMIR, 2002.

[4] P. Shrestha and T. Kalker, “Audio fingerprinting in peer-to-peer networks,” in5th International Conference on Music

Information Retrieval (ISMIR), 2004.

[5] O. Hellmuth, E. Allamanche, M. Cremer, T. Kastner, C. Neubauer, S. Schmidt, and F. Siebenhaar, “Content-based broadcast

monitoring using mpeg-7 audio fingerprints,” inInternational Symposium on Music Information Retrieval ISMIR, 2001.

[6] A. Y. Guo and S. Hava, “Time-warped longest common subsequence algorithm for music retrieval,” in5th International

Conference on Music Information Retrieval (ISMIR), 2004.

[7] E. Batlle, J. Masip, and E. Guaus, “Amadeus: a scalable hmm-based audio information retrieval system,” inFirst

International Symposium on Control, Communications and Signal Processing, March 2004, pp. 731– 734.

[8] S. Sukittanon and E. Atlas, “Modulation frequency features for audio fingerprinting,” inInternational Conference on

Acoustics, Speech and Dignal Processing (ICASSP) IEEE, University of Washington USA, 2002, pp. II 1773–1776.

[9] S. Subramanya, R. Simha, B. Narahari, and A. Youssef, “Transform-based indexing of audio data for multimedia databases,”

in International Conference on Multimedia Applications, 1999.

[10] E. Zwicker and H. Fastl,Psycho-Acoustics. Facts and Models. Springer, 1990.


24

[11] J. Herre, E. Allamanche, and O. Hellmuth, “Robust matching of audio signals using spectral flatness features,”IEEE

Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 127–130, 2001.

[12] J. S. Seo, M. Jin, S. Lee, D. Jang, S. Lee, and C. D. Yoo, “Audio fingerprinting based on normalized spectral subband

centroids,” inInternational Conference on Acoustics, Speech, and SignalProcessing (ICASSP), 2005.

[13] R. P. Hellman, “Asymmetry of masking between noise and tone,” Perception and Psychophysics, vol. 11, pp. 241–246,

1972.

[14] S. Pauws, “Musical key extraction from audio,” inInternational Symposium on Music Information Retrieval ISMIR, 2004.

[15] P. Cano, E. Battle, T. Kalker, and J. Haitsma, “A review of algorithms for audio fingerprinting,”Multimedia Signal

Processing,IEEE Workshop on, pp. 169–167, December 2002.

[16] M. A. Group, Text of ISO/IEC Final Draft International Standar 15938-4 Information Technology - Multimedia Content

Description Interface - Part 4: Audio, July 2001.

[17] F. Kurth and R. Scherzer, “A unified approach to content-based and fault tolerant music recognition,” in114th AES

Convention, Amsterdam, NL., 2003.

[18] A. C. Ibarrola and E. Chavez, “A robust entropy-based audio-fingerprint,” in IEEE International Conference on Multimedia

and Expo (ICME2006), July 2006, pp. 1729–1732.

[19] C. Shannon and W. Weaver,The Mathematical Theory of Communication. University of Illinois Press, 1949.

[20] J.-L. Shen, J.-W. Hung, and L.-S. Lee, “Robust entropy-based endpoint detection for speech recognition in noisy

environments,” inInternational Conference on Spoken Language Processing, Dec 1998.

[21] H. You, Q. Zhu, and A. Alwan, “Entropy-based variable frame rate analysis of speech signal and its applications to asr,”

in International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2004.

[22] (2006) Foobar2000 equalizer presets eqpresets.zip. [Online]. Available: http://sjeng.org/ftp/fb2k/eq presets.zip

[23] T. Fawcett, “Roc graphs: Notes and practical considerations for researchers,” HP Labs Tech, Tech. Rep. HPL2003-4,2003.

[24] E. Chavez, G. . Navarro, R. Baeza-Yates, and J. Marroquın, “Proximity searching in metric spaces,”ACM Computing

Surveys, vol. 33(3), pp. 273–321, 2001.

Antonio Camarena-Ibarrola Received the B. Eng. degree in Electrical Engineering in 1987 from the

Electrical Engineering School at theUniversidad Michoacana de San Nicolas de Hidalgoin Mexico. He

received a M.S. degree in Computer Science in 1996 from theInstituto Tecnologico de Tolucain Mexico.

His research interests are in pattern recognition and signal processing.

He has been a teacher at the Electrical Engineering School since 1988 and he is Currently working

towards his Phd degree.


25

Edgar Chavez received a M.S. degree in Computer Science fromUniversidad Nacional Autonoma de

Mexico, and a Phd. in Computer Science fromCentro de Investigacion en Matematicas, Mexico. He is

currently a full professor ofUniversidad Michoacana, fellow of theMexican Research System(SNI), and

president of theMexican Computer Science Society. He has published over 50 papers in international

conferences, journals and book chapters. He has been general chair of SPIRE 1999, ENC 2004, LATIN

2001, and co-chair of the technical program of CPM 2003, AWIC2004 and AdHocNow! 2005. His

research interests include pattern recognition, algorithms and Information Retrieval.


1 robust audio-fingerprinting with spectral entropy...

Documents