1 robust audio-fingerprinting with spectral entropy...
TRANSCRIPT
1
Robust Audio-Fingerprinting With Spectral
Entropy SignaturesAntonio Camarena-Ibarrola *,Student Member, IEEE,and Edgar ChavezMember, IEEE,
{camarena,elchavez}@umich.mx
Universidad Michoacana de San Nicolas de Hidalgo
Av. Francisco J. Mujica S/N Ciudad Universitaria CP 58000
Morelia, Michoacan, Mexico
Abstract
In this paper we propose a highly robust audio-fingerprint (AFP). We call this AFP theSpectral
Entropy Signature(SES). To extract the SES of a song, Shannon’s entropy is determined from the
spectral coefficients of each one of the first 24 critical bands according to the Bark scale; entropy values
are then binary coded to obtain a very compact AFP of only 0.13kbps.
To put the SES in context we compared it with aspectral flatness signature(SFS) and aTime domain
Entropy Signature(TES). The SES, TES and SFS were determined for every song in an assorted genre
collection of4 000 elements. Four hundred songs were severely degraded and searched for using excerpts
of five seconds. The SES showed higher robustness than both the TES and the SFS for the degradations
of white noise addition, equalization, lossy compression,re-recording in a noisy environment, low pass
filtering, time-shifting and cropping.
EDICS: AUD-CONT
Index Terms
Audio-Fingerprint, Entropy, Music Information Retrieval.
I. INTRODUCTION
Audio-Fingerprints (AFPs) are essential characteristicsof digital audio streams used to score the
perceptual similarity between audio signals. Ideally, an AFP should be aninvariant of the signal, an
intrinsic characteristic found in the signal even if it has suffered severe degradations, as far as a human
August 14, 2007 DRAFT
2
being is still able to correctly identify the audio stream. The potential applications of a robust AFP cover
a wide spectrum, some of them are listed below:
1) Broadcasts Monitoring. The assessment of sponsorship effectiveness may be done bycomputers
equipped with multi channel FM/TV cards [1].
2) Duplicate detection. Detecting duplicates is very important for maintaining the integrity of any
multimedia database.
3) Automatic labelling. Modern MP3 (MPEG1,2 layer 3) players provide the user with tools for
organizing songs, they rely in the contents of meta-data labels (e.g. Album’s title), when these
labels are empty they can be automatically filled using fingerprinting techniques.[2].
4) Querying by example. A song may be identified using a small excerpt of audio captured by a
mobile phone as explained in [3].
5) Filtering in p2p networks. When music is transmitted in a peer to peer network, the audio-fingerprint
is determined from the packets and searched for in a list of copyrighted songs to prevent illegal
copies [4].
A. Characteristics of an AFP
To accomplish the tasks described above, and AFP should havethe following properties:
1) Robustness. Audio signals may be subject to a variety of signal degradations such as noise contam-
ination, lossy compression, loudspeaker to microphone transmission (LsMic), low-pass filtering
simulating narrow band telephone line transmission, equalization, cropping, time shifting and
loudness variation. The AFP of a song should not be too different from the AFP of a degraded
version of the same song.
2) Compactness. Some applications need to store the AFP of every song from a possible big collection,
other applications need to transmit the AFP over the internet, these facts make compactness a very
desirable characteristic of an AFP.
3) Granularity. SomeMusic Information Retrievalapplications requiere the ability to identify a song
using only a small excerpt, for example, inquerying by hummingwe do not want the user having
to hum the entire song he is searching for. Granularity is also known asrobustness to cropping.
4) Time complexity. The AFP should be determined with as little computer effortas posible. The AFPs
of the whole collection of songs have to be determined in a reasonable time. Realtime systems
have to extract the AFP of a song on line, furthermore, inBroadcasts Monitoringit is desirable to
be able to compute the AFP of several audio channels simultaneously.
August 14, 2007 DRAFT
3
5) Scalability. Is defined as the ability of an audio fingerprinting system tooperate with large databases,
this feature is conditioned by a low time complexity, a compact AFP size and a good indexing
technique.
B. Audio-fingerprint modelling
The first thing an audio-fingerprinting system has to do is to extract features from the signal. The
module in charge of extracting relevant perceptual features of the audio signal is known as thefront end,
once this module delivers the features the signal, the AFP system models the songs in a way that best
serve the purpose of the application for which it has been designed. Some AFP models are listed below:
• Sequences of Feature Vectors. This kind of AFPs are also known astrajectories or traces. The
features extracted at equally spaced periods of time are simply stored in a list of vectors or in a
table, one row per frame. An example of this kind of AFP is the binary vector sequence described
in [3].
• StatisticsInstead of storing every feature vector, only statistical data over the set of feature vectors
are stored. The audio-fingerprint designed for MPEG-7 [5] computes the means, variances, minimum
and maximum values every 32 frames. The minimum and maximum values are used for delimiting
the search and the means and variances are used for the actualsearch using some measure like the
Mahalanobis’ distance.
• CodebooksThe sequence of feature vectors extracted from a song is replaced by a small number
of representative code vectors stored in acodebook, which from then on represents the song. This
model disregards the temporal evolution of the audio signal.
• Strings. Trajectories can be converted into long strings of integers using vector quantization. This
model allows the treatment of the songs as texts that can be compared using flexible string matching
techniques [6].
• Single vectors. These are the smallest AFPs, they are usually built with average features extracted
from the whole song, for example, an AFP can be a vector containing the beats per minute, the
average zero crossing rate and the average spectrum [2].
• Hidden Markov Models (HMM). These finite state machines model non stationary stochastic proceses
(e.g. songs). For each song of the collection a HMM is built. The features extracted from the test’s
song are considered to be a sequence of acoustic events and then used as the input for the candidate’s
HMM. The candidate’s HMM in turn reports the probability that the test song matches the candidate
song, this probability is used as a proximity measure for choosing the right song [7].
August 14, 2007 DRAFT
4
C. Feature extraction
Audio-fingerprinting systems extract features from the signal normally on a frame by frame basis. Most
systems extract the signal features in the frequency domainusing a variety of linear transforms such as
the Discrete Cosine Transform, the Discrete Fourier Transform, the Modulation Frequency Transform [8]
and some Discrete Wavelet Transforms like Haar’s and Walsh-Hadamard’s [9].
Early work on audio-fingerprinting inherited the benefits from decades of research in speech processing.
Looking for more relevant features of music a variety of perceptual variables have been used such as
Loudness(PL) [10], the Joint Acoustic and Modulation Frequency(JAMF) [8], the Spectral Flatness
Measure(SFM) [11], the Spectral Crest Factor(SCF) [11], Spectral Subband Centroids(SSC) [12],
tonality [13], the sign of Energy’s second derivative [3] andchroma values[14] among others [15]. In
[12] it is shown how theNormalized SSCare more robust than MFCC and tonality for lossy compression
and equalization. In [8] it is reported that the Normalized JAMF has superior robustness than aspectral
estimatefor compression and equalization. In [11] it is reported that SFM has superior robustness than
PL and SCF as well . SFM was adopted by MPEG-7 for audio fingerprinting purposes [16]. We now
present a brief definition of the SFM due to the importance of this feature and because we included the
use of the SFM in our experiments as a reference.
The SFM is a feature related to the tonality aspect of the audio signal. The SFM is defined as the ratio
of the geometric mean and the arithmetic mean of the power spectrum coefficients. The SFM for bandb
with bandwidthnb can be computed using formula (1). The SFM reports values between zero and one,
values near one means that the spectrum is flat and the audio isnoisy, while values near zero show that
the audio signal is more tone-like.
SFMb =
[
∏nb
i=1c(i)]
1
nb
1
nb
∑nb
i=1c(i)
(1)
Wherec is the vector where the power spectrum coefficients are stored.
To put our work in context we implemented an audio fingerprintusing the SFM as the relevant
perceptual feature, we will refer to it as the Spectral Flatness Signature (SFS). The SFM is computed
for each frame and band using the same resolution both in frequency and time as the Spectral Entropy
Signature (SES), this was done so that any possible improvement or declination of the robustness could
not be attributed to anything else but to the perceptual capabilities of the features put into comparison.
Some systems extract signal features directly in time domain as in [17] where the sign of the time
derivative of the signal was found to be robust to lossy compression and low-pass filtering. Another
August 14, 2007 DRAFT
5
example is the signature presented in [18] which is thoroughly described next.
D. The Entropy of a Signal as the Relevant Perceptual Feature
Searching for features in audio signals that would still be present if those audio signals are severely
degraded, we decided to explore the use of entropy for audio fingerprinting purposes. We began by using
the time domain entropy as explained in [18]. For completeness we include below a brief discussion of
the entropy and some interesting properties.
The entropy of a signal is a measure of the amount of information the signal carries [19]. Shannon’s
entropy is computed using (2) and its continuos version called “differential entropy” is defined as in (3)
[19].
H(x) = −
n∑
i=1
piln(pi) (2)
Wherepi is the probability for any sample of the signal to adopt valuei beingn the number of possible
values the samples may adopt, for example, if the sample sizeis of 8 bits, thenn = 28 = 256
H(X) = −
∫
+∞
−∞
p(x)ln[p(x)]dx (3)
The entropy of a signal is a measure of how unpredictable it is, if the signal is a constantk, then
its probability distribution function (PDF) is a unitary impulse located atk, that is pi = δ(k), and its
entropy or unpredictability is zero as shown in (4), observethat 0log(0) needs to be considered zero for
this to be true. On the opposite case, if the signal has a uniform distribution then the entropy would be
maximum, that is, ifpi = 1/n for n possible values then its entropy would beln(n) as in (5)
Hmin = −∑
i
δ(k)ln[δ(k)] = −ln(1) = 0 (4)
Hmax = −
n∑
i=1
1
nln(
1
n) = −ln(
1
n) = ln(n) (5)
Entropy has been used in speech signals in noisy environments as a segmentation tool [20]. Also,
entropy has been used in choosing the desirable frame rate inthe analysis of speech signals [21].
By processing an audio-signal in frames of two seconds overlapped 50% and computing Shannon’s
entropy every frame, a sequence of entropy values is obtained, we will refer to this sequence as the
entropy curve. The entropy curvesof several degradations of the songDiosa del cobreare shown in
August 14, 2007 DRAFT
6
figure 1. Please note how similar theentropy curveslook between the original and the lossy compressed
(i.e mp3@32kbps) version, the low-pass filtered (i.e. 1KHz cutoff) version and the scaled (i.e. 50 percent
louder) version. The profile of these fourentropy curvesis almost identical so can safely use the sign
of the derivative to build a binary string that we call in thispaper theTime-domain Entropy Signature
(TES). As reported in [18] TES is not only extremely compact and easy to compute but turned out to
be very robust for the specific degradations of low-pass filtering, scaling and lossy compression. On the
other hand, theentropy curveis severely deformed when the song is degraded by equalization, noise
mixing and re-recording (i.e Loudspeakers to microphone transmission in a noisy environment). The fact
that TES is not robust under equalization when compared to Haitsma’s AFP [3] is acknowledged in
[18], robustness under noise mixing and re-recording was not assessed in [18] but further experiments
conducted know and reported in this paper report weakness ofTES for noise mixing and re-recording
under noisy environment. To cope with these deformations, we combined the use of entropy with the
robust AFP design described in [3] only using spectral entropy instead of energy’s second derivative. An
extremely robust AFP was obtained which will be described next.
II. OUR CONTRIBUTION. THE SPECTRAL ENTROPY SIGNATURE (SES)
The human ear perceives better the lower frequencies than the higher ones. The Bark scale defines 25
critical bands, each one of them corresponds to a section of the cochlea of about 1.3 mm [10]. Equation
(6) can be used to convert Hertz to Barks.
z = 13tan−1
(
0.76f
1000
)
+ 3.5tan−1
(
f
7500
)2
(6)
Wheref is the frequency in Hertz andz is the frequency in Barks
If the entropy of the spectral coefficients that correspond to a specific critical band is computed for
every frame of an audio signal we obtain a sequence that we call spectral entropy curve. Remember from
the preceding section how equalization deformed theentropy curvemaking TES practically unsuitable
for this kind of degradation. We found that this was not the case when thespectral entropy curves
were obtained for each critical band. To show this effect, weincluded figure 2, where we can see the
spectral entropy curvesfor critical bands 4,8,12,16 and 20 of the songDiosa del cobre. The curves
at the left in figure 2 correspond to the original song while the curves at the right correspond to the
equalized version. Amazingly, thespectral entropy curvesseem almost unaffected by equalization. Not
August 14, 2007 DRAFT
7
0 50 100 150 200 2503
3.5
4
Ent
ropy
ORIGINAL
0 50 100 150 200 250
3.6
3.8
4
4.2
Ent
ropy
EQUALIZED
0 50 100 150 200 250
3
3.5
4
Ent
ropy
LOW PASS FILTERED
0 50 100 150 200 2502.5
3
3.5
Ent
ropy
NOISY
0 50 100 150 200 250
3
3.5
4
Ent
ropy
LOSSY COMPRESSED
0 50 100 150 200 2502.5
3
3.5
4
Time (seconds)
Ent
ropy
RE−RECORDED
0 50 100 150 200 250
3.43.63.8
44.2
Time (seconds)
Ent
ropy
LOUDER
Fig. 1. Entropy signal of several degraded versions of the song Diosa del cobre
all 24 critical bands are shown to avoid overcrowding, the other bands behave just alike. This early
experiment was quite encouraging for the design of an audio-fingerprint based on spectral entropy, the
Spectral Entropy Signature(SES).
A. Entropygram Determination
The firsts steps for the determination of the SES of a song are related to the determination of its
entropygram(defined below), these steps are:
1) Stereo audio signals are first converted to monoaural by averaging both channels.
August 14, 2007 DRAFT
8
0 100 200 30018
20
22
24
Ent
ropy
4th Band
Original
0 100 200 30018
20
22
24
Ent
ropy
4th Band
Equalized
0 100 200 300
18
20
22
24
Ent
ropy
8th Band
0 100 200 300
18
20
22
24
Ent
ropy
8th Band
0 100 200 30015
20
Ent
ropy
12th Band
0 100 200 30015
20
Ent
ropy
12th Band
0 100 200 300
16
18
20
Ent
ropy
16th Band
0 100 200 300
16
18
20
Ent
ropy
16th Band
0 100 200 300
16
18
20
Time (seconds)
Ent
ropy
20th Band
0 100 200 300
16
18
20
Time (seconds)
Ent
ropy
20th Band
Fig. 2. Spectral entropy curves for critical bands 4,8,12,16 and 20 (from the top down) according to the Bark scale. Not all
the 24 bands are shown so the figure is not overcrowded: Left, original. Right, Equalized version
2) The signal is processed in frames of 370 ms, this frame sizeensures an adequate time support for
entropy computation. The frame sizes normally used in audio-fingerprinting ranges from 10 ms to
500 ms according to [15]. The frame size used in [3] is precisely 370 ms.
3) Our frames are overlapped fifty percent, therefore, 5.4 frames per second will be the frame rate for
the SES extraction, a low frame rate like this will result in acompact audio-fingerprint.
4) To each frame the Hann window is applied and then its DFT is determined.
5) Shannon’s entropy is computed for the first 24 critical bands according to the Bark scale, discarding
August 14, 2007 DRAFT
9
only the 25th critical band (frequencies between 15.5 KHz and 20 KHz). For any given bandb, the
elements of the DFT corresponding tob are used to build two histograms, one for the real parts
and another one for the imaginary parts of these elements. The histograms are used to estimate
the probability distribution functions. Shannon’s entropy for the real and imaginary parts of the
DFT are computed separately, call themhbr and hbi respectively. The entropyhb for bandb is
determined as the sum ofhbr andhbi.
For each frame of the audio signal a vector with 24 values of spectral entropy is obtained. The sequence
of vectors corresponding to a short excerpt of audio of a few seconds make a matrix of 24 rows and
a number of columns that depends on the duration of the excerpt. Such a matrix can be shown as an
image where the horizontal axis represents time, the vertical axis represents frequency and the gray
levels represents the amount of information (i.e, entropy)for every band and frame. We call these images
Entropygrams. Some entropygrams are shown in figure 7 we will refer to them later.
B. The codification step
We used figure 2 to show how thespectral entropy curveof any specific critical band is practically not
deformed when a song is equalized, the profile of the curve remains almost unchanged, there is however
a vertical shift, therefore, just as we did for the TES, we keep for each frame only an indication of wether
the spectral entropy is increasing or not for each band. Equation (7) states how the bit corresponding
to bandb and framen of the AFP is determined using the entropy values of framesn and n − 1 for
bandb. Only 3 bytes (i.e 24 bits) are needed for each frame of audio signal, that was another reason for
dropping the 25th critical band. A diagram of the process of determining the SES including the steps for
the entropygram determination and codification step is depicted in figure 3.
F (n, b) =
1 if [hb(n) − hb(n − 1)] > 0
0 Otherwise(7)
Since the SES of a song is a binary matrix, it can be shown as a black and white image as the one
shown in figure 4 where a piece of the AFP fromDiosa del cobreis shown. In the same figure a 5
second excerpt is magnified.
III. E XPERIMENTS
The experiments on robustness were carried out taking into consideration the following deformations:
1) Cropping. The songs will be identified using excerpts of only 5 seconds.
August 14, 2007 DRAFT
10
Framing
BandDivision
EntropyComputation
Codification
F(n,0)
F(n,23)
F(n,1)
FFT
h0r(n)
0
H
+
->0
H
real
imag Z-1+
h0i(n)
h0(n-1)
h (n)
+-
+
+
h0(n) h0
(n-1)-
H
+
->0
H
real
imag Z-1+
h1i(n)
h1(n-1)
h (n)
+-
+
+
h1(n) h1
(n-1)-
1
h1r(n)
H
+
->0
H
real
imag Z-1+
h23i(n)
h23(n-1)
h (n)
+-
+
+
h23(n) h23
(n-1)-
23
h23r(n)
Fig. 3. Information content analysis and coding for SES extraction
Fig. 4. A fragment of the AFP of the songDiosa del cobre. A 5 sec excerpt is magnified
2) Desynchronization. Since the excerpts used to look for a song may begin at any instant to reproduce
a real scenario, no frame from the excerpt will correspond toexactly the same period of time than
any frame from the original song. This deformation is also known astime shifting.
3) Lossy compression to MP3 using a bit rate of only 32 kbps, this is frequently represented as
mp3@32kbps. This particular degradation introduces a timeshift also.
4) Equalization according to the diagrams shown on figure 5, these are common equalization styles
from [22].
5) Mixing with white noise, this kind of noise contaminates all bands in the same way. Once mixed
with white noise the songs have a Signal to Noise Ratio (SNR) between 3 dB and 5 dB. The SNR
is computed using equation (8) wherePsignal is the power of the original signal andPnoise is the
power of the noise added to the signal.
August 14, 2007 DRAFT
11
(a) 1965 (b) Classic V (c) Louder (d) Pop (e) Soft Bass
Fig. 5. Equalization styles used. Eighteen bars spread fromthe lowest band (i.e leftmost) at 55 Hz to the highest band at 20
KHz. A bar above the horizontal axis indicates the amplification of its corresponding band. A bar below the horizontal axis
indicates the attenuation of its corresponding band
SNR = 20log10(Psignal
Pnoise
) (8)
Noise produced by big fans fall in the range that is referred as white noise. The noise we used to
contaminate songs can be accessed under the name of turbofan-hifi.wav at:
http://www.asti-usa.com/skinny/sampler.html. Please note that colored noise is
not as severe since it will affect only some bands of the signal.
6) Low pass filtering with a cutoff frequency of 1KHz. A Butterworth filter of second order with -20
dB/decade was used, meaning that the signal’s amplitud has declined to a tenth for a frequency
that is ten times the cutoff frequency.
7) Loudspeaker-Microphone transmission (Ls-Mic), this degradation consisted on playing the music
with the pair of loudspeakers of a multimedia system and recapturing it with an omnidirectional
microphone with a sensibility of -54±3 dB and a frequency response of 50Hz to 16KHz in a noisy
environment.
8) Scaling. The signal was amplified 50 percent without clipping prevention, in fact approximately 30
percent of the signal’s peaks were clipped during this degradation.
A. Experiment 1. Using whole songs
Since TES is a signature designed for whole songs, it was onlyfair to include it exclusively in
experiments where whole songs were compared. The first experiment carried out is described as follows:
1) The audio fingerprints (i.e. SES, TES and SFS) of4 000 songs from all kind of genres (rock, pop,
classical, etc.) were extracted.
2) Four hundred songs (i.e. ten percent) were subject to the six signal degradations numbered (3) to
(8) at the beginning of this section.
August 14, 2007 DRAFT
12
3) The signatures of the2 400 audio files obtained in the previous step were also extracted.
4) The audio fingerprints of the degraded songs were searchedamong the audio-fingerprints of the
collection of4 000 using the nearest neighbor criterion.
As a distance measure for the SES of two songs the Hamming distance was used. The Hamming
distance was also used when comparing the TES of two songs. Finally, the Euclidian distance was used
when comparing the SFS of two songs.
Table I shows the precision rate for the TES, SES and SFS that resulted from this experiment. The
Precision rate is defined as the fraction of the correctly identified songs (i.e true positives) over the number
of queries performed (i.e true positives plus false positives) [23]. In this first experiment SES showed high
robustness to every considered degradation, which was veryencouraging. SFS showed high robustness to
Equalization, lossy compression an Scaling. Finally, TES showed high robustness to Low-Pass filtering,
Lossy Compression and Scaling, its robustness to Re-recording is also acceptable.
TABLE I
PRECISION RATE FOR DIFFERENT SIGNAL DEGRADATIONS USINGTES ,SESAND SFSWITHOUT CROPPING(WHOLE
SONGS)
Degradation TES SES SFS
Equalization 53.7 % 100 % 100 %
Noise contamination (SNR≈4dB) 63.2 % 100 % 55.3 %
Re-recording (LsMic) 92.1 % 100 % 80 %
Low-Pass filtering (1KHz) 100 % 100 % 72.1%
Lossy Compression (32kbps) 100 % 100 % 100 %
Scaling (50 percent louder) 100 % 100 % 100 %
B. Experiment 2. Using small excerpts
Experiment 1 was done using whole songs, however, for some applications it is important to recognize
a song using only a small excerpt of only a few seconds. To verify the robustness to the degradations
considered in experiment 1 combined with cropping and desynchronization at the same time, the following
experiment was carried out:
1) The signatures (i.e SES and SFS) of4 000 songs from all kind of genres (rock, pop, tropical,
classical, etc.) were extracted and stored.
August 14, 2007 DRAFT
13
2) 400 of these songs were degraded in six different ways: Lossy compression, Equalization, Mixing
with noise, Low pass filtering, Scaling and finally Loudspeaker-Microphone transmission in a noisy
environment.
3) From each of the 2400 audio files (including originals) obtained in the previous step a excerpt of
5 seconds was extracted, therefore all those degradations were combined with cropping and at the
same time with desynchronization.
4) The short signatures of the 2400 excerpts that resulted from the previous step were determined.
In figure 7 the entropygrams of a excerpt of the songDiosa del cobrecorresponding to the seven
degraded versions considered are shown, their corresponding signatures are shown in figure 8.
5) All the short signatures determined in the previous step were searched inside every whole song’s
signature from the collection of4 000 determined in the first step using the nearest neighbor
criterion. For example, the nearest SES signature to those shown in figure 8 was found inside the
piece of the whole song’s SES signature that is magnified in figure 4 and shown again in figure 6.
Fig. 6. The piece of SES magnified in figure 4 in the same size as the used in figures 8 and 9
The Hamming distance is used to establish how different de SES of two excerpts are from each
other. The Hamming distance between two binary matrixes canbe conceived as a measure of fullness
of the matrix that results from computing the absolute difference between them. In figure 9 we show
the differences found between the degraded versions of a excerpt of the songDiosa del cobreand the
nearest neighbor that was indeed found inside that song, precisely the one shown in figure 6.
.
Not even the excerpts extracted from the original songs werefound without errors, to understand this
consider that the probability for the first frame of the randomly selected excerpts to be aligned with
any frame of the song is very small, so the experiment is reproducing a real scenario, this effect is
known as desynchronization or time-shift. In figure 10, the distance between a excerpt of a song and
the most similar (i.e closest Hamming distance) segment of audio inside the same song is plotted as a
function of the time-shift. To generate this curve a song at 44100 samples per second was used, therefore
August 14, 2007 DRAFT
14
(a) Original (b) Equalized (c) Low-pass filtered
(d) Noisy (e) Lossy compressed (f) Ls-Mic
(g) Louder
Fig. 7. Entropygrams of the same excerpt of five seconds from several degraded versions of the songDiosa del cobre
a frame of 0.37 sec is made of 16384 samples, the first excerpt was extracted beginning at a position
that was a multiple of the frame size (i.e zero time-shift), this excerpt was of course found inside the
song without errors (i.e zero distance) and corresponds to the first point of the curve. The second excerpt
was extracted beginning 100 samples (i.e 2.2ms) after the first excerpt, the most similar piece of audio
inside the song was found with a normalized Hamming distanceof 0.013, it corresponds to the second
point of the curve. The third excerpt was extracted beginning 100 samples after the second excerpt and
so on. Since the frames are overlapped fifty percent a distance of zero is found again at a time-shift of
185 ms (i.e half the frame size). It is clear from figure 10 thatincreasing the overlap will result in a
more robust audio-fingerprint, if for example, the overlap is increased to 90% the normalized Hamming
distance between a random excerpt extracted from a song and the most similar segment of audio inside
the same song could not be greater than 0.1 instead of 0.2 which is the maximum distance observed in
figure 10. Of course, a price in processing time and disk spacewould have to be paid if an increase in
August 14, 2007 DRAFT
15
(a) Original (b) Equalized (c) Low-pass filtered
(d) Noisy (e) Lossy compressed (f) Ls-Mic
(g) Louder
Fig. 8. Signatures of excerpts of five seconds of the degradedversions of the songDiosa del cobre.
robustness is desired.
The SES of a song is a binary matrix with a number of rows that depends on the duration of the
song and a fixed number of columns (e.g.1501 × 24 for a song of 4 minutes and 39 seconds). On the
other hand, the SES of a excerpt is a binary matrix with a fixed size (i.e. 24 × 24 in our experiment)
as long as the duration of the excerpts does not change. Usingthe Hamming distance, we compared the
short binary matrix of the excerpt with every posible submatrix of the same size from every signature
of the collection in order to search the song to which the five seconds excerpt belongs to. For example,
the nearest SES to those shown in figure 8 was found inside the piece of the whole song’s SES that
is magnified in figure 4. The brute force search procedure tookapproximately 50 seconds to answer a
query in a 2.8 GHz Pentium 4 PC with 512 Mbytes of RAM. The searching time was reduced to about
20 seconds using the following strategy: Instead of finishing the computation of the Hamming distance
between any submatrix and the SES of the query excerpt just tofind out that they are too different, skip
August 14, 2007 DRAFT
16
(a) Original (b) Equalized (c) Low-pass filtered
(d) Noisy (e) Lossy compressed (f) Ls-Mic
(g) Louder
Fig. 9. Absolute differences, the Hamming distance as an indicator of fullness.
to the next submatrix as soon as the normalized Hamming distance between the first columns of the
submatrix and the first columns of the SES of the query excerptis higher than 0.3.
In table II the rate of correctly identified songs using SES and SFS are shown for the signals
degradations considered. The SES shows higher robustness to Noise addition, Re-recording and Low-pass
filtering.
C. Experiment 3.
Since Experiments 1 and 2 were not able to find any weakness on the SES, we designed a third
experiment where degraded songs are not only compared with originals but with other degraded versions
as well, for example, the equalized version of a song can be compared with its noisy version. In the
problem of Querying by example the degraded versions are always compared with originals, however
this is not the case for other applications, for example, in radio broadcast monitoring the audio signal
that is going to be used as the reference for a specific commercial spot is normally captured in the same
August 14, 2007 DRAFT
17
0 50 100 150 2000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Time−shift (ms)
Nor
mal
ized
Ham
min
g di
stan
ce
Fig. 10. Distance between a excerpt of a song and the most similar segment of audio inside the same song as a function of
the time-shift
TABLE II
PRECISION RATE FOR DIFFERENT SIGNAL DEGRADATIONS. EXPERIMENT 2 (USING EXCERPTS OF5 SECONDS)
Degradation SES SFS
Cropping and time-shift 100 % 100 %
Equalization, cropping and time-shift 100 % 100 %
Noise contamination, cropping and time-shift100 % 63 %
Re-recording in noisy environment,
cropping and time shift 100 % 75 %
Low-Pass filtering, cropping and time-shift 100 % 82 %
Lossy Compression,cropping and time-shift 100 % 100 %
Scaling, cropping and time shift 100 % 100 %
way as the audio signal to be monitored. As another example, consider a p2p application that is always
looking in the network for audio files with better quality than the ones in the local host, this application
would be comparing all kinds of degraded versions includingthose obtained from old tapes.
We used thirty eight songs for this experiment, each one on six degraded versions making a total of
August 14, 2007 DRAFT
18
228 audio files. Each audio file were compared with every otherone to fill up theconfusion matrixof
228 rows and 228 columns. The5 1984 locations of the confusion matrix correspond to the same number
of distances between audio-signatures that had to be computed. The confusion matrix that results from
this comparisons using SES is shown as an image in figure 11. Every pixel of figure 11 has a gray level
according to the distance it represents (darker means closer). The first row of pixels represents the set
of distances between the first audio file and the rest of them, the second row of pixels represents the
distances between the second audio file and all of the others and so on. The 228 pixels along the diagonal
ar all black because the distance between any audio file and itself is always zero. The symmetry of the
Hamming distance can also be appreciated in the confusion matrix. The names of the audio files have
a prefix according to the song’s name they belong to and a suffixthat denotes the kind of degradation
the song suffered. The audio files are maintained in alphabetical order according to its name, for this
reason, the ideal confusion matrix would be all white with 38black squares along the main diagonal,
each of which would be of size6 × 6 (i.e six degradations). The confusion matrix that results from this
comparisons using SFS is shown in figure 12. The confusion matrix that results from this comparisons
using TES is shown in figure 13. The expected6 × 6 black squares along the diagonal are not as well
defined in 12 or in 13 as they are in figure 11. Figure 11 even resemble us the ideal confusion matrix
described above, this fact reveals SES as a more robust audiofingerprint than SFS or TES.
20 40 60 80 100 120 140 160 180 200 220
20
40
60
80
100
120
140
160
180
200
220
Fig. 11. Confusion Matrix resulting from experiment 3 usingSES
August 14, 2007 DRAFT
19
20 40 60 80 100 120 140 160 180 200 220
20
40
60
80
100
120
140
160
180
200
220
Fig. 12. Confusion Matrix resulting from experiment 3 usingSFS
20 40 60 80 100 120 140 160 180 200 220
20
40
60
80
100
120
140
160
180
200
220
Fig. 13. Confusion Matrix resulting from experiment 3 usingTES
D. Sensitivity Analysis and Optimal Threshold Selection for Experiment 3
When two degraded versions of a same song have a distance below some thresholdth, we say that we
are in the presence of atrue positive, if those two degraded versions of a song have a distance above th,
we say we are dealing with afalse negative. On the other hand, when comparing two different songs, if
the distance between them falls belowth then we call that afalse positiveand if the distance is grater
August 14, 2007 DRAFT
20
than th, then is atrue negative. Table III summarizes these definitions.
TABLE III
DEFINITIONS FOR THE SENSITIVITY ANALYSIS
dist < th dist > th
Same songs True Positive (TP ) False Negative (FN )
Different songs False Positive (FP ) True Negative (TN )
The True Prediction Rate(TPR) is the fraction of songs the system correctly identifies (i.e. true
positives) from all the songs the system should have identified. The TPR is also known assensitivityor
recall and it is estimated with (9). TPR equals1 − FFR where FFR is the well knownFalse Rejection
Rate.
The False Prediction Rate(FPR) is a measure of how often the system mistakes a song for another
and it is defined as in (10). The FPR is also known asFalse Alarm Rateand equals1 − specificity
TPR =TP
TP + FN(9)
FPR =FP
FP + TN(10)
The ROC Space is the plane where the vertical axis is the TPR and the horizontal axis is the FPR, a
single point in this plane is the performance of the system for a given threshold. By varying the threshold
to all it posible values a ROC curve is generated. In figure 14 the ROC curves for the system that use
SES, TES and SFS are shown, there we can clearly see that the area under the ROC curve for SES is
greater than the area under the ROC curve of SFS or TES.
From the analysis for generating the ROC curves the optimal threshold for each system was also found,
this is the threshold for the point of the curve that is closest to the upper-left corner. Using the optimal
threshold, the precision rates for all posible combinations of degradation (e.g Low pass filtered against
equalized) when SES was used are shown on table IV. Table V show the precision rates when SFS was
used. Table VI show the precision rates when TES was used.
Segments from several degraded versions of one of the songs of the test set are available at
http://lc.fie.umich.mx/∼camarena/Audiofiles.html
August 14, 2007 DRAFT
21
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
TES
SES
SFS
False Prediction rate
Tru
e P
redi
ctio
n ra
te
Fig. 14. ROC curves for Experiment 3
TABLE IV
PRECISION RATE OBTAINED USINGSESAND ITS OPTIMAL THRESHOLD (EXPERIMENT 3)
LowPass EQ Loud Noisy LsMic
Original 100 % 100 % 100 % 100 % 100 %
LowPass 100 % 100 % 97 % 100 %
EQ 100 % 97 % 100 %
Loud 100 % 100 %
Noisy 95 %
IV. CONCLUSIONS
1) Regarding robustness. The spectral entropy signature proposed in this paper has proved to be highly
robust to heavy degradations of the audio signals. The SES turned out to be a more robust AFP
than the TES specifically for equalization, noise contamination and loudspeaker to microphone
transmission in a noisy environment (LsMic). The SES turnedout to be more robust than the SFS
specifically for Noise contaminated songs, LsMic and low-pass filtered songs. All the five seconds
excerpts were correctly identified using SES no matter what kind of degradation the audio signal
was subject to, these results do not contradict those reported in [11] since the level of degradation to
which the songs were subject in our experiments was a higher one, for example, in [11], the songs
August 14, 2007 DRAFT
22
TABLE V
PRECISION RATE OBTAINED USINGSFSAND ITS OPTIMAL THRESHOLD (EXPERIMENT 3)
LowPass EQ Loud Noisy LsMic
Original 97 % 100 % 100 % 77 % 95 %
LowPass 100 % 100 % 71 % 90 %
EQ 100 % 76 % 87 %
Loud 77 % 95 %
Noisy 74 %
TABLE VI
PRECISION RATE OBTAINED USINGTESAND ITS OPTIMAL THRESHOLD (EXPERIMENT 3)
LowPass EQ Loud Noisy LsMic
Original 87 % 52 % 100 % 74 % 85 %
LowPass 52 % 87 % 61 % 74 %
EQ 42 % 26 % 42 %
Loud 74 % 55 %
Noisy 37 %
were contaminated with noise only with a “reasonable SNR of 20-25 dB simulating background
noise”. Instead, we mixed noise up to an SNR of 4-5 dB. Please note that a SNR of 1 dB means
that noise and music have the same intensity level which would make difficult even for the human
auditory system to identify the songs.
It is very interesting how the SES of the low pass filtered song’s did not change significatively even
when only 8 out of the 24 considered critical bands fall belowthe cutoff frequency of 1 KHz. To
understand this effect, remember that a low-pass filter attenuates the content of the signal above the
cutoff frequency gradually as the frequency increases (theButterworth filter that we used attenuates
the signal at a rate of with -20 dB/decade), also the entropy value depends on the distribution of
the spectrum for each considered band disregarding its amplitude. You can see how the absolute
differences shown in figure 9(c) are not relevant below the first 20 bands, only the last 4 bands (at
the top) seem affected.
2) Regarding Compactness. Twenty four bits every 185 ms is a very compact fingerprint of only 0.13
August 14, 2007 DRAFT
23
kbit/s. A a reference, Haitsma-Kalker’s AFP requieres 2.6 kbit/s [3]. Of course, the resolution can
be tuned (i.e size of the sliding window and the overlap percentage) depending on the application.
3) Regarding Time complexity. The time it takes to determineSES of a song is approximately 7
percent of the duration of the song using a personal computerPentium 4 of 2.8GHz and 512 MB
of RAM. This parameter is important for real time applications. Again, the higher the resolution
adopted, the higher the time needed to determine the SES of a song will be.
4) Regarding Granularity. In this work, we show the results of the experiments where excerpts of five
seconds were used to identify a song. We however, experimented with excerpts of 10, 15 and 20
seconds as well. An elementary observation from such experiments is that the shorter the excerpt
was, the higher the required resolution (i.e greater overlap percentage) to identify a song had to
be.
5) Regarding Scalability. The first row of table IV is the accuracy rate of searching original songs
among 228 audio files. In experiment 2, songs were searched inside a collection of4 000 of them,
no decrease of the precision rate of SES is observed with the increase of the the database size.
Metric indexes, as surveyed in [24], could be used to speed upsearches.
REFERENCES
[1] S. Shin, O. Kim, J. Kim, and J. Choil, “A robust audio watermarking algorithm using pitch scaling,” in14th International
Conference on Digital Signal Processing, vol. 2, 2002, pp. 701 – 704.
[2] (2002) Musicbrainz trm musicbrainz-1.1.0.tar.gz. [Online]. Available: ftp://ftp.musicbrainz.org/pub/musicbrainz/
[3] J. Haitsma and T. Kalker, “A highly robust audio fingerprinting system,” inInternational Symposium on Music Information
Retrieval ISMIR, 2002.
[4] P. Shrestha and T. Kalker, “Audio fingerprinting in peer-to-peer networks,” in5th International Conference on Music
Information Retrieval (ISMIR), 2004.
[5] O. Hellmuth, E. Allamanche, M. Cremer, T. Kastner, C. Neubauer, S. Schmidt, and F. Siebenhaar, “Content-based broadcast
monitoring using mpeg-7 audio fingerprints,” inInternational Symposium on Music Information Retrieval ISMIR, 2001.
[6] A. Y. Guo and S. Hava, “Time-warped longest common subsequence algorithm for music retrieval,” in5th International
Conference on Music Information Retrieval (ISMIR), 2004.
[7] E. Batlle, J. Masip, and E. Guaus, “Amadeus: a scalable hmm-based audio information retrieval system,” inFirst
International Symposium on Control, Communications and Signal Processing, March 2004, pp. 731– 734.
[8] S. Sukittanon and E. Atlas, “Modulation frequency features for audio fingerprinting,” inInternational Conference on
Acoustics, Speech and Dignal Processing (ICASSP) IEEE, University of Washington USA, 2002, pp. II 1773–1776.
[9] S. Subramanya, R. Simha, B. Narahari, and A. Youssef, “Transform-based indexing of audio data for multimedia databases,”
in International Conference on Multimedia Applications, 1999.
[10] E. Zwicker and H. Fastl,Psycho-Acoustics. Facts and Models. Springer, 1990.
August 14, 2007 DRAFT
24
[11] J. Herre, E. Allamanche, and O. Hellmuth, “Robust matching of audio signals using spectral flatness features,”IEEE
Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 127–130, 2001.
[12] J. S. Seo, M. Jin, S. Lee, D. Jang, S. Lee, and C. D. Yoo, “Audio fingerprinting based on normalized spectral subband
centroids,” inInternational Conference on Acoustics, Speech, and SignalProcessing (ICASSP), 2005.
[13] R. P. Hellman, “Asymmetry of masking between noise and tone,” Perception and Psychophysics, vol. 11, pp. 241–246,
1972.
[14] S. Pauws, “Musical key extraction from audio,” inInternational Symposium on Music Information Retrieval ISMIR, 2004.
[15] P. Cano, E. Battle, T. Kalker, and J. Haitsma, “A review of algorithms for audio fingerprinting,”Multimedia Signal
Processing,IEEE Workshop on, pp. 169–167, December 2002.
[16] M. A. Group, Text of ISO/IEC Final Draft International Standar 15938-4 Information Technology - Multimedia Content
Description Interface - Part 4: Audio, July 2001.
[17] F. Kurth and R. Scherzer, “A unified approach to content-based and fault tolerant music recognition,” in114th AES
Convention, Amsterdam, NL., 2003.
[18] A. C. Ibarrola and E. Chavez, “A robust entropy-based audio-fingerprint,” in IEEE International Conference on Multimedia
and Expo (ICME2006), July 2006, pp. 1729–1732.
[19] C. Shannon and W. Weaver,The Mathematical Theory of Communication. University of Illinois Press, 1949.
[20] J.-L. Shen, J.-W. Hung, and L.-S. Lee, “Robust entropy-based endpoint detection for speech recognition in noisy
environments,” inInternational Conference on Spoken Language Processing, Dec 1998.
[21] H. You, Q. Zhu, and A. Alwan, “Entropy-based variable frame rate analysis of speech signal and its applications to asr,”
in International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2004.
[22] (2006) Foobar2000 equalizer presets eqpresets.zip. [Online]. Available: http://sjeng.org/ftp/fb2k/eq presets.zip
[23] T. Fawcett, “Roc graphs: Notes and practical considerations for researchers,” HP Labs Tech, Tech. Rep. HPL2003-4,2003.
[24] E. Chavez, G. . Navarro, R. Baeza-Yates, and J. Marroquın, “Proximity searching in metric spaces,”ACM Computing
Surveys, vol. 33(3), pp. 273–321, 2001.
Antonio Camarena-Ibarrola Received the B. Eng. degree in Electrical Engineering in 1987 from the
Electrical Engineering School at theUniversidad Michoacana de San Nicolas de Hidalgoin Mexico. He
received a M.S. degree in Computer Science in 1996 from theInstituto Tecnologico de Tolucain Mexico.
His research interests are in pattern recognition and signal processing.
He has been a teacher at the Electrical Engineering School since 1988 and he is Currently working
towards his Phd degree.
August 14, 2007 DRAFT
25
Edgar Chavez received a M.S. degree in Computer Science fromUniversidad Nacional Autonoma de
Mexico, and a Phd. in Computer Science fromCentro de Investigacion en Matematicas, Mexico. He is
currently a full professor ofUniversidad Michoacana, fellow of theMexican Research System(SNI), and
president of theMexican Computer Science Society. He has published over 50 papers in international
conferences, journals and book chapters. He has been general chair of SPIRE 1999, ENC 2004, LATIN
2001, and co-chair of the technical program of CPM 2003, AWIC2004 and AdHocNow! 2005. His
research interests include pattern recognition, algorithms and Information Retrieval.
August 14, 2007 DRAFT