changing the mfcc intensity functionnwj2/masters.pdf · the section \tuner mfcc code" is...

$: CHANGING THE MFCC INTENSITY FUNCTIONnwj2/masters.pdf · The section \tuneR MFCC code" is intended to be supplemental to the following paragraphs which explain the MFCC extraction$
CHANGING THE MFCC INTENSITY FUNCTION

Abstract of

a thesis presented to the Faculty

of the University at Albany, State University of New York

in partial fulfillment of the requirements

for the degree of

Master of Arts

College of Arts & Sciences

Department of Mathematics & Statistics

Nicholas W. Jarrett

2010

ABSTRACT

Mel-Frequency Cepstral Coefficients (MFCCs) are commonly used for a variety of

problems in signal processing. There can be limitations to this approach when you at-

tempt to model something which cannot be easily discriminated with the human ear.

The MFCC intensity function is: 1125 · ln(1 + f

700

), where f denotes frequency. This

thesis investigates the results that can be obtained with a more objective intensity

function to generate coefficients. These coefficients will be referred to as NMFCCs.

The new intensity function was obtained by using regression techniques to ap-

proximate the sorted frequencies of a collection of signals. The NMFCCs were used

in lieu of the MFCCs in an existing algorithm which was designed to discriminate

between the A,L,Q, and P keys using the sounds produced by striking them. The

results obtained through the use of MFCCs in this algorithm were used as a baseline

to assess the performance of the NMFCCs. With the substitution of NMFCCs for

MFCCs into the algorithm, the accuracy increased by approximately 30.51%.

Each problem has a different family of characteristic frequencies. The imposed

intensity function can either obfuscate or illuminate them. While these results do not

prove that objective intensity functions are superior to the traditional MFCCs, they

provide an example where the NMFCCs reveal more important characteristics than

the MFCCs.

ii

CHANGING THE MFCC INTENSITY FUNCTION

A thesis presented to the Faculty

of the University at Albany, State University of New York

in partial fulfillment of the requirements

for the degree of

Master of Arts

College of Arts & Sciences

Department of Mathematics & Statistics

Nicholas W. Jarrett

2010

ACKNOWLEDGEMENTS

“The world we have created is a product of our thinking;

it cannot be changed without changing our thinking.”

- Albert Einstein

Thank you to all of the inspirational people who have opened my understanding

of the world.

My thesis advisor, Dr. Carlos Rodriguez truly redefined my perception of statis-

tics. It’s my life’s passion, and his enthusiasm for Bayesian statistics has truly inspired

me. The best thing that’s happened to me here at the University at Albany, was be-

ing lucky enough to be in his classroom.

Dr. Karin Reinhold is my graduate advisor. Professor Reinhold has always been

someone I could go to with my troubles from choosing the classes which will most

benefit my subject area to just needing someone to talk to. She is truly an amazing

individual.

I have given several presentations for Dr. Martin Hildebrand’s master’s seminar.

His advice has helped me to come to where I am now, particularly in my abilities as

an instructor. He taught me something invaluable which I know no textbook, only

his life experience, could have taught me.

I would especially like to thank my good friend, Matthew Lester. We spent

countless nights working on problems together with little to no sleep. A lot of this

thesis would be radically different without his input.

iv

TABLE OF CONTENTS

ABSTRACT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii

ACKNOWLEDGEMENTS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

TABLE OF CONTENTS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

CHAPTER

1 INTRODUCTION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 ORIGIN OF THE MFCCS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

3 SHORT-TIME FOURIER TRANSFORMS. . . . . . . . . . . . . . . . . . . . . . . . . 5

4 CONSTRUCTION OF THE MFCCS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6

5 AUDITORY FEATURE EXTRACTION ON A COMPUTER. . . .11

6 MFCCS IN ACTION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

7 THE FOUR KEY CLASSIFICATION PROBLEM. . . . . . . . . . . . . . . . . .14

8 STRUCTURE IMPOSED BY MFCCS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .18

9 CONSTRUCTION OF THE NMFCCS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20

10 PERFORMANCE & CONCLUSION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

11 TUNER MFCC CODE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

REFERENCES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .30

v

1 Introduction

Mel Frequency Cepstral Coefficients (MFCCs) are used in many different areas

of signal processing. They are very similar in principle with the human ear, and are

especially good for speech recognition. MFCCs are defined in part by the intensity

function: 1125 · ln(1 + f

700

), where f denotes frequency. This thesis investigates the

results that can be obtained by using a more objective intensity function to generate

coefficients. These coefficients will be referred to as NMFCCs.

The new intensity function was obtained by using regression techniques to ap-

proximate the sorted frequencies of a collection of signals. The NMFCCs were used

in lieu of the MFCCs in an existing algorithm which was designed to discriminate

between the A,L,Q, and P keyboard keys using the sounds produced by striking them.

The results obtained through the use of MFCCs in this algorithm were used as a base-

line to assess the performance of the NMFCCs. With the substitution of NMFCCs

for MFCCs into the algorithm, the accuracy increased by approximately 30.51%.

All programming for this thesis was done for the R statistical package. The R

packages e1071, tuneR, and DAAG were used throughout this analysis.

This thesis was typeset with LATEX. Utilized packages and style files include

graphicx, geometry, setspace, and amssymb.

1

2 Origin of the MFCCs

The size of data sets for sound files are very large. In order to be able to analyze

the data, it must first be cut down to a reasonable size. The MFCCs are often used

for speech recognition, and essentially mimic the functionality the human ear. The

physiology of the human ear is used to motivate a solution for the data size problem.

Concepts which are at the heart of MFCCs will be presented in this section in their

relevance for the human ear as well.

One of the challenges in audio and visual analysis is simply to find a method

to deal with the sheer size of the dataset. Humans have been very successful in

performing discrimination for these kinds of problems: both in accuracy and speed.

Consequentially, many techniques directed at understanding such phenomena have

been developed to mimic human physiology.

At first, it may not be clear that characteristic feature extraction from the raw

data is necessary. Humans don’t record a list of amplitudes when we hear a sound,

so we’re not aware of the volume of data encoded in sound. For standard sampling

rates, the text file of raw amplitudes associated with 30 seconds of sound can be

over a gigabyte in size. Not only does our ear filter the sound significantly, but our

brain passes the information through filters as well. When standard speech is passed

through even simple audio filters, it can become unrecognizable. However, when the

listener is told what is being said before hand, they often become able to hear it. This

is an example of how the brain’s expectations place filters on auditory data. More

than this though, the information arriving at the brain has already been cut down

by the ear. This is supported by both anatomical and physiological reasons as well

as experimental evidence relating to the perceptibility of slight differences in sound

frequencies.

2

The human ear generally consists of three sections: outer, middle, and inner. See

Figure 1. For the purpose at hand, the inner is the most important. Sound waves are

collected in the outer ear and channeled in through the middle ear. They enter the

inner ear via the oval window. The cochlea, located in the inner ear, is essentially the

auditory feature extractor for the human body. Figure 2 gives a frequency-location

breakdown along the basilar membrane. The amplitude of a wave increases as it

moves along the tube until it hits its maximum. At this point it decays rapidly.

Figure 1: The Human Ear [6]

Figure 2: Basilar Membrane [10]

3

We are able to perceive sounds by the use of nerves located throughout the basilar

membrane which sense the point at which rapid decay begins and the intensity at it.

The idea is basically that there are holes in our ability to determine precise location.

Just as a mosquito can land on your arm without you feeling it, we may be unable

to precisely determine the exact location at which a wave begins to decay along the

basilar membrane. As a result, our ability to distinguish between sounds in these

regions may be diminished or even non-existent. This phenomenon is characterized

by considering critical bands, centered at critical frequencies, over which we are unable

to tell one frequency from another.

4

3 Short-Time Fourier Transforms (STFTs)

The MFCCs essentially transform and regroup the frequencies that make up

a sound wave; however, most waves are stored by listing their amplitudes, not by

their frequency decomposition. Therefore, as a prerequisite to the extraction of the

MFCCs, the raw data must be transformed into the frequency domain.

Short-Time Fourier Transforms (STFTs) are used to perform this transforma-

tion. Some of the theory and math behind STFTs will be presented in this section.

Understanding the general form of the STFT will be helpful in understanding the

process by which MFCCs are computed.

Let x[n] be a time-series.

Let w[n] represent the window over which analysis is to be performed.

Assumption: w[n] = 0 on the set {(−∞, 0)⋃

(Nw − 1,∞)}

Then, the STFT at time n can be expressed by:

X(n, ω) =m=∞∑m=−∞

x[m]w[n−m]ejωm

All of the transforms which will be utilized in this thesis are discrete. The reason

for this is discussed later in the section, “Auditory Feature Extraction on a Com-

puter.” The discrete transform is given by sampling X(n, ω) over the unit circle.

X(n, k) = X(n, ω)|ω= 2πNk =

m=∞∑m=−∞

x[m]w[n−m]e−j2πNkm

2πN

in the above equation represents the frequency sampling interval.

5

4 Construction of the MFCCs

Two methods by which MFCCs can be extracted from the signal will be pre-

sented in this section. The first is from the tuneR package for R, as well as [6]. A

slightly more mathematically rigorous approach can be seen in [11], which will be

presented after the traditional approach is given.

The general idea of MFCCs is very similar to the method of human auditory feature

extraction.

The intensity function for the Mel scale is:

M(f) = 1125 · ln(

1 +f

700

)

Figure 3: Mel Intensity & Critical Bands [6]

Sampling uniformly from the Mel scale and mapping down to the axis yields the

critical bands and frequencies. MFCCs attempt to mimic the filtering process of the

ear by combining the energy within critical bands. Naturally, this filtering process

works well for the purpose of speech processing; however, MFCCs are sometimes ap-

plied in more abstract problems.

More rigorously, in the traditional approach to extract MFCCs from a sound

file, the cochlea of the human ear is incorporated into the model by considering a

6

collection of bandpass filters. These filters are designed to ignore all frequencies out-

side of the critical band which they represent. As stated earlier, these critical bands

are centered at critical frequencies. These frequencies are typically chosen such that

the ability to distinguish between each successive critical frequency is some constant

amount.

1. Map the frequencies onto the intensity function: M(f)

2. Let c be a constant.

3. The set of critical frequencies fj must then satisfy M(fj+1)−M(fj) = c ∀j.

The section “tuneR MFCC code” is intended to be supplemental to the following

paragraphs which explain the MFCC extraction process as it is executed in traditional

methods. It can be skipped by the reader especially if they are not familiar with R.

As stated previously, the MFCC extraction process works with the frequency

decomposition of a sound file. STFTs were used to obtain this decomposition. STFTs

are FTs over incremental windows which cover the whole signal.

1. The frequency decomposition of the signal over each of these windows is stored

in [STFTs].

2. The intensity function, desired number of coefficients, and frequency overlapping

are used to define the [Scaling] matrix which takes the name T.windows in

“tuneR MFCC code.”

7

3. [STFTs] and [Scaling] are composed by matrix multiplication to combine the

energy over respective critical bands.

4. The log of the absolute value of this result is negated. This prepares it for

composition with the Inverse Discrete Cosine Transform (IDCT).

− log |[STFTs]x[Scaling]|

5. Finally matrix multiplication on the right by [IDCT] yields the MFCCs.

− log |[STFTs]x[Scaling]| x[IDCT ]

Since the IDCT can be interpreted as an inverse FT. It can be seen that this step

essentially undoes the transformation of the signal into frequencies using FTs from

the beginning of the process.

Figure 4: Mel Decomposition of a Signal [6]

Alternatively, the MFCCs are computed directly on the power spectrum in [11].

As with the traditional method, STFTs must be used first. An equivalent way to

express the STFT is: 12π

∫ −ππ

dw log |X(ejw)| · ejwk.

8

1. Transform the signal into its frequency decomposition:

ck =1

2π

∫ −ππ

dw log |X(ejw)| · ejwk (1)

Where |X(·)| represents the power spectrum.

2. The sequential application of a monotone invertible frequency warping function

g: [−π, π] → [−π, π] and Discrete Cosine Transform (DCT) can be expressed

as follows:

w → w̃ = g(w)

ck =1

2π

∫ −ππ

dw̃ log |X(ejw−1

(w̃))| · ejw̃k

3. By changing the variable of integration and using the derivative of the warping

function, dw̃/dw, the warping is directly incorporated into the cosine transfor-

mation.

ck =1

2π

∫ −ππ

dw log∣∣X(ejw)

∣∣ · ejg(w)k · g′(w) (2)

4. This is then approximated by a discrete sum.

ck ∼=1

N

N−22∑

n=0

{log∣∣∣X(ej

2πnN )∣∣∣ · cos

(g

(2πn

Nk

))· g′(

2πn

N

)}(3)

5. The specific frequency warping we are interested in for MFCCs is given by the

intensity function:

µ(w) = 2595 · log

(1 +

wfs2π · 700Hz

)(4)

Where fs denotes sampling frequency.

6. The Mel-Warping function must meet the criteria µ̃(π) = π for integration into

9

the cosine transformation.

µ̃(w) =π

µ(π)· µ(w) = d · log

(1 +

wfs2π · 700Hz

)(5)

where

d =π

log(1 + fs

2·700Hz

) (6)

7. By replacing g(·) by µ̃ in (3) the implementation of the MFCC computation

can be completed.

10

5 Auditory Feature Extraction on a Computer

Much like the human ear, computers acquire sound information from their en-

vironment by evaluating changes in air pressure induced by a sound wave. The

mechanical analog to the ear’s oval window is the microphone’s elastic membrane. It

fluctuates back and forth as a sound wave compresses it. By measuring the amount of

compression, a computer is able to obtain a list of amplitudes at discrete times which

represent the sound. The frequency of the compression is in fact identical to the

frequency of the corresponding sound, and the square of the pressure is proportional

to the sound’s intensity. Thus a sound wave can be completely characterized simply

by measuring fluctuations in the pressure induced along the membrane.

It is important to note that sound waves are continuous, however we are only

able to measure them discretely. This can potentially lead to some problems. As

stated previously, the FT of the data is typically interpreted as yielding its frequency

components, but the discrete nature by which we sample the data can actually obscure

the true frequencies. The signal which is reconstructed from these samples need not

necessarily be identical to the original wave. This behavior is referred to as aliasing.

Aliasing also happens in visual media. A typical example is the apparent counter

clockwise motion of cart-wheels in old cowboy movies. For an illustration of this

effect for waves, see figure 5.

Figure 5: Aliasing [6]

11

In fact, there are an infinite number of possible sine curves that can generate

the same regularly sampled data points. There are several ways in which aliasing can

be reduced. Clearly, a larger sampling rate reduces aliasing, since the reconstructed

function has to agree with the true function at the end points of the sampled intervals.

Additionally, the algorithm by which the frequencies are reconstructed can influence

how profound the effects of aliasing are. This issue was not investigated in great

detail for this thesis; however, keyboard strokes are hardly ordinary signals. It would

be interesting to see if improving the characteristic frequencies we attribute to each

signal by addressing the issue of aliasing could reveal more unique features for different

keys.

12

6 MFCCs in Action

MFCCs have been widely applied to problems everywhere from speech recog-

nition to analyzing mechanically produced signals. They are often used to address

the problem of music genre classification. They were shown to be highly successful

for this problem in [4]. MFCCs have been experimentally shown to work well for

instrument recognition, artist recognition, and music information retrieval [5].

MFCCs are also used for problems with more abstract sounds. They were suggested

as the best method of characteristic feature extraction for discrimination of computer

keyboard emanations by [1] and [2]. They were also used in [3], to discriminate

keyboard key strokes with SVMs. For the A, L, Q, and P keys, [3] was able to attain

an accuracy of 65.56% using MFCCs. The implications of using MFCCs are explored

in reference to this type of problem, and NMFCCs are developed in an attempt to

extract features which are more useful for discrimination. The results from [3] will be

used in the remainder of this thesis as a baseline which NMFCCs will be compared to.

Results must be interpreted in the context of the algorithm used to obtain them.

The following section describes the approach used in [3] for the four key classification

problem. It is not necessary for the theory of MFCCs and NMFCCs, and can be tem-

porarily skipped, if the reader wishes to continue developing them from a theoretical

stand point.

13

7 The Four Key Classification Problem

The training data set consists of 50 strokes of each of the A, L, Q, and P keys.

The key stroke signals are first isolated in the sound file, and then two sets of features

are extracted from them: autocorrelation coefficients and MFCCs. For each key,

65 autocorrelation coefficients are taken with lags ranging from 2 to 66. This is

augmented by a random number of MFCCs determined by 7 tuning parameters.

This matrix of features will be referred to as Xtraining. Furthermore, the true labels

for the training set will be denoted Ytraining. The SVM then learns the classification

rule for {Xtraining, Ytraining}. It finds this rule subject to the following constraints:

1. SVM-Type: C-classification

2. SVM-Kernel: radial

The discrimination rule is used on the features of the test data, Xtest, to generate

predicted labels Ypredicted. The accuracy of the discrimination rule is then evaluated

by comparing the true and predicted labels, Ytest, and Ypredicted. More explicitly, Ytest

is the following 90x1 matrix of labels:

> ytest =c("L", "A", "A", "L", "A", "L", "A", "A", "A", "L",

+ "L", "L", "L", "L", "L", "A", "A", "A", "A",

+ "A", "L", "A", "A", "L", "A", "A", "A", "L",

+ "A", "L", "L", "A", "A", "A", "A", "L", "A",

+ "L", "L", "A", "L", "A", "L", "A", "L","P","Q","Q","Q",

+ "Q","Q","Q","Q","P","P","Q","Q","Q",

+ "Q","Q","Q","P","P","P","P","P","P","Q","Q","Q",

+ "Q","Q","Q","P","Q","Q","Q","Q","P","Q","P","Q",

+ "P","Q","P","P","P","P","P","Q")

14

The discrimination was done with three binary decisions:

1. A against not A.

2. L against not L in what was not classified as an A in the first step.

3. Q against not Q in what was not classified as an L in the second step.

4. P is classified as whatever is not classified as Q in the third step.

Each step of the binary classification behaves as described above. More precisely,

each binary classification step goes through the below process:

A against not A:

1. Let 0 denote “not A.”

2. Define a new set of labels for the training data, YAtraining which contains only

A’s and 0’s.

3. Train the SVM on {Xtraining, YAtraining}.

4. Compare the learned decision rule with YAtest which contains only A’s and 0’s.

5. Remove everything classified as an A in this binary decision step.

6. Move on to the L against not L classification.

The following output is for the A against 0 binary classification step with the initial

values of the parameters.

15

Predicted Labels:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

0 A A A A A A A A A 0 0 A 0 0 A A

18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

A A A A A A 0 A A A 0 A 0 A A A A

35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51

A 0 A A A A A A 0 A A A A A A A 0

52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68

0 0 0 0 0 0 0 A 0 0 0 0 0 0 0 0 0

69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85

0 0 0 A 0 0 0 0 0 0 0 0 0 0 0 0 0

86 87 88 89 90

0 0 0 0 0

True Labels:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

0 A A 0 A 0 A A A 0 0 0 0 0 0 A A

18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

A A A 0 A A 0 A A A 0 A 0 0 A A A

35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51

A 0 A 0 0 A 0 A 0 A 0 0 0 0 0 0 0

52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

86 87 88 89 90

0 0 0 0 0

16

Summary of results:

true

predicted 0 A

0 48 0

A 17 25

The algorithm guessed 25 A’s and 48 0’s correctly, while it incorrectly classified 17

0’s as A’s.

The total accuracy of the algorithm is defined as the percentage of total keys

correctly predicted. As stated previously, the highest number reached using MFCCs

was 65.56%.

17

8 Structure Imposed by MFCCs

Humans are not very good at distinguishing keyboard strokes. The whole point

of MFCCs is to emulate the human auditory filtration process to extract the key

features of the data. This only truly makes sense if humans have been shown to be

able to discriminate well for the data set in question. i.e. if the features extracted

by humans have been shown to work well for discrimination. MFCCs may be very

appropriate for problems involving speech recognition, but it’s not immediately clear

if they are the best choice for the keyboard classification problem.

On a more fundamental level, the steepness of the intensity function of the Mel

scale determines the shape of the bands of frequencies whose energies are summed to-

gether. It can thus be seen that this intensity function is actually imposing a structure

on the characteristic features you can observe. The intensity function rises sharply

over low frequencies and then slowly flattens out. This results in thin critical bands

at low frequencies and wide bands at high frequencies.

Figure 6: Mel Scale and Incompatible True Characteristics [6]

Suppose that the true characteristic features of your data which allow discrim-

ination are not concentrated at low frequencies. An algorithm, which combines fre-

quencies in critical bands as defined by the intensity function above, might group

them together. See Figure 6 for an illustration. The characteristic features that are

useful for discrimination are labeled by x’s. Only the joint effects of characteristic

18

frequencies belonging to the same band are maintained. An intensity function which

is steep for high frequencies and relatively flat for low frequencies, would yield critical

bands which are more compatible with the data in Figure 6. Using the same number

of critical bands, it would be possible for such a function to maintain the individual

effects of more of these features.

This is just one example of a possible distribution of features. The idea behind

NMFCCs is that it is better to allow the data to tell you where its characteristic

features are, rather than presuming their density over the frequency domain a priori.

19

9 Construction of the NMFCCs

NMFCCs are for the most part identical to MFCCs. The key difference is that

their intensity function is modeled to fit the types of sounds which are being ana-

lyzed, as opposed to being based on the principles of human physiology. As with the

traditional MFCCs, the derivation of the NMFCCs begins with the STFT of the data.

This maps the data from the time-amplitude domain into the time-frequency domain.

The goal is to obtain a new intensity function that will represent the structure of the

characteristic features better. To do this, the frequencies that were obtained from

the Fourier Transform (FT) were sorted from least to greatest. This removes time

ordering sensitive data from the intensity function.

Recall: The STFT’s are composed with the ordered scale thus summing along

respective critical bands and recovering at least some of the time ordering.

[STFTs]x[Scaling]

Since the frequencies which make up a key stroke may not be the same each time

it is pressed, the frequency decompositions for many keys were combined to get a bet-

ter model for an average key hit. As a result, the size of the representation obtained

for the new intensity function was large. For the problem of key discrimination, it

was characterized by approximately 90,000 entries. In order to construct the new

[Scaling] matrix, new frequencies, not necessarily in the original 90,000 characteris-

tics need to be mapped by the new intensity function. To achieve this, a parametric

model for the intensity function was created.

First the frequency decomposition for a collection of keys was stored in F. The

data was sorted from least to greatest among each STFT window, and then from

least to greatest for each position over all windows. The result was that the columns

20

contain the nth smallest entry for each window sorted from least to greatest.

Figure 7: 1st smallest Figure 8: 25th smallest

Figure 9: 50th smallest

These are initially compared with exponentials centered about the mean of domain.

The residuals from this regression are used to help define the next set of parameters

which will be introduced. In greater detail:

1. Define T to be a sequence starting from 1.

2. Find the optimal β0 for frequency = β0·e0.01(T−µ) subject to standard regression

assumptions.

21

Figure 10: Residuals for 1st Figure 11: Residuals for 25th

Figure 12: Residuals for 50th

The data points after the minimum residual occurs are not considered. It is possible

that there could be important information in these few final points, but they do not

follow the same form as the rest of the data. Instead of adding more parameters into

the model to explain them sufficiently, they were thrown out. Exploring the amount

of information which is contained in these points is an area which further research

could investigate.

The locations of the maximum and minimum residuals from the above regressions

were used to define the cutoffs for a spline model which is exponential over the first

region and the sum of an exponential and a line over the second region. Some examples

of the residuals after the separation of the exponential and introduction of the linear

22

component are given below:

Figure 13: Residuals for 1st Figure 14: Residuals for 25th

Figure 15: Residuals for 50th

23

The fully sorted data is shown in the following graph. The information gained over

the aforementioned regressions will be used to help create a parametric model.

Figure 16: Combined Sorted Frequencies

These regressions were applied on a larger scale to the combined data set. The

information gained about the residuals was used to search more efficiently for the

optimal cutoff points for the spline. The functions defined on this region are also

dependent on the indices at which the maximum and minimum residuals were found

in the preliminary steps.

Summary of the Regression applied to the Combined Sorted Data

Call:

lm(formula = AllscfL[1:K2] ~ y2f + ys2f + linf)

Residuals:

Min 1Q Median 3Q Max

-87988.3 -531.9 107.1 792.7 44758.7

24

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 7.488e+02 1.267e+01 59.11 <2e-16 ***

y2f 7.152e+02 7.101e-01 1007.16 <2e-16 ***

ys2f 1.586e-183 0.000e+00 Inf <2e-16 ***

linf -7.222e+02 2.454e+00 -294.26 <2e-16 ***

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 3267 on 86773 degrees of freedom

Multiple R-squared: 0.9473, Adjusted R-squared: 0.9473

F-statistic: 5.199e+05 on 3 and 86773 DF, p-value: < 2.2e-16

This will serve as the NMFCC intensity function for the four key classification prob-

lem.

25

10 Performance & Conclusion

The NMFCCs were used in [3]’s algorithm in lieu of the MFCCs. Without

increasing the number of coefficients, the accuracy was improved by approximately

30.51% from 65.56% to 85.56%. The output from R below is for the initial values of

the parameters using the new NMFCCs.

Results from the First Binary Classification:

true

predicted 0 A

0 58 2

A 7 23

Results from the Second Binary Classification:

true

predicted 0 L

0 40 0

L 0 20

Results from the Third Binary Classification:

true

predicted 0 Q

0 12 0

Q 4 24

26

Results from classifying P as what remains:

true

predicted 0 P

0 0 0

P 2 10

The algorithm guesses correctly for 23 + 20 + 24 + 10 = 77 keys out of a total of 90

in the test set.

There are many aspects of NMFCCs which remain to be investigated. The ap-

proximation for the intensity function which was used still had very large errors in

some spots, but performed well with classification nevertheless. NMFCCs are con-

structed by the prevalence of certain frequencies in a signal, but this may not be truly

representative of the characteristic frequencies which allow for discrimination. Inten-

sity functions which are defined by the variability in the prevalence of frequencies in

certain areas may perform better.

In this case, it is clear that objective intensity functions yield better features

for discrimination. For speech recognition problems, MFCCs are logically very good

features to use; however, to make these results more general, future research could

look into the classes of problems for which NMFCCs and objective frequency functions

provide significant improvement of extracted features.

27

11 tuneR MFCC code

The following code is taken from the MFCC function in the tuneR package for R. It

is supplemental to the traditional approach presented in the section “Construction of

the MFCCs.”

> MFCC

function (object, a = 0.1, HW.width = 0.025, HW.overlapping = 0.25,

T.number = 24, T.overlapping = 0.5, K = 12)

{

if (!is(object, "Wave"))

stop("’object’ needs to be of class ’Wave’")

validObject(object)

if (object@stereo)

stop("Stereo processing not yet implemented...")

S <- object@left

sampling.rate <- [email protected]

S.length <- length(S)

S.fil <- filter(S, filter = c(1, -a))

S.fil <- S.fil[-S.length]

HW.length <- sampling.rate * HW.width

HW.shift <- HW.length * (1 - HW.overlapping)

coefficients <- floor(S.length/2) + 1

STDFT <- stftMFCC(S.fil, win = HW.length, inc = HW.shift,

coef = coefficients, wtype = hamming.window)

HW.number <- dim(STDFT$values)[1]

time <- (STDFT$center - 1)/(sampling.rate - 1)

frequency <- (0:(coefficients - 1))/(2 * coefficients) *

sampling.rate

28

Mel <- 2595 * log10(1 + frequency/700)

T.length <- (max(Mel) - min(Mel))/(T.number * (1 - T.overlapping) +

T.overlapping)

T.centers <- min(Mel) + 0.5 * T.length + (0:(T.number - 1)) *

(1 - T.overlapping) * T.length

T.windows <- matrix(0, coefficients, T.number)

for (m in 1:T.number) {

T.windows[, m] <- 1 - (abs(Mel - T.centers[m]))/(0.5 *

T.length)

T.windows[T.windows[, m] < 0, m] <- 0

}

IDCT <- sapply(1:K, function(x) cos(x * (1:T.number - 0.5) *

pi/T.number))

X <- log(abs(STDFT$values %*% T.windows)) %*% IDCT

X <- cbind(time, X)

return(X)

}

<environment: namespace:tuneR>

29

References

[1] Agrawal, Rakesh, and Asonov, Dmitri, “Keyboard Acoustic Emanations,” IBM

Almaden Research Center, pp. 1-9, 2004.

[2] Meadows, Catherine, and Paul Syverson,“Keyboard Acoustic Emanations

Revisited,” Proceedings of the 12th ACM Conference on Computer and

Communications Security pp.373-382, New York: ACM, Nov. 2005.

[3] Lester, Matthew D. “Sound Classification” Thesis. University At Albany, State

University of New York, 2010. Print.

[4] Jensen, Jesper Højvang, and Christensen, Mads Græsbøll, “Evaluation Of

MFCC Estimation Techniques For Music Similarity,”

“http://kom.aau.dk/̃ jhj/files/jensen06mfcc.pdf.”

[5] Jensen, Jesper Højvang, and Christensen, Mads Græsbøll, “Quantitative

Analysis of a Common Audio Similarity Measure,”

“http://www.ee.columbia.edu/̃ dpwe/pubs/JensCEJ09-quantmfcc.pdf.”

[6] Camastra, Francesco, and Alessandro, Vinciarelli, “Machine Learning for Audio,

Image and Video Analysis: Theory and Applications,” London: Springer, 2008.

Print.

[7] “Fourier Transform,” Wikipedia, the Free Encyclopedia,

“http://en.wikipedia.org/wiki/Fourier transform.”

[8] “Discrete Cosine Transform,” Wikipedia, the Free Encyclopedia,

“http://en.wikipedia.org/wiki/Discrete cosine transform.”

[9] “Mel-frequency cepstrum,” Wikipedia, the Free Encyclopedia,

“http://en.wikipedia.org/wiki/Mel-frequency cepstral coefficient.”

[10] Encyclopaedia Britannica. Chicago: Encyclopaedia Britannica, 1997. Print.

30

[11] Molau, Shirko and Pitz, Michael, “Computing Mel-Frequency Cepstral

Coefficients On The Power Spectrum,”

“http://acccn.net/cr569/Rstuff/keys/mfccs.pdf.”

[12] T. F. Quatieri, “Discrete-Time Speech Signal Processing: Principles and

Practice,” PHI, 2002. Print.

31

changing the mfcc intensity functionnwj2/masters.pdf · the section \tuner mfcc code" is...

Documents