statistical tools for audio processing

STATISTICAL TOOLS FOR AUDIO

PROCESSING Signal Image (Ecn) Mathieu Lagrange

Some material taken from Dan Ellis courses

Machine Learning

• Machine Learning deals with sub-problems in engineering and sciences rather than the global “intelligence” issue! •  Applied •  A set of well-defined approaches each within its limits that

can be applied to a problem set •  Classification / Pattern Recognition / Sequential

Reasoning / Induction / Parameter Estimation etc.

Machine Learning • Provide tools and reasoning for the design process of a

given problem •  Is an empirical science • Has a profound theoretical background •  Is extremely diverse • Keep in mind that,

•  Algorithms SHALL NOT be applied blindly to your data/problem set! •  The MATLAB Toolbox syndrome: Examine the hypothesis and limitation of each approach before hitting enter!

•  Do not forget your own intelligence!

Sample Example (I) • Communication theory:

•  Question: What should an optimal decoder do to recover Y from X ?

•  X is usually referred to as observation and is a random variable. •  In most problems, the real state of the world (y) is not observable

to us! So we try to infer this from the observation.

Sample Example (I) •  This is a typical Classification problem •  Intuitive Solution:

•  Threshold on 0.5 •  But let’s make life more difficult!

Sample Example (I) • Simple Solution 2:

•  Try to find an optimal boundary (defined as g(x)) that can best separate the two.

•  Define the decision function as + or - distance from this boundry.

•  I am thus assuming that the family of g(x) that discriminate X classes.

Sample Example (I) •  In the real world things are not as simple

•  Consider the following 2-dimensional problem •  Not hard to see the problem!

•  Consider the following 2-dimensional problem 1. To what extend does our solution

generalize to new data? •  The central aim of designing a classifier

is to correctly classify novel input!

is to correctly classify novel input! 2. How do we know when we have

collected adequately large and representative set of examples for training?

is to correctly classify novel input! 2. How do we know when we have

collected adequately large and representative set of examples for training?

3. How can we decide model complexity versus performance?

Sample Example (II) •  This is a typical Regression problem • Polynomial Curve Fitting

Sample Example (II) • Polynomial Curve Fitting

•  Sum-of-squares Error Function

•  0th order polynomial

•  1st order polynomial

•  3rd order polynomial

•  9th order polynomial

•  Over-fitting

Root-‐Mean-‐Square (RMS) Error:

•  Over-fitting and regularization •  Effect of data set size (9th order polynomial)

•  Regularization •  Penalize large coefficient values

•  9th order polynomial with

TRAIN MACHINES

•  Interaction between • The machine • The designer

The machine • Pattern recognition in action:

•  Examples: •  Speaker Detection •  Music genre classification •  Many more

The designer • Pattern recognition design cycle:

•  Examples: •  Speaker Detection •  Music genre classification •  Many more

Feature Extraction • Right features are critical

•  Invariance under irrelevant modications

•  Theoretically equivalent features may act very differently in a particular classifer •  Representations make important aspects explicit •  Remove irrelevant information

•  Feature design incorporates `domain knowledge’ •  although more data -> less need for `cleverness’

• Smaller `feature space' (fewer dimensions) •  Simpler models (fewer parameters) •  less training data needed •  faster training

The right features for audio ? • Completely depends on the task at hand

•  Speaker recognition •  Musical genre detection

• Most common perceptual dimensions •  Loudness (Amplitude) •  Pitch (Frequency) •  Timbre (Spectral Envelope)

What is important for human ?

Frequency Decomposition • A great idea that can be implemented in various ways:

•  Mechanically •  Analogically •  Numerically (fortunately)

Discrete Fourier Transform (DFT)

Short Time Fourier Transform (STFT) • Want to localize energy in time and frequency

•  break sound into short-time pieces •  calculate DFT of each one

The Spectrogram

Focus on the spectral envelope

MFCCs ? 1.  Take the Fourier

transform of (a windowed excerpt of) a signal.

2.  Map the powers of the spectrum obtained above onto the mel scale, using triangular overlapping windows.

3.  Take the logs of the powers at each of the mel frequencies.

4.  Take the discrete cosine transform (DCT)

5.  The MFCCs are the amplitudes of the resulting spectrum.

MFCCs Rules ?

Example • Audio

Potentials of the DCT step • Observation of Pols that the main components capture

most of the variance using a few smooth basis functions, smoothing away the pitch ripples

• Principal components of vowel spectra on a warped frequency scale aren't so far from the cosine basis functions

• Decorrelates the features. •  This is important because the MFCC are in most cases modelled

by Gaussians with diagonal covariance matrices

Classification • Given some data x and some classes Ci, the optimal

classifier is

• Can model data distribution directly •  Nearest neighbor, SVMs, AdaBoost, neural nets •  Leads to a discriminative model

• Can consider data likelihood •  Thanks to the Bayes’ rule •  Leads to a generative model

Basics on random variables • Random variable have joint

distributions p(x, y) • Marginal distribution of y is

• Knowing one value in a joint distribution constrains the remainder

Bayes Rule • Bayes is powerful

•  For generative models, it boils down to

Gaussian models • Easiest way to model distributions is via a parametric

model •  Assume known form, estimate a few parameters

• Gaussian model is simple and useful:

• Parameters to fit:

•  Mean •  variance

In d dimensions

• Described by •  A d-dimensional mean •  A dxd covariance matrix

Gaussian mixture models • Single Gaussians cannot model

•  distributions with multiple modes •  distributions with nonlinear correlations

• What about a weighted sum ?

•  Can fit anything given enough components

•  Interpretation: each observation is generated by one of the Gaussians, chosen with probability

Gaussian mixtures • Can approximate non linear correlation

• Problem: estimate the parameters of the model •  Easy if we knew which gaussian generated each x

Expectation-maximisation (EM) • General procedure for estimating model parameters when

some are unknown •  e.g. which GMM component generated a point

•  Iteratively updated model parameters to maximize Q, the •  expected log-probability of observed data x and hidden

data z

•  E step: calculate using •  M step: find model that maximizes Q using •  can prove that the likelihood is non-decreasing •  hence maximum likelihood model •  local optimum -> depends on initialization

Fitting GMMs with EM • Want to find

•  The parameters of the Gaussians •  Weights / priors on Gaussians •  That maximize the likelihood of training data x

•  If one could assign each training sample x to a particular gaussian, the estimation is trivial (model fitting)

• Hence, we treat mixture indices, z, as hidden •  Want to optimize Q of the form •  Differentiate wrt model parameters •  Leads to update equations that are:

Update equations • Parameters that maximize Q

• Each involves a « soft assignment » of xn in Gaussian k

E-M example • Start

(Fig. From A. Moore’s Tutorial)

E-M example •  1-st iteration

E-M example •  2-nd iteration

E-M example •  3-rd iteration

E-M example •  4-th iteration

Density Estimation

(Fig. Wikipedia)

What about K-means then ? • A special case of EM for

GMMs, where •  The membership assignement

is thresholded •  The Gaussians are fully

described by their means

K-means

Now •  You have

•  Features for representing audio in a meaningful way •  the MFCCs are able to complactly describe the spectral envelope

•  A tool to learn GMMs from training data (complete data for which you know the memberships)

•  Thanks to the Bayes’ theorem, •  you know that given an observation, the model for which this observation

have the maximum likelihood is the best one to consider. •  You can

•  Abstract recorded audio in a meaningful way •  Learn models for each class •  Given an unlabeled sample, decide which label is the most suitable

•  So, •  Have some rest and some food •  See you this afternoon for some hands-on practise

Resources

•  Artificial Intelligence: A Modern Approach Stuart Russel and Peter Norvig, Prentice Hall.

•  Pattern Classification R. Duda, P. Hart, D. Stork, Wiley Interscience, 2000.

•  Pattern Recognition and Machine Learning, Christopher M. Bishop, 2006.

•  Introduction to Machine Learning, Ethem Alpaydin, MIT Press, 2004.

•  The Elements of Statistical learning, T. Hastie, R. Tibshirani, J. Friedman, Springer Verlag, 2001. Also available online: http://www-stat.stanford.edu/~tibs/ElemStatLearn/

statistical tools for audio processing

Documents

using image processing and statistical analysis to...

statistical language processing

comp791a: statistical language processing

audio signal processing. audio signal processing, sometimes...

audio processing material

elec9344: speech and audio...

introducing audio signal processing & audio...

shure networked audio processing

statistical natural language processing

794 ieee transactions on audio, speech, and language … ·...

audio processing

lc786820e - compressed audio signal processor ic with usb...

statistical signal processing

audio signal processing -- quantization

speech & audio processing - part–idspuser/dasp/... ·...

low voltage subscriber line audio-processing circuit …...

digital audio effects processing & reverberation - aalto ·...

dasar audio processing

a statistical approach to automated offline dynamic...

learning joint statistical models for audio-visual fusion...