on learning with kernels for audio signal processing: the ... · background - applications in audio...

On Learning with Kernels for Audio SignalProcessing: the old and the new

Hachem Kadri

QARMA team - LIFAix-Marseille University

[email protected]

GIPSA-Lab 2013

Background - Functional Learning (1/2)

y

i

= f (xi

) + ‘

i

I Supervised learning

≠æ Data: n training examples {(x1, y1), . . . , (xn, yn)}≠æ Goal: learn f

Predictor ‘≠æ Response ModelRd {≠1, 1} Binary ClassificationRd {1, 2, 3, . . .} Multi-class ClassificationRd R Multiple Regression

H. Kadri, QARMA Learning with kernels: the old and the new 2/1

Background - Functional Learning (2/2)

I Minimization problem

minfœF

nÿ

i=1V

!y

i

, f(xi

)"

æ V: loss function - e.g. square loss:!y

i

≠ f(xi

)"2

I Overfitting problem

I Regularized minimization

minfœF

nÿ

i=1V

!y

i

, f(xi

)"

+ ⁄�(f)

æ �: regularization - e.g. L2-norm: �(f) = ÎfÎ2F


Background - Learning with Kernels (1/2)

y

i

= f (xi

) + ‘

i

; y

i

œ RI Linear model: f(x) = Èa, xÍ + b

I Kernels: nonlinear/nonparametric estimation

input space feature space

RKHS associated with a positive definite kernel k givesa desired feature space!!


Background - Learning with Kernels (2/2)

2 PerspectivesI Feature space

≠æ nonlinear in input space

≠æ projecting data into a Feature space

≠æ linear in Feature space

≠æ kernel trick È�(x1), �(x2)Í = k(x1, x2)

X ℝ

FX

ΦX

f

g

I RKHS theory≠æ Mercer theorem: integral operator + positive kernel

≠æ reproducing property: Èf, k(x, ·)Í = f(x)

≠æ representer theorem: f(·) =qi

–

i

k(xi

, ·) ; –

i

œ R


Background - Applications in audio processing

I Music segmentation (Davy et al., 2006 - IRCCyN)

I Speaker verification (Louradour et al., 2007 - IRIT)

I Speaker change detection (Harchaoui et al., 2008 - LTCI)

I Sound recognition (Rabaoui et al., 2008 - LAGIS)

I Speech inversion (Toutios et al., 2008 - LORIA)


Learning with kernels - Limitations and Challenges

+ + Geometric intuition and interpretation- - Choosing the kernel in advance- - Sequential, time-varying characteristics- - Limited to single task/scalar output

I “Sophisticated” kernel methods

• learning kernels ≠æ MKL: Multiple Kernel Learning

• probability distribution ≠æ RKHS embedding of distributions

• connection geometric/time-varying ≠æ FDA

• multi-task/complex outputs ≠æ Operator-valued Kernel

• Deep Learning - Representation Learning ≠æ ? . . .


≠æ Learning with kernels/RKHS embedding≠ ≠ ≠ + ≠ ≠ ≠ ≠ ≠ ≠ ≠ + ≠ ≠ ≠ ≠ ≠ ≠ ≠ ≠ ≠ ≠ + ≠ ≠ ≠ ≠ ≠ ≠≠ >

1950 1995 ≠ 2002 2005 ≠ . . .

Aronsazn Vapnik, Cortes, Scholkopf, Smola Gretton, Le Song, Fukumizu

≠æ Learning ⇠⇠⇠XXXwith kernels≠ ≠ ≠ ≠ ≠ ≠ ≠ ≠ ≠ ≠ ≠ + ≠ ≠ ≠ + ≠ ≠ ≠ + ≠ ≠ ≠ + ≠ ≠ ≠ ≠ ≠≠ >

2004 2006 2007 2010Lackiert - Bach Sonn. Rakoto. Cortes - Kloft

≠æ Learning with operator-valued kernels≠ ≠ ≠ ≠ ≠ + ≠ ≠ ≠ ≠ ≠ ≠ + ≠ ≠ ≠ ≠ ≠ + ≠ ≠ ≠ ≠ + ≠ ≠ ≠ ≠ ≠≠ >

1958 ≠ 1960 2005 2008 2010/2011Pedrick - Schwartz Micc & Pontil Caponnetto Kadri/d’Alche-Buc

≠æ Learning ⇠⇠⇠XXXwith operator-valued kernels≠ ≠ ≠ ≠ ≠ ≠ ≠ ≠ ≠ ≠ ≠ ≠ ≠ ≠ ≠ ≠ + ≠ ≠ ≠ ≠ + ≠ ≠ ≠ + ≠ ≠ ≠≠ >

2011 2012 2013Dinuzzo Kadri Sindhwani


FDA - Examples

850 900 950 1000 10502

2.5

3

3.5

4

4.5

5

5.5

wavelengths

ab

sorb

an

ces

Spectrometric Curves

0 50 100 150

5

10

15

20

25

frequencies

am

plit

ud

e

Speech

−0.04 −0.03 −0.02 −0.01 0 0.01 0.02 0.03 0.04−0.06

−0.04

−0.02

0

0.02

0.04

meters

me

ters

Handwriting

0 2 4 6 8 10 12

−0.2

−0.1

0

0.1

0.2

months

diff

ere

ntia

ted

Lo

g

Electricity Consumption

Regression Classification

Time warping Forecasting

Ramsay and Silverman (2002) - Ferraty and Vieu (2006)


FDA - Functional inputs & functional outputs

y

i

= f (xi

) + ‘

i

Predictor ‘≠æ Response ModelL

2L

2 Functional Model - Functional Responses

Temperature

0 50 100 150 200 250 300 350

−30

−20

−10

0

10

20

Day

(a)

Deg

C

Precipitation

0 50 100 150 200 250 300 3500

2

4

6

8

10

12

Day

(b)

mm

• Operator estimation≠æ min

fœF

nÿ

i=1Îy

i

≠ f(xi

)Î2Y + ⁄ÎfÎ2

F


Learning from functional responses - Discrete case

Learning from mul-tiple response data

Multiple out-put regression

Learning vector-valued function

Statistics Machinelearning

Multi-tasklearning

(Micchelli and Pontil, 2005)

C&Wprocedure

(Breiman and Friedman, 1997)


Reproducing kernels - From scalar to functional

Scalar-valued Function-valued

X ℝ

FX

ΦX

f

g

YX

FXY

ΦXY

f

g

• Operator-valued kernels & function-valued RKHS≠æ Nonlinear FDA


Outline

• Hilbert space of operators with Reproducing Kernelsæ Function-valued RKHSæ Operator-valued kernel

• Operator estimationæ L

2-regularized operator learning algorithmæ Block operator kernel matrix inversion

• Application to audio and speech processingæ Speech inversionæ Environmental sound recognition


Operator-valued kernels - Definition

• (xi

(s), y

i

(t))n

i=1 œ X ◊ Y

• X : �x

≠æ R ; Y : �y

≠æ R

• � ™ R : curve ; � ™ R2 : image

DefinitionKF (., .) : X ◊ X ≠æ L(Y)

IKF is Hermitian if KF (w, z) = KF (z, w)ú,

I it is nonnegative on X if for any {(wi

, u

i

)i=1,...,r

} œ X ◊ Yÿ

i,j

ÈKF (wi

, w

j

)ui

, u

j

ÍY Ø 0


Operator-valued kernels - Function-valued RKHS

• Extending real/vector-valued RKHS theory to FDA (Kadri etal. AISTATS 2010)

• RKHS of function-valued functions

DefinitionA Hilbert space F = {f : X ≠æ Y} is called a reproducing kernelHilbert space if there is an operator-valued kernel KF such that:

Ih : z ‘≠æ KF (w, z)g =∆ h œ F , ’w œ X and g œ Y

I ’f œ F , Èf, KF (w, .)gÍF = Èf(w), gÍY (reproducing property)


Operator-valued kernels - Uniqueness & Bijection

LemmaF function-valued RKHS =∆ KF (w, z) is unique

I Proof:Û ÈK Õ(wÕ

, .)gÕ, K(w, .)gÍF = ÈK Õ(wÕ

, w)gÕ, gÍY

Û ÈK(w, .)g, K

Õ(wÕ, .)gÕÍF = ÈK(w, w

Õ)g, g

ÕÍY

= Èg, K(w, w

Õ)úhÍY = Èg, K(wÕ

, w)gÕÍY

TheoremKF (w, z) nonnegative ≈∆ RKHS F

I Proof:≈

nqi,j=1

ÈK(wi

, w

j

), u

j

ÍY =nq

i,j=1ÈK(w

i

, .)ui

, K(wj

, .)uj

ÍF

∆ F0, ’f œ F0, f(.) =nq

i=1KF (w

i

, .)–i


Operator-valued kernels - Construction

• Multi-task kernel =∆ K(w, z) = k(w, z)TI k: real-valued kernelI T: diagonal matrix + low rank matrix (finite dimension)

• FDA kernel =∆ T œ L(Y) (infinite dimension) ?I Concurrent functional linear model

æ y(t) = –(t) + —(t)x(t)æ Multiplication operatoræ Varying coe�cient model (Hastie and Tibshirani, 1993)

I Functional linear model for functional responses (Ramsay andSilverman, 2005)

æ y(t) = –(t) +s

—(s, t)x(s)ds

æ Hilbert-Schmidt integral operator


Operator-valued kernels - Examples

1. Multiplication operatorKF : X ◊ X ≠æ L(Y)

x1, x2 ‘≠æ k

x

(x1, x2)T ky ; T

h

y

(t) , h(t)y(t)

2. Hilbert-Schmidt integral operatorKF : X ◊ X ≠æ L(Y)

x1, x2 ‘≠æ k

x

(x1, x2)T ky ; T

h

y

(t) ,s

h(s, t)y(s)ds

3. Composition operatorKF : X ◊ X ≠æ L(Y)

x1, x2 ‘≠æ C

Â(x1)CúÂ(x2) ; C

Ï

: f ‘≠æ f ¶ Ï


Operator-valued kernels - Feature map(Kadri et al., ICML 2011)

I Operator-valued kernel admits a feature map representationæ ÈK(x1, x2)y1, y2ÍY = È�(x1, y1), �(x2, y2)ÍL(X ,Y)

æ ÈK(x1, .)y1, K(x2, .)y2ÍF = ÈK(x1, x2)y1, y2ÍY

I Complex/infinite-dimensional inputsæ multiple functional data x

i

œ (L2)p

I FDA viewpointæ one observation = one continuous curve

Real-valued RKHS�

k

: (L2)p æ L((L2)p

,R)x ‘æ k(x, .)

dim: p ≠æ 1

Function-valued RKHS�y

K

: (L2)p æ L((L2)p

, L

2)x ‘æ K(x, .)y

dim: p ≠æ inf


Optimization problem - Representer theorem

TheoremThe solution of the minimization problem

minfœF

nÿ

i=1Îy

i

≠ f(xi

)Î2Y + ⁄ÎfÎ2

F

is achieved by a function of the form

f

ú(.) =nÿ

i=1KF (x

i

, .)—i


Optimization problem - Solution

minfœF

nqi=1

Îy

i

≠ f(xi

)Î2Y + ⁄ÎfÎ2

F

using the representer theorem & the reproducing property

≈∆ min—iœY

nqi=1

Îy

i

≠nq

j=1KF (x

i

, x

j

)—j

Î2Y + ⁄

nqi,j

ÈKF (xi

, x

j

)—i

, —

j

ÍY

I Discretization (Kadri et al., AISTATS 2010)æ grid {t1, . . . , t

m

} =∆ —

i

(t1), . . . , —

i

(tm

)

I Approximation (Kadri et al., Tech. Report 2011)æ Y a real RKHS =∆ —

i

=q

m

l=1 –

il

k(tl

, .)

I Analytic solution (Kadri et al., ICML 2011)æ (K + ⁄I)— = y ; — œ (Y)n and K œ [L(Y)]n◊n


Optimization problem - Block operator kernel matrixinversion

I (block) numerical range: spectral and operator theory

I Spectral theory of block operator matrices (C. Tretter, 2008)

IK(x

i

, x

j

) = G(xi

, x

j

)T, ’x

i

, x

j

œ X

I Kronecker product

æ K =

Q

caG(x1, x1)T . . . G(x1, x

n

)T... . . . ...

G(xn

, x1)T . . . G(xn

, x

n

)T

R

db = G ¢ T

æ K≠1 = G≠1 ¢ T

≠1


Algorithm 1 L

2-Regularized Operator Learning Algorithm

Inputdata x

i

œ (L2([0, 1]))p, y

i

œ L

2([0, 1]), size n

Eigendecomposition of G = G(xi

, x

j

)n

i,j=1 œ Rn◊n

eigenvalues –

i

œ R, eigenvectors v

i

œ Rn, size n

Eigendecomposition of T œ L(Y )Initialize k: number of eigenfunctionseigenvalues ”

i

œ R, eigenfunctions w

i

œ L

2([0, 1]), size k

Eigendecomposition of K = G ¢ T

K = K(xi

, x

j

)n

i,j=1 œ (L(Y ))n◊n

eigenvalues ◊

i

œ R, eigenfunctions z

i

œ (L2([0, 1]))n, size n◊k

◊ = – ¢ ”, z = v ¢ w

Solution — = (K + ⁄I)≠1y

Initialize ⁄: regularization parameter— = q

n◊k

i=1 (◊i

+ ⁄)≠1 qn

j=1Èzij

, y

j

Ízi


Applications - Speech inversionSS

Speech production

Speech Inversion

1: upper lip

2: lower lip

3: jaw

4: tongue tip

5: tongue body

6: velum

7: glottis

Time(s)

Speech signal

Am

plitu

de

Figure : Acoustic to articulatory inversion

I speech inversionæ learning the acoustic-to-articulatory mappingæ from MFCC to Vocal-tract time functions (VTTF)æ improving speech technology and understandingæ helping individuals with speech and hearing disorders


Applications - Speech inversion

0 0.5 10

10

20

0 0.2 0.4−20

0

202

-20 0.5 1

−20

0

20

0 0.5 1−20

0

20

LA

0 0.5 18

10

12

LP

0 0.5 1

0

20

40

TTC

D

0 0.5 10

50

100

TTC

L

0 0.5 10

5

10

TBC

D

0 0.5 1

100

150

TBC

L

0 0.5 1−0.1

−0.05

0

VE

L

0 0.5 10

0.2

0.4

Time(s)

GLO

2

-2

0 0.2 0.4−20

0

20

0 0.2 0.48

10

12

0 0.2 0.40

20

40

0 0.2 0.420

40

60

0 0.2 0.40

5

10

0 0.2 0.4100

120

140

0 0.2 0.4−0.2

0

0.2

0 0.2 0.40

0.2

0.4

Time(s)

0 0.5 10

10

20

0 0.5 18

9

10

0 0.5 1−50

0

50

0 0.5 10

50

100

0 0.5 1−20

0

20

0 0.5 10

100

200

0 0.5 1−0.2

0

0.2

0 0.5 10

0.2

0.4

Time(s)

1

-1

"beautiful" "conversation" "smooth"


Applications - Speech inversion

Tab.2 : Average RSSE for the tract variables

VT variables Á-SVR Multi-task functionalLA 2.763 2.341 1.562LP 0.532 0.512 0.528

TTCD 3.345 1.975 1.647TTCL 7.752 5.276 3.463TBCD 2.155 2.094 1.582TBCL 15.083 9.763 7.215VEL 0.032 0.034 0.029GLO 0.041 0.052 0.064Total 3.962 2.755 2.011

IÁ-SVR (Mitra et al., ICASSP 2009)

I Multi-task kernel (Kadri et al., ICASSP 2011)


Applications - Sound recognition

I Sound Recognition

≠æ Surveillance and security applications



I Features extraction≠æ temporal, spectral, cepstral, ... characteristics

0 1 2 3 4 5 6 7x 104

−1

−0.5

0

0.5

1

Time(s)

Ampl

itude

0 50 100 150 200 250 3000

0.2

0.4

0.6

0.8

1Evolution of the Zero Crossing Rate (ZCR)

0 50 100 1500.005

0.01

0.015

0.02

0.025

0.03Spec−Roll−off (SRF)

0 50 100 150−10

−5

0

5

10

15

20

25

3013 Cepstral coefficients (MFCC)


Applications - Sound recognitionI Limitations - Multivariate data modeling

≠æ features contain discrete values of various parameters≠æ feature vector œ RDP by concatenating samples of ”= features

I Solution - Multivariate functional data modeling

≠æ modeling each audio signal by a vector of functions in (L2)D



Tab.3 : Classes of sounds and number of samples in the database usedfor performance evaluation.

Classes Number Train Test Total Duration (s)Human screams C1 40 25 65 167

Gunshots C2 36 19 55 97Glass breaking C3 48 25 73 123

Explosions C4 41 21 62 180Door slams C5 50 25 75 96Phone rings C6 34 17 51 107

Children voices C7 58 29 87 140Machines C8 40 20 60 184

Total 327 181 508 18mn 14s



Figure : Structural similarities between two di�erent classes



Figure : Structural diversity inside the same sound class and betweenclasses



Tab.4 : Confusion Matrix obtained when using the Regularized LeastSquares Classification (RLSC) algorithm (Rifkin et al, 2003)

C1 C2 C3 C4 C5 C6 C7 C8C1 92 4 4.76 0 5.27 11.3 6.89 0C2 0 52 0 14 0 2.7 0 0C3 0 20 76.2 0 0 0 17.24 5C4 0 16 0 66 0 0 0 0C5 4 8 0 4 84.21 0 6.8 0C6 4 0 0 0 10.52 86 0 0C7 0 0 0 8 0 0 69.07 0C8 0 0 19.04 8 0 0 0 95

Total Recognition Rate = 77.56%



Tab.5 : Confusion Matrix obtained when using the FunctionalRegularized Least Squares algorithm

C1 C2 C3 C4 C5 C6 C7 C8C1 100 0 0 2 0 5.3 3.4 0C2 0 82 0 8 0 0 0 0C3 0 14 90.9 8 0 0 3.4 0C4 0 4 0 78 0 0 0 0C5 0 0 0 1 89.47 0 6.8 0C6 0 0 0 0 10.53 94.7 0 0C7 0 0 0 0 0 0 86.4 0C8 0 0 9.1 3 0 0 0 100

Total Recognition Rate = 90.18%


Applications - Beyond Audio ProcessingI Functional outputs - BCI

0 20 40 60 80 100 120 140 160 180 200−20

0

20

Ch. 1

0 20 40 60 80 100 120 140 160 180 200−10

0

10

Ch. 2

0 20 40 60 80 100 120 140 160 180 200−5

0

5

Ch. 3

0 20 40 60 80 100 120 140 160 180 200−5

0

5

Ch. 4

0 20 40 60 80 100 120 140 160 180 200−5

0

5C

h. 5

Time samples0 50 100 150 200

−1.5

−1

−0.5

0

0.5

1

1.5

Time samples

Fin

ger

Move

ment S

tate

0 50 100 150 200

−1

0

1

2

3

4

5

6

Time samples

Fin

ger

Move

ment

I Structured outputs - Image, Text, Graph prediction

5 10 15

2

4

6

8

5 10 15

2

4

6

8

5 10 15

2

4

6

8

5 10 15

2

4

6

8

5 10 15

2

4

6

8

5 10 15

2

4

6

8

5 10 15

2

4

6

8

5 10 15

2

4

6

8

I Tensor outputs - Multilinear multitaskJury 1 Jury 2 Jury 3

Athlete performance technical score · · ·artistic score · · ·


Conclusion & Perspectives

I Conclusionæ RKHS framework for functional data - Nonlinear FDAæ FDA kernelsæ Audio and Speech processing applications

I Perspectivesæ Mixed data (discrete, continuous,...)æ Learning the operator-valued kernelæ Multilinear representation learning


on learning with kernels for audio signal processing: the ... · background - applications in audio...

Documents