on learning with kernels for audio signal processing: the ... · background - applications in audio...

Post on 25-Jul-2020

4 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

On Learning with Kernels for Audio SignalProcessing: the old and the new

Hachem Kadri

QARMA team - LIFAix-Marseille University

hachem.kadri@lif.univ-mrs.fr

GIPSA-Lab 2013

Background - Functional Learning (1/2)

y

i

= f (xi

) + ‘

i

I Supervised learning

≠æ Data: n training examples {(x1, y1), . . . , (xn, yn)}≠æ Goal: learn f

Predictor ‘≠æ Response ModelRd {≠1, 1} Binary ClassificationRd {1, 2, 3, . . .} Multi-class ClassificationRd R Multiple Regression

H. Kadri, QARMA Learning with kernels: the old and the new 2/1

Background - Functional Learning (2/2)

I Minimization problem

minfœF

nÿ

i=1V

!y

i

, f(xi

)"

æ V: loss function - e.g. square loss:!y

i

≠ f(xi

)"2

I Overfitting problem

I Regularized minimization

minfœF

nÿ

i=1V

!y

i

, f(xi

)"

+ ⁄�(f)

æ �: regularization - e.g. L2-norm: �(f) = ÎfÎ2F

H. Kadri, QARMA Learning with kernels: the old and the new 3/1

Background - Learning with Kernels (1/2)

y

i

= f (xi

) + ‘

i

; y

i

œ RI Linear model: f(x) = Èa, xÍ + b

I Kernels: nonlinear/nonparametric estimation

input space feature space

RKHS associated with a positive definite kernel k givesa desired feature space!!

H. Kadri, QARMA Learning with kernels: the old and the new 4/1

Background - Learning with Kernels (2/2)

2 PerspectivesI Feature space

≠æ nonlinear in input space

≠æ projecting data into a Feature space

≠æ linear in Feature space

≠æ kernel trick È�(x1), �(x2)Í = k(x1, x2)

X ℝ

FX

ΦX

f

g

I RKHS theory≠æ Mercer theorem: integral operator + positive kernel

≠æ reproducing property: Èf, k(x, ·)Í = f(x)

≠æ representer theorem: f(·) =qi

i

k(xi

, ·) ; –

i

œ R

H. Kadri, QARMA Learning with kernels: the old and the new 5/1

Background - Applications in audio processing

I Music segmentation (Davy et al., 2006 - IRCCyN)

I Speaker verification (Louradour et al., 2007 - IRIT)

I Speaker change detection (Harchaoui et al., 2008 - LTCI)

I Sound recognition (Rabaoui et al., 2008 - LAGIS)

I Speech inversion (Toutios et al., 2008 - LORIA)

H. Kadri, QARMA Learning with kernels: the old and the new 6/1

Learning with kernels - Limitations and Challenges

+ + Geometric intuition and interpretation- - Choosing the kernel in advance- - Sequential, time-varying characteristics- - Limited to single task/scalar output

I “Sophisticated” kernel methods

• learning kernels ≠æ MKL: Multiple Kernel Learning

• probability distribution ≠æ RKHS embedding of distributions

• connection geometric/time-varying ≠æ FDA

• multi-task/complex outputs ≠æ Operator-valued Kernel

• Deep Learning - Representation Learning ≠æ ? . . .

H. Kadri, QARMA Learning with kernels: the old and the new 7/1

≠æ Learning with kernels/RKHS embedding≠ ≠ ≠ + ≠ ≠ ≠ ≠ ≠ ≠ ≠ + ≠ ≠ ≠ ≠ ≠ ≠ ≠ ≠ ≠ ≠ + ≠ ≠ ≠ ≠ ≠ ≠≠ >

1950 1995 ≠ 2002 2005 ≠ . . .

Aronsazn Vapnik, Cortes, Scholkopf, Smola Gretton, Le Song, Fukumizu

≠æ Learning ⇠⇠⇠XXXwith kernels≠ ≠ ≠ ≠ ≠ ≠ ≠ ≠ ≠ ≠ ≠ + ≠ ≠ ≠ + ≠ ≠ ≠ + ≠ ≠ ≠ + ≠ ≠ ≠ ≠ ≠≠ >

2004 2006 2007 2010Lackiert - Bach Sonn. Rakoto. Cortes - Kloft

≠æ Learning with operator-valued kernels≠ ≠ ≠ ≠ ≠ + ≠ ≠ ≠ ≠ ≠ ≠ + ≠ ≠ ≠ ≠ ≠ + ≠ ≠ ≠ ≠ + ≠ ≠ ≠ ≠ ≠≠ >

1958 ≠ 1960 2005 2008 2010/2011Pedrick - Schwartz Micc & Pontil Caponnetto Kadri/d’Alche-Buc

≠æ Learning ⇠⇠⇠XXXwith operator-valued kernels≠ ≠ ≠ ≠ ≠ ≠ ≠ ≠ ≠ ≠ ≠ ≠ ≠ ≠ ≠ ≠ + ≠ ≠ ≠ ≠ + ≠ ≠ ≠ + ≠ ≠ ≠≠ >

2011 2012 2013Dinuzzo Kadri Sindhwani

H. Kadri, QARMA Learning with kernels: the old and the new 8/1

FDA - Examples

850 900 950 1000 10502

2.5

3

3.5

4

4.5

5

5.5

wavelengths

ab

sorb

an

ces

Spectrometric Curves

0 50 100 150

5

10

15

20

25

frequencies

am

plit

ud

e

Speech

−0.04 −0.03 −0.02 −0.01 0 0.01 0.02 0.03 0.04−0.06

−0.04

−0.02

0

0.02

0.04

meters

me

ters

Handwriting

0 2 4 6 8 10 12

−0.2

−0.1

0

0.1

0.2

months

diff

ere

ntia

ted

Lo

g

Electricity Consumption

Regression Classification

Time warping Forecasting

Ramsay and Silverman (2002) - Ferraty and Vieu (2006)

H. Kadri, QARMA Learning with kernels: the old and the new 9/1

FDA - Functional inputs & functional outputs

y

i

= f (xi

) + ‘

i

Predictor ‘≠æ Response ModelL

2L

2 Functional Model - Functional Responses

Temperature

0 50 100 150 200 250 300 350

−30

−20

−10

0

10

20

Day

(a)

Deg

C

Precipitation

0 50 100 150 200 250 300 3500

2

4

6

8

10

12

Day

(b)

mm

• Operator estimation≠æ min

fœF

nÿ

i=1Îy

i

≠ f(xi

)Î2Y + ⁄ÎfÎ2

F

H. Kadri, QARMA Learning with kernels: the old and the new 10/1

Learning from functional responses - Discrete case

Learning from mul-tiple response data

Multiple out-put regression

Learning vector-valued function

Statistics Machinelearning

Multi-tasklearning

(Micchelli and Pontil, 2005)

C&Wprocedure

(Breiman and Friedman, 1997)

H. Kadri, QARMA Learning with kernels: the old and the new 11/1

Reproducing kernels - From scalar to functional

Scalar-valued Function-valued

X ℝ

FX

ΦX

f

g

YX

FXY

ΦXY

f

g

• Operator-valued kernels & function-valued RKHS≠æ Nonlinear FDA

H. Kadri, QARMA Learning with kernels: the old and the new 12/1

Outline

• Hilbert space of operators with Reproducing Kernelsæ Function-valued RKHSæ Operator-valued kernel

• Operator estimationæ L

2-regularized operator learning algorithmæ Block operator kernel matrix inversion

• Application to audio and speech processingæ Speech inversionæ Environmental sound recognition

H. Kadri, QARMA Learning with kernels: the old and the new 13/1

Operator-valued kernels - Definition

• (xi

(s), y

i

(t))n

i=1 œ X ◊ Y

• X : �x

≠æ R ; Y : �y

≠æ R

• � ™ R : curve ; � ™ R2 : image

DefinitionKF (., .) : X ◊ X ≠æ L(Y)

IKF is Hermitian if KF (w, z) = KF (z, w)ú,

I it is nonnegative on X if for any {(wi

, u

i

)i=1,...,r

} œ X ◊ Yÿ

i,j

ÈKF (wi

, w

j

)ui

, u

j

ÍY Ø 0

H. Kadri, QARMA Learning with kernels: the old and the new 14/1

Operator-valued kernels - Function-valued RKHS

• Extending real/vector-valued RKHS theory to FDA (Kadri etal. AISTATS 2010)

• RKHS of function-valued functions

DefinitionA Hilbert space F = {f : X ≠æ Y} is called a reproducing kernelHilbert space if there is an operator-valued kernel KF such that:

Ih : z ‘≠æ KF (w, z)g =∆ h œ F , ’w œ X and g œ Y

I ’f œ F , Èf, KF (w, .)gÍF = Èf(w), gÍY (reproducing property)

H. Kadri, QARMA Learning with kernels: the old and the new 15/1

Operator-valued kernels - Uniqueness & Bijection

LemmaF function-valued RKHS =∆ KF (w, z) is unique

I Proof:Û ÈK Õ(wÕ

, .)gÕ, K(w, .)gÍF = ÈK Õ(wÕ

, w)gÕ, gÍY

Û ÈK(w, .)g, K

Õ(wÕ, .)gÕÍF = ÈK(w, w

Õ)g, g

ÕÍY

= Èg, K(w, w

Õ)úhÍY = Èg, K(wÕ

, w)gÕÍY

TheoremKF (w, z) nonnegative ≈∆ RKHS F

I Proof:≈

nqi,j=1

ÈK(wi

, w

j

), u

j

ÍY =nq

i,j=1ÈK(w

i

, .)ui

, K(wj

, .)uj

ÍF

∆ F0, ’f œ F0, f(.) =nq

i=1KF (w

i

, .)–i

H. Kadri, QARMA Learning with kernels: the old and the new 16/1

Operator-valued kernels - Construction

• Multi-task kernel =∆ K(w, z) = k(w, z)TI k: real-valued kernelI T: diagonal matrix + low rank matrix (finite dimension)

• FDA kernel =∆ T œ L(Y) (infinite dimension) ?I Concurrent functional linear model

æ y(t) = –(t) + —(t)x(t)æ Multiplication operatoræ Varying coe�cient model (Hastie and Tibshirani, 1993)

I Functional linear model for functional responses (Ramsay andSilverman, 2005)

æ y(t) = –(t) +s

—(s, t)x(s)ds

æ Hilbert-Schmidt integral operator

H. Kadri, QARMA Learning with kernels: the old and the new 17/1

Operator-valued kernels - Examples

1. Multiplication operatorKF : X ◊ X ≠æ L(Y)

x1, x2 ‘≠æ k

x

(x1, x2)T ky ; T

h

y

(t) , h(t)y(t)

2. Hilbert-Schmidt integral operatorKF : X ◊ X ≠æ L(Y)

x1, x2 ‘≠æ k

x

(x1, x2)T ky ; T

h

y

(t) ,s

h(s, t)y(s)ds

3. Composition operatorKF : X ◊ X ≠æ L(Y)

x1, x2 ‘≠æ C

Â(x1)CúÂ(x2) ; C

Ï

: f ‘≠æ f ¶ Ï

H. Kadri, QARMA Learning with kernels: the old and the new 18/1

Operator-valued kernels - Feature map(Kadri et al., ICML 2011)

I Operator-valued kernel admits a feature map representationæ ÈK(x1, x2)y1, y2ÍY = È�(x1, y1), �(x2, y2)ÍL(X ,Y)

æ ÈK(x1, .)y1, K(x2, .)y2ÍF = ÈK(x1, x2)y1, y2ÍY

I Complex/infinite-dimensional inputsæ multiple functional data x

i

œ (L2)p

I FDA viewpointæ one observation = one continuous curve

Real-valued RKHS�

k

: (L2)p æ L((L2)p

,R)x ‘æ k(x, .)

dim: p ≠æ 1

Function-valued RKHS�y

K

: (L2)p æ L((L2)p

, L

2)x ‘æ K(x, .)y

dim: p ≠æ inf

H. Kadri, QARMA Learning with kernels: the old and the new 19/1

Optimization problem - Representer theorem

TheoremThe solution of the minimization problem

minfœF

nÿ

i=1Îy

i

≠ f(xi

)Î2Y + ⁄ÎfÎ2

F

is achieved by a function of the form

f

ú(.) =nÿ

i=1KF (x

i

, .)—i

H. Kadri, QARMA Learning with kernels: the old and the new 20/1

Optimization problem - Solution

minfœF

nqi=1

Îy

i

≠ f(xi

)Î2Y + ⁄ÎfÎ2

F

using the representer theorem & the reproducing property

≈∆ min—iœY

nqi=1

Îy

i

≠nq

j=1KF (x

i

, x

j

)—j

Î2Y + ⁄

nqi,j

ÈKF (xi

, x

j

)—i

, —

j

ÍY

I Discretization (Kadri et al., AISTATS 2010)æ grid {t1, . . . , t

m

} =∆ —

i

(t1), . . . , —

i

(tm

)

I Approximation (Kadri et al., Tech. Report 2011)æ Y a real RKHS =∆ —

i

=q

m

l=1 –

il

k(tl

, .)

I Analytic solution (Kadri et al., ICML 2011)æ (K + ⁄I)— = y ; — œ (Y)n and K œ [L(Y)]n◊n

H. Kadri, QARMA Learning with kernels: the old and the new 21/1

Optimization problem - Block operator kernel matrixinversion

I (block) numerical range: spectral and operator theory

I Spectral theory of block operator matrices (C. Tretter, 2008)

IK(x

i

, x

j

) = G(xi

, x

j

)T, ’x

i

, x

j

œ X

I Kronecker product

æ K =

Q

caG(x1, x1)T . . . G(x1, x

n

)T... . . . ...

G(xn

, x1)T . . . G(xn

, x

n

)T

R

db = G ¢ T

æ K≠1 = G≠1 ¢ T

≠1

H. Kadri, QARMA Learning with kernels: the old and the new 22/1

Algorithm 1 L

2-Regularized Operator Learning Algorithm

Inputdata x

i

œ (L2([0, 1]))p, y

i

œ L

2([0, 1]), size n

Eigendecomposition of G = G(xi

, x

j

)n

i,j=1 œ Rn◊n

eigenvalues –

i

œ R, eigenvectors v

i

œ Rn, size n

Eigendecomposition of T œ L(Y )Initialize k: number of eigenfunctionseigenvalues ”

i

œ R, eigenfunctions w

i

œ L

2([0, 1]), size k

Eigendecomposition of K = G ¢ T

K = K(xi

, x

j

)n

i,j=1 œ (L(Y ))n◊n

eigenvalues ◊

i

œ R, eigenfunctions z

i

œ (L2([0, 1]))n, size n◊k

◊ = – ¢ ”, z = v ¢ w

Solution — = (K + ⁄I)≠1y

Initialize ⁄: regularization parameter— = q

n◊k

i=1 (◊i

+ ⁄)≠1 qn

j=1Èzij

, y

j

Ízi

H. Kadri, QARMA Learning with kernels: the old and the new 22/1

Applications - Speech inversionSS

Speech production

Speech Inversion

1: upper lip

2: lower lip

3: jaw

4: tongue tip

5: tongue body

6: velum

7: glottis

Time(s)

Speech signal

Am

plitu

de

Figure : Acoustic to articulatory inversion

I speech inversionæ learning the acoustic-to-articulatory mappingæ from MFCC to Vocal-tract time functions (VTTF)æ improving speech technology and understandingæ helping individuals with speech and hearing disorders

H. Kadri, QARMA Learning with kernels: the old and the new 23/1

Applications - Speech inversion

0 0.5 10

10

20

0 0.2 0.4−20

0

202

-20 0.5 1

−20

0

20

0 0.5 1−20

0

20

LA

0 0.5 18

10

12

LP

0 0.5 1

0

20

40

TTC

D

0 0.5 10

50

100

TTC

L

0 0.5 10

5

10

TBC

D

0 0.5 1

100

150

TBC

L

0 0.5 1−0.1

−0.05

0

VE

L

0 0.5 10

0.2

0.4

Time(s)

GLO

2

-2

0 0.2 0.4−20

0

20

0 0.2 0.48

10

12

0 0.2 0.40

20

40

0 0.2 0.420

40

60

0 0.2 0.40

5

10

0 0.2 0.4100

120

140

0 0.2 0.4−0.2

0

0.2

0 0.2 0.40

0.2

0.4

Time(s)

0 0.5 10

10

20

0 0.5 18

9

10

0 0.5 1−50

0

50

0 0.5 10

50

100

0 0.5 1−20

0

20

0 0.5 10

100

200

0 0.5 1−0.2

0

0.2

0 0.5 10

0.2

0.4

Time(s)

1

-1

"beautiful" "conversation" "smooth"

H. Kadri, QARMA Learning with kernels: the old and the new 24/1

Applications - Speech inversion

Tab.2 : Average RSSE for the tract variables

VT variables Á-SVR Multi-task functionalLA 2.763 2.341 1.562LP 0.532 0.512 0.528

TTCD 3.345 1.975 1.647TTCL 7.752 5.276 3.463TBCD 2.155 2.094 1.582TBCL 15.083 9.763 7.215VEL 0.032 0.034 0.029GLO 0.041 0.052 0.064Total 3.962 2.755 2.011

IÁ-SVR (Mitra et al., ICASSP 2009)

I Multi-task kernel (Kadri et al., ICASSP 2011)

H. Kadri, QARMA Learning with kernels: the old and the new 25/1

Applications - Sound recognition

I Sound Recognition

≠æ Surveillance and security applications

H. Kadri, QARMA Learning with kernels: the old and the new 26/1

Applications - Sound recognition

I Features extraction≠æ temporal, spectral, cepstral, ... characteristics

0 1 2 3 4 5 6 7x 104

−1

−0.5

0

0.5

1

Time(s)

Ampl

itude

0 50 100 150 200 250 3000

0.2

0.4

0.6

0.8

1Evolution of the Zero Crossing Rate (ZCR)

0 50 100 1500.005

0.01

0.015

0.02

0.025

0.03Spec−Roll−off (SRF)

0 50 100 150−10

−5

0

5

10

15

20

25

3013 Cepstral coefficients (MFCC)

H. Kadri, QARMA Learning with kernels: the old and the new 27/1

Applications - Sound recognitionI Limitations - Multivariate data modeling

≠æ features contain discrete values of various parameters≠æ feature vector œ RDP by concatenating samples of ”= features

I Solution - Multivariate functional data modeling

≠æ modeling each audio signal by a vector of functions in (L2)D

H. Kadri, QARMA Learning with kernels: the old and the new 28/1

Applications - Sound recognition

Tab.3 : Classes of sounds and number of samples in the database usedfor performance evaluation.

Classes Number Train Test Total Duration (s)Human screams C1 40 25 65 167

Gunshots C2 36 19 55 97Glass breaking C3 48 25 73 123

Explosions C4 41 21 62 180Door slams C5 50 25 75 96Phone rings C6 34 17 51 107

Children voices C7 58 29 87 140Machines C8 40 20 60 184

Total 327 181 508 18mn 14s

H. Kadri, QARMA Learning with kernels: the old and the new 29/1

Applications - Sound recognition

Figure : Structural similarities between two di�erent classes

H. Kadri, QARMA Learning with kernels: the old and the new 30/1

Applications - Sound recognition

Figure : Structural diversity inside the same sound class and betweenclasses

H. Kadri, QARMA Learning with kernels: the old and the new 31/1

Applications - Sound recognition

Tab.4 : Confusion Matrix obtained when using the Regularized LeastSquares Classification (RLSC) algorithm (Rifkin et al, 2003)

C1 C2 C3 C4 C5 C6 C7 C8C1 92 4 4.76 0 5.27 11.3 6.89 0C2 0 52 0 14 0 2.7 0 0C3 0 20 76.2 0 0 0 17.24 5C4 0 16 0 66 0 0 0 0C5 4 8 0 4 84.21 0 6.8 0C6 4 0 0 0 10.52 86 0 0C7 0 0 0 8 0 0 69.07 0C8 0 0 19.04 8 0 0 0 95

Total Recognition Rate = 77.56%

H. Kadri, QARMA Learning with kernels: the old and the new 32/1

Applications - Sound recognition

Tab.5 : Confusion Matrix obtained when using the FunctionalRegularized Least Squares algorithm

C1 C2 C3 C4 C5 C6 C7 C8C1 100 0 0 2 0 5.3 3.4 0C2 0 82 0 8 0 0 0 0C3 0 14 90.9 8 0 0 3.4 0C4 0 4 0 78 0 0 0 0C5 0 0 0 1 89.47 0 6.8 0C6 0 0 0 0 10.53 94.7 0 0C7 0 0 0 0 0 0 86.4 0C8 0 0 9.1 3 0 0 0 100

Total Recognition Rate = 90.18%

H. Kadri, QARMA Learning with kernels: the old and the new 33/1

Applications - Beyond Audio ProcessingI Functional outputs - BCI

0 20 40 60 80 100 120 140 160 180 200−20

0

20

Ch. 1

0 20 40 60 80 100 120 140 160 180 200−10

0

10

Ch. 2

0 20 40 60 80 100 120 140 160 180 200−5

0

5

Ch. 3

0 20 40 60 80 100 120 140 160 180 200−5

0

5

Ch. 4

0 20 40 60 80 100 120 140 160 180 200−5

0

5C

h. 5

Time samples0 50 100 150 200

−1.5

−1

−0.5

0

0.5

1

1.5

Time samples

Fin

ger

Move

ment S

tate

0 50 100 150 200

−1

0

1

2

3

4

5

6

Time samples

Fin

ger

Move

ment

I Structured outputs - Image, Text, Graph prediction

5 10 15

2

4

6

8

5 10 15

2

4

6

8

5 10 15

2

4

6

8

5 10 15

2

4

6

8

5 10 15

2

4

6

8

5 10 15

2

4

6

8

5 10 15

2

4

6

8

5 10 15

2

4

6

8

I Tensor outputs - Multilinear multitaskJury 1 Jury 2 Jury 3

Athlete performance technical score · · ·artistic score · · ·

H. Kadri, QARMA Learning with kernels: the old and the new 34/1

Conclusion & Perspectives

I Conclusionæ RKHS framework for functional data - Nonlinear FDAæ FDA kernelsæ Audio and Speech processing applications

I Perspectivesæ Mixed data (discrete, continuous,...)æ Learning the operator-valued kernelæ Multilinear representation learning

H. Kadri, QARMA Learning with kernels: the old and the new 35/1

top related