on learning with kernels for audio signal processing: the ... · background - applications in audio...
TRANSCRIPT
On Learning with Kernels for Audio SignalProcessing: the old and the new
Hachem Kadri
QARMA team - LIFAix-Marseille University
GIPSA-Lab 2013
Background - Functional Learning (1/2)
y
i
= f (xi
) + ‘
i
I Supervised learning
≠æ Data: n training examples {(x1, y1), . . . , (xn, yn)}≠æ Goal: learn f
Predictor ‘≠æ Response ModelRd {≠1, 1} Binary ClassificationRd {1, 2, 3, . . .} Multi-class ClassificationRd R Multiple Regression
H. Kadri, QARMA Learning with kernels: the old and the new 2/1
Background - Functional Learning (2/2)
I Minimization problem
minfœF
nÿ
i=1V
!y
i
, f(xi
)"
æ V: loss function - e.g. square loss:!y
i
≠ f(xi
)"2
I Overfitting problem
I Regularized minimization
minfœF
nÿ
i=1V
!y
i
, f(xi
)"
+ ⁄�(f)
æ �: regularization - e.g. L2-norm: �(f) = ÎfÎ2F
H. Kadri, QARMA Learning with kernels: the old and the new 3/1
Background - Learning with Kernels (1/2)
y
i
= f (xi
) + ‘
i
; y
i
œ RI Linear model: f(x) = Èa, xÍ + b
I Kernels: nonlinear/nonparametric estimation
input space feature space
RKHS associated with a positive definite kernel k givesa desired feature space!!
H. Kadri, QARMA Learning with kernels: the old and the new 4/1
Background - Learning with Kernels (2/2)
2 PerspectivesI Feature space
≠æ nonlinear in input space
≠æ projecting data into a Feature space
≠æ linear in Feature space
≠æ kernel trick È�(x1), �(x2)Í = k(x1, x2)
X ℝ
FX
ΦX
f
g
I RKHS theory≠æ Mercer theorem: integral operator + positive kernel
≠æ reproducing property: Èf, k(x, ·)Í = f(x)
≠æ representer theorem: f(·) =qi
–
i
k(xi
, ·) ; –
i
œ R
H. Kadri, QARMA Learning with kernels: the old and the new 5/1
Background - Applications in audio processing
I Music segmentation (Davy et al., 2006 - IRCCyN)
I Speaker verification (Louradour et al., 2007 - IRIT)
I Speaker change detection (Harchaoui et al., 2008 - LTCI)
I Sound recognition (Rabaoui et al., 2008 - LAGIS)
I Speech inversion (Toutios et al., 2008 - LORIA)
H. Kadri, QARMA Learning with kernels: the old and the new 6/1
Learning with kernels - Limitations and Challenges
+ + Geometric intuition and interpretation- - Choosing the kernel in advance- - Sequential, time-varying characteristics- - Limited to single task/scalar output
I “Sophisticated” kernel methods
• learning kernels ≠æ MKL: Multiple Kernel Learning
• probability distribution ≠æ RKHS embedding of distributions
• connection geometric/time-varying ≠æ FDA
• multi-task/complex outputs ≠æ Operator-valued Kernel
• Deep Learning - Representation Learning ≠æ ? . . .
H. Kadri, QARMA Learning with kernels: the old and the new 7/1
≠æ Learning with kernels/RKHS embedding≠ ≠ ≠ + ≠ ≠ ≠ ≠ ≠ ≠ ≠ + ≠ ≠ ≠ ≠ ≠ ≠ ≠ ≠ ≠ ≠ + ≠ ≠ ≠ ≠ ≠ ≠≠ >
1950 1995 ≠ 2002 2005 ≠ . . .
Aronsazn Vapnik, Cortes, Scholkopf, Smola Gretton, Le Song, Fukumizu
≠æ Learning ⇠⇠⇠XXXwith kernels≠ ≠ ≠ ≠ ≠ ≠ ≠ ≠ ≠ ≠ ≠ + ≠ ≠ ≠ + ≠ ≠ ≠ + ≠ ≠ ≠ + ≠ ≠ ≠ ≠ ≠≠ >
2004 2006 2007 2010Lackiert - Bach Sonn. Rakoto. Cortes - Kloft
≠æ Learning with operator-valued kernels≠ ≠ ≠ ≠ ≠ + ≠ ≠ ≠ ≠ ≠ ≠ + ≠ ≠ ≠ ≠ ≠ + ≠ ≠ ≠ ≠ + ≠ ≠ ≠ ≠ ≠≠ >
1958 ≠ 1960 2005 2008 2010/2011Pedrick - Schwartz Micc & Pontil Caponnetto Kadri/d’Alche-Buc
≠æ Learning ⇠⇠⇠XXXwith operator-valued kernels≠ ≠ ≠ ≠ ≠ ≠ ≠ ≠ ≠ ≠ ≠ ≠ ≠ ≠ ≠ ≠ + ≠ ≠ ≠ ≠ + ≠ ≠ ≠ + ≠ ≠ ≠≠ >
2011 2012 2013Dinuzzo Kadri Sindhwani
H. Kadri, QARMA Learning with kernels: the old and the new 8/1
FDA - Examples
850 900 950 1000 10502
2.5
3
3.5
4
4.5
5
5.5
wavelengths
ab
sorb
an
ces
Spectrometric Curves
0 50 100 150
5
10
15
20
25
frequencies
am
plit
ud
e
Speech
−0.04 −0.03 −0.02 −0.01 0 0.01 0.02 0.03 0.04−0.06
−0.04
−0.02
0
0.02
0.04
meters
me
ters
Handwriting
0 2 4 6 8 10 12
−0.2
−0.1
0
0.1
0.2
months
diff
ere
ntia
ted
Lo
g
Electricity Consumption
Regression Classification
Time warping Forecasting
Ramsay and Silverman (2002) - Ferraty and Vieu (2006)
H. Kadri, QARMA Learning with kernels: the old and the new 9/1
FDA - Functional inputs & functional outputs
y
i
= f (xi
) + ‘
i
Predictor ‘≠æ Response ModelL
2L
2 Functional Model - Functional Responses
Temperature
0 50 100 150 200 250 300 350
−30
−20
−10
0
10
20
Day
(a)
Deg
C
Precipitation
0 50 100 150 200 250 300 3500
2
4
6
8
10
12
Day
(b)
mm
• Operator estimation≠æ min
fœF
nÿ
i=1Îy
i
≠ f(xi
)Î2Y + ⁄ÎfÎ2
F
H. Kadri, QARMA Learning with kernels: the old and the new 10/1
Learning from functional responses - Discrete case
Learning from mul-tiple response data
Multiple out-put regression
Learning vector-valued function
Statistics Machinelearning
Multi-tasklearning
(Micchelli and Pontil, 2005)
C&Wprocedure
(Breiman and Friedman, 1997)
H. Kadri, QARMA Learning with kernels: the old and the new 11/1
Reproducing kernels - From scalar to functional
Scalar-valued Function-valued
X ℝ
FX
ΦX
f
g
YX
FXY
ΦXY
f
g
• Operator-valued kernels & function-valued RKHS≠æ Nonlinear FDA
H. Kadri, QARMA Learning with kernels: the old and the new 12/1
Outline
• Hilbert space of operators with Reproducing Kernelsæ Function-valued RKHSæ Operator-valued kernel
• Operator estimationæ L
2-regularized operator learning algorithmæ Block operator kernel matrix inversion
• Application to audio and speech processingæ Speech inversionæ Environmental sound recognition
H. Kadri, QARMA Learning with kernels: the old and the new 13/1
Operator-valued kernels - Definition
• (xi
(s), y
i
(t))n
i=1 œ X ◊ Y
• X : �x
≠æ R ; Y : �y
≠æ R
• � ™ R : curve ; � ™ R2 : image
DefinitionKF (., .) : X ◊ X ≠æ L(Y)
IKF is Hermitian if KF (w, z) = KF (z, w)ú,
I it is nonnegative on X if for any {(wi
, u
i
)i=1,...,r
} œ X ◊ Yÿ
i,j
ÈKF (wi
, w
j
)ui
, u
j
ÍY Ø 0
H. Kadri, QARMA Learning with kernels: the old and the new 14/1
Operator-valued kernels - Function-valued RKHS
• Extending real/vector-valued RKHS theory to FDA (Kadri etal. AISTATS 2010)
• RKHS of function-valued functions
DefinitionA Hilbert space F = {f : X ≠æ Y} is called a reproducing kernelHilbert space if there is an operator-valued kernel KF such that:
Ih : z ‘≠æ KF (w, z)g =∆ h œ F , ’w œ X and g œ Y
I ’f œ F , Èf, KF (w, .)gÍF = Èf(w), gÍY (reproducing property)
H. Kadri, QARMA Learning with kernels: the old and the new 15/1
Operator-valued kernels - Uniqueness & Bijection
LemmaF function-valued RKHS =∆ KF (w, z) is unique
I Proof:Û ÈK Õ(wÕ
, .)gÕ, K(w, .)gÍF = ÈK Õ(wÕ
, w)gÕ, gÍY
Û ÈK(w, .)g, K
Õ(wÕ, .)gÕÍF = ÈK(w, w
Õ)g, g
ÕÍY
= Èg, K(w, w
Õ)úhÍY = Èg, K(wÕ
, w)gÕÍY
TheoremKF (w, z) nonnegative ≈∆ RKHS F
I Proof:≈
nqi,j=1
ÈK(wi
, w
j
), u
j
ÍY =nq
i,j=1ÈK(w
i
, .)ui
, K(wj
, .)uj
ÍF
∆ F0, ’f œ F0, f(.) =nq
i=1KF (w
i
, .)–i
H. Kadri, QARMA Learning with kernels: the old and the new 16/1
Operator-valued kernels - Construction
• Multi-task kernel =∆ K(w, z) = k(w, z)TI k: real-valued kernelI T: diagonal matrix + low rank matrix (finite dimension)
• FDA kernel =∆ T œ L(Y) (infinite dimension) ?I Concurrent functional linear model
æ y(t) = –(t) + —(t)x(t)æ Multiplication operatoræ Varying coe�cient model (Hastie and Tibshirani, 1993)
I Functional linear model for functional responses (Ramsay andSilverman, 2005)
æ y(t) = –(t) +s
—(s, t)x(s)ds
æ Hilbert-Schmidt integral operator
H. Kadri, QARMA Learning with kernels: the old and the new 17/1
Operator-valued kernels - Examples
1. Multiplication operatorKF : X ◊ X ≠æ L(Y)
x1, x2 ‘≠æ k
x
(x1, x2)T ky ; T
h
y
(t) , h(t)y(t)
2. Hilbert-Schmidt integral operatorKF : X ◊ X ≠æ L(Y)
x1, x2 ‘≠æ k
x
(x1, x2)T ky ; T
h
y
(t) ,s
h(s, t)y(s)ds
3. Composition operatorKF : X ◊ X ≠æ L(Y)
x1, x2 ‘≠æ C
Â(x1)CúÂ(x2) ; C
Ï
: f ‘≠æ f ¶ Ï
H. Kadri, QARMA Learning with kernels: the old and the new 18/1
Operator-valued kernels - Feature map(Kadri et al., ICML 2011)
I Operator-valued kernel admits a feature map representationæ ÈK(x1, x2)y1, y2ÍY = È�(x1, y1), �(x2, y2)ÍL(X ,Y)
æ ÈK(x1, .)y1, K(x2, .)y2ÍF = ÈK(x1, x2)y1, y2ÍY
I Complex/infinite-dimensional inputsæ multiple functional data x
i
œ (L2)p
I FDA viewpointæ one observation = one continuous curve
Real-valued RKHS�
k
: (L2)p æ L((L2)p
,R)x ‘æ k(x, .)
dim: p ≠æ 1
Function-valued RKHS�y
K
: (L2)p æ L((L2)p
, L
2)x ‘æ K(x, .)y
dim: p ≠æ inf
H. Kadri, QARMA Learning with kernels: the old and the new 19/1
Optimization problem - Representer theorem
TheoremThe solution of the minimization problem
minfœF
nÿ
i=1Îy
i
≠ f(xi
)Î2Y + ⁄ÎfÎ2
F
is achieved by a function of the form
f
ú(.) =nÿ
i=1KF (x
i
, .)—i
H. Kadri, QARMA Learning with kernels: the old and the new 20/1
Optimization problem - Solution
minfœF
nqi=1
Îy
i
≠ f(xi
)Î2Y + ⁄ÎfÎ2
F
using the representer theorem & the reproducing property
≈∆ min—iœY
nqi=1
Îy
i
≠nq
j=1KF (x
i
, x
j
)—j
Î2Y + ⁄
nqi,j
ÈKF (xi
, x
j
)—i
, —
j
ÍY
I Discretization (Kadri et al., AISTATS 2010)æ grid {t1, . . . , t
m
} =∆ —
i
(t1), . . . , —
i
(tm
)
I Approximation (Kadri et al., Tech. Report 2011)æ Y a real RKHS =∆ —
i
=q
m
l=1 –
il
k(tl
, .)
I Analytic solution (Kadri et al., ICML 2011)æ (K + ⁄I)— = y ; — œ (Y)n and K œ [L(Y)]n◊n
H. Kadri, QARMA Learning with kernels: the old and the new 21/1
Optimization problem - Block operator kernel matrixinversion
I (block) numerical range: spectral and operator theory
I Spectral theory of block operator matrices (C. Tretter, 2008)
IK(x
i
, x
j
) = G(xi
, x
j
)T, ’x
i
, x
j
œ X
I Kronecker product
æ K =
Q
caG(x1, x1)T . . . G(x1, x
n
)T... . . . ...
G(xn
, x1)T . . . G(xn
, x
n
)T
R
db = G ¢ T
æ K≠1 = G≠1 ¢ T
≠1
H. Kadri, QARMA Learning with kernels: the old and the new 22/1
Algorithm 1 L
2-Regularized Operator Learning Algorithm
Inputdata x
i
œ (L2([0, 1]))p, y
i
œ L
2([0, 1]), size n
Eigendecomposition of G = G(xi
, x
j
)n
i,j=1 œ Rn◊n
eigenvalues –
i
œ R, eigenvectors v
i
œ Rn, size n
Eigendecomposition of T œ L(Y )Initialize k: number of eigenfunctionseigenvalues ”
i
œ R, eigenfunctions w
i
œ L
2([0, 1]), size k
Eigendecomposition of K = G ¢ T
K = K(xi
, x
j
)n
i,j=1 œ (L(Y ))n◊n
eigenvalues ◊
i
œ R, eigenfunctions z
i
œ (L2([0, 1]))n, size n◊k
◊ = – ¢ ”, z = v ¢ w
Solution — = (K + ⁄I)≠1y
Initialize ⁄: regularization parameter— = q
n◊k
i=1 (◊i
+ ⁄)≠1 qn
j=1Èzij
, y
j
Ízi
H. Kadri, QARMA Learning with kernels: the old and the new 22/1
Applications - Speech inversionSS
Speech production
Speech Inversion
1: upper lip
2: lower lip
3: jaw
4: tongue tip
5: tongue body
6: velum
7: glottis
Time(s)
Speech signal
Am
plitu
de
Figure : Acoustic to articulatory inversion
I speech inversionæ learning the acoustic-to-articulatory mappingæ from MFCC to Vocal-tract time functions (VTTF)æ improving speech technology and understandingæ helping individuals with speech and hearing disorders
H. Kadri, QARMA Learning with kernels: the old and the new 23/1
Applications - Speech inversion
0 0.5 10
10
20
0 0.2 0.4−20
0
202
-20 0.5 1
−20
0
20
0 0.5 1−20
0
20
LA
0 0.5 18
10
12
LP
0 0.5 1
0
20
40
TTC
D
0 0.5 10
50
100
TTC
L
0 0.5 10
5
10
TBC
D
0 0.5 1
100
150
TBC
L
0 0.5 1−0.1
−0.05
0
VE
L
0 0.5 10
0.2
0.4
Time(s)
GLO
2
-2
0 0.2 0.4−20
0
20
0 0.2 0.48
10
12
0 0.2 0.40
20
40
0 0.2 0.420
40
60
0 0.2 0.40
5
10
0 0.2 0.4100
120
140
0 0.2 0.4−0.2
0
0.2
0 0.2 0.40
0.2
0.4
Time(s)
0 0.5 10
10
20
0 0.5 18
9
10
0 0.5 1−50
0
50
0 0.5 10
50
100
0 0.5 1−20
0
20
0 0.5 10
100
200
0 0.5 1−0.2
0
0.2
0 0.5 10
0.2
0.4
Time(s)
1
-1
"beautiful" "conversation" "smooth"
H. Kadri, QARMA Learning with kernels: the old and the new 24/1
Applications - Speech inversion
Tab.2 : Average RSSE for the tract variables
VT variables Á-SVR Multi-task functionalLA 2.763 2.341 1.562LP 0.532 0.512 0.528
TTCD 3.345 1.975 1.647TTCL 7.752 5.276 3.463TBCD 2.155 2.094 1.582TBCL 15.083 9.763 7.215VEL 0.032 0.034 0.029GLO 0.041 0.052 0.064Total 3.962 2.755 2.011
IÁ-SVR (Mitra et al., ICASSP 2009)
I Multi-task kernel (Kadri et al., ICASSP 2011)
H. Kadri, QARMA Learning with kernels: the old and the new 25/1
Applications - Sound recognition
I Sound Recognition
≠æ Surveillance and security applications
H. Kadri, QARMA Learning with kernels: the old and the new 26/1
Applications - Sound recognition
I Features extraction≠æ temporal, spectral, cepstral, ... characteristics
0 1 2 3 4 5 6 7x 104
−1
−0.5
0
0.5
1
Time(s)
Ampl
itude
0 50 100 150 200 250 3000
0.2
0.4
0.6
0.8
1Evolution of the Zero Crossing Rate (ZCR)
0 50 100 1500.005
0.01
0.015
0.02
0.025
0.03Spec−Roll−off (SRF)
0 50 100 150−10
−5
0
5
10
15
20
25
3013 Cepstral coefficients (MFCC)
H. Kadri, QARMA Learning with kernels: the old and the new 27/1
Applications - Sound recognitionI Limitations - Multivariate data modeling
≠æ features contain discrete values of various parameters≠æ feature vector œ RDP by concatenating samples of ”= features
I Solution - Multivariate functional data modeling
≠æ modeling each audio signal by a vector of functions in (L2)D
H. Kadri, QARMA Learning with kernels: the old and the new 28/1
Applications - Sound recognition
Tab.3 : Classes of sounds and number of samples in the database usedfor performance evaluation.
Classes Number Train Test Total Duration (s)Human screams C1 40 25 65 167
Gunshots C2 36 19 55 97Glass breaking C3 48 25 73 123
Explosions C4 41 21 62 180Door slams C5 50 25 75 96Phone rings C6 34 17 51 107
Children voices C7 58 29 87 140Machines C8 40 20 60 184
Total 327 181 508 18mn 14s
H. Kadri, QARMA Learning with kernels: the old and the new 29/1
Applications - Sound recognition
Figure : Structural similarities between two di�erent classes
H. Kadri, QARMA Learning with kernels: the old and the new 30/1
Applications - Sound recognition
Figure : Structural diversity inside the same sound class and betweenclasses
H. Kadri, QARMA Learning with kernels: the old and the new 31/1
Applications - Sound recognition
Tab.4 : Confusion Matrix obtained when using the Regularized LeastSquares Classification (RLSC) algorithm (Rifkin et al, 2003)
C1 C2 C3 C4 C5 C6 C7 C8C1 92 4 4.76 0 5.27 11.3 6.89 0C2 0 52 0 14 0 2.7 0 0C3 0 20 76.2 0 0 0 17.24 5C4 0 16 0 66 0 0 0 0C5 4 8 0 4 84.21 0 6.8 0C6 4 0 0 0 10.52 86 0 0C7 0 0 0 8 0 0 69.07 0C8 0 0 19.04 8 0 0 0 95
Total Recognition Rate = 77.56%
H. Kadri, QARMA Learning with kernels: the old and the new 32/1
Applications - Sound recognition
Tab.5 : Confusion Matrix obtained when using the FunctionalRegularized Least Squares algorithm
C1 C2 C3 C4 C5 C6 C7 C8C1 100 0 0 2 0 5.3 3.4 0C2 0 82 0 8 0 0 0 0C3 0 14 90.9 8 0 0 3.4 0C4 0 4 0 78 0 0 0 0C5 0 0 0 1 89.47 0 6.8 0C6 0 0 0 0 10.53 94.7 0 0C7 0 0 0 0 0 0 86.4 0C8 0 0 9.1 3 0 0 0 100
Total Recognition Rate = 90.18%
H. Kadri, QARMA Learning with kernels: the old and the new 33/1
Applications - Beyond Audio ProcessingI Functional outputs - BCI
0 20 40 60 80 100 120 140 160 180 200−20
0
20
Ch. 1
0 20 40 60 80 100 120 140 160 180 200−10
0
10
Ch. 2
0 20 40 60 80 100 120 140 160 180 200−5
0
5
Ch. 3
0 20 40 60 80 100 120 140 160 180 200−5
0
5
Ch. 4
0 20 40 60 80 100 120 140 160 180 200−5
0
5C
h. 5
Time samples0 50 100 150 200
−1.5
−1
−0.5
0
0.5
1
1.5
Time samples
Fin
ger
Move
ment S
tate
0 50 100 150 200
−1
0
1
2
3
4
5
6
Time samples
Fin
ger
Move
ment
I Structured outputs - Image, Text, Graph prediction
5 10 15
2
4
6
8
5 10 15
2
4
6
8
5 10 15
2
4
6
8
5 10 15
2
4
6
8
5 10 15
2
4
6
8
5 10 15
2
4
6
8
5 10 15
2
4
6
8
5 10 15
2
4
6
8
I Tensor outputs - Multilinear multitaskJury 1 Jury 2 Jury 3
Athlete performance technical score · · ·artistic score · · ·
H. Kadri, QARMA Learning with kernels: the old and the new 34/1
Conclusion & Perspectives
I Conclusionæ RKHS framework for functional data - Nonlinear FDAæ FDA kernelsæ Audio and Speech processing applications
I Perspectivesæ Mixed data (discrete, continuous,...)æ Learning the operator-valued kernelæ Multilinear representation learning
H. Kadri, QARMA Learning with kernels: the old and the new 35/1