mediaeval2015 - the spl-it-uc query by example search on speech system for mediaeval 2015
TRANSCRIPT
Jorge Proença 1,2
Luis Castela 2
Fernando Perdigão 1,2
The SPL-IT-UC Query by Example Search on
Speech system for MediaEval 2015
1 Signal Processing Lab - Instituto de
Telecomunicações, Coimbra, Portugal
2 Electrical and Computer Eng. Department,
University of Coimbra, Portugal
2
SPL-IT-UC System Overview
MediaEval 2015 - QUESST
| September 14-15 2015, Wurzen, GERMANY
Pre-processing:
Tackle background noise with Spectral Subtraction.
Features:
Posterior probabilities from 5 phonetic recognizers.
Searching:
Dynamic Time Warping (DTW) - 6 alternative paths to
tackle complex queries.
Fusion and Calibration:
Modify the per-query distribution of distance values;
Include side-info for fusion.
3
Noise Filtering with Spectral Subtraction
MediaEval 2015 - QUESST
| September 14-15 2015, Wurzen, GERMANY
1. High pass filter for low-frequency artefacts.
2. Analyze averaged Energy of the signal and determine high and
low levels through median of quartiles:
3. High SNR signals: no SS applied due to distortions.
Others: get >100ms candidate segments for "noise“ and apply
classical SS.
Improvement: from 0.8368 Cnxe → 0.8130 with SS
0 200 400 600 800 1000 1200-80
-70
-60
-50
-40
-30
-20
-10
0
Query frames (5ms)
Ave
rage
En
erg
y (
dB
)
4
Phonetic Recognizers
MediaEval 2015 - QUESST
| September 14-15 2015, Wurzen, GERMANY
Neural Networks based on long temporal context (Brno Univ. of
Tech. - BUT):
Czech
Hungarian
Russian
Portuguese (trained)
English (trained)
Output: state level posteriorgrams
(3 states per phoneme).
Silence/Noise frames removed
on queries.
5
Dynamic Time Warping
MediaEval 2015 - QUESST
| September 14-15 2015, Wurzen, GERMANY
Local Distance matrices for Query vs. Audio:
Dot Product of Query and Audio posterior probability vectors
6 sub-systems for DTW (and final fusion):
5 distance matrices from the 5 languages
a 6th one, the average of the 5 distance matrices – ML
Improvement: 5 langs fusion - 0.7971 Cnxe
ML - 0.8136
5 langs+ML - 0.7873
6
Dynamic Time Warping
MediaEval 2015 - QUESST
| September 14-15 2015, Wurzen, GERMANY
Basic DTW strategy (A1):
Smallest distance in identically weighted
unitary jumps.
Normalize accumulated distance by the length
of the path
Distance Matrix (top) and accumulated Distance matrix (bottom) of Query vs Audio
7
DTW Modifications
MediaEval 2015 - QUESST
| September 14-15 2015, Wurzen, GERMANY
5 additional approaches/paths to tackle complex queries:
(A2) – Cutting up to 250ms at the end of the query,
(A3) – Cutting up to 250ms at the beginning of the query.
Que
ryQ
ue
ry
Audio
Type 2: lexical variations
8
DTW Modifications
MediaEval 2015 - QUESST
| September 14-15 2015, Wurzen, GERMANY
(A4) – Allowing one jump in the path up to ½ Query’s length,
can’t occur at initial and final 250ms of the query
Que
ryQ
ue
ry
Audio
Query vs. Audio posterior distance matrix (top) and the best path from A4 (bottom)
Type 2: extra content in audio
9
DTW Modifications
MediaEval 2015 - QUESST
| September 14-15 2015, Wurzen, GERMANY
(A5) – Swaps: accounting for re-ordering of words.
Find the best path for the beginning of the query, ahead of
the end of the first one, with restrictions similar to (A4).
Query vs. Audio posterior distance matrix (top) and the best path from A5 (bottom)
Type 2: word re-ordering
Que
ryQ
ue
ry
Audio
10
DTW Modifications
MediaEval 2015 - QUESST
| September 14-15 2015, Wurzen, GERMANY
(A6) – Allowing one 'jump' along the query, of maximum ⅓ of
query length:
To tackle possible fillers inside the spontaneous query;
Similar to (A4), but small vertical jump instead.
Type 3: Spontaneous queries
11
Fusion and Calibration
MediaEval 2015 - QUESST
| September 14-15 2015, Wurzen, GERMANY
– 6 sub-systems x 6 paths = 36 distance vectors of audio-query pairs
1. Per query distribution:
Truncate large distances (lower them to the same value) to
the mean of the distribution.
… may help to lower the burden of critical false negatives on Cnxe.
Improvement: from 0.7939 → 0.7873 Cnxe
2. Normalize per-query: subtract mean, divide by standard
deviation.
12
Fusion and Calibration
MediaEval 2015 - QUESST
| September 14-15 2015, Wurzen, GERMANY
3. Side-info: 7 additional vectors for fusion:
– mean of distances per query before truncation and
normalization (from the best approach and sub-system:
ML-A2); Regaining some information lost;
– Query size in frames and log of query size;
– 4 SNR values: original and post SS SNRs of query and
of audio.
13
Fusion and Calibration
MediaEval 2015 - QUESST
| September 14-15 2015, Wurzen, GERMANY
(Linear Fusion with the Bosaris Toolkit)
4. Systems submitted:
1. Linear Fusion of all approaches and sub-systems + side-info
2. Harmonic Mean of approaches and Linear Fusion of sub-
systems + side-info
(In previous work, without linear fusion, Harmonic Mean of approaches
was found to be a good compromise)
3. Same as 1, without side-info
4. Same as 2, without side-info
14
Results
MediaEval 2015 - QUESST
| September 14-15 2015, Wurzen, GERMANY
– Side-info always helpful for the Cnxe metric.
– Fusion of All best on Dev set.
– Harmonic mean: best on Eval
(fusion of all may be over fitted for Dev).
Fusion Systems Dev: Cnxe, MinCnxe Eval: Cnxe, MinCnxe
1. All + side-info 0.7782, 0.7716 0.7866, 0.7809
2. H.mean + side-info 0.7862, 0.7800 0.7842, 0.7786
3. All, no side 0.7873, 0.7816 0.7930, 0.7875
4. H.mean, no side 0.7957, 0.7893 0.7914, 0.7865
15
Results
MediaEval 2015 - QUESST
| September 14-15 2015, Wurzen, GERMANY
Using only each individual DTW approach on Dev set and fusing:
(Hmean: 0.7862 Cnxe)
A1: 0.8041
A2: 0.7978
A3: 0.8335
A4: 0.8137
A5: 0.8184
A6: 0.8460
(A2) overall best, cutting the end of the query
may help in all cases due to co-articulation
or intonation differences.
(A6) performs badly. Filler in query may be
extension and not gap. Or too many false
positives
16
Conclusions
MediaEval 2015 - QUESST
| September 14-15 2015, Wurzen, GERMANY
Although the acoustic conditions were very challenging, small global
improvements could be investigated.
Main contributions:
– Performing a careful Spectral Subtraction – to diminish severe
background noise;
– Using the average distance matrix of all languages as 6th sub-
system;
– Considering 6 possible DTW paths to tackle complex matches;
– Truncating large distances per-query – may help to lower the
burden of critical false negatives.
– Besides side-info, all of the improvements also improve ATWV.