mediaeval2015 - the spl-it-uc query by example search on speech system for mediaeval 2015

Jorge Proença 1,2

Luis Castela 2

Fernando Perdigão 1,2

The SPL-IT-UC Query by Example Search on

Speech system for MediaEval 2015

1 Signal Processing Lab - Instituto de

Telecomunicações, Coimbra, Portugal

2 Electrical and Computer Eng. Department,

University of Coimbra, Portugal

2

SPL-IT-UC System Overview

MediaEval 2015 - QUESST

| September 14-15 2015, Wurzen, GERMANY

Pre-processing:

Tackle background noise with Spectral Subtraction.

Features:

Posterior probabilities from 5 phonetic recognizers.

Searching:

Dynamic Time Warping (DTW) - 6 alternative paths to

tackle complex queries.

Fusion and Calibration:

Modify the per-query distribution of distance values;

Include side-info for fusion.

3

Noise Filtering with Spectral Subtraction



1. High pass filter for low-frequency artefacts.

2. Analyze averaged Energy of the signal and determine high and

low levels through median of quartiles:

3. High SNR signals: no SS applied due to distortions.

Others: get >100ms candidate segments for "noise“ and apply

classical SS.

Improvement: from 0.8368 Cnxe → 0.8130 with SS

0 200 400 600 800 1000 1200-80

-70

-60

-50

-40

-30

-20

-10

0

Query frames (5ms)

Ave

rage

En

erg

y (

dB

)

4

Phonetic Recognizers



Neural Networks based on long temporal context (Brno Univ. of

Tech. - BUT):

Czech

Hungarian

Russian

Portuguese (trained)

English (trained)

Output: state level posteriorgrams

(3 states per phoneme).

Silence/Noise frames removed

on queries.

5

Dynamic Time Warping



Local Distance matrices for Query vs. Audio:

Dot Product of Query and Audio posterior probability vectors

6 sub-systems for DTW (and final fusion):

5 distance matrices from the 5 languages

a 6th one, the average of the 5 distance matrices – ML

Improvement: 5 langs fusion - 0.7971 Cnxe

ML - 0.8136

5 langs+ML - 0.7873

6

Dynamic Time Warping



Basic DTW strategy (A1):

Smallest distance in identically weighted

unitary jumps.

Normalize accumulated distance by the length

of the path

Distance Matrix (top) and accumulated Distance matrix (bottom) of Query vs Audio

7

DTW Modifications



5 additional approaches/paths to tackle complex queries:

(A2) – Cutting up to 250ms at the end of the query,

(A3) – Cutting up to 250ms at the beginning of the query.

Que

ryQ

ue

ry

Audio

Type 2: lexical variations

8

DTW Modifications



(A4) – Allowing one jump in the path up to ½ Query’s length,

can’t occur at initial and final 250ms of the query

Que

ryQ

ue

ry

Audio

Query vs. Audio posterior distance matrix (top) and the best path from A4 (bottom)

Type 2: extra content in audio

9

DTW Modifications



(A5) – Swaps: accounting for re-ordering of words.

Find the best path for the beginning of the query, ahead of

the end of the first one, with restrictions similar to (A4).

Query vs. Audio posterior distance matrix (top) and the best path from A5 (bottom)

Type 2: word re-ordering

Que

ryQ

ue

ry

Audio

10

DTW Modifications



(A6) – Allowing one 'jump' along the query, of maximum ⅓ of

query length:

To tackle possible fillers inside the spontaneous query;

Similar to (A4), but small vertical jump instead.

Type 3: Spontaneous queries

11

Fusion and Calibration



– 6 sub-systems x 6 paths = 36 distance vectors of audio-query pairs

1. Per query distribution:

Truncate large distances (lower them to the same value) to

the mean of the distribution.

… may help to lower the burden of critical false negatives on Cnxe.

Improvement: from 0.7939 → 0.7873 Cnxe

2. Normalize per-query: subtract mean, divide by standard

deviation.

12




3. Side-info: 7 additional vectors for fusion:

– mean of distances per query before truncation and

normalization (from the best approach and sub-system:

ML-A2); Regaining some information lost;

– Query size in frames and log of query size;

– 4 SNR values: original and post SS SNRs of query and

of audio.

13




(Linear Fusion with the Bosaris Toolkit)

4. Systems submitted:

1. Linear Fusion of all approaches and sub-systems + side-info

2. Harmonic Mean of approaches and Linear Fusion of sub-

systems + side-info

(In previous work, without linear fusion, Harmonic Mean of approaches

was found to be a good compromise)

3. Same as 1, without side-info

4. Same as 2, without side-info

14

Results



– Side-info always helpful for the Cnxe metric.

– Fusion of All best on Dev set.

– Harmonic mean: best on Eval

(fusion of all may be over fitted for Dev).

Fusion Systems Dev: Cnxe, MinCnxe Eval: Cnxe, MinCnxe

1. All + side-info 0.7782, 0.7716 0.7866, 0.7809

2. H.mean + side-info 0.7862, 0.7800 0.7842, 0.7786

3. All, no side 0.7873, 0.7816 0.7930, 0.7875

4. H.mean, no side 0.7957, 0.7893 0.7914, 0.7865

15

Results



Using only each individual DTW approach on Dev set and fusing:

(Hmean: 0.7862 Cnxe)

A1: 0.8041

A2: 0.7978

A3: 0.8335

A4: 0.8137

A5: 0.8184

A6: 0.8460

(A2) overall best, cutting the end of the query

may help in all cases due to co-articulation

or intonation differences.

(A6) performs badly. Filler in query may be

extension and not gap. Or too many false

positives

16

Conclusions



Although the acoustic conditions were very challenging, small global

improvements could be investigated.

Main contributions:

– Performing a careful Spectral Subtraction – to diminish severe

background noise;

– Using the average distance matrix of all languages as 6th sub-

system;

– Considering 6 possible DTW paths to tackle complex matches;

– Truncating large distances per-query – may help to lower the

burden of critical false negatives.

– Besides side-info, all of the improvements also improve ATWV.

17



Thank You

Questions?

mediaeval2015 - the spl-it-uc query by example search on speech system for mediaeval 2015

Education