signal processing institute swiss federal institute of technology, lausanne 1 feature selection for...

20
Signal Processing Institute Swiss Federal Institute of Technology, Lausanne 1 Feature selection for Feature selection for audio-visual speech recognition audio-visual speech recognition Mihai Gurban Mihai Gurban

Post on 19-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Signal Processing Institute Swiss Federal Institute of Technology, Lausanne

1

Feature selection for Feature selection for audio-visual speech recognitionaudio-visual speech recognition

Mihai GurbanMihai Gurban

Signal Processing Institute Swiss Federal Institute of Technology, Lausanne

2Outline

Feature selection and extractionFeature selection and extraction– Why select features?– Information theoretic criteria

Our approachOur approach– The audio-visual recognizer– Audio-visual integration– Features and selection methods

Experimental resultsExperimental results

ConclusionConclusion

Signal Processing Institute Swiss Federal Institute of Technology, Lausanne

3Feature selection

Features and classificationFeatures and classification– Features (or attributes, properties, characteristics) - different types of

measures that can be taken on the same physical phenomenon– An instance (or pattern, sample, example) - collection of feature values

representing simultaneous measurements– For classification, each sample has an associated class label

Feature selectionFeature selection– Finding from the original feature set, a subset which retains most of the

information that is relevant for a classification task– This is needed because of the curse of dimensionality

Why dimensionality reduction?Why dimensionality reduction?– The number of samples required to obtain accurate models of the data grows

exponentially with the dimensionality– The computing resources required also grow with the dimensionality of the

data– Irrelevant information can decrease performance

Signal Processing Institute Swiss Federal Institute of Technology, Lausanne

4Feature selection

Entropy and mutual informationEntropy and mutual information– H(X), the entropy of X – the amount of uncertainty about the value of X– I(X;Y), the mutual information between X and Y – the reduction in the

uncertainty of X due to the knowledge of Y (or vice-versa)

Maximum dependencyMaximum dependency– One of the frequently used criteria is mutual information– Pick YS1…YSm from the set Y1…Yn of features, such that

I(YS1,YS2,…, YSm ; C) is maximum

How many subsets?How many subsets?– Impossible to check all subsets, high number of combinations:

– As an approximate solution, greedy algorithms are used

– The number of possibilities is reduced to

Signal Processing Institute Swiss Federal Institute of Technology, Lausanne

5A simple example

Entropies and mutual information can be represented by Venn diagramsEntropies and mutual information can be represented by Venn diagrams

We are searching for the features YWe are searching for the features YSiSi with maximum mutual information with the with maximum mutual information with the

class labelclass label Assume the complete set of features is Assume the complete set of features is

Y 4

Y 5 Y 3Y 2

Y 1

C

Signal Processing Institute Swiss Federal Institute of Technology, Lausanne

6A simple example

Y 1

C

Signal Processing Institute Swiss Federal Institute of Technology, Lausanne

7A simple example

Y 2

Y 1

C

Signal Processing Institute Swiss Federal Institute of Technology, Lausanne

8A simple example

Y 3Y 2

Y 1

C

Signal Processing Institute Swiss Federal Institute of Technology, Lausanne

9A simple example

Y 3Y 2

Y 1

C

Signal Processing Institute Swiss Federal Institute of Technology, Lausanne

10Which criterion to penalize redundancy?

Many different criteria proposed in the literatureMany different criteria proposed in the literature

Our criterion penalizes only relevant redundancyOur criterion penalizes only relevant redundancy

Signal Processing Institute Swiss Federal Institute of Technology, Lausanne

11Solutions from the literature

““Natural” DCT orderingNatural” DCT ordering– Zigzag scanning, used in

compression (JPEG/MPEG)

Maximum mutual informationMaximum mutual information– Typically the redundancy is not taken

into account

Linear Discriminant AnalysisLinear Discriminant Analysis– A transform is applied on the features

Signal Processing Institute Swiss Federal Institute of Technology, Lausanne

12Our application: AVSR

AUDIO

VISUAL FRONT-END

VISUAL FEATURE

EXTRACTION

FACE DETECTION MOUTH LOCALIZATION

LIP TRACKING

AUDIO FEATURE

EXTRACTION

VIDEO

AUDIO-VISUAL FUSION

AUDIO-ONLY RECOGNITION

VISUAL-ONLY RECOGNITION

AUDIO-VISUAL RECOGNITION

Experiments on the CUAVE databaseExperiments on the CUAVE database– 36 speakers, 10 words, 5 repetitions per speaker– Leave-one-out crossvalidation– Audio features: MFCC coefficients– Visual features: DCT with first and second temporal derivatives– Different levels of noise added to the audio

Signal Processing Institute Swiss Federal Institute of Technology, Lausanne

13The multi-stream HMM

Audio(39 MFCCs)

Video (DCT features)

Audio-visual integration with multi-stream HMMsAudio-visual integration with multi-stream HMMs– States are modeled with gaussian mixtures– Each modality is modeled separately– The emission likelihood is a weighted product– The optimal weights are chosen for each SNR

Signal Processing Institute Swiss Federal Institute of Technology, Lausanne

14Information content of different types of features

Comparison of mutual information I(X;C) between different features

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5

Mu

tua

l in

form

ati

on

I(X

;C)

MFCC in clean conditions

MFCC at -10dB of SNR

DCT coef f icients

PCA coef f icients

Optical f low coef f icients

Signal Processing Institute Swiss Federal Institute of Technology, Lausanne

15Visual-only recognition rate

0 20 40 60 80 100 120 140 160 180 20030

35

40

45

50

55

60

65

Number of features

Wor

d ac

cura

cy %

Max MIpenalize redundancy

Max MI

Zigzag ordering

Signal Processing Institute Swiss Federal Institute of Technology, Lausanne

16Audio-visual performance

20.00

30.00

40.00

50.00

60.00

70.00

80.00

90.00

100.00

clean 25db 20db 15db 10db 05db 00db -05db -10db

SNR

Wo

rd a

ccu

racy

%

audio only

video only

audio-visual

Signal Processing Institute Swiss Federal Institute of Technology, Lausanne

17AV performance with clean audio

0 20 40 60 80 100 120 140 160 180 20098.2

98.3

98.4

98.5

98.6

98.7

98.8

98.9

99

99.1

99.2

Number of features

Wor

d ac

cura

cy %

AV

audio-only

Signal Processing Institute Swiss Federal Institute of Technology, Lausanne

18AV performance at 10db SNR

0 20 40 60 80 100 120 140 160 180 20091

91.5

92

92.5

93

93.5

94

94.5

95

95.5

96

Number of features

Wor

d ac

cura

cy %

AV

audio-only

Signal Processing Institute Swiss Federal Institute of Technology, Lausanne

19Noisy AV and visual-only comparison

0 20 40 60 80 100 120 140 160 180 20040

45

50

55

60

65

70

Number of features

Wor

d ac

cura

cy %

AV performance-10db SNR

Video-only performance

Signal Processing Institute Swiss Federal Institute of Technology, Lausanne

20Conclusion and future work

Feature selection for audio-visual speech recognitionFeature selection for audio-visual speech recognition– Visual-only recognition rate not a good predictor for audio-visual

performance because of dimensionality – Maximum audio-visual performance is obtained for small video

dimensionalities– Algorithms that improve performance at small dimensionalities are

needed

Future workFuture work– Better methods to compute the amount of redundancy between

features