audio-visual speech processing gérard chollet, hervé bredin, thomas hueber, rémi landais, patrick...

Audio-Visual Speech Processing

Gérard Chollet, Hervé Bredin, Thomas Hueber, Rémi Landais, Patrick Perrot, Leila Zouari

NOLISP, Paris, March 23rd 2007

NOLISP 2007, PARIS 23 Mai 2007

Audiovisual identity verification

Compulsory ? for:– Homeland/firms security: restricted

accesses,…– Secured computer login– Secured on-line signature of contracts

(e-Commerce)



Available features

– Face/Face features (lip, eyes) Face Modality– Speech Speech Modality– Speech Synchrony Synchrony Modality



Face modality– Detection:

• Generative models (MPT toolbox)• Temporal median Filtering• Eyes detection within faces

– Normalization: geometry + illumination



Face modality– Selection:

• Keep only the most reliable detection results

• Based on the distance Rel between a detected zone and its projection over the eigenfaces space


Audiovisual identity verification Face Modality:

– Two verification strategies and one single comparison framework

• Global = Eigenfaces:– Calculation of a set of directions (eigenfaces)

defining a projection space– Two faces are compared regarding their

projection on the eigenfaces space.– Learning data: BIOMET (130 pers.) + BANCA

(30 pers.)



Face Modality:• SIFT descriptors:

– Keypoints extraction– Keypoints representation: 128-dimensional

vector (gradient orientation histogramme,…) + 4-dimensional position vector

SIFT descriptor(dim 128)

Position (x,y) + scale + orientation(dim 4)



Face Modality:• SVD-based matching method:

– Compare two videos V1 and V2– Exclusive principle: One-to-one correspondences

between» Faces (global)» Descriptors (local)

– Principle:» Proximity matrix computation between faces or

descriptors» Extraction of good pairings (made easy by SVD

computation)– Scores:

» One matching score between global representations

» One matching score between local representations



Speech Modality:– GMM-based approach;

• One world model• Each speaker model is derived from the

World Model by MAP adaptation• Speech verification score: derived from

likelihood ratio



Synchrony Modality:– Principle: synchrony between lips and

speech carries identity information– Process:

• Computation of a synchrony model (CoIA analysis) for each person based on DCT (visual signal) and MFCC (speech signal)

• Comparison of the test sample with the synchrony model



Experiments:– BANCA database:

• 52 persons divided into two groups (G1 and G2)• 3 recording conditions• 1 person 8 recordings (4 client accesses, 4

impostor accesses) • Evaluation based on P protocol: 234 client

accesses and 312 impostor accesses– Scores:

• 4 scores per access (PCA face, SIFT face, speech, synchrony)

• Score fusion based on RBF-SVM: hyperplan learned on G1/tested on G2 and conversely)



Experiments:

audio-visual speech processing gérard chollet, hervé bredin, thomas hueber, rémi landais, patrick...

Documents