audio-visual speech processing gérard chollet, hervé bredin, thomas hueber, rémi landais, patrick...
TRANSCRIPT
Audio-Visual Speech Processing
Gérard Chollet, Hervé Bredin, Thomas Hueber, Rémi Landais, Patrick Perrot, Leila Zouari
NOLISP, Paris, March 23rd 2007
Page 8 NOLISP 2007, PARIS 23 Mai 2007
Audiovisual identity verification
Compulsory ? for:– Homeland/firms security: restricted
accesses,…– Secured computer login– Secured on-line signature of contracts
(e-Commerce)
Page 9 NOLISP 2007, PARIS 23 Mai 2007
Audiovisual identity verification
Available features
– Face/Face features (lip, eyes) Face Modality– Speech Speech Modality– Speech Synchrony Synchrony Modality
Page 10 NOLISP 2007, PARIS 23 Mai 2007
Audiovisual identity verification
Face modality– Detection:
• Generative models (MPT toolbox)• Temporal median Filtering• Eyes detection within faces
– Normalization: geometry + illumination
Page 11 NOLISP 2007, PARIS 23 Mai 2007
Audiovisual identity verification
Face modality– Selection:
• Keep only the most reliable detection results
• Based on the distance Rel between a detected zone and its projection over the eigenfaces space
Page 12 NOLISP 2007, PARIS 23 Mai 2007
Audiovisual identity verification Face Modality:
– Two verification strategies and one single comparison framework
• Global = Eigenfaces:– Calculation of a set of directions (eigenfaces)
defining a projection space– Two faces are compared regarding their
projection on the eigenfaces space.– Learning data: BIOMET (130 pers.) + BANCA
(30 pers.)
Page 13 NOLISP 2007, PARIS 23 Mai 2007
Audiovisual identity verification
Face Modality:• SIFT descriptors:
– Keypoints extraction– Keypoints representation: 128-dimensional
vector (gradient orientation histogramme,…) + 4-dimensional position vector
SIFT descriptor(dim 128)
Position (x,y) + scale + orientation(dim 4)
Page 14 NOLISP 2007, PARIS 23 Mai 2007
Audiovisual identity verification
Face Modality:• SVD-based matching method:
– Compare two videos V1 and V2– Exclusive principle: One-to-one correspondences
between» Faces (global)» Descriptors (local)
– Principle:» Proximity matrix computation between faces or
descriptors» Extraction of good pairings (made easy by SVD
computation)– Scores:
» One matching score between global representations
» One matching score between local representations
Page 15 NOLISP 2007, PARIS 23 Mai 2007
Audiovisual identity verification
Speech Modality:– GMM-based approach;
• One world model• Each speaker model is derived from the
World Model by MAP adaptation• Speech verification score: derived from
likelihood ratio
Page 16 NOLISP 2007, PARIS 23 Mai 2007
Audiovisual identity verification
Synchrony Modality:– Principle: synchrony between lips and
speech carries identity information– Process:
• Computation of a synchrony model (CoIA analysis) for each person based on DCT (visual signal) and MFCC (speech signal)
• Comparison of the test sample with the synchrony model
Page 17 NOLISP 2007, PARIS 23 Mai 2007
Audiovisual identity verification
Experiments:– BANCA database:
• 52 persons divided into two groups (G1 and G2)• 3 recording conditions• 1 person 8 recordings (4 client accesses, 4
impostor accesses) • Evaluation based on P protocol: 234 client
accesses and 312 impostor accesses– Scores:
• 4 scores per access (PCA face, SIFT face, speech, synchrony)
• Score fusion based on RBF-SVM: hyperplan learned on G1/tested on G2 and conversely)
Page 18 NOLISP 2007, PARIS 23 Mai 2007
Audiovisual identity verification
Experiments: