research activities at auth related to dialogue detection ioannis pitas constantine kotropoulos...

Research activities at AUTH related to dialogue Research activities at AUTH related to dialogue detectiondetection

Ioannis Pitas Ioannis Pitas Constantine KotropoulosConstantine Kotropoulos

Nikos NikolaidisNikos Nikolaidis

WP6 e-team: Audiovisual UnderstandingWP6 e-team: Audiovisual Understanding

AIIA Lab, Department of InformaticsAIIA Lab, Department of InformaticsAristotle University of ThessalonikiAristotle University of Thessaloniki

OutlineOutline IntroductionIntroduction Dialogue detection concept: cross-correlation of Dialogue detection concept: cross-correlation of

indicator functionsindicator functions Speaker turn detection based on speech and visual Speaker turn detection based on speech and visual

cues (mouth activity)cues (mouth activity) Frontal face detection; facial feature detection (e.g. Frontal face detection; facial feature detection (e.g.

mouth)mouth) One-two speaker detection One-two speaker detection Speaker clustering based on speech and visual cuesSpeaker clustering based on speech and visual cues FingerprintingFingerprinting


Indicator functions and their cross-Indicator functions and their cross-correlation (1)correlation (1)

A dialogue between two persons from the movie “Secret Window” [Dialogue 1] .

( )AI n n

( )ABc d

n( )BI n

d


Indicator functions and their cross-Indicator functions and their cross-correlation (2)correlation (2)

( )AI n

( )BI n

( )ABc d

n

n

d

A scene without a dialogue between two persons


Speaker Turn DetectionSpeaker Turn Detection

Audio Segmentation aims at finding acoustic events within Audio Segmentation aims at finding acoustic events within an audio stream. Speaker turn detection is a special case of an audio stream. Speaker turn detection is a special case of speaker segmentation.speaker segmentation.

Important step in pre-processing of speech in order to Important step in pre-processing of speech in order to implement audio indexing or speaker tracking.implement audio indexing or speaker tracking.

Usually, no prior knowledge about speakers is assumed. Usually, no prior knowledge about speakers is assumed.

Speaker 1 Speaker 2


MODEL BASED SEGMENTATION

( , )Z ZN

DISTBIC DISTBIC

( , )Y YN CONTRAST THE HYPOTHESIS OF NO SPEAKER TURN ( ) AGAINST THE SPEAKER TURN( )

ZN

,X YN N

( ) log2

log log2 2

ZZ

X YX Y

NML i

N N

( , )X XN XN vectors in X

YN vectors in Y

BIC CRITERION( ) ( ) 0BIC i ML i P

Z X YN N N

Speaker turn!!!!


Frontal face images at quartet and Frontal face images at quartet and octet resolutionoctet resolution

Original ImageOriginal Image Quartet ImageQuartet Image Octet ImageOctet Image


Face detection based on cornersFace detection based on corners

The figures show the 3 possible feature point set The figures show the 3 possible feature point set configurations, having 100 feature points each. They differ configurations, having 100 feature points each. They differ at the minimum distance allowed between the feature at the minimum distance allowed between the feature points. In general, small inter feature point distances yield points. In general, small inter feature point distances yield a feature point concentration and poor face detection. The a feature point concentration and poor face detection. The minimum allowed distance is a parameter of the training minimum allowed distance is a parameter of the training procedure.procedure.


Face detection Receiver Operating Face detection Receiver Operating Characteristic (ROC) curvesCharacteristic (ROC) curves

• For the SVM-based face For the SVM-based face detection, the best results detection, the best results were obtained with the were obtained with the sigmoidal kernel. Best sigmoidal kernel. Best equal error rate 4.5%equal error rate 4.5%

• The maximum likelihood The maximum likelihood detection commits a few detection commits a few false alarm. For FAR in false alarm. For FAR in [5.2%, 5.67%] the FRR [5.2%, 5.67%] the FRR drops quickly from 6.1% to drops quickly from 6.1% to 0.7%. 0.7%.


One/Two Speaker Detection One/Two Speaker Detection

Two-speaker detection (NIST 2002): Best EER 16.2 %

Kajarekar, Adami, Hermansky, 2003

One-speaker detection (NIST 2002): Best EER 7.1 %


Frontal face authenticationFrontal face authentication


FingerprintingFingerprinting

research activities at auth related to dialogue detection ioannis pitas constantine kotropoulos...

Documents

corners face detection

poor face detection

detection audio segmentation

aiia lab

speaker tracking

feature points

indicator functions

feature point distances