multimodal group action clustering in meetings

Multimodal Group Action Clustering in

Meetings

Dong Zhang, Daniel Gatica-Perez, Samy Bengio, Iain McCowan, Guillaume Lathoud

IDIAP Research InstituteSwitzerland

Outline

Meetings: Sequences of Actions. Why Clustering? Layered HMM Framework. Experiments. Conclusion and Future Work.

Meetings: Sequences of Actions Meetings are commonly understood as sequences of events

or actions: meeting agenda: prior sequence of discussion points,

presentations, decisions to be made, etc. meeting minutes: posterior sequence of key phases of meeting,

summarised discussions, decisions made, etc.

We aim to investigate the automatic structuring of meetings as sequences of meeting actions.

The actions are multimodal in nature: speech, gestures, expressions, gaze, written text, use of devices, laughter, etc.

In general, these action sequences are due to the group as a whole, rather than a particular individual.

Structuring of Meetings

Meetings t

A meeting is modelled as a continuous sequence of group actions taken from a mutually exclusive and exhaustive set:

V = { V1, V2, V3, …, VN }

V1 V4 V5 V1 V6V2 V3 V3

Engaged

Neutral

Disengaged

Group actions based on Interest Level

Brainstorming

Decision Making

Information sharing

Group actions based on Tasks

Discussion

GA based on Turn-taking

Monologue

Monologue + Note-taking

Note-taking

Presentation

Presentation + Note-taking

Whiteboard

Whiteboard + Note-taking

Previous Work

Recognition of Meeting Actions supervised, single layer approaches. investigated different multi-stream HMM variants, with streams

modelling modalities or individuals.

Layered HMM Approach: First layer HMM models Individual Actions (I-HMM), second

layer HMM models Group Actions (G-HMM). Showed improvement over single layer approach.

Please refer to:I. McCowan, et al “Modeling human interactions in meetings”. In ICASSP 2003.I. McCowan, et al “Automatic Analysis of Multimodal Group Actions in Meetings”. To appear in IEEE Trans. on PAMI, 2005.D. Zhang, et al “Modeling Individual and Group Actions in Meetings: a Two-Layer HMM Framework”. In IEEE Workshop on Event Mining, CVPR, 2004.

Why Clustering? Unsupervised action clustering instead of supervised action

recognition. High-level semantic group actions are difficult to:

define: what action lexica are appropriate? annotate: in general temporal boundaries not precise.

Clustering allows to find natural structure of meeting, and may help us better understand the data.

Engaged

Neutral

Disengaged

Group actions based on Interest Level

Brainstorming

Decision Making

Information sharing

Group actions based on Tasks

Discussion

GA based on Turn-taking

Monologue

Note-taking

Presentation

Whiteboard

Outline

Single-layer HMM Framework

Single-layer HMM: a large vector of audio-visual features from each participant and group-level features are concatenated to define the observation space

Please refer to:I. McCowan, et al “Modeling human interactions in meetings”. In ICASSP 2003.I. McCowan, et al “Automatic Analysis of Multimodal Group Actions in Meetings”. To appear in IEEE Trans. on PAMI, 2005.

Two-layer HMM Framework

Two-layer HMM: By defining a proper set of individual actions, we decompose the group action recognition problem into two layers: from individual to group. Both layers use ergodic HMMs or extensions.

Please refer to:D. Zhang, et al “Modeling Individual and Group Actions in Meetings: a Two-Layer HMM Framework”. In IEEE Workshop on Event Mining, CVPR, 2004.

Advantages1. Compared with single-layer

HMM, smaller observation spaces.

2. Individual layer HMM (I-HMM) is person-independent, then well-estimated model can be trained with much more data.

3. Group layer HMM (G-HMM) is less sensitive to variations of the low-level audio-visual features.

4. Easily explore different combination systems.

Audio-visual features

Unsupervised HMM

Supervised HMM

Models for I-HMM Early Integration (Early Int.)

A standard HMM is trained on combined audio-visual features.

Multi-stream HMM (MS-HMM)combine audio-only and visual-only streams. Each stream is trained independently. The final classification is based on the fusion of the outputs of both modalities by estimating their joint occurrence.

Asynchronous HMM (A-HMM)

S. Dupont et “Audio-visual speech modeling for continuous speech recognition”. IEEE Transactions on Multimedia 141--151, Sep. 2000.

Please refer toS. Bengio. “An asynchronous hidden Markov model for audio-visualspeech recognition”. NIPS 2003

Models for G-HMM: Clustering

Clusters No.Segmentations

Likelihood

Please refer to: J. Ajmera et. “A robust speaker clustering algorithm”. In IEEE ASRU Workshop 2003

Assume unknown segmentation and number of clusters.

Linking Two Layers (1)

Normalization

Please refer to:D. Zhang, et al “Modeling Individual and Group Actions in Meetings: a Two-Layer HMM Framework”. In IEEE Workshop on Event Mining, CVPR, 2004.

Outline

Data Collection Scripted meeting corpus

30 for training, 29 for testing 5-minute each meeting 4-person each meeting http://mmm.idiap.ch/

3 cameras, 12 microphones.

Meeting was ‘scripted’ as a sequences of actions.

IDIAP meeting roomPlease refer toI. McCowan, et “Modeling human interactions in meetings”. In ICASSP 2003.

Audio-Visual Feature Extraction Person-specific audio-visual features

Audio Seat region audio Activity Speech Pitch Speech Energy Speech Rate

Visual Head vertical centroid Head eccentricity Right hand centroid Right hand angle Right hand eccentricity Head and hand motion

Group-level audio-visual features Audio

Audio activity from white-board region Audio activity from screen region

Visual Mean difference from white-board Mean difference from projector screen

camera1 camera3 camera2

Action Lexicon

Group Actions = Individual Actions + Group Devices(Group actions can be treated as a combination of individual actions

plus states of group devices.)

Writing

Speaking

Individual actions

Projector screen

Whiteboard

Group devices

Discussion

Group actions

Monologue

Note-taking

Presentation

Whiteboard

Example

Projector Used

Person 2 W S W

Person 1 S S W

Person 3 W S S W

Person 4 S W S

Whiteboard Used

Monologue1+ Note-taking

Group Action DiscussionPresentation+ Note-taking

Whiteboard+ Note-taking

S Speaking W Writing Idle

Performance Measures

We use the “purity” concept to evaluate results

Average action purity (aap):“How well one action is limited to only one cluster?”

Average cluster purity (acp): “How well one cluster is limited to only one action?”

Combine “aap” and “acp” into one measure “K”

Please refer to: J. Ajmera et. “A robust speaker clustering algorithm”. In IEEE ASRU Workshop 2003

Results

Method N K

Single-layer HMM

Visual 8.72 50.6

Audio 3.03 58.6

AV 4.10 65.7

Two-layer HMM

Visual 6.20 56.8

Audio 3.10 63.7

Early Int. 3.59 70.1

MS-HMM 4.17 71.8

A-HMM 3.51 73.8

Clustering Individual Meetings

True number of clusters: 3.93

Method N K

Single-layer HMM

Visual 16.33 30.6

Audio 3.16 50.6

AV 6.73 62.1

Two-layer HMM

Visual 11.67 38.2

Audio 3.50 56.7

Early Int. 10.60 68.3

MS-HMM 7.28 69.8

A-HMM 7.10 72.2

Clustering Meeting Collections

True number of clusters: 8

Results

Clustering individual meetings Clustering entire meeting collection

Conclusions

Structuring of meetings as a sequence of group actions.

We proposed a layered HMM framework for group action clustering: supervised individual layer and unsupervised group layer.

Experiments showed: advantage of using both audio and visual modalities. better performance using layered HMM. clustering gives meaningful segmentation into group actions. clustering yields consistent labels when done across multipl

e meetings.

Future Work

Clustering: Investigating different sets of Individual Actions. Handling variable numbers of participants across or within

meetings. Related:

Joint training of layers in supervised 2-layer HMM. Defining new sets of group actions, e.g. interest-level.

Data collection: In the scope of the AMI project (www.amiproject.org), we

are currently collecting a 100 hour corpus of natural meetings to facilitate further research.

Hard decision the individual action model with the highest probability outputs a value of 1 while all other models output a 0 value.

Audio-visual features

Soft decision: (0.7, 0.1, 0.2)Hard decision: (1, 0, 0)

Soft decision outputs the probability of each individual action model as input features to G-HMM:

Results

Two clustering cases Clustering individual meetings Clustering the entire meeting

collection

The baseline system Single-layer HMM

Methods K

Two-layer HMM 0.738

Single-layer HMM 0.657

Methods K

Two-layer HMM 0.722

Single-layer HMM 0.621

Clustering individual meetings

Clustering entire meeting collection

multimodal group action clustering in meetings

group layer

individual layer

sequences of meeting

models group actions

models individual actions

action sequences

single layer approaches

modeling individual

Documents

deep multimodal clustering for unsupervised audiovisual...

clustering k-mean clustering

chapter 2 2. multimodal transportation · chapter 2 2....

business focus group meeting #1 · focus group meetings...

multimodal transport - congreso...

multimodal transformer for unaligned multimodal language

clustering. overview definition of clustering existing...

photonicsand optoelectronics meetings (poem ...photonic...

multimodal data visualization, denoising and clustering

bi-clustering. 2 data mining: clustering where k-means...

graph-based multimodal clustering for social event detection...

clustering. 2 outline introduction k-means clustering ...

measure of location and variability. histogram multimodal...

standards adoption deep dive workshop report: measured...

multimodal summarization with guidance of multimodal...

miele – multimodal interoperability e-services for …with...

affinity clustering: hierarchical clustering at...

ward’s hierarchical clustering method: clustering

fiata and multimodal coridors - unece · pdf filemode or...

guest lecture:...