multimodal group action clustering in meetings
Post on 23-Jan-2016
25 Views
Preview:
DESCRIPTION
TRANSCRIPT
1
Multimodal Group Action Clustering in
Meetings
Dong Zhang, Daniel Gatica-Perez, Samy Bengio, Iain McCowan, Guillaume Lathoud
IDIAP Research InstituteSwitzerland
2
Outline
Meetings: Sequences of Actions. Why Clustering? Layered HMM Framework. Experiments. Conclusion and Future Work.
3
Meetings: Sequences of Actions Meetings are commonly understood as sequences of events
or actions: meeting agenda: prior sequence of discussion points,
presentations, decisions to be made, etc. meeting minutes: posterior sequence of key phases of meeting,
summarised discussions, decisions made, etc.
We aim to investigate the automatic structuring of meetings as sequences of meeting actions.
The actions are multimodal in nature: speech, gestures, expressions, gaze, written text, use of devices, laughter, etc.
In general, these action sequences are due to the group as a whole, rather than a particular individual.
4
Structuring of Meetings
Meetings t
A meeting is modelled as a continuous sequence of group actions taken from a mutually exclusive and exhaustive set:
V = { V1, V2, V3, …, VN }
V1 V4 V5 V1 V6V2 V3 V3
t
Engaged
Neutral
Disengaged
Group actions based on Interest Level
Brainstorming
Decision Making
Information sharing
Group actions based on Tasks
Discussion
GA based on Turn-taking
Monologue
Monologue + Note-taking
Note-taking
Presentation
Presentation + Note-taking
Whiteboard
Whiteboard + Note-taking
5
Previous Work
Recognition of Meeting Actions supervised, single layer approaches. investigated different multi-stream HMM variants, with streams
modelling modalities or individuals.
Layered HMM Approach: First layer HMM models Individual Actions (I-HMM), second
layer HMM models Group Actions (G-HMM). Showed improvement over single layer approach.
Please refer to:I. McCowan, et al “Modeling human interactions in meetings”. In ICASSP 2003.I. McCowan, et al “Automatic Analysis of Multimodal Group Actions in Meetings”. To appear in IEEE Trans. on PAMI, 2005.D. Zhang, et al “Modeling Individual and Group Actions in Meetings: a Two-Layer HMM Framework”. In IEEE Workshop on Event Mining, CVPR, 2004.
6
Why Clustering? Unsupervised action clustering instead of supervised action
recognition. High-level semantic group actions are difficult to:
define: what action lexica are appropriate? annotate: in general temporal boundaries not precise.
Clustering allows to find natural structure of meeting, and may help us better understand the data.
Engaged
Neutral
Disengaged
Group actions based on Interest Level
Brainstorming
Decision Making
Information sharing
Group actions based on Tasks
Discussion
GA based on Turn-taking
Monologue
Monologue + Note-taking
Note-taking
Presentation
Presentation + Note-taking
Whiteboard
Whiteboard + Note-taking
7
Outline
Meetings: Sequences of Actions. Why Clustering? Layered HMM Framework. Experiments. Conclusion and Future Work.
8
Single-layer HMM Framework
Single-layer HMM: a large vector of audio-visual features from each participant and group-level features are concatenated to define the observation space
Please refer to:I. McCowan, et al “Modeling human interactions in meetings”. In ICASSP 2003.I. McCowan, et al “Automatic Analysis of Multimodal Group Actions in Meetings”. To appear in IEEE Trans. on PAMI, 2005.
9
Two-layer HMM Framework
Two-layer HMM: By defining a proper set of individual actions, we decompose the group action recognition problem into two layers: from individual to group. Both layers use ergodic HMMs or extensions.
Please refer to:D. Zhang, et al “Modeling Individual and Group Actions in Meetings: a Two-Layer HMM Framework”. In IEEE Workshop on Event Mining, CVPR, 2004.
10
Advantages1. Compared with single-layer
HMM, smaller observation spaces.
2. Individual layer HMM (I-HMM) is person-independent, then well-estimated model can be trained with much more data.
3. Group layer HMM (G-HMM) is less sensitive to variations of the low-level audio-visual features.
4. Easily explore different combination systems.
Audio-visual features
Unsupervised HMM
Supervised HMM
11
Models for I-HMM Early Integration (Early Int.)
A standard HMM is trained on combined audio-visual features.
Multi-stream HMM (MS-HMM)combine audio-only and visual-only streams. Each stream is trained independently. The final classification is based on the fusion of the outputs of both modalities by estimating their joint occurrence.
Asynchronous HMM (A-HMM)
S. Dupont et “Audio-visual speech modeling for continuous speech recognition”. IEEE Transactions on Multimedia 141--151, Sep. 2000.
Please refer toS. Bengio. “An asynchronous hidden Markov model for audio-visualspeech recognition”. NIPS 2003
12
Models for G-HMM: Clustering
Clusters No.Segmentations
6
5
4
3
2
1
Likelihood
Please refer to: J. Ajmera et. “A robust speaker clustering algorithm”. In IEEE ASRU Workshop 2003
Assume unknown segmentation and number of clusters.
13
Linking Two Layers (1)
14
Linking Two Layers (2)
Normalization
Please refer to:D. Zhang, et al “Modeling Individual and Group Actions in Meetings: a Two-Layer HMM Framework”. In IEEE Workshop on Event Mining, CVPR, 2004.
15
Outline
Meetings: Sequences of Actions. Why Clustering? Layered HMM Framework. Experiments. Conclusion and Future Work.
16
Data Collection Scripted meeting corpus
30 for training, 29 for testing 5-minute each meeting 4-person each meeting http://mmm.idiap.ch/
3 cameras, 12 microphones.
Meeting was ‘scripted’ as a sequences of actions.
IDIAP meeting roomPlease refer toI. McCowan, et “Modeling human interactions in meetings”. In ICASSP 2003.
17
Audio-Visual Feature Extraction Person-specific audio-visual features
Audio Seat region audio Activity Speech Pitch Speech Energy Speech Rate
Visual Head vertical centroid Head eccentricity Right hand centroid Right hand angle Right hand eccentricity Head and hand motion
Group-level audio-visual features Audio
Audio activity from white-board region Audio activity from screen region
Visual Mean difference from white-board Mean difference from projector screen
camera1 camera3 camera2
18
Action Lexicon
Group Actions = Individual Actions + Group Devices(Group actions can be treated as a combination of individual actions
plus states of group devices.)
Idle
Writing
Speaking
Individual actions
Projector screen
Whiteboard
Group devices
Discussion
Group actions
Monologue
Monologue + Note-taking
Note-taking
Presentation
Presentation + Note-taking
Whiteboard
Whiteboard + Note-taking
19
Example
Projector Used
Person 2 W S W
Person 1 S S W
Person 3 W S S W
Person 4 S W S
Whiteboard Used
Monologue1+ Note-taking
Group Action DiscussionPresentation+ Note-taking
Whiteboard+ Note-taking
W
W
S Speaking W Writing Idle
20
Performance Measures
We use the “purity” concept to evaluate results
Average action purity (aap):“How well one action is limited to only one cluster?”
Average cluster purity (acp): “How well one cluster is limited to only one action?”
Combine “aap” and “acp” into one measure “K”
Please refer to: J. Ajmera et. “A robust speaker clustering algorithm”. In IEEE ASRU Workshop 2003
21
Results
Method N K
Single-layer HMM
Visual 8.72 50.6
Audio 3.03 58.6
AV 4.10 65.7
Two-layer HMM
Visual 6.20 56.8
Audio 3.10 63.7
Early Int. 3.59 70.1
MS-HMM 4.17 71.8
A-HMM 3.51 73.8
Clustering Individual Meetings
True number of clusters: 3.93
Method N K
Single-layer HMM
Visual 16.33 30.6
Audio 3.16 50.6
AV 6.73 62.1
Two-layer HMM
Visual 11.67 38.2
Audio 3.50 56.7
Early Int. 10.60 68.3
MS-HMM 7.28 69.8
A-HMM 7.10 72.2
Clustering Meeting Collections
True number of clusters: 8
22
Results
Clustering individual meetings Clustering entire meeting collection
23
Conclusions
Structuring of meetings as a sequence of group actions.
We proposed a layered HMM framework for group action clustering: supervised individual layer and unsupervised group layer.
Experiments showed: advantage of using both audio and visual modalities. better performance using layered HMM. clustering gives meaningful segmentation into group actions. clustering yields consistent labels when done across multipl
e meetings.
24
Future Work
Clustering: Investigating different sets of Individual Actions. Handling variable numbers of participants across or within
meetings. Related:
Joint training of layers in supervised 2-layer HMM. Defining new sets of group actions, e.g. interest-level.
Data collection: In the scope of the AMI project (www.amiproject.org), we
are currently collecting a 100 hour corpus of natural meetings to facilitate further research.
27
Linking Two Layers (1)
Hard decision the individual action model with the highest probability outputs a value of 1 while all other models output a 0 value.
Audio-visual features
Soft decision: (0.7, 0.1, 0.2)Hard decision: (1, 0, 0)
Soft decision outputs the probability of each individual action model as input features to G-HMM:
28
Results
Two clustering cases Clustering individual meetings Clustering the entire meeting
collection
The baseline system Single-layer HMM
Methods K
Two-layer HMM 0.738
Single-layer HMM 0.657
Methods K
Two-layer HMM 0.722
Single-layer HMM 0.621
Clustering individual meetings
Clustering entire meeting collection
top related