modeling individual and group actions in meetings with layered hmms
DESCRIPTION
modeling individual and group actions in meetings with layered HMMs. dong zhang, daniel gatica-perez samy bengio, iain mccowan, guillaume lathoud idiap research institute martigny, switzerland. meetings as sequences of actions. human interaction similar/complementary roles - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: modeling individual and group actions in meetings with layered HMMs](https://reader035.vdocument.in/reader035/viewer/2022070403/568139ae550346895da14922/html5/thumbnails/1.jpg)
modeling individual and modeling individual and group actions in meetings group actions in meetings
with layered HMMswith layered HMMs
dong zhang, daniel gatica-perez
samy bengio, iain mccowan, guillaume lathoud
idiap research institute
martigny, switzerland
![Page 2: modeling individual and group actions in meetings with layered HMMs](https://reader035.vdocument.in/reader035/viewer/2022070403/568139ae550346895da14922/html5/thumbnails/2.jpg)
meetings as sequences of actions
– human interaction
• similar/complementary roles
• individuals constrained by group
– agenda: prior sequence • discussion points• presentations• decisions to be made
– minutes: posterior sequence • key phases• summarized discussions• decisions made
![Page 3: modeling individual and group actions in meetings with layered HMMs](https://reader035.vdocument.in/reader035/viewer/2022070403/568139ae550346895da14922/html5/thumbnails/3.jpg)
the goal: recognizing sequences of meeting actions
Presentation Group Discussion
Whether Budget
High HighNeutral
Discussion Phase
Topic
Group Interest Level
Information Sharing Decision MakingGroup Task
Timeline
meeting views
group-level actions = meeting actions
![Page 4: modeling individual and group actions in meetings with layered HMMs](https://reader035.vdocument.in/reader035/viewer/2022070403/568139ae550346895da14922/html5/thumbnails/4.jpg)
our work: two-layer HMMs
• decompose the recognition problem• both layers use HMMs
– individual action layer: I-HMM: various models – group action layer: G-HMM
![Page 5: modeling individual and group actions in meetings with layered HMMs](https://reader035.vdocument.in/reader035/viewer/2022070403/568139ae550346895da14922/html5/thumbnails/5.jpg)
our work in detail
1. definition of meeting actions
2. audio-visual observations
3. action recognition
4. results
D. Zhang et al, “Modeling Individual and Group Actions in Meetings with Layered HMMs”, IEEE CVPR Workshop on Event Mining, 2004.
N. Oliver et al, ICMI 2002.
I. McCowan et al, ICASSP 2003, PAMI 2005.
![Page 6: modeling individual and group actions in meetings with layered HMMs](https://reader035.vdocument.in/reader035/viewer/2022070403/568139ae550346895da14922/html5/thumbnails/6.jpg)
1. defining meeting actions
• multiple parallel views
– tech-based: what we can recognize?
– application-based: respond to user needs
– psychology-based: coding schemes from social psychology
• each view a set of actions
A = { A1, A2, A3, A4, …, AN }
• actions in a set– consistent: one view, answering one question– mutually exclusive– exhaustive
![Page 7: modeling individual and group actions in meetings with layered HMMs](https://reader035.vdocument.in/reader035/viewer/2022070403/568139ae550346895da14922/html5/thumbnails/7.jpg)
multi-modal turn-taking
• describes the group discussion state
A = { ‘discussion’,‘monologue’ (x4), ‘white-board’,‘presentation’,‘note-taking’,‘monologue + note-taking’ (x4),‘white-board + note-taking’,
‘presentation + note-taking’}
• individual actionsI = { ‘speaking’,
‘writing’, ‘idle’}
• actions are multi-modal in nature
![Page 8: modeling individual and group actions in meetings with layered HMMs](https://reader035.vdocument.in/reader035/viewer/2022070403/568139ae550346895da14922/html5/thumbnails/8.jpg)
example
Presentation Used
Person 2 W S W
Person 1 S S W
Person 3 W S S W
Person 4 S W S
Whiteboard Used
Monologue1+ Note-taking
Group Action DiscussionPresentation+ Note-taking
Whiteboard+ Note-taking
W
W
![Page 9: modeling individual and group actions in meetings with layered HMMs](https://reader035.vdocument.in/reader035/viewer/2022070403/568139ae550346895da14922/html5/thumbnails/9.jpg)
2. audio-visual observations
audio• 12 channels, 48 kHz • 4 lapel microphones• 1 microphone array
video• 3 CCTV cameras
all synchronized
![Page 10: modeling individual and group actions in meetings with layered HMMs](https://reader035.vdocument.in/reader035/viewer/2022070403/568139ae550346895da14922/html5/thumbnails/10.jpg)
multimodal feature extraction: audio
• microphone array– speech activity (SRP-PHAT)
• seats • presentation/whiteboard area
– speech/silence segmentation
• lapel microphones – speech pitch – speech energy– speaking rate
![Page 11: modeling individual and group actions in meetings with layered HMMs](https://reader035.vdocument.in/reader035/viewer/2022070403/568139ae550346895da14922/html5/thumbnails/11.jpg)
multimodal feature extraction: video
• head + hands blobs– skin colour models (GMM)– head position– hands position + features (eccentricity,size,orientation) – head + hands blob motion
• moving blobs from background subtraction
![Page 12: modeling individual and group actions in meetings with layered HMMs](https://reader035.vdocument.in/reader035/viewer/2022070403/568139ae550346895da14922/html5/thumbnails/12.jpg)
3. recognition with two-layer HMM
• each layer trained independently
• trained as in ASR (Torch)
• simultaneous segmentation and recognition
• compared with single-layer HMM
– smaller observation spaces
– I-HMM trained with much more data
– G-HMM less sensitive to feature variations
– combinations can be explored
![Page 13: modeling individual and group actions in meetings with layered HMMs](https://reader035.vdocument.in/reader035/viewer/2022070403/568139ae550346895da14922/html5/thumbnails/13.jpg)
models for I-HMM
• early integration
– all observations concatenated– correlation between streams– frame-synchronous streams
• asynchronous (Bengio, NIPS 2002)– a and v streams with single state sequence– states emit on one or both streams, given a sync variable– inter-stream asynchrony
• multi-stream (Dupont, TMM 2000)
– HMM per stream (a or v), trained independently– decoding: weighted likelihoods combined at each frame– little inter-stream asynchrony– multi-band and a-v ASR
![Page 14: modeling individual and group actions in meetings with layered HMMs](https://reader035.vdocument.in/reader035/viewer/2022070403/568139ae550346895da14922/html5/thumbnails/14.jpg)
linking the two layers
• hard decision
i-action model with highest probability outputs 1; all other models output 0.
• soft decision
outputs probability for each individual action model
Audio-visual features
HD: (1, 0, 0)SD: (0.9, 0.05, 0.05)
![Page 15: modeling individual and group actions in meetings with layered HMMs](https://reader035.vdocument.in/reader035/viewer/2022070403/568139ae550346895da14922/html5/thumbnails/15.jpg)
• 59 meetings (30/29 train/test)
• four-people, five-minute
• scripts
– schedule of actions
– natural behavior
• features: 5 f/s
4. experiments: data + setup
mmm.idiap.ch
![Page 16: modeling individual and group actions in meetings with layered HMMs](https://reader035.vdocument.in/reader035/viewer/2022070403/568139ae550346895da14922/html5/thumbnails/16.jpg)
performance measures
• individual actions: frame error rate (FER)
• group actions: action error rate (AER)
•Subs: number of substituted actions
•Del: number of deleted actions
•Ins: number of added actions
•Total actions: number of target actions
![Page 17: modeling individual and group actions in meetings with layered HMMs](https://reader035.vdocument.in/reader035/viewer/2022070403/568139ae550346895da14922/html5/thumbnails/17.jpg)
results: individual actions
(0.8,0.2)
43000 frames
(0.2-2.2s)
• visual-only audio-only audio-visual• asynchronous effects between modalities• accuracy: speaking: 96.6 %, writing: 90.8%, idle: 81.5%
![Page 18: modeling individual and group actions in meetings with layered HMMs](https://reader035.vdocument.in/reader035/viewer/2022070403/568139ae550346895da14922/html5/thumbnails/18.jpg)
results: group actions
• multi-modality outperforms single modalities
• two-layer HMM outperforms single-layer HMM for a-only, v-only and a-v
• best model: A-HMM
• soft decision slightly better than hard decision
8% improvement, significant at 96% level
![Page 19: modeling individual and group actions in meetings with layered HMMs](https://reader035.vdocument.in/reader035/viewer/2022070403/568139ae550346895da14922/html5/thumbnails/19.jpg)
action-based meeting structuring
![Page 20: modeling individual and group actions in meetings with layered HMMs](https://reader035.vdocument.in/reader035/viewer/2022070403/568139ae550346895da14922/html5/thumbnails/20.jpg)
conclusions
• structuring meetings as sequences of meeting actions
– layered HMMs successful for recognition
– turn-taking patterns: useful for browsing
– public dataset, standard evaluation procedures
• open issues
– less training data (unsupervised, acm mm04)
– other relevant actions (interest-level, icassp05)
– other features (words, emotions)
– efficient models for many interacting streams
![Page 21: modeling individual and group actions in meetings with layered HMMs](https://reader035.vdocument.in/reader035/viewer/2022070403/568139ae550346895da14922/html5/thumbnails/21.jpg)
Linking Two Layers (1)
![Page 22: modeling individual and group actions in meetings with layered HMMs](https://reader035.vdocument.in/reader035/viewer/2022070403/568139ae550346895da14922/html5/thumbnails/22.jpg)
Linking Two Layers (2)
Normalization
Please refer to:D. Zhang, et al “Modeling Individual and Group Actions in Meetings: a Two-Layer HMM Framework”. In IEEE Workshop on Event Mining, CVPR, 2004.