learning to separate object sounds by watching unlabeled...

14
Learning to Separate Object Sounds by Watching Unlabeled Video Ruohan Gao 1 , Rogerio Feris 2 , Kristen Grauman 1 1 The University of Texas at Austin, 2 IBM Research Sight and Sound CVPR 2018 workshop, June 2018

Upload: others

Post on 07-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learning to Separate Object Sounds by Watching Unlabeled …rhgao/talks/CVPR2018_sight_and_sound.pdfLearning to Separate Object Sounds by Watching Unlabeled Video Ruohan Gao1, Rogerio

Learning to Separate Object Sounds by Watching Unlabeled Video

Ruohan Gao1, Rogerio Feris2, Kristen Grauman1

1The University of Texas at Austin, 2IBM Research

Sight and SoundCVPR 2018 workshop, June 2018

Page 2: Learning to Separate Object Sounds by Watching Unlabeled …rhgao/talks/CVPR2018_sight_and_sound.pdfLearning to Separate Object Sounds by Watching Unlabeled Video Ruohan Gao1, Rogerio

Listening to learn

Page 3: Learning to Separate Object Sounds by Watching Unlabeled …rhgao/talks/CVPR2018_sight_and_sound.pdfLearning to Separate Object Sounds by Watching Unlabeled Video Ruohan Gao1, Rogerio

Listening to learn

Page 4: Learning to Separate Object Sounds by Watching Unlabeled …rhgao/talks/CVPR2018_sight_and_sound.pdfLearning to Separate Object Sounds by Watching Unlabeled Video Ruohan Gao1, Rogerio

woof meow ring

Goal: a repertoire of objects and their sounds

Challenge: a single audio channel usuallymixes sounds of multiple objects

Listening to learn

clatter

Page 5: Learning to Separate Object Sounds by Watching Unlabeled …rhgao/talks/CVPR2018_sight_and_sound.pdfLearning to Separate Object Sounds by Watching Unlabeled Video Ruohan Gao1, Rogerio

Visually-guided audio source separation

Traditional approach: • Detect low-level correlations within a single video• Learn from clean single audio source examples

[Darrell et al. 2000; Fisher et al. 2001; Rivet et al. 2007; Barzelay & Schechner 2007; Casanovas et al. 2010; Parekh et al. 2017; Pu et al. 2017; Li et al. 2017]

audio

visual

Page 6: Learning to Separate Object Sounds by Watching Unlabeled …rhgao/talks/CVPR2018_sight_and_sound.pdfLearning to Separate Object Sounds by Watching Unlabeled Video Ruohan Gao1, Rogerio

Learning to separate object sounds

Our idea: Leverage visual objects to learn from unlabeled video with multiple audio sources

Unlabeled video Object sound models

Violin

DogCat

Disentangle

Page 7: Learning to Separate Object Sounds by Watching Unlabeled …rhgao/talks/CVPR2018_sight_and_sound.pdfLearning to Separate Object Sounds by Watching Unlabeled Video Ruohan Gao1, Rogerio

Deep multi-instance multi-label learning (MIML) to disentangle which visual objects make which sounds

Non-negativematrix

factorization

Visual predictions(ResNet-152

objects)

GuitarSaxophone

Output: Group of audio basis vectors per object class

Visualframes

Audio Audio basis vectors

Top visual detections

Our approach: training

!

!

MIMLUnlabeled video

Page 8: Learning to Separate Object Sounds by Watching Unlabeled …rhgao/talks/CVPR2018_sight_and_sound.pdfLearning to Separate Object Sounds by Watching Unlabeled Video Ruohan Gao1, Rogerio

AudioEstimate

activations

Violin Sound Piano Sound

Novel video

Frames

Initialize audio basis matrix

Violin bases Piano bases

ViolinPiano

Our approach: inferenceGiven a novel video, use discovered object sound models to guide audio source separation.

Visual predictions(ResNet-152

objects)

Semi-supervisedsource separation

using NMF

Page 9: Learning to Separate Object Sounds by Watching Unlabeled …rhgao/talks/CVPR2018_sight_and_sound.pdfLearning to Separate Object Sounds by Watching Unlabeled Video Ruohan Gao1, Rogerio

Train on 100,000 unlabeled multi-source video clips, then separate audio for novel video

Results

Baseline: M. Spiertz, Source-filter based clustering for monaural blind source separation. International Conference on Digital Audio Effects, 2009

Page 10: Learning to Separate Object Sounds by Watching Unlabeled …rhgao/talks/CVPR2018_sight_and_sound.pdfLearning to Separate Object Sounds by Watching Unlabeled Video Ruohan Gao1, Rogerio

Train on 100,000 unlabeled multi-source video clips, then separate audio for novel video

Results

Page 11: Learning to Separate Object Sounds by Watching Unlabeled …rhgao/talks/CVPR2018_sight_and_sound.pdfLearning to Separate Object Sounds by Watching Unlabeled Video Ruohan Gao1, Rogerio

Train on 100,000 unlabeled multi-source video clips, then separate audio for novel video

Results

Failure cases

Page 12: Learning to Separate Object Sounds by Watching Unlabeled …rhgao/talks/CVPR2018_sight_and_sound.pdfLearning to Separate Object Sounds by Watching Unlabeled Video Ruohan Gao1, Rogerio

Visually-aided audio source separation (SDR)

Results

Our method achieves large gains, andmatch the separated source to meaningfulvideo.

Our method achieves large gains, andmatch the separated source to meaningfulvideo.

Our method achieves large gains, and it also hasthe capability to match the separated source tomeaningful acoustic objects in the video.

Visually-aided audio denoising (NSDR)

Page 13: Learning to Separate Object Sounds by Watching Unlabeled …rhgao/talks/CVPR2018_sight_and_sound.pdfLearning to Separate Object Sounds by Watching Unlabeled Video Ruohan Gao1, Rogerio

Concurrent Work on Audio-VisualSource Separation

⎼ Owens & Efros , Audio-Visual Scene Analysis with Self-Supervised Multisensory Features, 2018

⎼ Ephrat et al., Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation, 2018

⎼ Zhao et al. , The Sound of Pixels, 2018

⎼ Afouras et al., The Conversation: Deep Audio-Visual SpeechEnhancement, 2018

We learn from uncurated multi-object multi-sourcevideos, and study a diverse set of object categories.

Page 14: Learning to Separate Object Sounds by Watching Unlabeled …rhgao/talks/CVPR2018_sight_and_sound.pdfLearning to Separate Object Sounds by Watching Unlabeled Video Ruohan Gao1, Rogerio

Conclusion

⎼ Learn object sound models from unlabeled videosto perform audio-visual source separation

⎼ Integrate localized object detections and motion

Unlabeled video Object sound models

Violin

DogCat

Disentangle