learning to separate object sounds by watching unlabeled...

Learning to Separate Object Sounds by Watching Unlabeled Video

Ruohan Gao1, Rogerio Feris2, Kristen Grauman1

1The University of Texas at Austin, 2IBM Research

Sight and SoundCVPR 2018 workshop, June 2018

Listening to learn

woof meow ring

Goal: a repertoire of objects and their sounds

Challenge: a single audio channel usuallymixes sounds of multiple objects

Listening to learn

clatter

Visually-guided audio source separation

Traditional approach: • Detect low-level correlations within a single video• Learn from clean single audio source examples

[Darrell et al. 2000; Fisher et al. 2001; Rivet et al. 2007; Barzelay & Schechner 2007; Casanovas et al. 2010; Parekh et al. 2017; Pu et al. 2017; Li et al. 2017]

audio

visual

Learning to separate object sounds

Our idea: Leverage visual objects to learn from unlabeled video with multiple audio sources

Unlabeled video Object sound models

Violin

DogCat

Disentangle

Deep multi-instance multi-label learning (MIML) to disentangle which visual objects make which sounds

Non-negativematrix

factorization

Visual predictions(ResNet-152

objects)

GuitarSaxophone

Output: Group of audio basis vectors per object class

Visualframes

Audio Audio basis vectors

Top visual detections

Our approach: training

!

!

MIMLUnlabeled video

AudioEstimate

activations

Violin Sound Piano Sound

Novel video

Frames

Initialize audio basis matrix

Violin bases Piano bases

ViolinPiano

Our approach: inferenceGiven a novel video, use discovered object sound models to guide audio source separation.

Visual predictions(ResNet-152

objects)

Semi-supervisedsource separation

using NMF

Train on 100,000 unlabeled multi-source video clips, then separate audio for novel video

Results

Baseline: M. Spiertz, Source-filter based clustering for monaural blind source separation. International Conference on Digital Audio Effects, 2009


Results


Results

Failure cases

Visually-aided audio source separation (SDR)

Results

Our method achieves large gains, andmatch the separated source to meaningfulvideo.

Our method achieves large gains, andmatch the separated source to meaningfulvideo.

Our method achieves large gains, and it also hasthe capability to match the separated source tomeaningful acoustic objects in the video.

Visually-aided audio denoising (NSDR)

Concurrent Work on Audio-VisualSource Separation

⎼ Owens & Efros , Audio-Visual Scene Analysis with Self-Supervised Multisensory Features, 2018

⎼ Ephrat et al., Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation, 2018

⎼ Zhao et al. , The Sound of Pixels, 2018

⎼ Afouras et al., The Conversation: Deep Audio-Visual SpeechEnhancement, 2018

We learn from uncurated multi-object multi-sourcevideos, and study a diverse set of object categories.

Conclusion

⎼ Learn object sound models from unlabeled videosto perform audio-visual source separation

⎼ Integrate localized object detections and motion

Unlabeled video Object sound models

Violin

DogCat

Disentangle

learning to separate object sounds by watching unlabeled...

Documents