learning to separate object sounds by watching unlabeled...
TRANSCRIPT
Learning to Separate Object Sounds by Watching Unlabeled Video
Ruohan Gao1, Rogerio Feris2, Kristen Grauman1
1The University of Texas at Austin, 2IBM Research
Sight and SoundCVPR 2018 workshop, June 2018
Listening to learn
Listening to learn
woof meow ring
Goal: a repertoire of objects and their sounds
Challenge: a single audio channel usuallymixes sounds of multiple objects
Listening to learn
clatter
Visually-guided audio source separation
Traditional approach: • Detect low-level correlations within a single video• Learn from clean single audio source examples
[Darrell et al. 2000; Fisher et al. 2001; Rivet et al. 2007; Barzelay & Schechner 2007; Casanovas et al. 2010; Parekh et al. 2017; Pu et al. 2017; Li et al. 2017]
audio
visual
Learning to separate object sounds
Our idea: Leverage visual objects to learn from unlabeled video with multiple audio sources
Unlabeled video Object sound models
Violin
DogCat
Disentangle
Deep multi-instance multi-label learning (MIML) to disentangle which visual objects make which sounds
Non-negativematrix
factorization
Visual predictions(ResNet-152
objects)
GuitarSaxophone
Output: Group of audio basis vectors per object class
Visualframes
Audio Audio basis vectors
Top visual detections
Our approach: training
!
!
MIMLUnlabeled video
AudioEstimate
activations
Violin Sound Piano Sound
Novel video
Frames
Initialize audio basis matrix
Violin bases Piano bases
ViolinPiano
Our approach: inferenceGiven a novel video, use discovered object sound models to guide audio source separation.
Visual predictions(ResNet-152
objects)
Semi-supervisedsource separation
using NMF
Train on 100,000 unlabeled multi-source video clips, then separate audio for novel video
Results
Baseline: M. Spiertz, Source-filter based clustering for monaural blind source separation. International Conference on Digital Audio Effects, 2009
Train on 100,000 unlabeled multi-source video clips, then separate audio for novel video
Results
Train on 100,000 unlabeled multi-source video clips, then separate audio for novel video
Results
Failure cases
Visually-aided audio source separation (SDR)
Results
Our method achieves large gains, andmatch the separated source to meaningfulvideo.
Our method achieves large gains, andmatch the separated source to meaningfulvideo.
Our method achieves large gains, and it also hasthe capability to match the separated source tomeaningful acoustic objects in the video.
Visually-aided audio denoising (NSDR)
Concurrent Work on Audio-VisualSource Separation
⎼ Owens & Efros , Audio-Visual Scene Analysis with Self-Supervised Multisensory Features, 2018
⎼ Ephrat et al., Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation, 2018
⎼ Zhao et al. , The Sound of Pixels, 2018
⎼ Afouras et al., The Conversation: Deep Audio-Visual SpeechEnhancement, 2018
We learn from uncurated multi-object multi-sourcevideos, and study a diverse set of object categories.
Conclusion
⎼ Learn object sound models from unlabeled videosto perform audio-visual source separation
⎼ Integrate localized object detections and motion
Unlabeled video Object sound models
Violin
DogCat
Disentangle