![Page 1: “ Pixels that Sound ” Find pixels that correspond (correlate !?) to sound](https://reader037.vdocument.in/reader037/viewer/2022103006/56813515550346895d9c6a27/html5/thumbnails/1.jpg)
“ Pixels that Sound ”
Find pixels that correspond (correlate !?) to sound
Kidron, Schechner, Elad, CVPR 2005
34
![Page 2: “ Pixels that Sound ” Find pixels that correspond (correlate !?) to sound](https://reader037.vdocument.in/reader037/viewer/2022103006/56813515550346895d9c6a27/html5/thumbnails/2.jpg)
Audio-Visual Analysis: Applications• Lip reading – detection of lips (or person)
Slaney, Covell (2000)
Bregler, Konig (1994)
• Analysis and synthesis of music from motionMurphy, Andersen, Jensen (2003)
• Source separation based on visionLi, Dimitrova, Li, Sethi (2003)
Smaragdis, Casey (2003)
Nock, Iyengar, Neti (2002)
Fisher, Darrell, Freeman, Viola (2001)
Hershey, Movellan (1999)
• Tracking Vermaak, Gangnet, Blake, Pérez (2001)
• Biological systemsGutfreund, Zheng, Knudsen (2002)
47
![Page 3: “ Pixels that Sound ” Find pixels that correspond (correlate !?) to sound](https://reader037.vdocument.in/reader037/viewer/2022103006/56813515550346895d9c6a27/html5/thumbnails/3.jpg)
Problem: Different Modalities
camera
microphone
audio-visual analysis
Visual data
25 frames/sec
Each frame: 576 x 720 pixels
Audio data
44.1 KHz, few bands
Not stereophonic
Kidron, Schechner, Elad, Pixels that Sound
47
![Page 4: “ Pixels that Sound ” Find pixels that correspond (correlate !?) to sound](https://reader037.vdocument.in/reader037/viewer/2022103006/56813515550346895d9c6a27/html5/thumbnails/4.jpg)
Previous Work
Pointwise correlationNock, Iyengar, Neti (2002)
Hershey, Movellan (1999)
Ill-posed(lack of data)
• Canonical Correlation Analysis (CCA)Smaragdis, Casey (2003)
Li, Dimitrova, Li, Sethi (2003)
Slaney, Covell (2000)
Cluster of pixels - linear superposition
• Mutual Information (MI)Fisher et. al. (2001)
Cutler, Davis (2000)
Bregler,Konig (1994)
NotTypical
highly complex
54
![Page 5: “ Pixels that Sound ” Find pixels that correspond (correlate !?) to sound](https://reader037.vdocument.in/reader037/viewer/2022103006/56813515550346895d9c6a27/html5/thumbnails/5.jpg)
Kidron, Schechner, Elad, Pixels that Sound
49
ProjectionProjection
Video Audio
Pixel #1
Pixel #2
Pixel #3
Band #1
Band #2
Optimal Optimal visual components
CCA
![Page 6: “ Pixels that Sound ” Find pixels that correspond (correlate !?) to sound](https://reader037.vdocument.in/reader037/viewer/2022103006/56813515550346895d9c6a27/html5/thumbnails/6.jpg)
Visual Projection
1Dvariable
Projection
34012052687436859Video features• Pixels intensity• Transform coeff (wavelet)• Image differences
v
40
![Page 7: “ Pixels that Sound ” Find pixels that correspond (correlate !?) to sound](https://reader037.vdocument.in/reader037/viewer/2022103006/56813515550346895d9c6a27/html5/thumbnails/7.jpg)
Audio Projection
1Dvariable
Projection
Audio features• Average energy per frame• Transform coeffs per frame
a
41
![Page 8: “ Pixels that Sound ” Find pixels that correspond (correlate !?) to sound](https://reader037.vdocument.in/reader037/viewer/2022103006/56813515550346895d9c6a27/html5/thumbnails/8.jpg)
Canonical Correlation
Video AudioRepresentation
Projections(per time window)
Random variables(time dependent)
Correlation coefficient
42
![Page 9: “ Pixels that Sound ” Find pixels that correspond (correlate !?) to sound](https://reader037.vdocument.in/reader037/viewer/2022103006/56813515550346895d9c6a27/html5/thumbnails/9.jpg)
CCA Formulation
yield an eigenvalue problem:Knutsson, Borga, Landelius (1995)
CanonicalCorrelationProjections
Largest Eigenvalue
equivalent to
Corresponding Eigenvectors
43
![Page 10: “ Pixels that Sound ” Find pixels that correspond (correlate !?) to sound](https://reader037.vdocument.in/reader037/viewer/2022103006/56813515550346895d9c6a27/html5/thumbnails/10.jpg)
Visual Data
t (frames)
Spatial Location(pixels intensities)
Kidron, Schechner, Elad, Pixels that Sound
51
![Page 11: “ Pixels that Sound ” Find pixels that correspond (correlate !?) to sound](https://reader037.vdocument.in/reader037/viewer/2022103006/56813515550346895d9c6a27/html5/thumbnails/11.jpg)
Rank Deficiency
t (frames)
Spatial Location(pixels intensities)
=
Kidron, Schechner, Elad, Pixels that Sound
44
![Page 12: “ Pixels that Sound ” Find pixels that correspond (correlate !?) to sound](https://reader037.vdocument.in/reader037/viewer/2022103006/56813515550346895d9c6a27/html5/thumbnails/12.jpg)
Estimation of Covariance
Rank deficient
45
![Page 13: “ Pixels that Sound ” Find pixels that correspond (correlate !?) to sound](https://reader037.vdocument.in/reader037/viewer/2022103006/56813515550346895d9c6a27/html5/thumbnails/13.jpg)
Ill-Posedness
Prior solutions:
• Use many more frames poor temporal resolution.
• Aggressive spatial pruning poor spatial resolution.
• Trivial regularization
Impossible to invert !!!
46
![Page 14: “ Pixels that Sound ” Find pixels that correspond (correlate !?) to sound](https://reader037.vdocument.in/reader037/viewer/2022103006/56813515550346895d9c6a27/html5/thumbnails/14.jpg)
A General Problem
Small amount of data
The problem is ILL-POSED
Over fitting is likely
Large number of weights
47
![Page 15: “ Pixels that Sound ” Find pixels that correspond (correlate !?) to sound](https://reader037.vdocument.in/reader037/viewer/2022103006/56813515550346895d9c6a27/html5/thumbnails/15.jpg)
An Equivalent Problem
Minimizing
Maximizing
48
![Page 16: “ Pixels that Sound ” Find pixels that correspond (correlate !?) to sound](https://reader037.vdocument.in/reader037/viewer/2022103006/56813515550346895d9c6a27/html5/thumbnails/16.jpg)
Single Audio Band
(The denominator is non-zero)
Minimizing
Knowndata
A has a single column, and
49
![Page 17: “ Pixels that Sound ” Find pixels that correspond (correlate !?) to sound](https://reader037.vdocument.in/reader037/viewer/2022103006/56813515550346895d9c6a27/html5/thumbnails/17.jpg)
=
Time
a(ti)
a (1)
a (30)
a (2)
V a
Full correlation if
Underdetermined system !
Kidron, Schechner, Elad, Pixels that Sound
52
end
![Page 18: “ Pixels that Sound ” Find pixels that correspond (correlate !?) to sound](https://reader037.vdocument.in/reader037/viewer/2022103006/56813515550346895d9c6a27/html5/thumbnails/18.jpg)
Detected correlated pixels
“Out of clutter, find simplicity.
From discord, find harmony.”
Albert Einstein
52
end
![Page 19: “ Pixels that Sound ” Find pixels that correspond (correlate !?) to sound](https://reader037.vdocument.in/reader037/viewer/2022103006/56813515550346895d9c6a27/html5/thumbnails/19.jpg)
Sparse Solution
• Non-convex• Exponential
complexity
-norm minimum
53
![Page 20: “ Pixels that Sound ” Find pixels that correspond (correlate !?) to sound](https://reader037.vdocument.in/reader037/viewer/2022103006/56813515550346895d9c6a27/html5/thumbnails/20.jpg)
The -norm criterion
• Sparse• Convex• Polynomial
complexity
in common situations
-norm minimum
Donoho, Elad (2005)
54
![Page 21: “ Pixels that Sound ” Find pixels that correspond (correlate !?) to sound](https://reader037.vdocument.in/reader037/viewer/2022103006/56813515550346895d9c6a27/html5/thumbnails/21.jpg)
The Minimum Norm Solution
Energy spread
-norm minimum
Solving using -norm (pseudo-inverse, SVD, QR)
55
![Page 22: “ Pixels that Sound ” Find pixels that correspond (correlate !?) to sound](https://reader037.vdocument.in/reader037/viewer/2022103006/56813515550346895d9c6a27/html5/thumbnails/22.jpg)
Linear programming
Fully correlated
Sparse
No parameters to tweak
Polynomial
Audio-visual events
Maximum correlation: Eigenproblem
Minimum objective function G
56
![Page 23: “ Pixels that Sound ” Find pixels that correspond (correlate !?) to sound](https://reader037.vdocument.in/reader037/viewer/2022103006/56813515550346895d9c6a27/html5/thumbnails/23.jpg)
Multiple Audio Bands - Solution
-ball
Non-convex constraint
• Convex• Linear
The optimization problem:
57
![Page 24: “ Pixels that Sound ” Find pixels that correspond (correlate !?) to sound](https://reader037.vdocument.in/reader037/viewer/2022103006/56813515550346895d9c6a27/html5/thumbnails/24.jpg)
1 ball
Multiple Audio Bands
Optimization over each face is:
S1
S2
S3 S4
No parameters to tweak
•
• Each face: linear programming
58
![Page 25: “ Pixels that Sound ” Find pixels that correspond (correlate !?) to sound](https://reader037.vdocument.in/reader037/viewer/2022103006/56813515550346895d9c6a27/html5/thumbnails/25.jpg)
Sharp & Dynamic, Despite Distraction
Frame 9 Frame 42 Frame 68
Frame 115 Frame 146 Frame 169
![Page 26: “ Pixels that Sound ” Find pixels that correspond (correlate !?) to sound](https://reader037.vdocument.in/reader037/viewer/2022103006/56813515550346895d9c6a27/html5/thumbnails/26.jpg)
Frame 51
Frame 106
Frame 83
Frame 177
• Sparse
• Localization on the proper elements
• False alarm – temporally inconsistent
• Handling dynamics
Performing in Audio Noise
![Page 27: “ Pixels that Sound ” Find pixels that correspond (correlate !?) to sound](https://reader037.vdocument.in/reader037/viewer/2022103006/56813515550346895d9c6a27/html5/thumbnails/27.jpg)
–norm: Energy Spread
Movie #1 Movie #2
Frame 83Frame 146
56
![Page 28: “ Pixels that Sound ” Find pixels that correspond (correlate !?) to sound](https://reader037.vdocument.in/reader037/viewer/2022103006/56813515550346895d9c6a27/html5/thumbnails/28.jpg)
–norm: Localization
Movie #1 Movie #2
Frame 83Frame 146
57
![Page 29: “ Pixels that Sound ” Find pixels that correspond (correlate !?) to sound](https://reader037.vdocument.in/reader037/viewer/2022103006/56813515550346895d9c6a27/html5/thumbnails/29.jpg)
The “Chorus Ambiguity”
Who’s talking?
Synchronized talk
Not unique (ambiguous)
Possible solutions:• Left• Right• Both
![Page 30: “ Pixels that Sound ” Find pixels that correspond (correlate !?) to sound](https://reader037.vdocument.in/reader037/viewer/2022103006/56813515550346895d9c6a27/html5/thumbnails/30.jpg)
The “Chorus Ambiguity”
-norm-norm
feature 1
feature 2
feature 1
feature 2
Both