gabriel kreiman ://cbmm.mit.edu/.../bmm_11aug2020_kreiman.pdf · virtual bmm 2020...
TRANSCRIPT
![Page 1: Gabriel Kreiman ://cbmm.mit.edu/.../BMM_11Aug2020_Kreiman.pdf · Virtual BMM 2020 gabriel.kreiman@tch.harvard.edu. What is there? Who are they? Where are they? What are they doing?](https://reader036.vdocument.in/reader036/viewer/2022081614/5fc1a290f78c4907305e31f9/html5/thumbnails/1.jpg)
Recurrent computations to the rescue
Scan to download papers+data+code
Gabriel Kreiman
http://klab.tch.harvard.edu
Virtual BMM 2020
![Page 2: Gabriel Kreiman ://cbmm.mit.edu/.../BMM_11Aug2020_Kreiman.pdf · Virtual BMM 2020 gabriel.kreiman@tch.harvard.edu. What is there? Who are they? Where are they? What are they doing?](https://reader036.vdocument.in/reader036/viewer/2022081614/5fc1a290f78c4907305e31f9/html5/thumbnails/2.jpg)
What is there?
Who are they?
Where are
they?
What are they doing?
What is their relationship?
What are they looking
at?
What happened
before?
Where is Obama’s
foot
Search for all
the mirrors
How many shoes are
there?
An image is worth a million words
![Page 3: Gabriel Kreiman ://cbmm.mit.edu/.../BMM_11Aug2020_Kreiman.pdf · Virtual BMM 2020 gabriel.kreiman@tch.harvard.edu. What is there? Who are they? Where are they? What are they doing?](https://reader036.vdocument.in/reader036/viewer/2022081614/5fc1a290f78c4907305e31f9/html5/thumbnails/3.jpg)
State of the Art in Image Captioning
Kreiman and Serre, 2020Beyond the feedforward sweep
![Page 4: Gabriel Kreiman ://cbmm.mit.edu/.../BMM_11Aug2020_Kreiman.pdf · Virtual BMM 2020 gabriel.kreiman@tch.harvard.edu. What is there? Who are they? Where are they? What are they doing?](https://reader036.vdocument.in/reader036/viewer/2022081614/5fc1a290f78c4907305e31f9/html5/thumbnails/4.jpg)
Visual cognition: a sequence of routines*
1. Extract initial sensory map à Call VisualSampling
What is there?
What are they doing?
What are they looking
at?What will
happen next?
Where is Obama’s
foot?Where
are they?
Search for all
the mirrors
* Shimon Ullman. Visual Routines. 1984
Divide et impera: break down task into simpler, reusable visual routines
2. Propose image gist à Call FastPeriphAssess
3. Propose foveal objects à Call FovealRecognition
4. Inference à Call PatternCompletion
5. Working memory à Call Visual Buffer
6. Task-dependent sampling à Call EyeMovement
7. Response à Call DecisionMaking
![Page 5: Gabriel Kreiman ://cbmm.mit.edu/.../BMM_11Aug2020_Kreiman.pdf · Virtual BMM 2020 gabriel.kreiman@tch.harvard.edu. What is there? Who are they? Where are they? What are they doing?](https://reader036.vdocument.in/reader036/viewer/2022081614/5fc1a290f78c4907305e31f9/html5/thumbnails/5.jpg)
Standing on the shoulders of giants:Mesoscopic connectivity of the primate visual system
Felleman and Van Essen. Cerebral Cortex 1991
Retina
Primary visual cortex
~ Deep convolutional neural networks
Krizhevsky et al, NIPS 2012
![Page 6: Gabriel Kreiman ://cbmm.mit.edu/.../BMM_11Aug2020_Kreiman.pdf · Virtual BMM 2020 gabriel.kreiman@tch.harvard.edu. What is there? Who are they? Where are they? What are they doing?](https://reader036.vdocument.in/reader036/viewer/2022081614/5fc1a290f78c4907305e31f9/html5/thumbnails/6.jpg)
Visual cognition: a sequence of routines*
1. Extract initial sensory map à Call VisualSampling Retina
What is there?
What are they doing?
What are they looking
at?What will
happen next?
Where is Obama’s
foot?Where
are they?
Search for all
the mirrors
* Shimon Ullman. Visual Routines. 1984
Divide et impera: break down task into simpler, reusable visual routines
2. Propose image gist à Call FastPeriphAssess V1, V2, V4
3. Propose foveal objects à Call FovealRecognition ITC
4. Inference à Call PatternCompletion Ventral Stream
5. Working memory à Call Visual Buffer PFC
6. Task-dependent sampling à Call EyeMovement Dorsal Stream
7. Response à Call DecisionMaking PFC
![Page 7: Gabriel Kreiman ://cbmm.mit.edu/.../BMM_11Aug2020_Kreiman.pdf · Virtual BMM 2020 gabriel.kreiman@tch.harvard.edu. What is there? Who are they? Where are they? What are they doing?](https://reader036.vdocument.in/reader036/viewer/2022081614/5fc1a290f78c4907305e31f9/html5/thumbnails/7.jpg)
Outline
Each additional processing step takes ~10 to 15 ms
Schmolesky et al 1998
![Page 8: Gabriel Kreiman ://cbmm.mit.edu/.../BMM_11Aug2020_Kreiman.pdf · Virtual BMM 2020 gabriel.kreiman@tch.harvard.edu. What is there? Who are they? Where are they? What are they doing?](https://reader036.vdocument.in/reader036/viewer/2022081614/5fc1a290f78c4907305e31f9/html5/thumbnails/8.jpg)
Visual cognition: a sequence of routines*
1. Extract initial sensory map à Call VisualSampling Retina ~50 ms
What is there?
What are they doing?
What are they looking
at?What will
happen next?
Where is Obama’s
foot?Where
are they?
Search for all
the mirrors
* Shimon Ullman. Visual Routines. 1984
Divide et impera: break down task into simpler, reusable visual routines
2. Propose image gist à Call FastPeriphAssess V1, V2, V4 ~100 ms
3. Propose foveal objects à Call FovealRecognition ITC ~150 ms
4. Inference à Call PatternCompletion Ventral Stream ~200 ms
5. Working memory à Call Visual Buffer PFC ~300 ms
6. Task-dependent sampling à Call EyeMovement Dorsal Stream ~300 ms
7. Response à Call DecisionMaking PFC ~300-1000 ms
![Page 9: Gabriel Kreiman ://cbmm.mit.edu/.../BMM_11Aug2020_Kreiman.pdf · Virtual BMM 2020 gabriel.kreiman@tch.harvard.edu. What is there? Who are they? Where are they? What are they doing?](https://reader036.vdocument.in/reader036/viewer/2022081614/5fc1a290f78c4907305e31f9/html5/thumbnails/9.jpg)
Outline
1. Pattern completion: making inferences from partial information
2. Putting vision in context
3. Active sampling: extracting information via eye movements
![Page 10: Gabriel Kreiman ://cbmm.mit.edu/.../BMM_11Aug2020_Kreiman.pdf · Virtual BMM 2020 gabriel.kreiman@tch.harvard.edu. What is there? Who are they? Where are they? What are they doing?](https://reader036.vdocument.in/reader036/viewer/2022081614/5fc1a290f78c4907305e31f9/html5/thumbnails/10.jpg)
Outline
1. Pattern completion: making inferences from partial information
2. Putting vision in context
3. Active sampling: extracting information via eye movements
Tang et al, Neuron 2014Tang et al, PNAS 2018
Martin Schrimpf
HanlinTang
Bill Lotter
![Page 11: Gabriel Kreiman ://cbmm.mit.edu/.../BMM_11Aug2020_Kreiman.pdf · Virtual BMM 2020 gabriel.kreiman@tch.harvard.edu. What is there? Who are they? Where are they? What are they doing?](https://reader036.vdocument.in/reader036/viewer/2022081614/5fc1a290f78c4907305e31f9/html5/thumbnails/11.jpg)
Pattern completion is a hallmark of intelligence
A
1, 2, 3, 5, 7, 11,
Even though it was raining heavily, John decided to go out without an…
V_S_A_ R_C_G_I_I_N
I
13
Visual Recognition
Umbrella
Also: Reading a storySocial interactions
C E G
![Page 12: Gabriel Kreiman ://cbmm.mit.edu/.../BMM_11Aug2020_Kreiman.pdf · Virtual BMM 2020 gabriel.kreiman@tch.harvard.edu. What is there? Who are they? Where are they? What are they doing?](https://reader036.vdocument.in/reader036/viewer/2022081614/5fc1a290f78c4907305e31f9/html5/thumbnails/12.jpg)
Visual recognition of occluded objects
![Page 13: Gabriel Kreiman ://cbmm.mit.edu/.../BMM_11Aug2020_Kreiman.pdf · Virtual BMM 2020 gabriel.kreiman@tch.harvard.edu. What is there? Who are they? Where are they? What are they doing?](https://reader036.vdocument.in/reader036/viewer/2022081614/5fc1a290f78c4907305e31f9/html5/thumbnails/13.jpg)
Visual inference from partial information
![Page 14: Gabriel Kreiman ://cbmm.mit.edu/.../BMM_11Aug2020_Kreiman.pdf · Virtual BMM 2020 gabriel.kreiman@tch.harvard.edu. What is there? Who are they? Where are they? What are they doing?](https://reader036.vdocument.in/reader036/viewer/2022081614/5fc1a290f78c4907305e31f9/html5/thumbnails/14.jpg)
Strong robustness to limited visibility
Tang et al, Neuron 2014Tang et al, PNAS 2018
Robust recognition despite heavy occlusion
It “costs” ~50 ms extra to visually complete patterns
![Page 15: Gabriel Kreiman ://cbmm.mit.edu/.../BMM_11Aug2020_Kreiman.pdf · Virtual BMM 2020 gabriel.kreiman@tch.harvard.edu. What is there? Who are they? Where are they? What are they doing?](https://reader036.vdocument.in/reader036/viewer/2022081614/5fc1a290f78c4907305e31f9/html5/thumbnails/15.jpg)
Peeking inside the human brain
![Page 16: Gabriel Kreiman ://cbmm.mit.edu/.../BMM_11Aug2020_Kreiman.pdf · Virtual BMM 2020 gabriel.kreiman@tch.harvard.edu. What is there? Who are they? Where are they? What are they doing?](https://reader036.vdocument.in/reader036/viewer/2022081614/5fc1a290f78c4907305e31f9/html5/thumbnails/16.jpg)
Neural selectivity is retained despite strong occlusion
Tang et al, Neuron 2014Tang et al, PNAS 2018
Neural mechanisms of pattern completion
Neural responses are delayed by ~50 ms
![Page 17: Gabriel Kreiman ://cbmm.mit.edu/.../BMM_11Aug2020_Kreiman.pdf · Virtual BMM 2020 gabriel.kreiman@tch.harvard.edu. What is there? Who are they? Where are they? What are they doing?](https://reader036.vdocument.in/reader036/viewer/2022081614/5fc1a290f78c4907305e31f9/html5/thumbnails/17.jpg)
Rumination through horizontal connections
1. Behavior: 50 ms cost.
2. Neurophysiology: 50 ms delay
3. Neuroanatomy: slow horizontal connection
![Page 18: Gabriel Kreiman ://cbmm.mit.edu/.../BMM_11Aug2020_Kreiman.pdf · Virtual BMM 2020 gabriel.kreiman@tch.harvard.edu. What is there? Who are they? Where are they? What are they doing?](https://reader036.vdocument.in/reader036/viewer/2022081614/5fc1a290f78c4907305e31f9/html5/thumbnails/18.jpg)
Neurobiological recipe: add horizontal recurrent connections to deep convolutional networks
![Page 19: Gabriel Kreiman ://cbmm.mit.edu/.../BMM_11Aug2020_Kreiman.pdf · Virtual BMM 2020 gabriel.kreiman@tch.harvard.edu. What is there? Who are they? Where are they? What are they doing?](https://reader036.vdocument.in/reader036/viewer/2022081614/5fc1a290f78c4907305e31f9/html5/thumbnails/19.jpg)
Outline
1. Pattern completion: making inferences from partial information
2. Putting vision in context
3. Active sampling: extracting information via eye movements
Mengmi ZhangMartin Schrimpf
Zhang et al, CVPR 2020
![Page 20: Gabriel Kreiman ://cbmm.mit.edu/.../BMM_11Aug2020_Kreiman.pdf · Virtual BMM 2020 gabriel.kreiman@tch.harvard.edu. What is there? Who are they? Where are they? What are they doing?](https://reader036.vdocument.in/reader036/viewer/2022081614/5fc1a290f78c4907305e31f9/html5/thumbnails/20.jpg)
The role of context in vision: Example 1
![Page 21: Gabriel Kreiman ://cbmm.mit.edu/.../BMM_11Aug2020_Kreiman.pdf · Virtual BMM 2020 gabriel.kreiman@tch.harvard.edu. What is there? Who are they? Where are they? What are they doing?](https://reader036.vdocument.in/reader036/viewer/2022081614/5fc1a290f78c4907305e31f9/html5/thumbnails/21.jpg)
The role of context in vision: Example 2
++
![Page 22: Gabriel Kreiman ://cbmm.mit.edu/.../BMM_11Aug2020_Kreiman.pdf · Virtual BMM 2020 gabriel.kreiman@tch.harvard.edu. What is there? Who are they? Where are they? What are they doing?](https://reader036.vdocument.in/reader036/viewer/2022081614/5fc1a290f78c4907305e31f9/html5/thumbnails/22.jpg)
Investigating the role of context in visual recognition
![Page 23: Gabriel Kreiman ://cbmm.mit.edu/.../BMM_11Aug2020_Kreiman.pdf · Virtual BMM 2020 gabriel.kreiman@tch.harvard.edu. What is there? Who are they? Where are they? What are they doing?](https://reader036.vdocument.in/reader036/viewer/2022081614/5fc1a290f78c4907305e31f9/html5/thumbnails/23.jpg)
What, where, and when of contextual modulation
![Page 24: Gabriel Kreiman ://cbmm.mit.edu/.../BMM_11Aug2020_Kreiman.pdf · Virtual BMM 2020 gabriel.kreiman@tch.harvard.edu. What is there? Who are they? Where are they? What are they doing?](https://reader036.vdocument.in/reader036/viewer/2022081614/5fc1a290f78c4907305e31f9/html5/thumbnails/24.jpg)
Context enhances visual recognition
![Page 25: Gabriel Kreiman ://cbmm.mit.edu/.../BMM_11Aug2020_Kreiman.pdf · Virtual BMM 2020 gabriel.kreiman@tch.harvard.edu. What is there? Who are they? Where are they? What are they doing?](https://reader036.vdocument.in/reader036/viewer/2022081614/5fc1a290f78c4907305e31f9/html5/thumbnails/25.jpg)
Attention Predictionon context𝛼!"
�̂�!Feature Extraction
using VGG16(weights shared)
𝐼#
ℎ!
Predicted Class Labels
“pizza”
ℎ$
𝐼"
𝑎#
𝑎"
𝛼$"
Attention Predictionon object
𝛼!# 𝛼$#
LSTM1 �̂�$ LSTM2
“cake”
ℎ! ℎ$Predicted
Class Labels( '𝑧!#, '𝑧!") ( '𝑧$#, '𝑧$")
Time
CATNet: Context-aware Attention Two-Stream Network
![Page 26: Gabriel Kreiman ://cbmm.mit.edu/.../BMM_11Aug2020_Kreiman.pdf · Virtual BMM 2020 gabriel.kreiman@tch.harvard.edu. What is there? Who are they? Where are they? What are they doing?](https://reader036.vdocument.in/reader036/viewer/2022081614/5fc1a290f78c4907305e31f9/html5/thumbnails/26.jpg)
Context enhances visual recognition in the model
![Page 27: Gabriel Kreiman ://cbmm.mit.edu/.../BMM_11Aug2020_Kreiman.pdf · Virtual BMM 2020 gabriel.kreiman@tch.harvard.edu. What is there? Who are they? Where are they? What are they doing?](https://reader036.vdocument.in/reader036/viewer/2022081614/5fc1a290f78c4907305e31f9/html5/thumbnails/27.jpg)
What, where, and when of contextual modulation
![Page 28: Gabriel Kreiman ://cbmm.mit.edu/.../BMM_11Aug2020_Kreiman.pdf · Virtual BMM 2020 gabriel.kreiman@tch.harvard.edu. What is there? Who are they? Where are they? What are they doing?](https://reader036.vdocument.in/reader036/viewer/2022081614/5fc1a290f78c4907305e31f9/html5/thumbnails/28.jpg)
Outline
1. Pattern completion: making inferences from partial information
2. Putting vision in context
3. Active sampling: extracting information via eye movements
Mengmi Zhang
Zhang et al, Nature Communications 2018
![Page 29: Gabriel Kreiman ://cbmm.mit.edu/.../BMM_11Aug2020_Kreiman.pdf · Virtual BMM 2020 gabriel.kreiman@tch.harvard.edu. What is there? Who are they? Where are they? What are they doing?](https://reader036.vdocument.in/reader036/viewer/2022081614/5fc1a290f78c4907305e31f9/html5/thumbnails/29.jpg)
High-resolution fovea, low-resolution periphery
Paper
![Page 30: Gabriel Kreiman ://cbmm.mit.edu/.../BMM_11Aug2020_Kreiman.pdf · Virtual BMM 2020 gabriel.kreiman@tch.harvard.edu. What is there? Who are they? Where are they? What are they doing?](https://reader036.vdocument.in/reader036/viewer/2022081614/5fc1a290f78c4907305e31f9/html5/thumbnails/30.jpg)
We are legally blind outside the fovea
DO NOT MOVE YOUR EYES!
![Page 31: Gabriel Kreiman ://cbmm.mit.edu/.../BMM_11Aug2020_Kreiman.pdf · Virtual BMM 2020 gabriel.kreiman@tch.harvard.edu. What is there? Who are they? Where are they? What are they doing?](https://reader036.vdocument.in/reader036/viewer/2022081614/5fc1a290f78c4907305e31f9/html5/thumbnails/31.jpg)
Eye movements are critical for scene understanding
![Page 32: Gabriel Kreiman ://cbmm.mit.edu/.../BMM_11Aug2020_Kreiman.pdf · Virtual BMM 2020 gabriel.kreiman@tch.harvard.edu. What is there? Who are they? Where are they? What are they doing?](https://reader036.vdocument.in/reader036/viewer/2022081614/5fc1a290f78c4907305e31f9/html5/thumbnails/32.jpg)
Four key properties of visual search
1. Selectivity[Distinguishing target from distractors]
2. Invariance [Finding target irrespective of changes in appearance]
3. Efficiency [Rapid search, avoiding exhaustive exploration]
4. Generalization [No training required]
Waldo, Wally,CharlieWalter
![Page 33: Gabriel Kreiman ://cbmm.mit.edu/.../BMM_11Aug2020_Kreiman.pdf · Virtual BMM 2020 gabriel.kreiman@tch.harvard.edu. What is there? Who are they? Where are they? What are they doing?](https://reader036.vdocument.in/reader036/viewer/2022081614/5fc1a290f78c4907305e31f9/html5/thumbnails/33.jpg)
Active sampling during visual search
Zhang et al, 2018
Mengmi Zhang
![Page 34: Gabriel Kreiman ://cbmm.mit.edu/.../BMM_11Aug2020_Kreiman.pdf · Virtual BMM 2020 gabriel.kreiman@tch.harvard.edu. What is there? Who are they? Where are they? What are they doing?](https://reader036.vdocument.in/reader036/viewer/2022081614/5fc1a290f78c4907305e31f9/html5/thumbnails/34.jpg)
Invariant Visual Search Network (IVSN)
VGG16 (Simonyan et al 2014)Visual search circuitry: e.g. Bichot et al (2015)
Scan to download paper+code+data
Zhang et al. Nature Communications 2018
![Page 35: Gabriel Kreiman ://cbmm.mit.edu/.../BMM_11Aug2020_Kreiman.pdf · Virtual BMM 2020 gabriel.kreiman@tch.harvard.edu. What is there? Who are they? Where are they? What are they doing?](https://reader036.vdocument.in/reader036/viewer/2022081614/5fc1a290f78c4907305e31f9/html5/thumbnails/35.jpg)
Neurally inspired model captures sampling behavior
Finding any Waldo: zero-shot invariant and efficient visual search.Zhang et al, 2018
![Page 36: Gabriel Kreiman ://cbmm.mit.edu/.../BMM_11Aug2020_Kreiman.pdf · Virtual BMM 2020 gabriel.kreiman@tch.harvard.edu. What is there? Who are they? Where are they? What are they doing?](https://reader036.vdocument.in/reader036/viewer/2022081614/5fc1a290f78c4907305e31f9/html5/thumbnails/36.jpg)
Summary
Attractor-based recurrent neural networks can perform pattern completion
Top-down signals can guide eye movements during visual search
Recurrent connections and eccentricity dependence help incorporate context
Divide and Conquer Strategy: A sequence of flexible, reusable, interchangeable visual routines
Algorithm discovery: neuroscience + behavior + computational models
Scan to download papers+data+code
http://[email protected]
![Page 37: Gabriel Kreiman ://cbmm.mit.edu/.../BMM_11Aug2020_Kreiman.pdf · Virtual BMM 2020 gabriel.kreiman@tch.harvard.edu. What is there? Who are they? Where are they? What are they doing?](https://reader036.vdocument.in/reader036/viewer/2022081614/5fc1a290f78c4907305e31f9/html5/thumbnails/37.jpg)
Recurrent computations to the rescue
Scan to download papers+data+code
Gabriel Kreiman
http://klab.tch.harvard.edu
Virtual BMM 2020
![Page 38: Gabriel Kreiman ://cbmm.mit.edu/.../BMM_11Aug2020_Kreiman.pdf · Virtual BMM 2020 gabriel.kreiman@tch.harvard.edu. What is there? Who are they? Where are they? What are they doing?](https://reader036.vdocument.in/reader036/viewer/2022081614/5fc1a290f78c4907305e31f9/html5/thumbnails/38.jpg)
A long and exciting way forward
Microsoft Captioning Bothttps://www.captionbot.ai/
Kreiman and Serre, 2020Beyond the feedforward sweep
![Page 39: Gabriel Kreiman ://cbmm.mit.edu/.../BMM_11Aug2020_Kreiman.pdf · Virtual BMM 2020 gabriel.kreiman@tch.harvard.edu. What is there? Who are they? Where are they? What are they doing?](https://reader036.vdocument.in/reader036/viewer/2022081614/5fc1a290f78c4907305e31f9/html5/thumbnails/39.jpg)
The what, where, and when of contextual modulation