nips2009: understand visual scenes - part 1
TRANSCRIPT
![Page 1: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/1.jpg)
Understanding Visual Scenes
Antonio Torralba
Computer Science and Artificial Intelligence Laboratory (CSAIL) Department of Electrical Engineering and Computer Science
![Page 2: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/2.jpg)
![Page 3: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/3.jpg)
![Page 4: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/4.jpg)
DUCK
DUCK
GRASS
PERSON
TREE
LAKE
BENCH
PERSON
PERSON PERSON
DUCK
PATH
SKY
SIGN
![Page 5: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/5.jpg)
A VIEW OF A PARK ON A NICE SPRING DAY
![Page 6: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/6.jpg)
PERSON FEEDING DUCKS IN THE PARK
PEOPLE WALKING IN THE PARK
DUCKS LOOKING FOR FOOD
Do not feed the ducks sign
![Page 7: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/7.jpg)
DUCKS ON TOP OF THE GRASS
PEOPLE UNDER THE SHADOW OF THE TREES
![Page 8: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/8.jpg)
Why do we care about recognition? Perception of function: We can perceive the 3D shape, texture,
material properties, without knowing about objects. But, the concept of category encapsulates also information
about what can we do with those objects.
“We therefore include the perception of function as a proper –indeed, crucial- subject for vision science”, from Vision Science, chapter 9, Palmer.
![Page 9: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/9.jpg)
The perception of function • Direct perception (affordances): Gibson
Flat surface Horizontal Knee-high …
Sittable upon
Chair Chair
Chair?
Flat surface Horizontal Knee-high …
Sittable upon
Chair
• Mediated perception (Categorization)
![Page 10: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/10.jpg)
Direct perception Some aspects of an object function can be
perceived directly • Functional form: Some forms clearly
indicate to a function (“sittable-upon”, container, cutting device, …)
Sittable-upon Sittable-upon
Sittable-upon
It does not seem easy to sit-upon this…
![Page 11: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/11.jpg)
Scenes, as objects, also have affordances
![Page 12: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/12.jpg)
The function of the scene
![Page 13: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/13.jpg)
![Page 14: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/14.jpg)
Direct perception Some aspects of an object function can be
perceived directly • Observer relativity: Function is observer
dependent From http://lastchancerescueflint.org
![Page 15: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/15.jpg)
Limitations of Direct Perception Objects of similar structure might have very different functions
Not all functions seem to be available from direct visual information only.
![Page 16: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/16.jpg)
Limitations of Direct Perception
![Page 17: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/17.jpg)
Object detection and recognition Short overview of current approaches
![Page 18: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/18.jpg)
Object recognition Is it really so hard?
This is a chair
Find the chair in this image Output of normalized correlation
![Page 19: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/19.jpg)
Object recognition Is it really so hard?
Find the chair in this image
Pretty much garbage Simple template matching is not going to make it
![Page 20: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/20.jpg)
So, let’s make the problem simpler: Block world
Object Recognition in the Geometric Era: a Retrospective. Joseph L. Mundy. 2006
![Page 21: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/21.jpg)
Object Recognition in the Geometric Era: a Retrospective. Joseph L. Mundy. 2006
Binford and generalized cylinders
Recognition by components
Irving Biederman Recognition-by-Components: A Theory of Human Image Understanding. Psychological Review, 1987.
Introduced in computer vision by A. Pentland, 1986.
![Page 22: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/22.jpg)
Families of recognition algorithms Bag of words models Voting models
Constellation models Rigid template models
Sirovich and Kirby 1987 Turk, Pentland, 1991 Dalal & Triggs, 2006
Fischler and Elschlager, 1973 Burl, Leung, and Perona, 1995
Weber, Welling, and Perona, 2000 Fergus, Perona, & Zisserman, CVPR 2003
Viola and Jones, ICCV 2001 Heisele, Poggio, et. al., NIPS 01
Schneiderman, Kanade 2004 Vidal-Naquet, Ullman 2003
Shape matching Deformable models
Csurka, Dance, Fan, Willamowski, and Bray 2004 Sivic, Russell, Freeman, Zisserman, ICCV 2005
Berg, Berg, Malik, 2005 Cootes, Edwards, Taylor, 2001
![Page 23: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/23.jpg)
Feret dataset, 1996 DARPA
The face age
• The representation and matching of pictorial structures Fischler, Elschlager (1973). • Face recognition using eigenfaces M. Turk and A. Pentland (1991). • Human Face Detection in Visual Scenes - Rowley, Baluja, Kanade (1995) • Graded Learning for Object Detection - Fleuret, Geman (1999) • Robust Real-time Object Detection - Viola, Jones (2001) • Feature Reduction and Hierarchy of Classifiers for Fast Object Detection in Video Images - Heisele, Serre, Mukherjee, Poggio (2001) • ….
![Page 24: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/24.jpg)
Face detection
![Page 25: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/25.jpg)
Haar-like filters and cascades Viola and Jones, ICCV 2001
The average intensity in the block is computed with four sums independently of the block size.
Also Fleuret and Geman, 2001
![Page 26: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/26.jpg)
Generic objects: Edge based descriptors
Gavrila, Philomin, ICCV 1999 Papageorgiou & Poggio (2000)
J. Shotton, A. Blake, R. Cipolla. PAMI 2008.
Opelt, Pinz, Zisserman, ECCV 2006
![Page 27: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/27.jpg)
Histograms of oriented gradients
Shape context Belongie, Malik, Puzicha, NIPS 2000 SIFT, D. Lowe, ICCV 1999
![Page 28: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/28.jpg)
Histograms of oriented gradients Dalal & Trigs, 2006
x Not a person
x person
![Page 29: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/29.jpg)
Adding parts Felzenszwalb, McAllester, Ramanan. 2008.
![Page 30: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/30.jpg)
Felzenszwalb, McAllester, Ramanan. 2008.
Adding parts
![Page 31: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/31.jpg)
Felzenszwalb, McAllester, Ramanan. 2008.
![Page 32: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/32.jpg)
Evaluation of performance
The detector challenge: by looking at the output of a detector on a random set of images, can you guess which object is it trying to detect?
Before plotting and ROC or precision-recall curves…
![Page 33: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/33.jpg)
What object is detector trying to detect?
The detector challenge: by looking at the output of a detector on a random set of images, can you guess which object is it trying to detect?
![Page 34: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/34.jpg)
What object is detector trying to detect?
The detector challenge: by looking at the output of a detector on a random set of images, can you guess which object is it trying to detect?
Table detector
![Page 35: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/35.jpg)
1 2
3
4
5
6
7
1. chair, 2. table, 3. road, 4. road, 5. table, 6. car, 7. keyboard.
![Page 36: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/36.jpg)
Some symptoms of standard approaches
![Page 37: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/37.jpg)
![Page 38: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/38.jpg)
![Page 39: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/39.jpg)
![Page 40: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/40.jpg)
![Page 41: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/41.jpg)
Scenes rule over objects
3D percept is driven by the scene, which imposes its ruling to the objects
![Page 42: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/42.jpg)
Scene recognition The gist of the scene
![Page 43: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/43.jpg)
Mary Potter (1976) Mary Potter (1975, 1976) demonstrated that during a rapid sequential visual presentation (100 msec per image), a novel picture is instantly understood and observers seem to comprehend a lot of visual information
![Page 44: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/44.jpg)
Demo : Rapid image understanding
Instructions: 9 photographs will be shown for half a second each. Your task is to memorize these pictures
By Aude Oliva
![Page 45: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/45.jpg)
![Page 46: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/46.jpg)
![Page 47: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/47.jpg)
![Page 48: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/48.jpg)
![Page 49: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/49.jpg)
![Page 50: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/50.jpg)
![Page 51: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/51.jpg)
![Page 52: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/52.jpg)
![Page 53: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/53.jpg)
![Page 54: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/54.jpg)
![Page 55: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/55.jpg)
Which of the following pictures have you seen ?
If you have seen the image clap your hands once
If you have not seen the image do nothing
Memory Test
![Page 56: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/56.jpg)
Have you seen this picture ?
![Page 57: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/57.jpg)
NO
![Page 58: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/58.jpg)
Have you seen this picture ?
![Page 59: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/59.jpg)
NO
![Page 60: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/60.jpg)
Have you seen this picture ?
![Page 61: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/61.jpg)
NO
![Page 62: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/62.jpg)
Have you seen this picture ?
![Page 63: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/63.jpg)
NO
![Page 64: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/64.jpg)
Have you seen this picture ?
![Page 65: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/65.jpg)
Yes
![Page 66: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/66.jpg)
Have you seen this picture ?
![Page 67: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/67.jpg)
NO
![Page 68: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/68.jpg)
You have seen these pictures
You were tested with these pictures
![Page 69: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/69.jpg)
The gist of the scene
In a glance, we remember the meaning of an image and its global layout but some objects and details are forgotten
![Page 70: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/70.jpg)
From objects to scenes
Image I
Local features L L L L
O2 O2 O2 O2
S SceneType 2 {street, office, …}
O1 O1 O1 O1 Object localization
Riesenhuber & Poggio (99); Vidal-Naquet & Ullman (03); Serre & Poggio, (05); Agarwal & Roth, (02), Moghaddam, Pentland (97), Turk, Pentland (91),Vidal-Naquet, Ullman, (03) Heisele, et al, (01), Agarwal & Roth, (02), Kremp, Geman, Amit (02), Dorko, Schmid, (03) Fergus, Perona, Zisserman (03), Fei Fei, Fergus, Perona, (03), Schneiderman, Kanade (00), Lowe (99)
![Page 71: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/71.jpg)
What makes scenes different?
Different objects, different spatial layout
Floor
Door
Light
Wall Wall Door
Ceiling
Painting
Fireplace armchair armchair
Coffee table
Door Door
Ceiling Lamp
mirror mirror wall
Door
wall
wall
painting
Bed Side-table
Lamp
phone alarm
carpet
![Page 72: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/72.jpg)
What makes scenes different?
Different objects, similar spatial layout
Window
sink dishes faucet
cabinet
cabinet
counter
cabinet
mirror shelves mirror
ceiling
wall
towel sink sink
cabinet cabinet
faucet soap counter
counter beer
mirror glasses
stool
lamp lamp
![Page 73: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/73.jpg)
What makes scenes different?
Similar objects, different spatial layout
table
table
table
ceiling
wall window window
bottle
chair chair
chair
chair
chair chair
chair chair
chair chair
chair
chair
table
table
table
table
table table
table table
table
table table
table
chair chair
chair chair
floor floor
wall
ceiling window
window
![Page 74: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/74.jpg)
What makes scenes different?
Similar objects, different spatial layout
table
table
table
ceiling
wall window window
bottle
chair chair
chair
chair
chair chair
chair chair
chair chair
chair
chair
table
table
table
table
table table
table table
table
table table
table
chair chair
chair
floor floor
wall
ceiling window
window
chair
![Page 75: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/75.jpg)
What makes scenes different?
Similar objects, and similar spatial layout
seat seat
seat seat
seat seat
seat seat
window window window
ceiling cabinets cabinets
seat seat
seat seat
seat seat
seat seat window window
ceiling cabinets cabinets
seat seat seat seat
seat seat seat seat
seat seat seat seat
seat seat seat seat
screen
ceiling
wall column
Different lighting, different materials, very specific object categories
![Page 76: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/76.jpg)
What can be an alternative to objects?
![Page 77: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/77.jpg)
Scene emergent features “Recognition via features that are not those of individual objects but “emerge” as objects are brought into relation to each other to form a scene.” – Biederman 81
From “on the semantics of a glance at a scene”, Biederman, 1981
![Page 78: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/78.jpg)
Examples of scene emergent features
Suggestive edges and junctions Simple geometric forms
Blobs Textures
![Page 79: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/79.jpg)
Ensemble statistics Ariely, 2001, Seeing sets: Representation by statistical properties Chong, Treisman, 2003, Representation of statistical properties
Alvarez, Oliva, 2008, 2009, Spatial ensemble statistics
Conclusion: observers had more accurate representation of the mean than of the individual members of the set.
![Page 80: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/80.jpg)
From scenes to objects
S SceneType 2 {street, office, …}
Image
Local features L L L
I
O1 O1 O1 O1 O2 O2 O2 O2
L• Scene emergent • Ensemble statistics • Global features
GObject localization
![Page 81: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/81.jpg)
How far can we go without objects?
S SceneType 2 {street, office, …}
Image
Local features L L L
I
L
• Scene emergent • Ensemble statistics • Global features
G
![Page 82: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/82.jpg)
Global image descriptors
![Page 83: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/83.jpg)
Global image descriptors
Sivic et. al., ICCV 2005 Fei-Fei and Perona, CVPR 2005
Bag of words Spatially organized textures
Non localized textons
S. Lazebnik, et al, CVPR 2006 Walker, Malik. Vision Research 2004 …
M. Gorkani, R. Picard, ICPR 1994 A. Oliva, A. Torralba, IJCV 2001
… R. Datta, D. Joshi, J. Li, and J. Z. Wang, Image Retrieval: Ideas, Influences, and Trends of the New Age, ACM Computing Surveys, vol. 40, no. 2, pp. 5:1-60, 2008.
![Page 84: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/84.jpg)
Gist descriptor
8 orientations 4 scales x 16 bins 512 dimensions
• Apply oriented Gabor filters over different scales • Average filter energy in each bin
Similar to SIFT (Lowe 1999) applied to the entire image M. Gorkani, R. Picard, ICPR 1994; Walker, Malik. Vision Research 2004; Vogel et al. 2004; Fei-Fei and Perona, CVPR 2005; S. Lazebnik, et al, CVPR 2006; …
Oliva and Torralba, 2001
![Page 85: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/85.jpg)
Textons Filter bank K-means (100 clusters)
Walker, Malik, 2004
Malik, Belongie, Shi, Leung, 1999
![Page 86: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/86.jpg)
Bag of words & spatial pyramid matching
S. Lazebnik, et al, CVPR 2006
Sivic, Zisserman, 2003. Visual words = Kmeans of SIFT descriptors
![Page 87: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/87.jpg)
The 15-scenes benchmark
Bedroom Suburb
Industrial Kitchen
Living room Coast Forest
Highway
Building facade
Mountain Open country Street
Skyscrapers
Office
Store
Oliva & Torralba, 2001 Fei Fei & Perona, 2005 Lazebnik, et al 2006
![Page 88: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/88.jpg)
Scene recognition 100 training samples per class
SVM classifier in both cases Human performance
![Page 89: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/89.jpg)
Large Scale Scene Recognition Nature Urban Indoor
~1,000 categories
>130,000 images
>12,000 fully annotated images
Xiao, Hays, Ehinger, Oliva, Torralba; maybe 2010
![Page 90: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/90.jpg)
![Page 91: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/91.jpg)
Performance with 400 categories
Xiao, Hays, Ehinger, Oliva, Torralba; maybe 2010
![Page 92: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/92.jpg)
Xiao, Hays, Ehinger, Oliva, Torralba; maybe 2010
Abbey
Airplane cabin
Airport terminal
Alley
Amphitheater
Training images
![Page 93: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/93.jpg)
Xiao, Hays, Ehinger, Oliva, Torralba; maybe 2010
Abbey
Airplane cabin
Airport terminal
Alley
Amphitheater
Training images Correct classifications
![Page 94: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/94.jpg)
Xiao, Hays, Ehinger, Oliva, Torralba; maybe 2010
Abbey
Airplane cabin
Airport terminal
Alley
Amphitheater
Monastery Cathedral Castle
Toy shop Van Discotheque
Subway Stage Restaurant
Restaurant patio
Courtyard Canal
Harbor Coast Athletic field
Training images Correct classifications Miss-classifications
![Page 95: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/95.jpg)
Example of three different scenes RIVER
BEACH
VILLAGE
Xiao, Hays, Ehinger, Oliva, Torralba; maybe 2010
![Page 96: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/96.jpg)
But they are all part of the same picture
Xiao, Hays, Ehinger, Oliva, Torralba; maybe 2010
![Page 97: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/97.jpg)
But they are all part of the same picture
Xiao, Hays, Ehinger, Oliva, Torralba; maybe 2010
![Page 98: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/98.jpg)
Scene detection
Xiao, Hays, Ehinger, Oliva, Torralba; maybe 2010
![Page 99: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/99.jpg)
Categories or a continuous space?
Check poster by Malisiewicz, Efros
![Page 100: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/100.jpg)
Categories or a continuous space? From the city to the mountains in 10 steps
![Page 101: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/101.jpg)
Spatial envelope: a continuous space of scenes
Degree of Openness
Deg
ree
of E
xpan
sion
Highway Street City centre Tall Building
Oliva & Torralba, 2001
![Page 102: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/102.jpg)
Coast Countryside Forest Mountain
Deg
ree
of R
ugge
dnes
s
Degree of Openness
Spatial envelope: a continuous space of scenes
Oliva & Torralba, 2001
![Page 103: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/103.jpg)
Context for object recognition
![Page 104: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/104.jpg)
Who needs context anyway? We can recognize objects even out of context
Banksy
![Page 105: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/105.jpg)
Look‐Alikes by Joan Steiner
Even in high resolution, we can not shut down contextual processing and it is hard to recognize the true identities of the elements that compose this scene.
![Page 106: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/106.jpg)
Why is context important? • Changes the interpretation of an object (or its function)
• Context defines what an unexpected event is
![Page 107: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/107.jpg)
Objects and Scenes
Biederman’s violaPons (1981):
![Page 108: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/108.jpg)
Global precedence Forest Before Trees: The Precedence of Global Features in Visual Perception Navon (1977)
![Page 109: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/109.jpg)
Scene recognition without object recognition
S
g
Scene
Scene features
Murphy, Torralba, Freeman; NIPS 2003. Torralba, Murphy, Freeman, CACM 2010.
![Page 110: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/110.jpg)
An integrated model of Scenes, Objects, and Parts
Ncar
S
g
Scene
Scene gist
features
0
0
1
1
5
5
N
P(Ncar | S = street)
P(Ncar | S = park) N
Murphy, Torralba, Freeman; NIPS 2003. Torralba, Murphy, Freeman, CACM 2010.
![Page 111: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/111.jpg)
Object retrieval: scene features vs. detector Results using the keyboard detector alone
Results using both the detector and the global scene features
Murphy, Torralba, Freeman; NIPS 2003. Torralba, Murphy, Freeman, CACM 2010.
![Page 112: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/112.jpg)
The layered structure of scenes
p(x2|x1) p(x)
In a display with multiple targets present, the location of one target constraints the ‘y’ coordinate of the remaining targets, but not the ‘x’ coordinate.
Assuming a human observer standing on the ground
Torralba, Oliva, Castelhano, Henderson. 2006
![Page 113: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/113.jpg)
Context driven object detection
Zcar Ncar
S
g
Scene
Scene gist
features
0 1 5
P(Ncar | S = street)
N
Murphy, Torralba, Freeman; NIPS 2003. Torralba, Murphy, Freeman, CACM 2010.
![Page 114: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/114.jpg)
An integrated model of Scenes, Objects, and Parts
p(d | F=1) = N(d | µ1, σ1) p(d | F=0) = N(d | µ0, σ0)
We train a multiview car detector.
xcari dcar
i
car Fi
N=4
Murphy, Torralba, Freeman; NIPS 2003. Torralba, Murphy, Freeman, CACM 2010.
![Page 115: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/115.jpg)
An integrated model of Scenes, Objects, and Parts
Zcar Ncar
S
g
Scene
Scene gist
features xcar
i dcari
car Fi
M=4
Murphy, Torralba, Freeman; NIPS 2003. Torralba, Murphy, Freeman, CACM 2010.
![Page 116: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/116.jpg)
Two tasks
![Page 117: NIPS2009: Understand Visual Scenes - Part 1](https://reader037.vdocument.in/reader037/viewer/2022100309/554e73bab4c90545698b4bb0/html5/thumbnails/117.jpg)