relja arandjelović and andrew zisserman · 2014. 10. 28. · visual vocabulary with a semantic...

1
Visual vocabulary with a semantic twist Relja Arandjelović and Andrew Zisserman Visual Geometry Group, Department of Engineering Science, University of Oxford Motivation and objectives Semantic vocabulary Results Fast Semantic Segmentation via Soft Segments (FSSS) (paper within a paper) • Standard large scale instance retrieval: - Usually based on matching local descriptors, e.g. (Root)SIFT - Not distinctive enough - Can't "see the big picture" • SemanticSIFT: - Matching: utilize local image semantic content before after • Suppose we have pixel-wise semantic segmentation into C classes • Assign a "semantic word" to a local image patch: - The patch contains semantic class c if it contains at least one pixel of a class c - Number of possible semantic words K s =2 C -1 - For our choice: {sky, ora, other} (C=3) there are K s =7 semantic words: {sky}, {ora}, {other}, {sky, ora}, {sky, other}, {ora, other}, {sky, ora, other} Matching Product vocabulary Feature removal • Patches can match only if their semantic words are identical Win #1: Increases precision due to stricter matching • SemanticSIFT vocabulary: product vocabulary of the visual and semantic vocabularies; size K=K semantic x K visual • Large scale retrieval: ranking via inverted index which exploits bag-of-words sparsity - Larger vocabulary => shorter posting lists => fewer items to traverse during scoring => faster retrieval Win #2: Faster retrieval due to the larger (product) vocabulary • For a specic task: some features are not useful, or even detrimental • Can remove features a priori known to be irrelevant Win #3: Reduced storage (RAM) costs Win-Win-Win • Testing on Oxford 5k and 105k datasets, training on Paris6k • Baseline: Hamming Embedding + burstiness • Over 5 random seeds: +1.2% • Baseline with 7x larger visual vocabulary, Oxford 5k: 54.9% • Expected speedup for an average query for Oxford 105k and SoftSemanticSIFT: 38.4% Mean average precision (mAP) Empirical speedup for the 55 Oxford queries • State-of-the-art semantic segmentation methods take minutes per image • We introduce a new method which takes 7 seconds on a single CPU in MATLAB for a 500x500 pixel image • Code available: www.robots.ox.ac.uk/~vgg/software/fast_semantic_segmentation • Idea: - Start with fast soft-segmentation method by Leordeanu et al. ECCV 2012 (takes 1.7s) - To handle segmentation uncertainty: introduce an "unknown" class and allow it to match all classes - Minimize an energy which stimulates agreement between soft-segments and similar pixels, taking into account soft-segment unary potentials - Stanford background dataset: 78% @ 3.7s / image - State-of-the-art: Lempitsky et al. (2011): 81.9% @ minutes per image due to using globalPb • Results: - Tighe & Lazebnik (2010): 77.5% @ 10 min / image no geometric verication False matches based on SIFT that are removed by semantic ltering

Upload: others

Post on 13-Sep-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Relja Arandjelović and Andrew Zisserman · 2014. 10. 28. · Visual vocabulary with a semantic twist Relja Arandjelović and Andrew Zisserman Visual Geometry Group, Department of

Visual vocabulary with a semantic twistRelja Arandjelović and Andrew Zisserman

Visual Geometry Group, Department of Engineering Science, University of Oxford

Motivation and objectives

Semantic vocabulary

Results

Fast Semantic Segmentationvia Soft Segments (FSSS)

(paper within a paper)

• Standard large scale instance retrieval:

- Usually based on matching local descriptors, e.g. (Root)SIFT

- Not distinctive enough

- Can't "see the big picture"

• SemanticSIFT:

- Matching: utilize local image semantic content

before

after

• Suppose we have pixel-wise semantic segmentation into C classes

• Assign a "semantic word" to a local image patch:

- The patch contains semantic class c if it contains at least one pixel of a class c

- Number of possible semantic words Ks=2C -1

- For our choice: {sky, flora, other} (C=3) there are Ks=7 semantic words: {sky}, {flora}, {other}, {sky, flora}, {sky, other}, {flora, other}, {sky, flora, other}

Matching

Product vocabulary

Feature removal

• Patches can match only if their semantic words are identical• Win #1: Increases precision due to stricter matching

• SemanticSIFT vocabulary: product vocabulary of the visual and semantic vocabularies; size K=Ksemantic x Kvisual

• Large scale retrieval: ranking via inverted index which exploits bag-of-words sparsity

- Larger vocabulary => shorter posting lists => fewer items to traverse during scoring => faster retrieval

• Win #2: Faster retrieval due to the larger (product) vocabulary

• For a specific task: some features are not useful, or even detrimental• Can remove features a priori known to be irrelevant

• Win #3: Reduced storage (RAM) costs

Win-Win-Win

• Testing on Oxford 5k and 105k datasets, training on Paris6k

• Baseline: Hamming Embedding + burstiness

• Over 5 random seeds: +1.2%

• Baseline with 7x larger visual vocabulary, Oxford 5k: 54.9%

• Expected speedup for an average query for Oxford 105k and SoftSemanticSIFT: 38.4%

Mean average precision (mAP)

Empirical speedup for the 55 Oxford queries

• State-of-the-art semantic segmentation methods take minutes per image

• We introduce a new method which takes 7 seconds on a single CPU in MATLAB for a 500x500 pixel image

• Code available:www.robots.ox.ac.uk/~vgg/software/fast_semantic_segmentation

• Idea:

- Start with fast soft-segmentation method by Leordeanu et al. ECCV 2012 (takes 1.7s)

- To handle segmentation uncertainty: introduce an "unknown" class and allow it to match all classes

- Minimize an energy which stimulates agreement between soft-segments and similar pixels, taking into account soft-segment unary potentials

- Stanford background dataset: 78% @ 3.7s / image- State-of-the-art: Lempitsky et al. (2011): 81.9% @ minutes per image due to using globalPb

• Results:

- Tighe & Lazebnik (2010): 77.5% @ 10 min / image

no geometric verification

False matches based on SIFT that are removed by semantic filtering