visual recognition with humans in the...

Serge Belongie UC San Diego

Visual Recognition with Humans in the Loop

Peter Welinder Pietro Perona

Steve Branson Catherine Wah Boris Babenko Florian Schroff

http://www.cse.ucsd.edu

Outline• Visipedia project overview • Relevant Work • Birds-200 dataset • “Visual 20 Questions” game • Results • Discussion

What Is Visipedia?

http://en.wikipedia.org/wiki/Bird

• The visual counterpart to Wikipedia • A user-generated encyclopedia of visual knowledge • An effort to associate articles with large quantities of

well-organized, intuitive visual concepts

Motivation• People will willingly label or organize certain

images if: – They are interested in a particular subject matter – They have the appropriate expertise

Ring-tailed lemur Thruxton Jackaroo

Motivation• Construct a more comprehensive and intuitive

knowledge base of visual objects • Provide services like better text-to-image search

and image-to-article search

Populating Visipedia• Populate Wikipedia articles with more visual

data using large quantities of unlabeled data on the web

World wide web Visipedia

Related Work: Systems• Botanist’s Field Guide [Belhumeur et al. ’08] • Oxford Flowers [Nilsback & Zisserman ’08] • STONEFLY9 [Martínez-Muñoz et al. ’09] • omoby [IQEngines.com ‘10] • 20 Questions game [20q.net] • ReCAPTCHA [von Ahn et al. ’08] • Wikimedia Commons

!10

Related Work: Methods

• Relevance Feedback • Active Learning • Expert Systems • Decision Trees • Feature Sharing & Taxonomies • Parts & Attributes • Crowdsourcing & Human Computation

!11

Attribute-Based Classification• Train classifiers on

attributes instead of objects • Attributes are shared by

different object classes • Attributes provide the

ingredients necessary to recognize each object class

Lampert et al. 2009 Farhadi et al. 2009

Wikimedia Commons• Multiple ways of

organizing sub-categories and visual information

• Sub-categories or clusters are represented by some exemplar image

http://commons.wikimedia.org/wiki/Dog

Motivation (Computer Vision Perspective)

• Need for more training data – Beyond the capacity of any one research group – Better quality control

• Need for more realistic data – Let people define what tasks are important – Study tightly-related categories

Dealing With a Large Number of Related Classes

• Standard classification methods fail because: – Only small number of training examples per class are

available – Variation between classes is small – Variation within a class is often still high

Brewer’s Sparrow Vesper Sparrow

Birds-200 Dataset

6033 images over 200 bird species

Image Harvesting

• Flickr: text search on species name • MTurk: presence/absence and bounding

boxes

!20

Attribute Labeling• attributes from whatbird.com • 25 visual attributes -> 288 binary attributes

– similar to “dichotomous key” in biology • MTurk interface

– {guessing, probably, definitely} • 5x redundancy factor

!23

Attribute-Based Classification• Number of attributes

is less than number of classes

• Attribute classification tasks might be easier

• Makes it easier to incorporate human knowledge

www.whatbird.com

!29

MTurker Label Certainty

MTurker Feedback• “These hits were fun. Will you be posting more of them anytime

soon? Thanks!” • “These are Beautiful birds and I am enjoying this hit collection” • “I really enjoy doing your hits, they are fun and interesting. Thanks.” • “Love doing these because I'm a bird watcher.” • “the birds are so cute..hope u can send more kind of birds” • “I haven't really studied birds, but doing these HITs has made me

realize just how beautiful they are. It has also made me aware of the many different types of birds. Thank you”

• “I REALLY LOVE THE COLOR OF THE BIRDS.” • “Thank you for providing this job. The fact that the images are

beautiful to look at make it a lot more enjoyable to do!” • “Enjoyable to do.” • Hourly Wage ≈ $1.25

!30

Visual 20 Questions

!32

• “Computer Vision” module = Vedaldi’s VLFeat • VQ Geometric Blur, color/gray SIFT spatial pyramid • Multiple Kernel Learning • Per-Class 1-vs-All SVM • 15 training examples per bird species • Choose question to maximize expected Information Gain

General Observations

• User Responses are Stochastic • Computer Vision Reduces Manual Labor • User Responses Drive Up Performance • Computer Vision Improves Overall

Performance • Different Questions are Asked w/ and w/o

Computer Vision • Recognition is not Always Successful

!35

0 10 20 30 40 50 600

0.2

0.4

0.6

0.8

1

Number of Binary Questions Asked

Perc

ent C

lass

ified

Cor

rect

ly

Deterministic UsersMTurk UsersMTurk Users + Model

w/o Computer Vision

!36

• User Responses are Stochastic

0 10 20 30 40 500

0.1

0.2

0.3

0.4

0.5

0.6

0.7


Perc

ent C

lass

ified

Cor

rect

ly

No CV1−vs−allAttribute

w/ Computer Vision

!37• Computer Vision Reduces Manual Labor

0 2 4 6 8 10 12 14 160

0.05

0.1

0.15

0.2


Perc

ent o

f Tes

tset

Imag

es

No CV (11.11)1−vs−all (6.64)Attribute (6.43)

w/ Computer Vision (cont’d)

!38• User Responses Drive Up Performance

• Computer Vision Improves Overall Performance • Different Questions are Asked w/ and w/o

Computer Vision

• Recognition is not Always Successful

Indigo Bunting Blue Grosbeak

Future Work• More Birds! More Categories! • Attribute Induction • Incorporate Part Localization • Partner with Wikimedia Foundation

Project Website

• Database, harvesting software, etc – http://visipedia.org

!43

• (extra slides follow)

!44

Part Labels

• Part diagrams give some indication of the spatial configuration of parts, but people will do this only for a small number of images

Object Localization and Shared Parts

• Training a classifier with latent variables (Dollar et al. 2008, Felzenszwalb et al. 2008)

• Latent variables are things like the pose and location of parts

• Objects in related domains share the same types of parts and poses

Shared Parts and AttributesPine Warbler Cape May Warbler Kentucky Warbler

Yellow Beak Black Striped

Hornet

Attribute and Part Detectors

Belly


Pine Warbler Cape May Warbler Kentucky Warbler

• Train part and attribute classifiers from class descriptions: – – – Part locations zi

belly, zihead, zi

beak in image xi are latent variables

Belly: solid, yellow Head: yellow Beak: all-purpose

Belly: striped, yellow, black Head: black Beak: all-purpose

Belly: solid, yellow Head: black Beak: all-purpose


Pine Warbler Cape May Warbler Kentucky Warbler

• Training examples for each part/attribute span across different bird classes – For each Cape May Warbler image xi

Belly: solid, yellow Head: yellow Beak: all-purpose

Belly: striped, yellow, black Head: black Beak: all-purpose

Belly: solid, yellow Head: black Beak: all-purpose

Objects are More than Class LabelsPine Warbler Cape May Warbler Kentucky Warbler

Yellow Beak Black Striped

Hornet

Attribute and Part Detectors

• Represent objects as parts and attributes • Model relationships between classes • Pool training examples from different object classes • Define building blocks useful to detect new object classes

Belly

Classification Using Multiple Pathways

• Arrange recognition tasks into multiple “pathways”Bird Pathway

Bird Detector

Species Detectors

Parts, Pose,

Attributes

Face Pathway

Face Detector

Face Recognition

Parts, Attributes

Text Pathway

Text Detector

Text Reading

Image

Indoor vs. Outdoor

Graphic vs. Real Image


• Place redundant calculations in earlier pathways • Transfer information from easier tasks to harder

ones • Cascade classification tasks to avoid

unnecessary computations


• Pathway components: – A domain: a set of object classes that often have

similar parts or attributes – Takes as input an image, information extracted from

earlier pathways – Algorithms useful for extracting attributes and

information relevant to the domain – Outputs estimated attributes and part locations – Invokes other pathways as necessary

Clustering and Near Duplicate Detection

Raw Beef Cooked Beef Cow Diagrams

• Improve presentation of data by suppressing duplicate, redundant, or similar images

Clustering and Near Duplicate Detection

• Use similarity metrics in different feature spaces, e.g. bag of words, color histogram, GIST and standard methods for clustering and near duplicate detection

Image Registration• Bring unlabeled images into correspondence with a

labeled one using some matching function, e.g. an affine or perspective transformation or shape matching

• Transfer labels from labeled images to unlabeled ones

Visual Knowledge

• Associate categories with predictions of which visual attributes are most representative

Sacred IbisGlossy Ibis

Curved Beak

Interactive Labeling Systems• Speedup the population of image examples for

some category: – Active learning: intelligently query labeling tasks

while incrementally training a category classifier – Relevance feedback: use labeled images to re-rank

the relevancy of unlabeled to a category

• Semi-supervised segmentation methods, e.g. GrabCut

Combining Knowledge From Text and Images

• Leverage article text and link structure in Wikipedia articles

Connecting Knowledge Between Article Text and Images

• Use article text and link structure to predict class attributes, taxonomical structure, and object parts

Connecting Knowledge Between Article Text and Images

• Add “links” between images and article text

visual recognition with humans in the...

Documents