compact descriptors for visual search

Compact Descriptors 4 Visual Search

Danilo Pau ([email protected])

Senior Principal Engineer

Senior Member of Technical Staff

SMIEEE

SI/CVRP

STMicroelectronics/AST

Courtesy: M. Funamizu

Agenda

• Visual Search: Context

• MPEG initiative on Visual Search

• Compact Descriptors for Visual Search

• Implementation

• Use Cases

• Visual Search Evolution: Moving Pictures and 3D

• Question and Answers

2

15/01/2013Presentation Title

Agenda




• Implementation

• Use Cases



3


Visual Search Context• Millions of images and videos continue being uploaded all over the

world on remote servers

• Each day on Facebook 300 million photos are uploaded

• roughly 58 photos uploaded each second

• One hour of video uploaded to YouTube every second

4


Content Based Image Recognition

• CBIR covers the concept of search that analyzes the actual content inthe image, rather than relying on metadata.

• The development of this concept incorporated many algorithms andtechniques from fields such as statistics, pattern recognition andcomputer vision.

• CBIR attracted a lot of attention and after many years of research, ithas expanded towards the marketplace.

• CBIR’s application on mobile market is called Mobile Visual Search

• Visual Search is about the capability to initiate a search using animage as a query that captures a rigid object

• Market potential of mobile visual search considers any mobile device with camera(phones, tablets and hybrids).

5


CBIR vs QR Codes

• Quick Response codes, a type of two-dimensional barcode.

• The code is scanned by the mobile imager to produce a URL addressfor re-direction and browsing.

• QR codes are being used by 6.2% of the smart phone users in USA

6


Lots of Existing Applications• Google’s Goggles

• Nokia’s Point and Find

• oMoby

• Like.com

• Kooaba

• Moodstocks

• Snaptell

• pixlinQ

• Bing

7


Existing Apps use Jpeg

• Previous applications use mobile imager that send JPEG compressed queries

8


Remote server

Mobile device

Send Jpeg images

Visual search result

Database

An Example of Visual Search

Courtesy Telecom Italia

Interest Point DescriptionDescriptor pairingInliers

9

Query

The Rise of Compressed Descriptors

• Alternatively send “compact features” extracted from raw images

• For example Scale Invariant Feature Transform – SIFT visual descriptors

• Consider 1200 descriptors, each one 128 Bytes, 4 bytes for coordinates, times 30 fps � network load nearly 38 Mbit/s �unacceptable

10


0

20

40

60

80

100

120

140

160

JPEG High JPEG Low SIFT

VGA Image

JPEG High

JPEG Low

SIFT

KB

Systems Considered

• Instead of sending images (a)

• application can send compact descriptors (b)

• and even perform search locally (c).

11

Previous Attempts

• Hashing• Locality Sensitive Hashing [Yeo et ali., 2008]

• Similarity Sensitive Coding [Torralba et ali., 2008]

• Spectral Hashing [Weiss et ali, 2008]

• Transform Coding• Karunen-love Transform [Chandrasekhar et ali. 2009]

• ICA based Transform [Narozny et ali., 2008]

• Vector Quantization• Product Quantization [Jegou et ali., 2010]

• Tree Structured Vector Quantization [Nistr et ali., 2006]

• Alternative to SIFT• Compressed Histogram of Gradients [Chandrasekhar et ali. 2011]

12


Agenda




• Implementation

• Use Cases



13


Is a standard on Visual Search needed ?

• Reduce load on wireless networks carrying visual search-related information.

• Ensure interoperability of visual search applications and databases,

• Enable hardware support for descriptor extraction and matching in mobile devices,

• Enable high level of performance of implementations conformant to the standard,

• Simplify design of descriptor extraction and matching for visual search applications,

14

What is a suitable standardizationbody ?

• Informal title:• Moving Picture Experts Group (MPEG)

• Formal title:• ISO/IEC JTC1 SC29 WG11 (Coding of Moving Pictures and Audio)

• Parent SDOs:• ISO: International Organization for Standardization • IEC: International Electro technical Commission• JTC 1: Joint Technical Committee One• SC29: Study Committee 29: Coding of Audio, Picture,

Multimedia and Hypermedia Information

• Members: National Bodies (25 voting, 16 observers)

JTC 1

SC29

WG11 (MPEG)

15

Agenda




• Implementation

• Use Cases



17


CDVS : Scope

• Descriptor extraction process needed to ensure interoperability.

• Bitstream of compact descriptors

Query Image

Descriptor extraction

Descriptor bitstream

Descriptor matching

Geometric verification

Database

List of results

Standard

18

Requirements

� Robustness� High matching accuracy shall be achieved at least for images of textured

rigid objects, landmarks, and printed documents. � The matching accuracy shall be robust to changes in vantage points,

camera parameters, lighting conditions, as well as in the presence of partial occlusions.

� Sufficiency� Descriptors shall be self-contained, in the sense that no other data are

necessary for matching.

� Compactness� Shall minimize lengths/size of image descriptors

� Scalability� Shall allow adaptation of descriptor lengths to support the required

performance level and database size.� Shall enable design of web-scale visual search applications and

databases.

19

How to achieve robustness• Image content is transformed into visual feature with coordinates

that are invariant to illumination, scale, rotation, affine and perspective transforms

20

Types of invariance

• Illumination

21

• Illumination

• Scale

22Types of invariance

• Illumination

• Scale

• Rotation


• Illumination

• Scale

• Rotation

• Affine Transform


• Illumination

• Scale

• Rotation

• Affine Transform

• Full Perspective


Compactness 26


0

20

40

60

80

100

120

140

160

JPEG High JPEG Low SIFT 512B 1KB 2KB 4KB 8KB 16KB

VGA Image

JPEG High

JPEG Low

SIFT

512B

1KB

2KB

4KB

8KB

16KB

KB

Extraction Pipeline 27

Image

Compactdescriptors

H Mode

H-Mode uses SQ encoding (256B)

S-Mode uses MSVQ encoding (38KB)

Both Mode uses SCFV (49KB)

Resizing

Local DescriptionExtraction

Encoding

SCFV

Descriptor

Coordinate coding

Arithmetic coding

MSVQ

encoding

Keypointselection

SIFTDoG

Transform & SQ

S Mode

Properties of SIFTDavid Lowe’s local descriptor detection extraction (1999-2004)

Extraordinarily robust matching technique• Can handle changes in viewpoint

• Up to about 30 degree out of plane rotation

• Can handle significant changes in illumination• Sometimes even day vs. night (below)

• Lots of code available � http://www.vlfeat.org (BSD license)

28

Pyramid of DoG

DoGs

DoGs

DoGs

Octave 1

Octave n

Scale 1 Scale m29

Actual Interest Point Detector Output 30

Building a Descriptor• Take 16x16 patch window around detected interest point

• Subdivide patch with 4x4 sub-patches

• Create per sub patch 8 bin-histogram over edge orientations weighted by magnitude

• These lead to a 4x4x8=128 element vector � the SIFT descriptor

31


0 2ππππ

angle histogram

Key point selection

• Basic idea: inlier features do not behave, in a statistical sense, as do the outlier features.

• Relevance value that results from taking into account distance from center, scale, orientation, peak, mean and variance of the SIFT descriptor.

32

• Main idea is to generate a compressed descriptor from uncompressed SIFT by

• Simple linear combinations of histograms

• Scalar quantisation of resultant values

• Adaptive Arithmetic coding

• Main benefits• Very low computational complexity

• Negligible memory requirements

• Highly scalable

• Allows for very efficient matching and retrieval

Local Descriptor Compression H mode 33

Vector Quantizer Scheme: S- Mode 34

Location Encoding

• Histogram Map: The positions of the nonzero bins are encoded asbinary words through scanning columns and compressing the words byarithmetic coding.

• Histogram Count: The number of coordinates in the nonzero bins isencoded in an iterative fashion, by specifying first which bins containmore than 1 key point, then by specifying which among these thatcontain more than 2 keypoints, and so forth

35

Agenda




• Implementation

• Use Cases



36


Extraction times

• SIFT interest point detection and feature extraction made the biggest contribution

• Global descriptors as complex as Interest Point Detection

• Very fast local descriptors and coordinate encoding

37

15/01/2013Quantitative evaluation of CDVS extraction and pairwise matching

Agenda




• Implementation

• Use Cases



38


Mobile Visual Search: Music CDs

Query

Stream Music

39

… …

SnapshotPaper-copy Initiate Visual

Search

Mass Storage

SendCompact Query

Selective quality&contentprinting

Multimedia Content RetrievalFrom the cloud

Augmentation Rendering

Composition of augmentations

and image

Augmentation 3D models and markers

Transmission of markers and 3D

models

2D / 3D Rendering

Content Augmentation

40Visual Search: eReaders, Printers

News FinderStill Pictures - Visual Search

41


Application and Use Cases from Broadcaster point of view

• Logo Detection

• Interactive Fruition

42

15/01/2013Presentation TitleCourtesy RAI

Automotive 3D Top View

EC

UCam

Cam

Cam

Cam

43

Automotive 3D Top View 44

45Moving Pictures Visual Search

Courtesy Telecom Design

Agenda




• Implementation

• Use Cases



46


Intra Predicted Descriptors 47


� Desirable Properties:

� An inter descriptor coded in a compact visual stream

� Expressed in terms of one or more temporally neighboring descriptors.

� The "inter" part of the term refers to the use of Inter Frame Prediction.

� Designed to achieve higher compression rates and/or better precision-recall performances

3D Mobile Devices Will Surpass 148 Million in 2015

• Advances in the 3D technology are very fast

• Industry adoption opens new opportunities � 3D Visual Search

• From In-Stat studies:• ~ 30 % of all handheld game consoles will be 3D by 2015.

• 3D mobile devices will increase demand for image sensors by 130 %.

• In 2012, Notebook will be the first 3D enabled mobile device to reach 1 million units.

• By 2014, 18 % of all tablets will be 3D.

• Nintendo, Fuji, GoPro, Sony, ViewSonic, LG, Origin, Toshiba, Fujitsu, HP, ASUS, Lenovo, Dell, Alienware, HTC and Sharp focusing on autostereoscopy mobile technologies

48


49


Microsoft Kinect Asus Xtion

Google 3D Warehouse

LG Optimus 3D P920

LG Optimus Pad

HTC EVO 3D Sharp Aquos SH-12C

3DS by Nintendo

3D Object Recognition with Kinect 50


http://www.youtube.com/watch?v=eRW1zG_aONk

Courtesy: CV laboratory University of Bologna

SHOT: Unique Signatures of Histograms for Local Surface Description

Agenda




• Implementation

• Use Cases



51


52


compact descriptors for visual search

Technology

visual search context

concept of search

visual search danilo

visual searchapplications

answers presentation

usapresentation title

informal title

secondpresentation title