imagespirit: verbal guided image parsingsunw.csail.mit.edu/2014/papers2/25_cheng_sunw.pdf · 2017....

ImageSpirit: Verbal Guided Image Parsing

Ming-Ming Cheng1 Shuai Zheng1 Wen-Yan Lin2 Vibhav Vineet2

Paul Sturgess2 Nigel Crook2 Niloy J. Mitra3 Philip H. S. Torr11University of Oxford 2Oxford Brookes University 3University College London

Abstract

Humans describe images in terms of nouns and adjec-tives while algorithms operate on images represented assets of pixels. Bridging this gap between how we humanswould like to access images versus their typical represen-tation is the goal of image parsing. In this paper we pro-pose treating nouns as object labels and adjectives as visualattributes. This allows us to formulate the image parsingproblem as one of jointly estimating per-pixel object andattribute labels from a set of training images. We proposean efficient (interactive time) solution to this problem. Us-ing the extracted labels as handles, our system empowers auser to automatically refine the results. This enables hands-free parsing of an image into pixel-wise object/attribute la-bels that correspond to human semantics. Verbally select-ing objects of interests enables a novel and natural inter-action modality that can possibly be used to interact withnew generation devices (e.g., smart phones, Google Glass,consoles, living room devices). We demonstrate our systemon a large number of real-world images with varying com-plexity. To help understand the tradeoffs compared to tra-ditional mouse based interactions, results are reported forboth a large scale quantitative evaluation and a user study.

1. IntroductionHumans perceive image scenes in terms of nouns (e.g.

chair) and adjectives (e.g. textured). In contrast, pixels forma different representation for computers. In order to bridgethis gap between the human perception and machine repre-sentation, we propose an efficient system that allows usersto produce high quality image parsing results by verbally in-structing the software 1. Such a scheme enables hands-freeparsing of an image into pixel-wise object-attribute labels.The output can be directly consumed by the new devicesthat are difficult to accommodate mouse interaction, suchas smart phones, Google Glass. Such an interaction modal-ity not only enriches how we interact with images, but alsoprovides important capability for many applications, where

1The full paper is accepted to ACM Transactions on Graphics (ACMTOG) [1]. Shuai Zheng is the joint first author.

Figure 1. User interface of the verbal-guided image parsing system (la-beling thumbnail view).

non-touch interaction is crucial (e.g. in hospital, doctorstend to access image with non-touch interactions).

We face three technical challenges in developing such averbal guided 2 image parsing system: (i) words are con-cepts that are difficult to translate into pixel-wise mean-ing; (ii) how to control the overall system using only verbalcues; and (iii) ensuring the system responds at interactiverates. To address the first challenge, we treat nouns as ob-jects and adjectives as attributes. Using training data, weobtain a pixel-wise hypothesis for each object and attribute,e.g. Fig. 2(a). Technically, these are integrated through anovel, multi-label factorial conditional random field (CRF)that jointly estimates both object and attribute segmentationas seen in Figure 2(b). This joint segmentation approachprovides verbal handles to the underlying image. Further-more, our modeling of the symbiotic relation between at-tributes and objects results in a higher quality parsing overprior object-only segmentation techniques [3, 2]. Our sec-ond technical challenge, verbal control, is also addressed byjoint object and attribute CRF. We use adjectives in the usercommands as automatic attribute predictions and the corre-lation between adjectives (attributes) and nouns (objects) to

2We use the term verbal as a short hand to indicate word-based, i.e.,nouns, and adjectives. In this paper, we focus on semantic image parsingrather than natural language processing.

1

Cotton

Glass

Textured

Wall

Floor

Cabinet

...

...

(a) Inputs: an image and learned weak hypothesis [4]

Textured ...

...

Wall

bed

floor

GlassCotton

(b) Automatic scene parsing results

Textured ...

...

Wall

bed

floor

GlassCotton

(c) Natural language guided parsing

Figure 2. Given an scene image, our system generates multiple weak object/attributes cues (a). Using a novel multi-label CRF, we generate per-pixel objectand attribute labeling (b). Based on this output, additional verbal guidance: ‘Refine the cotton bed in center-middle’, ‘Refine the white bed in center-middle’,‘Refine the glass picture’, ‘Correct the wooden white cabinet in top-right to window’ allows re-weighting of CRF terms to generate, at interactive rates, highquality scene parsing result (c). Best viewed in color.

mutually reinforce each other. This allows the user to in-tuitively incorporate a high level understanding of the cur-rent image and quickly find discriminative visual attributesto improve scene parsing. For example, in Fig. 2(c) withverbal inputs such as ‘glass picture’, our algorithm can re-weight the CRFs for both glass and picture to provide a goodquality ‘picture’ segment boundary. Finally, we show thatour joint CRF formulation can be factorized. This permitsthe use of efficient filtering based techniques [2] to performinference at interactive speed.

2. System Summary

After the user loads an image, our system automaticallyassigns an object class label (noun) and sets of attributes la-bels (adjectives) to each pixel. Using these results, our sys-tem selects a subset of objects and attributes that are mostrelated to the image (see also Fig. 1). These coarse seg-ments provide the bridge between image pixels and verbalcommands. Given the various segments, the user can usehis/her high level knowledge about the image to strengthenor weaken various object and attribute classes. For example,the initial results in Fig. 1 might prompt the user to realizethat the bed is missing from the segmentation but the ‘cot-ton’ attribute is well defined. Thus, the simple command‘Refine the cotton bed in center-middle’ will strengthen theassociation between cotton and bed, allowing a better seg-mentation of the bed. Note that the final object boundarydoes not follow the original attribute segments because ver-bal information is incorporated as soft cues which are inter-preted by a CRF within the context of the other information.

Once objects have been semantically segmented, it be-comes straightforward to manipulate them using verb-basedcommands such as move, change, etc. As a demonstrationof this concept, we encapsulate a series of rule-based imageprocessing commands needed to execute an action, allow-ing hands-free image manipulation.

(a) Object deformation

Figure 3. Verbal guided image manipulation applications. The commandsused is: ‘Refine the glossy monitor’ and ‘Make the wooden cabinet lower’.

3. ExperimentsWe first evaluate our system based on an attribute-

augmented NYURGB image dataset. By splitting the datainto training and test sets, we observe an improvement us-ing our automated object/attribute segmentation approach,compared to the state-of-the-art algorithms [3, 2]. More-over, our system provides critical verbal handles for refine-ment and subsequent edits leading to a significant (30%)improvement when verbal interaction is allowed. Empiri-cally, we find that our interactive joint image parsing is bet-ter aligned with human perception than those of previousnon-interactive disjoint approaches. Further, we find our in-door scene parsing system works well on the images down-loaded from Internet using ‘bedroom’ as a search word.

References[1] M.-M. Cheng, S. Zheng, W.-Y. Lin, V. Vineet, P. Sturgess, N. Crook,

N. Mitra, and P. Torr. Imagespirit: Verbal guided image parsing. ACMTransactions on Graphics (ACM TOG), 2014.

[2] P. Krahenbuhl and V. Koltun. Efficient inference in fully connectedCRFs with gaussian edge potentials. In NIPS, 2011.

[3] L. Ladicky, C. Russell, P. Kohli, and P. H. S. Torr. Associative hierar-chical CRFs for object class image segmentation. In ICCV, 2009.

[4] J. Shotton, J. Winn, C. Rother, and A. Criminisi. Textonboost for im-age understanding: Multi-class object recognition and segmentationby jointly modeling texture, layout, and context. IJCV, 81(1), 2009.

imagespirit: verbal guided image parsingsunw.csail.mit.edu/2014/papers2/25_cheng_sunw.pdf · 2017....

Documents