mobile object recognition using multi-sensor … · mobile object recognition using multi-sensor...

MOBILE OBJECT RECOGNITION USING MULTI-SENSOR INFORMATION FUSIONIN URBAN ENVIRONMENTS

Katrin Amlacher, Patrick Luley, Gerald Fritz, Alexander Almer and Lucas Paletta

JOANNEUM RESEARCH Forschungsgesellschaft mbHInstitute of Digital Image ProcessingWastiangasse 6, 8010 Graz, Austria

ABSTRACT

Mobile vision services have recently been proposed for thesupport of urban nomadic users. A major issue for the perfor-mance of the service - involving indexing into a huge amountof reference images - is ambiguity in the visual information.We propose to exploit geo-information in association with vi-sual features to restrict the search within a local context. Ina mobile image retrieval task of urban object recognition, wedetermine object hypotheses from (i) mobile image based ap-pearance and (ii) GPS based positioning and investigate theperformance of Bayesian information fusion with respect toa geo-referenced image database (TSG-20). The results fromgeo-referenced image capture in an urban scenario prove asignificant increase in recognition accuracy (> 10%) whenusing the geo-contextual information in contrast to omittinggeo-information.

Index Terms— Object recognition, multi-sensor infor-mation fusion, geo-services, urban scenarios

1. INTRODUCTION

Mobile object recognition and visual positioning have re-cently been proposed in terms of mobile vision services forthe support of urban nomadic users. The performance ofthese services is highly dependent on the uncertainty in thevisual information. Covering large urban areas with naive ap-proaches would require to refer to a huge amount of referenceimages and consequently to highly ambiguous features.

Previous work on mobile vision services primarily ad-vanced the state-of-the-art in computer vision methodologyfor the application in urban scenarios. [1] provided a firstinnovative attempt on building identification proposing localaffine features for object matching. [2] introduced image re-trieval methodology for the indexing of visually relevant in-formation from the web for mobile location recognition. Sub-sequent attempts [3, 4, 5] advanced the methodology furthertowards highly robust building recognition, however, it has

This work is supported in part by the European Commission fundedproject MOBVIS under grant number FP6-511051

not yet been considered to investigate the contribution of geo-information to the performance of the vision service.

In this paper we propose to exploit contextual informa-tion from geo-services with the purpose to cut down the visualsearch space into a subset of all available object hypotheses inthe large urban area. Geo-information in association with vi-sual features enables to restrict the search within a local con-text. We extract object hypotheses in the local context from (i)mobile image based appearance and (ii) GPS based position-ing and investigate the performance of Bayesian informationfusion with respect to a reference database (TSG-20). The re-sults from experimental tracks and image capture in an urbanscenario prove a significant increase in recognition accuracy(> 10%; Sec. 4) when using the geo-contextual information.

2. URBAN OBJECT DETECTION ANDRECOGNITION

Urban image based recognition provides the technologyfor both object awareness and positioning. Outdoor geo-referencing still mainly relies on satellite based signals whereproblems arise when the user enters urban canyons and theavailability of satellite signals dramatically decreases due tovarious shadowing effects. Cell identification is not treatedhere due to its large positioning error. Alternative conceptsfor localization are economically not affordable, such as, INSand markers that need to be massively distributed across theurban area. In the following, we propose a system for mobileimage retrieval. For image based urban object recognition, webriefly describe and make use of the methodology presentedin [5].

Mobile recognition system In the first stage, the user cap-tures an image about an object of interest in its field of view,and a software client initiates wireless data submission to theserver. Assuming that a GPS receiver is available, the mo-bile device reads the actual position estimate and sends thistogether with the image to the server .

In the second stage, the web-service reads the messageand analyzes the geo-referenced image. Based on a currentquality of service and the given decision for object detection

and identification, the server prepares the associated annota-tion information from the content database and sends it backto the client for visualization.

Informative features for recognition Research on visualobject detection has recently focused on the development oflocal interest operators [6, 7] and the integration of local in-formation into object recognition. The SIFT (Scale InvariantFeature Transformation) descriptor [7] is widely used for itscapabilities for robust matching despite viewpoint, illumina-tion and scale changes in the object image captures whichis mandatory for mobile vision services. The InformativeFeatures Approach (i-SIFT [5]) uses local density estimationsto determine the posterior entropy, making local informationcontent explicit with respect to object discrimination.

The information content from a posterior distribution isdetermined with respect to given task specific hypotheses. Incontrast to costly global optimization, one expects that it issufficiently accurate to estimate a local information contentfrom the posterior distribution within a sample test point’slocal neighborhood in descriptor space. One is primarily in-terested to get the information content of any sample localdescriptor di in descriptor space D, di ∈ R|D|, with re-spect to the task of object recognition, where oi denotes anobject hypothesis from a given object set SO. For this, oneneeds to estimate the entropy H(O|di) of the posterior distri-bution P (ok|di), k = 1 . . .Ω, Ω is the number of instantia-tions of the object class variable O. The Shannon conditionalentropy denotes H(O|di) ≡ −

∑k P (ok|di) log P (ok|di).

One approximates the posteriors at di using only samples gj

inside a Parzen window of a local neighborhood ε, ||di −dj || ≤ ε, j = 1 . . . J . Fig. 1 depicts discriminative descrip-tors in an entropy-coded representation of local SIFT fea-tures di. From discriminative descriptors one proceeds to en-tropy thresholded object representations, providing increas-ingly sparse representations with increasing recognition accu-racy, in terms of storing only selected descriptor informationthat is relevant for classification purposes, i.e., those di withH(O|di) ≤ HΘ.

Attentive object detection and recognition Detectiontasks require the rejection of images whenever they do notcontain any objects of interest. For this we consider to es-timate the entropy in the posterior distribution – obtainedfrom a normalized histogram of the object votes – and rejectimages with posterior entropies above a predefined thresh-old. The proposed recognition process is characterized by anentropy driven selection of image regions for classification,and a voting operation.

3. GEO-INDEXED OBJECT RECOGNITION

Geo-services provide access to information about a local con-text that is stored in a digital city map. Map information interms of map features is indexed via a current estimate on theuser position that can be derived from satellite based signals

Fig. 1. Concept for recognition from informative local de-scriptors. (I) First, standard SIFT descriptors are extractedwithin the test image. (II) Decision making analyzes the de-scriptor voting for MAP decision. (III) In i-SIFT attentiveprocessing, a decision tree estimates the SIFT specific en-tropy, and only informative descriptors are attended for de-cision making (II).

(GPS), dead-reckoning devices and so on. The map featurescan provide geo-contextual information in terms of locationof points of interest, objects of traffic infrastructure, informa-tion about road structure and shop information.

Fig. 2. Extraction of object hypotheses from geo-services.(Left to right) Within a local spatial neighborhood (geo-focus), distances to the points of interest are determined,weighted by an exponential function and normalised to resultin a distribution on object hypotheses.

Geo-services for object hypotheses In previous work[8] we already emphasised the general relevance of geo-services for the application of mobile vision services, suchas mobile object recognition. However, the contribution ofpositioning to recognition was merely on a conceptual leveland the contribution of the geo-services to the performanceof geo-indexed object recognition was not quantitatively as-sessed. Fig. 2 depicts a novel methodology to introducegeo-service based object hypotheses. (i) A geo-focus is firstdefined with respect to a radius of expected position accu-racy with respect to the city map. (ii) Distances between userposition and points of interest (e.g., tourist sight buildings)that are within the geo-focus are estimated. (iii) The distancesare then weighted according to a normal density function byp(x) = 1/((2π)d/2|Σ|1/2) exp−1/2(x−µ)T Σ−1(x−µ).By investigating different values for σ, assuming (Σij) =

δijσ2j , we can tune the impact of distances on the weighting

of object hypotheses. (iv) Finally, weighted distances are nor-malised and determine confidence values of individual objecthypotheses.

Bayesian information fusion Distributions over objecthypotheses from vision (Sec. 2) and geo-services are then in-tegrated via Bayesian decision fusion. Although an analyticinvestigation of both visual and position signal based infor-mation should prove statistical dependency between the cor-responding random variables, we assume that it is here suf-ficient to pursue a naive Bayes approach for the integrationof the hypotheses (in order to get a rapid estimate about thecontribution of geo-services to mobile vision services) by

P (ok|yi,v,xi,g) = p(ok|yi,v)p(ok|xi,g), (1)

where indices v and g mark information from image (y) andpositioning (x), respectively.

4. EXPERIMENTS

The overall goal of the experiments was to determine and toquantify the contribution of geo-services to object recognitionin urban environments. The performance in the detection andrecognition of objects of interest on the query images with re-spect to a given reference image database and a given method-ology (TSG-20 [5]) was compared to the identical processingbut using geo-information and information fusion for the in-tegration of object hypotheses (Sec. 3).

User scenario In the application scenario, we imagine atourist exploring a foreign city for tourist sights (e.g., build-ings). He is equipped using a mobile device with built-in GPSand can send image based queries to a server using UMTS orWLAN based connectivity. The server performs geo-indexedobject recognition and is expected to respond with tourist rel-evant annotation if a point of interest was identified.

Fig. 3. Airborne image of the test site with user geo-track(blue), query image captures (red), and points of interest(POIs, yellow). Query image and GPS based position esti-mate are sent to the server which responds with annotation.

Hardware and Image Databases In the experiments weused an ultra-mobile PC (Sony Vaio UMPC VGN-UX1XN)with 1.3 MPixels image captures. Reference imagery [5]with 640× 480 resolution about building objects of the TSG-20 database1 were captured from a camera-equipped mobilephone (Nokia 6230), containing changes in 3D viewpoint,partial occlusions, scale changes by varying distances for ex-posure, and various illumination changes. For each object weselected 2 images taken by a viewpoint change of ≈ ±30

and of similar distance to the object for training to determinethe i-SIFT based object representation (Sec. 2). 2 additionalviews were taken for test purposes (40 test images in total).

Query Image Databases For the evaluation of back-ground detection we used a dataset of 120 query images,containing only images of buildings and street sides with-out TSG-20 objects. Another dataset was acquired with theUMPC, which consists of seven images per TSG-20 objectfrom different view points; images were captured on differentdays under different weather conditions.

Results on object detection and recognition In the firstevaluation stage, each individual image query was evaluatedfor vision based object detection and recognition (Sec. 2),then regarding extraction of geo-service based object hy-potheses (Sec. 3), and finally with respect to Bayesian deci-sion fusion on the individual probability distributions (Sec. 3).

Detection is a pre-processing step to recognition to avoidgeo-services to support confidences for objects that are notin the query image. Preliminary experiments resulted in lowperformance (PT rate 89.2%, FP rate 20.1%), however, geo-indexed object recognition then finds more correct hypotheses

Fig. 4 depicts sample query images associated with corre-sponding distributions on object hypotheses from vision, geo-services, and using information fusion. The results demon-strate significant increases in the confidences of correct objecthypotheses. The evaluation of the complete database of imagequeries about TSG-20 objects (Fig. 5) proves a decisive ad-vantage for taking geo-service based information into accountin contrast to purely vision based object recognition. Whilevision based recognition is on a low level (≈ 84%; prob-ably due to low sensor quality), an exponentially weightedspatial enlargement of the scope on object hypotheses withgeo-services increased the recognition accuracy up to≈ 97%.With increasing σ an increasing number of object hypothesesare taken for information fusion and the performance finallydrops to vision based recognition performance (uniform dis-tribution in the geo-service based object hypotheses).

Choice of image database The reason to use the TSG-20 [5] database was to make use of geo-referenced trainingimages. Other publicly available building image databases,such as the ZuBuD [1], do not provide geo-coordinates to ourframework. See [5] for detailed performance comparison be-tween vision based approaches.

1http://dib.joanneum.at/cape/TSG-20/

(a)

(b)

(c)

Fig. 4. Integration of object hypotheses from (a) vision and(b) geo-services into a (c) fused distribution. Examples withsample input images demonstrate clear increases in the con-fidences of the correct object hypothesis and therefore a sig-nificant improvement in the performance of the mobile visionservice (Fig. 5).

5. CONCLUSIONS

In this work we investigated the contribution of geo-contextualinformation for the improvement of performance in vi-sual object detection and recognition. We argued that geo-information provides a focus on the local object context thatwould enable a meaningful selection of expected object hy-potheses and finally proved that the performance of urbanobject recognition can be significantly improved. This workwill be relevant for a multitude of mobile vision services andgeo-indexed image retrieval, enabling higher accuracy andmore robust mobile applications.

Future work will investigate the application of geo-services in association with larger geo-referenced urban im-age databases, e.g., using the Tele Atlas image database (inpreparation) in order to examine the described effects in moredetail.

6. ACKNOWLEDGMENTS

This work is supported in part by the European Commissionfunded project MOBVIS under grant number FP6-511051and by the FWF Austrian National Research Network on

Fig. 5. Experimental results about the complete test set withgeo-referenced query imagery for UMPC based camera sen-sors, and integrated recognition accuracy (OR+Geo) with vi-sion (OR) and geo-services (Geo) based information undervariation of the distance parameter σ (Sec. 3).

Cognitive Vision under sub-project S9104-N04.

7. REFERENCES

[1] H. Shao, T. Svoboda, and L. van Gool, “HPAT indexing for fastobject/scene recognition based on local appearance,” in Proc.International Conference on Image and Video Retrieval, CIVR2003. Chicago,IL, 2003, pp. 71–80.

[2] T. Yeh, K. Tollmar, and T. Darrell, “Searching the web with mo-bile images for location recognition,” in Proc. IEEE ComputerVision and Pattern Recognition, CVPR 2004, Washington, DC,2004, pp. 76–81.

[3] Raphael Maree, Pierre Geurts, Justus Piater, and Louis We-henkel, “Decision trees and random subwindows for objectrecognition,” in ICML workshop on Machine Learning Tech-niques for Processing Multimedia Content (MLMM2005), 2005.

[4] Stepan Obdrzalek and Jiri Matas, “Sub-linear indexing for largescale object recognition.,” in Proceedings of the British MachineVision Conference, 2005, vol. 1, pp. 1–10.

[5] Gerald Fritz, Christin Seifert, and Lucas Paletta, “A Mobile Vi-sion System for Urban Object Detection with Informative LocalDescriptors,” in Proc. IEEE 4th International Conference onComputer Vision Systems, ICVS, New York, NY, January 2006.

[6] K. Mikolajczyk and C. Schmid, “A performance evaluation oflocal descriptors,” in Proc. Computer Vision and Pattern Recog-nition, CVPR 2003, Madison, WI, 2003.

[7] D. Lowe, “Distinctive image features from scale-invariant key-points,” International Journal of Computer Vision, vol. 60, no.2, pp. 91–110, 2004.

[8] P. Luley, L. Paletta, A. Almer, M. Schardt, and J. Ringert, “Geo-services and computer vision for object awareness in mobile sys-tem applications,” in Proc. 3rd Symposium on LBS and Cartog-raphy. 2005, pp. 61–64, Springer.

mobile object recognition using multi-sensor … · mobile object recognition using multi-sensor...

Documents