collabar: edge-assisted collaborative image recognition ... › lions › papers ›...

CollabAR: Edge-assisted Collaborative Image Recognition forMobile Augmented Reality

Zida LiuDuke [email protected]

Guohao LanDuke University

[email protected]

Jovan StojkovicUniversity of Belgrade

[email protected]

Yunfan ZhangDuke University

[email protected]

Carlee Joe-WongCarnegie Mellon [email protected]

Maria GorlatovaDuke University

[email protected]

ABSTRACTMobile Augmented Reality (AR), which overlays digital content onthe real-world scenes surrounding a user, is bringing immersiveinteractive experiences where the real and virtual worlds are tightlycoupled. To enable seamless and precise AR experiences, an imagerecognition system that can accurately recognize the object in thecamera view with low system latency is required. However, dueto the pervasiveness and severity of image distortions, an effec-tive and robust image recognition solution for mobile AR is stillelusive. In this paper, we present CollabAR, an edge-assisted sys-tem that provides distortion-tolerant image recognition for mobileAR with imperceptible system latency. CollabAR incorporates bothdistortion-tolerant and collaborative image recognition modules inits design. The former enables distortion-adaptive image recogni-tion to improve the robustness against image distortions, while thelatter exploits the ‘spatial-temporal’ correlation among mobile ARusers to improve recognition accuracy. We implement CollabAR onfour different commodity devices, and evaluate its performance ontwo multi-view image datasets. Our evaluation demonstrates thatCollabAR achieves over 96% recognition accuracy for images withsevere distortions, while reducing the end-to-end system latencyto as low as 17.8ms for commodity mobile devices.

CCS CONCEPTS•Computingmethodologies→Distributed algorithms;Mixed/ augmented reality.

KEYWORDSEdge computing, collaborative augmented reality, image recogni-tion.

1 INTRODUCTIONMobile augmented reality (AR), which overlays digital content withthe real world around a user, has recently jumped from the pagesof science fiction novels into the hands of consumers. To enableseamless contextual AR experience, an effective image recognitionsystem that can accurately recognize objects in the camera viewof the mobile device with imperceptible latency is required [1, 2].While this may come across as a solved problem given the recentadvancements in deep neural networks (DNNs) [3, 4] and mobileoffloading [1, 5, 6], there are three practical aspects that have beenlargely overlooked in existing solutions.

First, in real-world mobile AR scenarios, a large proportion ofimages taken by the smartphone or the head-mounted AR devicecontains distortions. For instance, images frequently contain mo-tion blur caused by the motion of the user [1, 7]. Gaussian blurappears when the camera is de-focusing, or is used in an underwa-ter AR scenario [8]. Gaussian white noise is also inevitable in lowillumination conditions [9]. Indeed, using two different commoditymobile AR devices, our measurement study (Section 3.3) indicatesthat over 70% of the images can be corrupted by distortions in differentpractical scenarios. Given the high distortion proportion, simplyfiltering out the distorted images, as has been suggested in priorwork [1, 2, 10], can hardly solve the problem.

Second, for image recognition, applying DNNs trained on largescale datasets, e.g., ImageNet [11] and Caltech-256 [12], directlyto mobile AR images that contain severe multiple distortions isdifficult, and often results in dramatic performance degradation [9,13, 14]. This deficiency results from the domain adaptation problem,where distorted images fail to share the same feature distributionwith the clear training images [13, 15, 16]. Indeed, when testingdistorted images with a pre-trained MobileNetV2 [4] network onthe Caltech-256 dataset, we observe that even a small amount ofdistortion in the images can significantly affect the recognitionaccuracy (Section 3.2). Although deblurring methods and spatialfilters [17] can be used to reduce distortions, they also remove thefine-scaled image details that are useful for image recognition.

Lastly, to achieve imperceptible recognition latency for resource-constrained mobile AR devices, a number of works have exploredcomputation offloading [1, 5, 18] and approximate computationreuse [6, 19] to reduce the system overhead. Although the state-of-the-art solutions have reduced the end-to-end system latency tobelow 33ms [5, 18], such performance is achieved in distortion-freeor distortion-modest environments. The challenge in mitigatingthe trade-off between recognition accuracy and system latency formobile AR with severe multiple distortions has not been addressed.

To fill this the gap, we present CollabAR, an edge-assisted Col-laborative image recognition framework for mobile AugmentedReality. Due to recent advances in AR development platforms, ARapplications have become accessible on a wide range of mobile de-vices. We have seen many dedicated AR devices, such as MicrosoftHoloLens [20] and Magic Leap [21], as well as Google ARCore [22]and Apple ARKit [23] SDKs that work on a wide range of mobilephones and tablets. The pervasive deployment of mobile AR willoffer numerous opportunities for multi-user collaboration [10, 24].Moreover, mobile users that are in close proximity usually exhibit

Zida Liu, Guohao Lan, Jovan Stojkovic, Yunfan Zhang, Carlee Joe-Wong, and Maria Gorlatova

a strong correlation in both device usage and spatial-temporal con-text [25]. The requested image recognition tasks are usually cor-related temporally or spatially. Thus, unlike conventional imagerecognition systems where individual users independently com-plete their tasks to recognize the same object [1, 5, 18], CollabARembraces the concept of collaborative image recognition in its design,and exploits the temporally and spatially correlated images capturedby the users to improve the image recognition accuracy.

Bringing this high-level concept into a holistic system requiresovercoming several challenges. First, we need to address the domainadaptation problem [15] caused by the image distortions. As DNNscan adapt to a particular distortion type, but not multiple types atthe same time [9, 15], we need to adaptively select a DNN that hasbeen fine-tuned to recognize a specific type of distorted imagesin runtime. We introduce the distortion-tolerant image recognizerwhich incorporates the image distortion classifier and a set of dedi-cated recognition experts. CollabAR employs the image distortionclassifier to identify the most significant distortion contained inthe image, and triggers one of the dedicated recognition experts tohandle the distorted image. This distortion-adaptive selection of arecognizer ensures a lower feature domain mismatch between theimage and the DNN, and a better robustness against distortions.

Second, to enable collaborative image recognition, we need toidentify a set of spatially and temporally correlated images that con-tain the same object. Prior works rely on image feature vectors [19]or feature maps [6] to identify temporally correlated images (e.g.,continuous frames in a video). However, image features are vulner-able to distortions, and result in high lookup errors in practical ARscenarios. CollabAR introduces the anchor-based pose estimationand the spatial-temporal image lookup module to efficiently identifycorrelated images.

Third, as the correlated images suffer from various distortionswith different severity levels, they lead to heterogeneity in the recog-nition confidence and accuracy. Conventional multi-view imagerecognition systems [26, 27] lack a reliable metric to differentiatethe heterogeneity, and are not able to optimally aggregate the multi-view results [28]. In this work, we propose the Auxiliary-assistedMulti-view Ensemble Learning (AMEL) to dynamically aggregatethe heterogeneous recognition results of the correlated images.

Lastly, having all the system building blocks in mind, we aim toprovide imperceptible system latency (i.e., below 33ms to ensure30fps continuous recognition), without sacrificing accuracy. Thetargeted low system latency can leave more time and space formany other computation-intensive tasks (e.g., virtual object ren-dering) that are running on mobile AR devices. We explore severaldesign options (i.e., selection of DNN models, what to offload andwhat to process locally) and conduct a comprehensive system pro-filing to guide the final optimized implementation. We propose anedge-assisted system framework to mitigate the trade-off betweenrecognition accuracy and system latency.

In this paper, we presents CollabAR, a collaborative image recog-nition system for mobile AR. Our main contributions are:

• We propose the distortion-tolerant image recognizer to resolve thedomain adaptation problem caused by image distortions. Com-pared to conventional methods, the proposed solution improvesthe recognition accuracy by 20% to 60% for different settings.

• To further boost the recognition accuracy, CollabAR embraces col-laboration opportunities among mobile AR devices, and exploitsthe spatial-temporal correlation among images for multi-viewimage recognition. CollabAR introduces the anchor-based imagelookup module to effectively identify the spatially and temporallycorrelated images, and the auxiliary-assisted multi-view ensemblelearning framework to dynamically aggregate the heterogeneousmulti-view results. Compared to the single-view-based recogni-tion, the proposed method can improve the recognition accuracyby 16% to 20% in severe multiple distortions scenario.

• We implement CollabAR on four different commodity devices,and evaluate its performance on two multi-view image datasets,one public and one collected by ourselves. The evaluation demon-strates that CollabAR achieves over 96% recognition accuracy forimages that contain severe multiple distortions, while reducingthe end-to-end system latency to as low as 17.8ms.

The research artifacts, including our own collected multi-viewmultiple distortions image dataset (Section 7), and the source codesfor both the distortion-tolerant image recognizer and the auxiliary-assisted multi-view ensemble learning framework, are available athttps://github.com/CollabAR-Source/.

2 RELATEDWORKAs a holistic image recognition system for multi-user mobile AR,CollabAR builds on a body of prior work in mobile AR, distortion-tolerant image recognition, and distributed image recognition.

Image Recognition forMobile AR. Before overlaying the ren-dered virtual objects on the view of a user, mobile AR systems lever-age image recognition to identify the object appearing in the view.For this purpose, image retrieval [1, 18], image localization [2], andDNN-based image recognition methods [5] are widely used in priorwork. Image retrieval-based solutions, such as OverLay [1] andJaguar [18], exploit salient feature descriptors to match image inwith the annotated image stored in the database. However, salientdescriptors are vulnerable to image distortions; a large proportion ofimages cannot be correctly recognized but is simply filtered out [1].Image localization-based methods require the server to maintainthe annotated image dataset of the service area [29], which is com-putationally heavy and involves a large volume of data [2]. Lastly,DNN-based solutions [5] perform well with pristine images, buteven a small amount of distortion can lead to a dramatic perfor-mance drop [9, 13]. Instead of filtering out the distorted images,CollabAR incorporates the distortion-tolerant image recognizer toenable image recognition for mobile AR systems.

Distortion-tolerant Image Recognition: Recent efforts havebeen made by the computer vision community in resolving the neg-ative impacts of image distortion on DNNs [9, 13, 14, 16]. Dodge etal. [9] propose a mixture-of-experts-based model, in which recogni-tion experts are fine-tuned to handle a specific image distortion, andtheir outputs are aggregated using a gating network. In [13], Ghoshet al. propose a similar master-slave CNN architecture for distortedimage recognition. Brokar et al. [14] propose an objective metricto identify the most distortion-susceptible convolutional filters inthe CNNs. Correction units are added in the network to correctthe outputs of the identified filters. However, prior works mainlyfocus on the single-view scenario, and fail when the image contains

CollabAR: Edge-assisted Collaborative Image Recognition for Mobile Augmented Reality

multiple distortions or the distortion level is high [9, 13, 14]. Incontrast, CollabAR can dynamically aggregate the spatially andtemporally correlated images to improve recognition accuracy inmultiple distortions scenario.

Distributed Image Recognition: Using distributed informa-tion collected from multi-camera setups for image and object recog-nition is a widely studied topic. For instance, in DDNNs [27], dis-tributed deep neural networks are deployed over cloud, edge, andend devices for joint object recognition. By leveraging the geograph-ical diversity of a multi-camera network, DDNNs improves theobject recognition accuracy. To improve the accuracy for distortion-tolerant pill image recognition, MobileDeepPill [30] relies on amulti-CNN model to collectively capture the shape, color, and im-print characteristics of pills. Kestrel [31] incorporates a video ana-lytics system that detects and tracks vehicles across heterogeneousmulti-camera networks. These existing systems only perform indistortion-free [27] or distortion-modest environments [30]. Theyare not able to recognize the image if it contains severe or multipledistortions. Moreover, many of them suffer from high end-to-endlatency (i.e., 200ms) [31].

3 BACKGROUND AND CHALLENGES3.1 Image DistortionsIn this paper, we consider three different types of image distor-tion that commonly appear in mobile AR scenarios: motion blur,Gaussian blur, and Gaussian noise.

Motion blur (MBL): images taken by the smartphone or thehead-mountedAR set camera frequently containmotion blur causedby the motion of the user [1]. Given a pristine image F (x ,y) withpixel coordinates x and y, the corresponding motion-blurred image,FMBL(x ,y), can be modeled as FMBL(x ,y) = F (x ,y) ∗ H (l ,a) [16],where H (l ,a) is the non-uniform motion blur kernel that charac-terizes the length (l) and orientation (a) of the motion when thecamera shutter is open, and ∗ is the convolution operator. The dis-tortion level of FMBL(x ,y) is determined by the length of the blurkernel l , while the angle a defines the direction in which the motionblur affects the image. Figure 1(a) shows the pristine image (l=0)and four motion-blurred images with different kernel lengths (theangle a is randomly selected). Motion blur dramatically reduces thesharpness of the image and distorts its spatial features.

Gaussian blur (GBL): Gaussian blur appears when the camerais de-focusing or the image is taken underwater (e.g., in underwaterAR [8]) or in a foggy environment [17]. For a pristine image F (x ,y),the corresponding Gaussian-blurred image, FGBL(x ,y), can be mod-eled as FGBL(x ,y) = F (x ,y) ∗Gσ (x ,y) [17], where Gσ (x ,y) is thetwo-dimensional circularly symmetric centralized Gaussian kernel.The distortion level of FGBL(x ,y) is determined by the aperturesize k of the circular Gaussian kernel. Figure 1(b) shows an exampleof images containing different levels of Gaussian blur. Note that kmust be odd to maintain the circularly symmetric property.

Gaussian noise (GN): noise is also inevitable in images. Pos-sible sources of noise include poor illumination conditions [9],digital zooming, and the use of a low quality image sensor [32]. Inmost cases, noise can be modeled as zero-mean additive Gaussiannoise [32]. Given a pristine image F (x ,y), the corresponding noisyimage is expressed as FGN (x ,y) = F (x ,y) + N (x ,y), where N (x ,y)

(a)

l=0

(b)

l=10 l=20 l=30 l=40

(c)

k=0 k=11 k=21 k=31 k=41

σGN2=0 σGN

2=0.01 σGN2=0.02 σGN

2=0.03 σGN2=0.04

Figure 1: Examples of images containing different types ofdistortions at different severity levels: (a) motion blur withblur kernel length l , (b) Gaussian blur with aperture size k ,and (c) Gaussian noise with variance σ 2

GN .

0 10 20 30 40

Blur kernel length (l)

0

25

50

75

100

Acc

urac

y (%

) MBL

0 11 21 31 41

Aperture size (k)

0

25

50

75

100

Acc

urac

y (%

) GBL

0 0.010.020.030.04

Variance ( 2GN

)

0

25

50

75

100

Acc

urac

y (%

) GN

Figure 2: Impacts of image distortion on the recognition per-formance of MobileNetV2 with the Caltech-256 dataset.

is the additive noise which follows a zero-mean Gaussian distribu-tion. The distortion level of FGN (x ,y) is decided by the standarddeviation σGN . As shown in Figure 1(c), the Gaussian noise addsrandom variation in both brightness and color of the image.

3.2 Impact of Distortion on Image RecognitionDeep neural networks (DNN), such as the AlexNet [3] and theMobileNetV2 [4], have shown state-of-the-art performance in imagerecognition. However, as DNNmodels are usually trained and testedon the images that are pristine, e.g., ImageNet [11], directly applyingthem to mobile AR images that contain severe multiple distortionsoften leads to dramatic performance degradation [14, 17, 33]. Thisdeficiency results from the domain adaptation problem where realworld distorted images fail to share the same feature distributionwith the pristine training images [13, 15, 16].

As a case study to quantify this impact, we investigate the imagerecognition accuracy of MobileNetV2 on distorted images. We pre-train the MobileNetV2 using the images from the Caltech-256 [12]dataset, and synthesize distorted images using the methods intro-duced in Section 3.1. For each type of distortion, we adjust thedistortion level by configuring the corresponding parameter. Thesynthesized images are used as the test dataset. The results areshown in Figure 2. We can observe that even a small amount ofdistortion in the images could significantly affect the recognition ac-curacy. For instance, the accuracy of MobileNetV2 drops from 73.6%to 33.0% with a blur kernel length of 10, even though the distortionat this level does not hinder the human eyes’ ability to recognizethe object [34] (e.g., see an example in Figure 1(a)).


Table 1: Measurements of image distortions in real-worldmobile AR scenarios. A large proportion of images suffersfrom severe distortions.

Distortion Setting HardwareNokia 7.1 Magic Leap One

Gaussian noise(σ 2GN ⩾ 0.003)

Dark room (7lux) 617/1558=39.6% infeasibleCamera zoom-in (509lux) 1755/2185=97.2% infeasible

Motion blur(l ⩾ 5)

Corridor (178lux) 2681/3452=77.6% 3760/3776=99.6%Sunny outdoor (9873lux) 16/2687=0.5% 957/2766=34.6%

Gaussian blur(k ⩾ 5)

Foggy 762/935=81.6% infeasibleUnderwater 1250/1524=82.0% infeasible

3.3 Image Distortion for Real-world Mobile ARBelow, we quantify the extent of image distortion in real-word mo-bile AR scenarios. We use two commodity AR devices, the Nokia 7.1smartphone and the Magic Leap One head-mounted AR set, torecord videos (at 30 frames per second) in different environments,and measure the proportion of recorded frames that suffer fromsevere image distortions.

First, to quantify Gaussian noise, we use the Nokia 7.1 to recordvideos in environments with poor illumination conditions, i.e., in-door dark room (with 7lux). In addition, to study Gaussian noiseresulting from digital zooming, we conduct an experiment in anindoor environment at daytime (with 509lux), and turn on the built-in digital zoom function of the Nokia 7.1 to record videos of objectsthat are three meters away. We cannot record video using MagicLeap One as its image and video recording service is unavailablewhen the light condition is low, and it does not support imagezooming. Second, to quantify motion blur, we ask one user to holdthe Nokia 7.1 in her hand and ask another user to wear the MagicLeap One. They record videos using the device front-camera in bothindoor corridor (with 178lux) and sunny outdoor (with 9873lux)environments when walking at a normal pace. Lastly, to quantifyGaussian blur, we put the Nokia 7.1 in a waterproof case, and recordvideos in both underwater and foggy environments.1 The collectedimages comprise the MobileDistortion dataset, which is availableat https://github.com/CollabAR-Source/Distortion.

To determine if an image suffers from severe distortion, we usedistortion levels σ 2

GN = 0.003, l = 5, and k = 5 as the thresholds,for the three types of distortion, respectively. The thresholds corre-spond to the distortion levels that lead to a 10% accuracy drop inthe experiments shown in Figure 2. The results are summarized inTable 1. The illumination of the environment is measured using thesmartphone light sensor that is adjacent to the smartphone camera.First, we apply the image spatial quality estimator [35] to quantifyGaussian noise. In poor illumination conditions, over 39% of theimages recorded by the Nokia 7.1 suffer from severe Gaussian noise.Moreover, when using digital zooming, 97% of the images sufferfrom severe Gaussian noise due to the additive noise introducedin digital image processing. Second, we apply the partial-blur im-age detector [36] to quantify motion blur. As shown in Table 1, forMagic Leap One, 99.6% and 34.6% of the recorded images suffer fromsevere motion blur in the corridor and the outdoor environments,respectively. For Nokia 7.1, 77.6% and 0.5% of the images containsevere motion blur in these two environments. We can notice thatthere is less motion blurring in the sunny outdoor environment1We do not conduct this experiment with a Magic Leap One as we are unaware of theavailability of waterproof cases for it.

Distortion

classifier

Pristine

expert

Motion blur

expert

Gaussian

blur expert

Gaussian

noise expert

Recognition module

Edge Server

Resized

Image

Distortion-tolerant

image recognizer

Multi-view

ensemble learning

Results

cache

Correlated Image lookup

Spatial-temporal

image lookup

Anchor-time

cache

Cloud

anchors

Resu

lt

Auxiliary-assisted

multi-view ensembler

anchors

timestamps image IDs

Pose

estimation

Mobile

client

Figure 3: System architecture of CollabAR.

(9873lux) than in the corridor (178lux): as the camera exposure timeis shorter with a higher illumination, and the impact of motion onthe image is proportional to the exposure time, there is less motionblurring in high illumination environment. Lastly, we apply thelocal binary pattern-based blur detector [37] to detect Gaussianblur. The results show that 82% and 81.6% of the images suffer fromsevere Gaussian blur in the underwater and foggy environments,respectively. Our measurements highlight the pervasiveness andseverity of image distortions in real-world scenarios. Given the highdistortion proportion, simply filtering out the distorted images, ashas been suggested in prior work [1, 2, 10], can hardly solve thedistortion problem for practical AR applications.

4 SYSTEM OVERVIEWThe architecture of CollabAR is shown in Figure 3, which incor-porates three major components: (1) the distortion-tolerant imagerecognizer, (2) the correlated image lookup module, and (3) theauxiliary-assisted multi-view ensembler. They are deployed on theedge server to mitigate the trade-off between recognition accuracyand end-to-end system latency (detailed system measurements aregiven in Section 8.4). In addition to the server, the client is runningan anchor-based pose estimation module which leverages the Cloudanchors provided by the Google ARCore [22] to track the locationand orientation of the mobile device.

When the user takes an image of the object, the mobile clientsends the image along with the IDs of the anchors in the cameraview to the server. Having received the data, the server marksthe image with a unique image ID and a timestamp. The imageis used as the input for the distortion-tolerant image recognizer,while the anchor IDs and the timestamp are used as referencesfor spatial-temporal image lookup. In CollabAR, anchor IDs andtimestamps are stored in the anchor-time cache in the format of< imageID, {anchorIDs}, timestamp >, while the recognition re-sults of the images are stored in the results cache in the format of< imageID, inferenceResult > for future reuse and aggregation. Inour design, we do not share information among the clients. Instead,clients communicate directly with the edge server to ensure highrecognition accuracy and low end-to-end latency.

Distortion-tolerant image recognizer: the distortion imagerecognizer incorporates an image distortion classifier (Section 5.1)and four recognition experts (Section 5.2) to resolve the domain


adaptation problem [15] caused by the image distortions. As DNNscan adapt to a particular distortion, but not multiple distortionsat the same time [9, 15], we need to identify the most significantdistortion in the image, and adaptively select a DNN that is dedi-cated to the detected distortion. As shown in Figure 3, for a newimage received from the client, the distortion classifier detects ifthe image is pristine or if it contains any type of distortion (i.e.,motion blur, Gaussian blur, or Gaussian noise). Then, based onthe detected distortion type, one of the four recognition experts istriggered for the recognition. Each of the recognition experts is aConvolutional Neural Network (CNN) that has been fine-tuned torecognize a specific type of distorted images. The inference resultof the expert is sent to the auxiliary-assisted multi-view ensemblerfor aggregation, and is stored in the results cache for reuse.

Correlated image lookup: to enable collaborative image recog-nition, the correlated image lookup module (Section 6.1) searchesthe Anchor-time cache to find previous images that are spatiallyand temporally correlated with the new one. The anchor IDs areused as the references to identify the spatial correlation, while thetimestamps are used to determine the temporal correlation. Thecorrelated images could be the images that are captured by differ-ent users or the continuous image frames that are obtained by asingle device. The IDs of the correlated images are forwarded tothe multi-view ensembler.

Auxiliary-assistedmulti-view ensembler: based on the iden-tified image IDs, the multi-view ensembler retrieves the inferenceresults of the correlated images from the results cache. Togetherwith the recognition result of the new image, the retrieved resultsare aggregated by the multi-view ensemble learning module toimprove the recognition accuracy. Specifically, as the correlatedimages suffer from various distortions with different severity lev-els, they are unequal in quality and lead to heterogeneity in therecognition accuracy. CollabAR leverages the Auxiliary-assistedMulti-view Ensemble Learning (AMEL) (Section 6.2) to dynamicallyaggregate the heterogeneous results provided by the correlatedimages. Below, we describe the design of CollabAR in detail.

5 DISTORTION-TOLERANT RECOGNIZER5.1 Image Distortion ClassifierDifferent types of distortion have distinct impacts on the frequencydomain features of the original images. For instance, as discussedin Section 3.1, Gaussian blur can be considered as having a two-dimensional circularly symmetric centralized Gaussian convolutionon the original image in the spatial domain. This is equivalent toapplying a Gaussian-weighted, circularly shaped low pass filteron the image, which filters out the high frequency componentsin the original image. Similarly, motion blur can be consideredas a Gaussian-weighted, directional low pass filter that smoothsthe original image along the direction of the motion [7]. Lastly,the Fourier transform of additive Gaussian noise is approximatelyconstant over the entire frequency domain, whereas the originalimage contains mostly low frequency information [38]. Hence, animage with severe Gaussian noise will have higher signal energyin the high frequency components.

Figure 4 gives the Fourier spectrum of images containing differ-ent distortions. The spectrum is shifted such that the DC-value is

Pristine Motion blur Gaussian blur Gaussian noise

RGB

Spectrum

Figure 4: The Fourier spectrum of images containing differ-ent distortions. Different distortions have distinct impactson the frequency domain features of the images. The pat-terns are independent of the semantic content in the image.

RGB Grayscale Spectrum

Rec. 601

CodingDFT

Distortion

classifier

Distortion

type

Figure 5: The pipeline of the image distortion classification.

displayed in the center of the image. For any point in the spectrum,the farther away it is from the center, the higher is its correspondingfrequency. The brightness of the point in the spectrum indicates theenergy of the corresponding frequency component in the image.We can see that the pristine image contains components of all fre-quencies, and most of the high energy components are in the lowerfrequency domain. Differently, Gaussian blur removes the highfrequency components of the image in all directions. The remain-ing two dominating lines in the spectrum, one passing verticallyand one passing horizontally through the center of the spectrum,correspond to the vertical and horizontal patterns in the originalimage. Similarly, motion blur removes the high frequency compo-nents in the image along the direction of motion. Thus, in additionto the vertical and horizontal lines, we can see a strong diagonalline which is perpendicular to the direction of the motion blur.Lastly, Gaussian noise introduces more high energy components inthe high frequency domain. We leverage these distinct frequencydomain patterns for distortion classification.

Figure 5 shows the pipeline of the image distortion classification.First, we convert the original RGB image into grayscale using thestandard Rec. 601 luma coding method [39]. Then, we apply the two-dimensional discrete Fourier transform (DFT) to obtain the Fourierspectrum of the grayscale image, and shift the zero-frequency com-ponent to the center of the spectrum. The centralized spectrum isused as the input for the distortion classifier. As shown in Table 2,the distortion classifier adopts a shallow CNN architecture whichconsists of three convolutional layers (conv1, conv2, and conv3), twopooling layers (pool1 and pool2), one flatten layer, and one fullyconnected layer (fc). This shallow design avoids over-fitting andensures low computation latency.

5.2 Recognition ExpertsBased on the output of the distortion classifier, one of the fourdedicated recognition experts is selected for the image recognition.We consider either the lightweight MobileNetV2 or the AlexNet asthe base DNN model for the recognition experts. Although deepermodels (e.g., ResNet [40] and VGGNet [41]) can achieve higher


Table 2: The architecture of the image distortion classifier.Layer Size In Size Out Filterconv1 224 × 224 × 3 111 × 111 × 32 3 × 3, 2conv2 111 × 111 × 32 109 × 109 × 16 3 × 3, 1pool1 109 × 109 × 16 54 × 54 × 16 2 × 2, 2conv3 54 × 54 × 16 54 × 54 × 1 1 × 1, 1pool2 54 × 54 × 1 27 × 27 × 1 2 × 2, 2flatten 27 × 27 × 1 729fc 729 4

accuracy on pristine images, they are still susceptible to image dis-tortions [9, 42]. Recent studies have shown more than 40% accuracydrop on ResNet [42] and VGGNet [9] when the images containdistortion. In contrast, CollabAR ensures better robustness againstdistortions while maintaining a lower computation latency. Weinitialize all the CNN layers with the values trained on the fullImageNet dataset. Then, to fine-tune the experts on a target dataset,we add a fully connected layer as the last layer with the number ofunits corresponding to the number of classes in the target dataset.

In the fine-tuning process, we first train the pristine expert,ExpertP , using pristine images in the target dataset. Then, thethree distortion experts, i.e., motion blur expert ExpertMBL , Gauss-ian blur expert ExpertGBL , and Gaussian noise expert ExpertGN ,are initialized with the weights of ExpertP , and fine-tuned usingthe corresponding distortion images. During the fine-tuning, halfof the images in the mini-batch are pristine and the other half aredistorted with a random distortion level. This ensures better ro-bustness against variations in the distortion level (i.e., it helps inminimizing the effect of domain-induced changes in feature dis-tribution [15]) and helps the CNNs to learn features from bothpristine and distorted images [43]. For ExpertMBL , the distortionlevel is configured by varying the blur kernel length l and the mo-tion angle a. Their values are randomly selected from the uniformdistributions where l ∼ U (0, 40) and a ∼ U (0, 180). Similarly, forExpertGBL , the distortion level is controlled by the aperture lengthk ∼ U (0, 41). Lastly, for ExpertGN , the distortion level is decidedby the variance of the additive Gaussian noise σ 2

GN ∼ U (0, 0.04).

6 COLLABORATIVE MULTI-VIEW IMAGERECOGNITION

Mobile users that are in close proximity usually exhibit a strong cor-relation in both device usage and spatial-temporal contexts [19, 25].This is especially true in mobile AR scenario where the requestedimage recognition tasks are usually correlated temporally (e.g., con-secutive image frames from the same user [5, 6, 44]) and spatially(e.g., different users aim to recognize the same object [10, 19]). Forinstance, in a museum, visitors that are in close proximity may usethe AR application to recognize and augment information of thesame artwork. Although images are taken from different positionsor at different times, they contain the same object [19]. CollabAR ex-ploits the spatial-temporal correlation among the images to enablecollaborative multi-view image recognition. Note that CollabARis not limited to the multi-user scenario. It can also aggregate theconsecutive image frames captured by a single user as long as theyare spatially and temporally correlated.

6.1 Correlated Image Lookup

Tem

pora

ltim

e

Spatial pose

view 1

view 2

anchor#1 anchor#2 anchor#3

Tem

pora

l co

rrela

tion

t2

t1

Anchors in view 1

Anchors in view 2

Overlap

checking

Spatial co

rrela

tion

image 1

image 2

Δt

= t

2-t

1

Figure 6: An example of correlated image lookup. Imagesof the same artwork are taken from different poses, view1and view2, at different times, t1 and t2. The anchors in theimage views are used to check the spatial correlation, andthe timestamps are used to check the temporal correlation.

6.1.1 Anchor-based Pose Estimation. To identify the spatially andtemporally correlated images, CollabAR exploits the Google AR-Core [22] to estimate the pose (the position and orientation) ofthe device when the image was taken. When the user is movingin space, ARCore leverages a process called concurrent odometryand mapping (COM) which aggregates the image frames taken bythe camera and the inertial data captured by the device’s motionsensors to simultaneously estimate the orientation and location ofthe mobile device relative to the world.

Specifically, CollabAR leverages the Cloud anchors [45] providedby ARCore as the references for pose estimation. An anchor is avirtual identifier which describes a fixed pose in the physical space.The numerical pose estimation of the anchor is updated by theARCore over time, which makes the anchor a stable pose referenceagainst cumulative tracking errors [2, 46]. In our design, anchorsare created and managed by the administrator in advance. Whenrunning the application, anchors in the space can be resolved bythe users to establish a common spatial reference among them. Asa toy example shown in Figure 6, three cloud anchors have beenplaced in space to identify three different poses. When the user istaking an image of the artwork from view1 at time t1, the anchorsin the camera view will be sent to the edge and associated with thetimestamp t1. As shown previously in Figure 3, the anchors andthe timestamps are stored in the anchor-time cache in the formatof < imageID, {anchorIDs}, timestamps >. When a new image,imaдe2, is taken at time t2 from view2 either by the same or by adifferent user, the spatial-temporal image lookup module searchesthe anchor-time cache to find previous images that are spatiallyand temporally correlated with the new one.

6.1.2 Spatial-temporal Image Lookup. For any image in the cache,imaдeCached , it is considered as spatially correlated with the newimage, imaдeNew , if the images satisfy:

{anchorIDs}New ∩ {anchorIDs}Cached , ∅, (1)

where {anchorIDs}New and {anchorIDs}Cached are the sets of an-chors that appeared in views of imaдeNew and imaдeCached , re-spectively. If two images contain the same anchor in their views,they are spatially correlated. For instance, in Figure 6, imaдe1 con-tains the same anchor, anchor#2, that also appears in the view of


Result

Distortion-tolerant

image recognizer

Distortion-tolerant

image recognizer

. . .

. . .

Corr

ela

ted

images

Current image

P1, S1

Pm-1, Sm-1

Retrieved results of

m-1 correlated images

. . .

Distortion-tolerant

image recognizer

Heterogenous

base learners

Pm, Sm Inference result

of current image

Auxiliary-assisted

multi-view ensembler

Figure 7: The architecture of the Auxiliary-assisted Multi-view Ensemble Learning (AMEL).

imaдe2. Thus, they are spatially correlated. Similarly, the cachedimage is considered as temporally correlated to the new image ifthe images satisfy:

∆t = tNew − tCached and ∆t ≦ Tfresh, (2)

where tNew and tCached are the timestamps of the new and cachedimages, respectively. ∆t is the freshness of imaдeCached with re-spect to imaдeNew . If the freshness ∆t is within the freshnessthreshold, Tfresh, imaдeCached is considered as temporally corre-lated with imaдeNew . Note that the setting of the freshness thresh-old Tfresh varies with the application scenarios. In cases where theobject at a given position changes frequently, e.g., animals in thezoo, we should use a lower freshness threshold. Differently, in sce-narios where the positions of the objects do not change very often,e.g., exhibited items in a museum or large appliances in a build-ing, a higher freshness threshold can be tolerated. An image in theanchor-time cache is considered as spatially and temporally corre-lated to the new image if they satisfy Equations 1 and 2 at the sametime. The IDs of the identified correlated images are forwarded tothe multi-view ensembler for aggregation.

6.2 Auxiliary-assisted Multi-view EnsembleLearning

In image recognition, ensembling the results of multiple base clas-sifiers is a widely used approach to improve the overall recognitionaccuracy [27, 47]. Following this trend, CollabAR aggregates therecognition results of the spatially and temporally correlated imagesto improve the recognition accuracy of the current image.

As shown previously in Figure 3, using the image IDs identi-fied by the correlated image lookup module as the key, the multi-view ensembler retrieves the results of the correlated images fromthe results cache. This is done by retrieving any stored records inthe result cache, i.e., < imageID, inferenceResult >, with imageIDmatches to the identified images. As shown in Figure 7, assumingthatm − 1 correlated images are identified, the inference result ofthe current image (the probability vector Pm ) is aggregated withthat of them−1 correlated images (P1, ...,Pm−1) by the ensembler.

However, given the heterogeneity of them images (i.e., imagesare captured from different angles, suffer from different distortionswith different distortion levels), the images lead to unequal recog-nition performance. To quantify their performance and help the

Table 3: Summary of the MVMDD dataset. It has 32,400 im-ages in total, including 1,296 pristine and 31,104 distortedimages that are generated using data augmentation.

Pristine image set

Object categories 6Number of views 6Background complexity 2Size of object in image 3Number of instances 6

Total pristine images 6×6×2×3×6=1,296

Augmented image set Types of distortion 3Distortion levels 8

Total augmented images 1,296×3×8=31,104Total images 32,400

ensembler in aggregating the results dynamically, auxiliary fea-tures [28, 48] can be used. We propose to use the normalizedentropy as the auxiliary feature. The normalized entropy Sk mea-sures the recognition quality and the confidence of the distortion-tolerant image recognizer on the recognition of the kth image in-stance. Sk can be calculated from the probability vector as follows:

Sk (Pk ) = −

|C |∑i=1

pi logpilog |C|

, (3)

where Pk = {p1, ...,p |C |} is the probability vector of the kth baselearner on the image instance, and |C| is the number of object cate-gories in the dataset. The value of normalized entropy is between0 and 1. A value close to 0 indicates that the distortion-tolerantimage recognizer is confident about its recognition on the imageinstance, whereas a value close to 1 means that it is not confident.

In AMEL, the set of normalized entropy, (S1, ...,Sm ), is usedas the weight in averaging heterogeneous outputs of the spatial-temporal correlated images. Specifically, given the current imageand them − 1 correlated images, the final aggregated recognitionoutput P can be expressed as P =

∑mi=1(1 − Si )Pi , where Pi is the

probability vector of the ith image and (1 − Si ) is the associatedweight. This allows AMEL to dynamically adjust the weights inaggregating them images during the classification. As the imagesresult in unequal recognition accuracy, AMEL ismore robust againstthis variance than standard averaging methods which would assignequal weights to multi-view images during the aggregation [26].We evaluate AMEL in Section 8.3.3.

7 MULTI-VIEWMULTI-DISTORTION IMAGEDATASET

We create a Multi-View Multi-Distortion image Dataset (MVMDD)to study the impact of image distortion on multi-view AR. Ourdataset includes a pristine image set and an augmented distortionimage set. The details are summarized in Table 3.

Pristine image set:we collect pristine images (i.e., images with-out distortion) using a commodity Nokia 7.1 smartphone. Six cate-gories of everyday objects are considered, cup, phone, bottle, book,bag, and pen. Each category has six instances. For each instance,images are taken from six different views (six different angles witha 60◦ angle difference between any two adjacent views), two differ-ent background complexity levels (a clear white table backgroundand a noisy background containing other non-target objects), andthree distances. We adjust the distance between the camera and


bag book bottle cup pen phone

Figure 8: Examples of the pristine images that are collectedin our MVMDD dataset.

the object such that the sizes of the object in the images are dif-ferent. For the three distances, the object occupies approximatelythe whole, half, and one-tenth of the total area of the image. Theresolution of the images is 3024×4032. As summarized in Table 3,we collect 6×6×6×3×2 = 1,296 pristine images in total. Figure 8provides examples of the collected images.

Data augmentation for image distortion: to ensure robust-ness against image distortion, a large volume of distorted images isrequired to train the DNNs. However, in practice, it is difficult tocollect a large number of images with different types of distortions.To address this challenge, we apply the three distortion modelsintroduced in Section 3.1 to synthesize images with different dis-tortions. We consider eight severity levels for each distortion type.Specifically, for motion blur, we adjust the motion blur kernel lengthl from 5 to 40, with a step size of 5, to create eight levels of blur. Sim-ilarly, we configure the aperture size of the 2D circularly symmetriccentralized Gaussian function from 6 to 41, with a step size of 5,for the Gaussian blur, and change the variance of the zero-meanGaussian distribution from 0.005 to 0.04, with a step size of 0.005,for the Gaussian noise. As summarized in Table 3, we generate1,296×3×8=31,104 distorted images in total. The dataset is publiclyavailable at https://github.com/CollabAR-Source/MVMDD.

8 EVALUATION8.1 Experimental Setup8.1.1 Implementation. The client of CollabAR is implemented onAndroid smartphones. We leverage the Google ARCore SDK torealize the anchor-based pose estimation module introduced inSection 6.1.1. The edge server is a desktop with an Intel i7-8700kCPU and a Nvidia GTX 1080 GPU.We realize the server of CollabARusing the Python Flask framework. The distortion-tolerant imagerecognizer and the multi-view ensembler are implemented usingKeras 2.3 on top of the TensorFlow 2.0 framework. The client andthe server communicate through the HTTP protocol.

8.1.2 Benchmark dataset. Four datasets are considered in the eval-uation: Caltech-256 [12], MIRO [49], MobileDistortion (collected inSection 3.3), andMVMDD (collected in Section 7). TheCaltech-256dataset has 257 categories of objects with more than 80 instances

per category. TheMIRO dataset has 12 categories of objects withten instances per category. For each instance, it contains multi-viewimages that are captured from ten elevation angles and 16 azimuthangles. In the evaluation, we randomly select six distinct angles torepresent six different views. Our own MVMDD dataset has sixcategories of objects with six instances per category. Each objectinstance is captured from six views and three different distances,with two different backgrounds. In addition to the original pristineimages, for both Caltech-256 and MIRO, we apply the data aug-mentation methods (Section 3.1) to generate new sets of distortedimages. Lastly, we also prepare 300 image instances for each dis-tortion type from the MobileDistortion image set we collectedin Section 3.3. We use it to examine the distortion classifier ondistorted images in real-world mobile AR scenarios.

8.2 Distortion Classifier PerformanceFirst, we evaluate the performance of the image distortion classi-fier using all four benchmark datasets. We perform 3-fold crossvalidation on each of the four datasets.

The confusion matrix and the accuracy of the image distortionclassifier are shown in Figure 9. The classifier achieves 95.5%, 94.2%,92.9%, and 93.3% accuracy on the MobileDistortion, Caltech-256,MIRO, and MVMDD datasets, respectively. Moreover, the confu-sion matrix indicates that most of the classification errors happenin differentiating Motion blur (MBL) and Gaussian blur (GBL). Asshown in Figure 9, for the four different datasets, 15.3%, 3.9%, 18.1%,and 15.3% of the GBL distorted images are misclassified as MBL.In our experiments we observe that MBL and GBL can be misclas-sified as each other regardless of the distortion level. This is dueto the similarity between these two distortions. Specifically, MBLand GBL have similar impact on the image spectrum and result inmisclassification when the direction of the motion blur is parallelor perpendicular to the central horizontal line of the image. In addi-tion, we also notice that when the distortion level is low, i.e., l<10for MBL and σ 2

GN <0.01 for GN, the distorted images are sometimesmisclassified as pristine images. This is intuitive, as the impact ofthe image distortion on the Fourier spectrum is limited when thedistortion level is low, and thus, results in similar spectrum featureas pristine images. Overall, on all four image datasets, the distortionclassifier achieves 94% accuracy on average.

8.3 Image Recognition AccuracyBelow, we evaluate the image recognition accuracy of CollabAR. AsCaltech-256 does not contain multi-view images, we use MIRO andMVMDD as the datasets. We first examine the performance of thedistortion-tolerant image recognizer, followed by the evaluationof CollabAR in multi-view and multiple distortions scenarios. Weperform 3-fold cross validation in this experiment.

8.3.1 Performance of Distortion-tolerant Image Recognizer. We ex-amine the recognition accuracy of bothMobileNet-based andAlexNet-based implementations. We consider the single-distortion scenarioand gradually increase the distortion level to investigate its impacton the recognition accuracy.

The results are shown in Figure 10. First, regardless of the dis-tortion type, the MobileNet-based implementation outperformsthe AlexNet-based one by 15% and 40% on MVMDD and MIRO,


Accuracy: 95.50%

100.0%100

0.0%0

0.0%0

0.0%0

0.0%0

100.0%82

0.0%0

0.0%0

0.0%0

15.3%18

84.7%100

0.0%0

0.0%0

0.0%0

0.0%0

100.0%100

Pristine MBL GBL GNTarget Class

Pristine

MBL

GBL

GN

Out

put C

lass

(a) MobileDistortion

Accuracy: 94.21%

93.2%723

3.9%30

0.1%1

2.8%22

3.5%28

93.1%743

3.4%27

0.0%0

0.0%0

3.9%31

96.1%755

0.0%0

5.4%44

0.1%1

0.0%0

94.5%775

Pristine MBL GBL GN

Target Class

Pristine

MBL

GBL

GN

Out

put C

lass

(b) Caltech-256

Accuracy: 92.92%

95.6%172

0.0%0

0.0%0

4.4%8

2.7%4

97.3%143

0.0%0

0.0%0

0.0%0

18.1%39

81.9%176

0.0%0

0.0%0

0.0%0

0.0%0

100.0%178


Pristine

MBL

GBL

GN

Out

put C

lass

(c) MIRO

Accuracy: 93.33%

98.6%72

1.4%1

0.0%0

0.0%0

0.0%0

100.0%91

0.0%0

0.0%0

0.0%0

15.3%17

84.7%94

0.0%0

7.0%8

0.0%0

0.0%0

93.0%107


Pristine

MBL

GBL

GN

Out

put C

lass

(d) MVMDD

Figure 9: Confusionmatrix and accuracy of the image distor-tion classifier on different datasets: (a) MobileDistortion, (b)Caltech-256, (c) MIRO, and (d) MVMDD. On average, the dis-tortion classifier achieves 94% accuracy on both real-worldand synthesized distorted images.

0 10 20 30 40Blur kernel length (l)

20

40

60

80

100

Acc

urac

y (%

)

(a) Motion blur

1 11 21 31 41Aperture size (k)

20

40

60

80

100

Acc

urac

y (%

)

(b) Gaussian blur

0.0 0.01 0.02 0.03 0.04Standard deviation (

GN)

20

40

60

80

100

Acc

urac

y (%

)

(C) Gaussian noise

AlexNet on MVMDDAlexNet on MIROMobileNet on MVMDDMobileNet on MIRO

Figure 10: Accuracy of distortion-tolerant image recognizerwhen image contains (a) Motion blur, (b) Gaussian blur, and(c) Gaussian noise, respectively.MobileNet-based implemen-tation outperforms the AlexNet-based one by 15% and 40%on MVMDD and MIRO datasets, respectively.

respectively. The improvement can be attributed to the use of thedepthwise separable convolution as an efficient building block forimage feature extraction, and the use of a linear bottleneck forbetter feature transformation [4]. Second, MobileNet-based recog-nition experts are more robust against image distortions. With thehighest examined distortion level, i.e., l = 40 for MBL, k = 41 forGBL, and σ 2

GN = 0.04 for GN, we achieve 94.4% and 86.9% accuracyon average across the three distortions on MVMDD and MIRO,respectively. However, we still experience a modest accuracy dropwhen the distortion level is high. With the highest examined distor-tion level, CollabAR suffers 5% and 9% drop in accuracy on averageon the two datasets, respectively. The accuracy drop will becomemuch more severe when the image contains multiple distortions,but it can be resolved with the help of multi-view collaboration.

8.3.2 Performance of Multi-view Collaboration. Below, we evaluatethe performance of CollabAR in the multi-view scenario where spa-tially and temporally correlated images are aggregated to improve

1 2 3 4 5 6

Number of views

0

20

40

60

80

100

Acc

urac

y (%

)

w/o DC, w/o Auxiliary w/o DC, w/ Auxiliaryw/ DC, w/o Auxiliaryw/ DC, w/ Auxiliary

(a) MIRO

1 2 3 4 5 6

Number of views

0

20

40

60

80

100

Acc

urac

y (%

)

(b) MVMDD

Figure 11: Accuracy of CollabAR in the multi-view single-distortion scenario on (a) MIRO and (b) MVMDD datasets.

the single-view accuracy. Moreover, we investigate how multi-viewcollaboration can help in multiple distortions cases.

Setup: we apply MobileNet-based implementation for the recog-nition experts, as it demonstrates a higher accuracy than theAlexNet-based one. To microbenchmark the end-to-end system design, weevaluate CollabAR with two design options: (1) with and withoutthe image distortion classifier, and (2) with and without the aux-iliary feature for the multi-view ensembler. We consider up to sixviews to be aggregated for collaborative recognition.

Multi-view Single-distortion: in the single-distortion case,the image of each view contains one of the three distortions (i.e.,MBL, GBL, or GN). We set the image distortion to a high level (i.e.,30 < l < 40 for MBL, 31 < k < 41 for GBL, and 0.03 < σ 2

GN < 0.04for GN) to evaluate CollabAR with the most severe distortions.

The performance is shown in Figure 11. First, with the imagedistortion classifier, CollabAR can correctly select the recognitionexpert that is dedicated to the examined distortion image, andcan significantly improve the recognition accuracy. Given differentnumber of views aggregated, CollabAR with the distortion classifiercan improve the overall accuracy by 20% to 60% when compared tothe case without a distortion classifier. Second, with the auxiliaryfeature, CollabAR can dynamically adjust the weights in multi-viewaggregation, and can improve the overall accuracy up to 5% and 14%on MIRO and MVMDD, respectively. Lastly, the accuracy increaseswith the number of aggregated views. In case where CollabAR isincorporated with both the distortion classifier and the auxiliaryfeature, the overall accuracy can be improved by 15% and 7% withsix views aggregated, on MIRO and MVMDD, respectively.

Multi-view Multi-distortion: below, we examine CollabAR,with both the distortion classifier and the auxiliary feature, in themultiple distortions scenario. We consider four multiple distortioncombinations: MBL+GN, GBL+GN, GBL+MBL, and GBL+MBL+GN.

Figure 12 shows the performance of CollabAR given differentdistortion combinations in the testing image. The distortion is setto a modest level (i.e., 5 < l < 15 for MBL, 6 < k < 16 for GBL, and0.005 < σ 2

GN < 0.015 for GN). First, we observe a significant drop inthe accuracy compared to that in the single-distortion scenario. Thisis expected: since the dedicated recognition experts are fine-tunedon imageswith specific types of distortions, multiple distortionswillcause a mismatch in the feature distribution between the corruptedtesting images and the dedicated recognition experts, and will thusresult in a performance drop. Fortunately, the accuracy improveswith the number of views aggregated. For different combinations


1 2 3 4 5 6Number of views

20

40

60

80

100

Acc

urac

y (%

)

MBL+GNGBL+GNGBL+MBLGBL+MBL+GN

(a) MVMDD

1 2 3 4 5 6Number of views

20

40

60

80

100

Acc

urac

y (%

)

(b) MIRO

Figure 12: Accuracy in multi-view multiple distortions sce-nario. With six views aggregated, the accuracy improves byup to 16.5% and 20.3% on MIRO and MVMDD, respectively.

GBL+GN

GBL+MBL+GN

GBL+MBL

MBL+GN0

10

20

30

40

Improvem

ent (

%)

5~1510~2015~25

20~3025~3530~40

(a) MVMDD

GBL+GN

GBL+MBL+GN

GBL+MBL

MBL+GN0

10

20

30

40

Improvem

ent (

%)

(b) MIRO

Figure 13: Accuracy improvement of CollabAR in multi-view multi-distortion scenario given different distortioncombinations and distortion levels (i.e., by configuring l andk from 5 to 40 with a step size of 10). With six views aggre-gated, the overall accuracy is improved by 9.5% on averageacross different scenarios.

of image distortions, the accuracy can improve by up to 16.5% and20.3% on MIRO and MVMDD, respectively.

Figure 13 shows the accuracy improvements of the multi-viewaggregation given different distortion combinations and distortionlevels. Overall, with six views aggregated, the overall recognitionaccuracy can be improved by 9.5% on average. Specifically, the im-provements become modest when the distortion level is high. Thisis reasonable: if all the aggregated images suffer from severe distor-tions, simply aggregating the multi-view results would not improvethe overall performance by much. One exception in our experimentis the ‘GBL+MBL’. This is because our distortion-tolerant imagerecognizer is robust against the ‘GBL+MBL’ when the distortionlevel is low (as shown in Figure 12), and thus we achieve a higherimprovement when the distortion level of ‘GBL+MBL’ is higher. Inpractical mobile AR scenarios, we believe that the probability of allthe correlated images suffering severe multiple distortions is fairlylow. Thus, as it will be shown in the following section, CollabARachieves over 96% accuracy as long as there is one low-distortionimage in the multi-view collaboration.

8.3.3 Advantages of AMEL. Below, we investigate the performanceof CollabAR in heterogeneous multi-view multiple distortions sce-narios, where images from different views are diverse in distortiontype and distortion level. We define a ‘good view’ if the image ispristine and we define a ‘bad view’ if the image contains high dis-tortion levels (i.e., 30 < l < 40 for MBL, 31 < k < 41 for GBL, and

1g1b

1g2b

1g3b

2g2b

2g3b

2g4b

80

90

100

Acc

urac

y (%

)

w/o Auxiliaryw/ Auxiliary

(a) GBL+MBL

1g1b

1g2b

1g3b

2g2b

2g3b

2g4b

60

70

80

90

100

Acc

urac

y (%

)

(b) MBL+GN

1g1b

1g2b

1g3b

2g2b

2g3b

2g4b

60

70

80

90

100

Acc

urac

y (%

)

(c) GBL+GN

1g1b

1g2b

1g3b

2g2b

2g3b

2g4b

60

70

80

90

100

Acc

urac

y (%

)

(d) GBL+MBL+GN

Figure 14: CollabAR accuracy on the MVMDD dataset withand without the auxiliary feature. The images contain mul-tiple distortions and are diverse in the distortion level.

0.03 < σ 2GN < 0.04 for GN). We also demonstrate the advantages

of the proposed auxiliary-assisted multi-view ensemble over theconventional ensemble method without the auxiliary feature.

Figure 14 compares the performance of CollabAR on theMVMDDdataset with and without the auxiliary feature. CollabAR can al-ways achieve a higher accuracy with the auxiliary feature. Theimprovement is most pronounced when there are more ‘bad views’than ‘good views’ in the multi-view aggregation (e.g., ‘1g2b’, ‘1g3b’,‘2g3b’, and ‘2g4b’, where ‘1g2b’ refers to ‘one good view and two badviews’). For instance, as shown in Figure 14(d), in a scenario wherethe image contains three types of distortion, the auxiliary featurecan improve the performance by 8% on average and up to 26% ina more diverse case (‘1g3b’). Overall, CollabAR achieves 96% and88% accuracy on average, with and without the auxiliary feature,respectively. We achieve similar results on the MIRO dataset.

8.4 System ProfilingBelow, we present a comprehensive profiling of CollabAR in termsof computation and communication latency. We consider two de-ployments of CollabAR: (1) the whole system is deployed on the mo-bile client, and (2) the edge-assisted design inwhich the computation-intensive recognition pipeline is deployed on the server. We im-plement CollabAR on four different platforms. In addition to theedge server, we also consider three different commodity smart-phones, Nokia 7.1, Google Pixel 2 XL, and Xiaomi 9, as the mobileclients. The three smartphones are heterogeneous in both computa-tion and communication performance. We leverage the TensorFlowLite [50] as the framework when implementing CollabAR on thesmartphones.

8.4.1 Computation Latency. To examine the computation latencyof CollabAR, we tear down its image recognition pipeline into twoprocessing components: (1) the image distortion classifier and(2) the AMEL image recognition. Two possible implementations ofthe AMEL, i.e., AlexNet-based and MobileNetV2-based, are consid-ered. We run 30 trials of the end-to-end image recognition pipelineand report the average time consumed by each of the components.Table 4 shows the computation latency (in ms) when we deploy the


Table 4: The computation latency (in ms) of CollabARwhen deployed on different hardware platforms. The edge-assisted design achieves the lowest overall latency of 8.1ms.

ProcessingComponent

ProcessingUnit

Hardware PlatformEdge server Nokia 7.1 Pixel 2 XL Xiaomi 9

AMEL(AlexNet)

CPU 48.0 255.4 189.9 124.2GPU 6.8 331.0 127.5 71.2

AMEL(MobileNetV2)

CPU 28.1 108.4 81.9 33.7GPU 4.7 46.9 26.0 33.1

Distortionclassifier

CPU 4.4 29.3 22.33 13.1GPU 3.4 20.3 8.7 8.4

Lowestoverall latency

CPU 32.5 137.7 104.2 46.8GPU 8.1 67.2 34.7 41.5

system on different hardware platforms. First, regarding the latencyof AMEL, MobileNet-based design can always achieve lower latencythan AlexNet-based one, for the same hardware platform and pro-cessing unit used. This is because MobileNetV2 is more lightweightand is well-optimized for running on resource-limited devices [51].Second, the edge server achieves the lowest overall computationlatency of 8.1ms when using its GPU for the computation. Our edge-assisted design achieves more than four times speedup over the bestexamined mobile platform. Although the latency for individual mo-bile platforms can be reduced when more efficient DNN model andpowerful mobile GPU are used, the edge-assisted design ensureslow computation latency when mobile devices are heterogeneousin computation power.

8.4.2 Communication Latency. Communication latency is a factorof delay when we deploy the recognition process on the edge. Thereare three phases involved in a single end-to-end image recognition:• Network delay: client transmits either the resized image or theoriginal image to the edge server, and the server sends the recog-nition result back to the client.

• Image resizing: resizing process is needed when the client decidesto transmit the resized image. The resizing converts the originalimage from 3,024×4,032×3 to the size of 224×224×3.

• Image encoding: the encoding compresses the raw image usingthe JPEG lossless compression.In our experiment, the client and the server are connected through

a single-hopWi-Fi network in our lab.Wemeasure the latencywhenusing both 2.4GHz and 5GHz wireless channels. Table 5 shows thelatency of the mobile clients when communicating with the edgeserver. First, because of the higher data rate, the 5GHz channelachieves a lower latency than the 2.4GHz. Second, comparing toresizing and sending the resized image, sending the original imagedirectly to the server increases the latencies in both network delayand image encoding. Overall, with more advanced wireless chips,Pixel 2 XL and Xiaomi 9 achieve similar round-trip communicationlatencies of 13.4ms and 9.7ms, respectively. Nokia 7.1 has the high-est latency of 24.6ms. The communication latency is largely affectedby the wireless network condition. Our current measurement isconducted under modest background traffic load. To address thehigh communication latency in congested networks, we can com-press the images with a higher ratio. For instance, one can applya higher degree of compression to parts of the image that are lesslikely to contain important features [5].

8.4.3 End-to-end Latency. Putting all together, when executingon the edge, the end-to-end system latency is 32.7ms, 21.5ms, and

Table 5: Communication latency (inms) for different clients.The results are averaged over 30 trials. For each recognitionrequest, the clients take 10ms to 25ms in communication.

Content Hardware PlatformNokia 7.1 Pixel 2 XL Xiaomi 9

Image resizing 4.1 2.5 2.3

Image encoding Resized image 10.3 4.6 2.4Original image 722.5 348.1 182.5

Network delay(Resized image)

2.4GHz 14.3 6.6 5.05GHz 10.2 6.3 5.0

Network delay(Original image)

2.4GHz 120.7 60.3 32.05GHz 91.2 29.1 30.4

Lowest round-trip latency 24.6 13.4 9.7

17.8ms, for Nokia 7.1, Pixel 2 XL, and Xiaomi 9, respectively. Theresults indicate that our edge-assisted design can provide accurateimage recognition without sacrificing the end-to-end system la-tency. Overall, for all the three commodity smartphones that wehave examined, the edge-assisted design allows us to achieve 30fpscontinuous image recognition.

9 DISCUSSIONIn this section, we discuss our limitations and outline potentialsolutions for future work.

Real-world image distortions. In the current presentation,CollabAR is evaluated with distortion images that are synthesizedusing simple distortion models (namely, the zero-mean additiveGaussian noise model for Gaussian noise [32], the non-uniformmotion blur kernel for Motion blur [16], and the two-dimensionalGaussian kernel for Gaussian blur [17]). However, there is a non-negligible gap between the artificial and the real-world distor-tions [42, 52]. Inevitably, the gap may result in poor recognitionperformance when the system is applied to ‘in-the-wild’ AR sce-narios. To alleviate this problem, instead of modeling the imagedistortions using simple noise models, one can leverage the Gener-ative Adversarial Network (GAN) to generate more realistic imagedistortions using real-world distortion images as input [52–54]. Thedistortion images generated from GAN can be further used to trainCollabAR and make it more robust to real-world distortions.

Mobile device heterogeneity. By aggregating multi-view im-ages, CollabAR has shown great improvement in recognition accu-racy. However, the current evaluation is performed on multi-viewimages taken from a single type of device (i.e., Nokia 7.1). In practice,owing to the heterogeneity in the underlying hardware and signalprocessing pipelines, different mobile devices are showing varia-tions in their image outputs on the same physical object/scene [55].For image recognition system, this device heterogeneity may leadto feature mismatch among images captured by different devices.For the future work, a potential solution is leveraging the cycle-consistent adversarial networks [53, 56] to learn the translationfunction among different mobile devices, and use it to reduce thedomain shift caused the device heterogeneity.

Impact of image offloading on recognition. During imageoffloading, the information loss due to image resizing and compres-sion process (e.g., the JPEG compression) may also lead to accuracydegradation [42, 54]. However, its impact on CollabAR is small. Inour design, since all images are processed with the same offloadingprocedure, they suffer the same information loss during resizing


and compression. The domain shift and feature mismatch betweenthe training and testing images are largely eliminated, and thus,the accuracy loss due to image offloading is negligible.

10 CONCLUSIONIn this paper, we presented CollabAR, an edge-assisted collabo-rative image recognition system that enables distortion-tolerantimage recognition with imperceptible system latency for mobileaugmented reality. To achieve this goal, we propose the distortion-tolerant image recognition to improve robustness against real-worldimage distortions, and the collaborative multi-view image recog-nition to improve the overall recognition accuracy. We implementthe end-to-end system on four different commodity devices, andevaluate its performance on two multi-view image datasets. Ourevaluation demonstrates that CollabAR achieves over 96% recogni-tion accuracy for images with severe distortions, while reducingthe end-to-end system latency to as low as 17.8ms.

ACKNOWLEDGMENTSWe would like to thank our anonymous reviewers and shepherdfor their insightful comments and guidance on making the bestpossible representation of our work. This work was supported inpart by the Lord Foundation of North Carolina and by NSF awardsCSR-1903136 and CNS-1908051.

REFERENCES[1] P. Jain, J. Manweiler, and R. Roy Choudhury, “OverLay: Practical mobile aug-

mented reality,” in ACM MobiSys, 2015.[2] K. Chen, T. Li, H.-S. Kim, D. E. Culler, and R. H. Katz, “Marvel: Enabling mobile

augmented reality with low energy and low latency,” in ACM SenSys, 2018.[3] A. Krizhevsky, I. Sutskever, and G. Hinton, “ImageNet classification with deep

convolutional neural networks,” in NIPS, 2012.[4] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen, “MobileNetV2:

Inverted residuals and linear bottlenecks,” in IEEE CVPR, 2018.[5] L. Liu, H. Li, and M. Gruteser, “Edge assisted real-time object detection for mobile

augmented reality,” in ACM MobiCom, 2019.[6] M. Xu, M. Zhu, Y. Liu, X. Lin, and X. Liu, “DeepCache: Principled cache for mobile

deep vision,” in ACM MobiCom, 2018.[7] H. Ji and C. Liu, “Motion blur identification from image gradients,” in IEEE CVPR,

2008.[8] “i-MARECULTURE,” https://imareculture.eu.[9] S. F. Dodge and L. J. Karam, “Quality robust mixtures of deep neural networks,”

IEEE Transactions on Image Processing, 2018.[10] W. Zhang, B. Han, P. Hui, V. Gopalakrishnan, E. Zavesky, and F. Qian, “CARS:

Collaborative augmented reality for socialization,” in ACM HotMobile, 2018.[11] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li, “ImageNet: A large-scale

hierarchical image database,” in IEEE CVPR, 2009.[12] G. Griffin, A. Holub, and P. Perona, “Caltech-256 object category dataset,” 2007.[13] S. Ghosh, R. Shet, P. Amon, A. Hutter, and A. Kaup, “Robustness of deep convo-

lutional neural networks for image degradations,” in IEEE ICASSP, 2018.[14] T. S. Borkar and L. J. Karam, “DeepCorrect: Correcting DNN models against

image distortions,” IEEE Transactions on Image Processing, 2019.[15] K. Saenko, B. Kulis, M. Fritz, and T. Darrell, “Adapting visual category models to

new domains,” in Springer ECCV, 2010.[16] J. Sun, W. Cao, Z. Xu, and J. Ponce, “Learning a convolutional neural network for

non-uniform motion blur removal,” in IEEE CVPR, 2015.[17] J. Flusser, S. Farokhi, C. Höschl, T. Suk, B. Zitová, and M. Pedone, “Recognition of

images degraded by gaussian blur,” IEEE Transactions on Image Processing, 2015.[18] W. Zhang, B. Han, and P. Hui, “Jaguar: Low latency mobile augmented reality

with flexible tracking,” in ACM MM, 2018.[19] P. Guo, B. Hu, R. Li, and W. Hu, “FoggyCache: Cross-device approximate compu-

tation reuse,” in ACM MobiCom, 2018.[20] “HoloLens,” https://www.microsoft.com/en-us/hololens.

[21] “Magic leap,” https://www.magicleap.com/.[22] “Google ARCore,” https://developers.google.com/ar/.[23] “Apple ARKit,” https://developer.apple.com/documentation/arkit.[24] X. Ran, C. Slocum, M. Gorlatova, and J. Chen, “ShareAR: Communication-efficient

multi-user mobile augmented reality,” in ACM HotNets, 2019.[25] H. Verkasalo, “Contextual patterns in mobile service usage,” Personal and Ubiqui-

tous Computing, 2009.[26] Z. Zhou, Ensemble methods: Foundations and algorithms. Chapman andHall/CRC,

2012.[27] S. Teerapittayanon, B. McDanel, and H.-T. Kung, “Distributed deep neural net-

works over the cloud, the edge and end devices,” in IEEE ICDCS, 2017.[28] N. F. Rajani and R. J. Mooney, “Stacking with auxiliary features,” in AAAI IJCAI,

2017.[29] M. Labbe and F. Michaud, “Appearance-based loop closure detection for online

large-scale and long-term operation,” IEEE Transactions on Robotics, 2013.[30] X. Zeng, K. Cao, and M. Zhang, “MobileDeepPill: A small-footprint mobile deep

learning system for recognizing unconstrained pill images,” in ACM MobiSys,2017.

[31] H. Qiu, X. Liu, S. Rallapalli, A. J. Bency, K. Chan, R. Urgaonkar, B. Manjunath,and R. Govindan, “Kestrel: Video analytics for augmented multi-camera vehicletracking,” in IEEE IoTDI, 2018.

[32] W. Liu andW. Lin, “Additive white gaussian noise level estimation in SVD domainfor images,” IEEE Transactions on Image Processing, 2012.

[33] Y. Zhou, S. Song, and N. Cheung, “On classification of distorted images with deepconvolutional neural networks,” in IEEE ICASSP, 2017.

[34] S. Dodge and L. Karam, “Human and DNN classification performance on imageswith quality distortions: A comparative study,” ACM Transactions on AppliedPerception, 2019.

[35] A. Mittal, A. K. Moorthy, and A. C. Bovik, “No-reference image quality assessmentin the spatial domain,” IEEE Transactions on Image Processing, 2012.

[36] R. Liu, Z. Li, and J. Jia, “Image partial blur detection and classification,” in IEEECVPR, 2008.

[37] X. Yi and M. Eramian, “LBP-based segmentation of defocus blur,” IEEE Transac-tions on Image Processing, 2016.

[38] A. Jain, Fundamentals of digital image processing. Prentice Hall, 1989.[39] C. Poynton, Digital video and HD: Algorithms and Interfaces. Elsevier, 2012.[40] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,”

in IEEE CVPR, 2016.[41] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale

image recognition,” in IEEE ICLR, 2015.[42] Z. Sun, M. Ozay, Y. Zhang, X. Liu, and T. Okatani, “Feature quantization for

defending against distortion of images,” in IEEE CVPR, 2018.[43] X. Peng, J. Hoffman, Y. Stella, and K. Saenko, “Fine-to-coarse knowledge transfer

for low-res image classification,” in IEEE ICIP, 2016.[44] Y. Li and W. Gao, “DeltaVR: Achieving high-performance mobile VR dynamics

through pixel reuse,” in ACM/IEEE IPSN, 2019.[45] “Google cloud anchors,” https://developers.google.com/ar/develop/developer-

guides/anchors.[46] F. Li, C. Zhao, G. Ding, J. Gong, C. Liu, and F. Zhao, “A reliable and accurate

indoor localization method using phone inertial sensors,” in ACM UbiComp, 2012.[47] Y. Lin, T. Liu, and C. Fuh, “Local ensemble kernel learning for object category

recognition,” in IEEE CVPR, 2007.[48] M.M. Derakhshani, S. Masoudnia, A. H. Shaker, O. Mersa, M. A. Sadeghi, M. Raste-

gari, and B. N. Araabi, “Assisted excitation of activations: A learning techniqueto improve object detectors,” in IEEE CVPR, 2019.

[49] A. Kanezaki, Y. Matsushita, and Y. Nishida, “RotationNet: Joint object categoriza-tion and pose estimation using multiviews from unsupervised viewpoints,” inIEEE CVPR, 2018.

[50] “Tensorflow lite,” https://www.tensorflow.org/lite.[51] A. Howard, M. Zhu, B. Chen, D. Kalenichenko,W.Wang, T.Weyand,M. Andreetto,

and H. Adam, “MobileNets: Efficient convolutional neural networks for mobilevision applications,” arXiv preprint arXiv:1704.04861, 2017.

[52] J. Chen, J. Chen, H. Chao, and M. Yang, “Image blind denoising with generativeadversarial network based noise modeling,” in IEEE CVPR, 2018.

[53] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translationusing cycle-consistent adversarial networks,” in IEEE ICCV, 2017.

[54] A. Bulat, J. Yang, and G. Tzimiropoulos, “To learn image super-resolution, use aGAN to learn how to do image degradation first,” in Springer ECCV, 2018.

[55] A. Ignatov, N. Kobyshev, R. Timofte, K. Vanhoey, and L. Van Gool, “DSLR-qualityphotos on mobile devices with deep convolutional networks,” in IEEE ICCV, 2017.

[56] A. Mathur, A. Isopoussu, F. Kawsar, N. Berthouze, and N. D. Lane, “Mic2Mic:Using cycle-consistent generative adversarial networks to overcome microphonevariability in speech systems,” in ACM/IEEE IPSN, 2019.

https://imareculture.eu

https://www.microsoft.com/en-us/hololens

https://www.magicleap.com/

https://developers.google.com/ar/

https://developer.apple.com/documentation/arkit

https://developers.google.com/ar/develop/developer-guides/anchors

https://developers.google.com/ar/develop/developer-guides/anchors

https://www.tensorflow.org/lite

collabar: edge-assisted collaborative image recognition ... › lions › papers ›...

Documents