providing guidance for maintenance operations using ... hugo... · providing guidance for...

Providing Guidance for Maintenance Operations Using Automatic

Markerless Augmented Reality System

Hugo Alvarez∗ Iker Aguinaga† Diego Borro‡

CEIT and Tecnun (University of Navarra), Spain

ABSTRACT

This paper proposes a new real-time Augmented Reality basedtool to help in disassembly for maintenance operations. Thistool provides workers with augmented instructions to performmaintenance tasks more efficiently. Our prototype is a completeframework characterized by its capability to automatically generateall the necessary data from input based on untextured 3D trianglemeshes, without requiring additional user intervention. Anautomatic offline stage extracts the basic geometric features. Theseare used during the online stage to compute the camera pose froma monocular image. Thus, we can handle the usual textureless 3Dmodels used in industrial applications. A self-supplied and robustmarkerless tracking system that combines an edge tracker, a pointbased tracker and a 3D particle filter has also been designed tocontinuously update the camera pose. Our framework incorporatesan automatic path-planning module. During the offline stage, theassembly/disassembly sequence is automatically deduced from the3D model geometry. This information is used to generate thedisassembly instructions for workers.

Index Terms: I.2.10 [Computing Methodologies]: ArtificialIntelligence—Vision and Scene Understanding; I.4.8 [ComputingMethodologies]: Image Processing and Computer Vision—SceneAnalysis J.6 [Computer Applications]: Computer-AidedEngineering—Computer-aided manufacturing (CAM)

1 INTRODUCTION

Augmented Reality (AR) is a set of technologies that enrich theway in which users experience the real world by embedding virtualobjects in it that coexist and interact with real objects. This way,the user can be exposed to different environments and sensationsin a safe and more realistic way. AR has been applied successfullyto many areas, such as medicine [7], maintenance and repair [9],cultural heritage [8] and education [17].

To embed virtual objects in the real world, the camera pose mustme known. Moreover, the techniques that achieve this goal canbe classified into two groups: marker tracking [5] and markerlesstracking [9]. While the first group requires environment adaptation,since some patterns are added to the scene, the second optionuses natural features that are already in the scene. Besides, themarkerless tracking generally needs the user intervention to makean unpleasant training process.

In this paper, we describe a real time automatic markerless ARdisassembler for maintenance and repair operations. The systemgenerates instructions indicating which the next step is and theway to proceed. The instructions are represented graphically bysuperimposing a virtual representation of the next step on top of

∗e-mail: [email protected]†e-mail: [email protected]‡e-mail: [email protected]

the real view of the system being repaired. Thus, the next part todisassemble is virtually coloured in red and translated along theextraction path (Figure 8). Indeed, a virtual arrow emphasize theextraction direction. The online assistance provided can, therefore,reduce the time spent by technicians searching for information inpaper based documentation, and thus, it enables faster and morereliable operation.

The main challenge addressed by this work is the automatic(without user intervention) generation of the suitable informationto provide user feedback. The system receives as an input asingle untextured 3D triangle mesh of each component. Theassembly/disassembly sequence is computed automatically fromthis input, finding collision-free extraction trajectories and theprecedence relationship of the disassembly of the different partscomposing the target system [2]. Additionally, some geometricfeatures, such as edges and junctions, are also automaticallyextracted to identify and track the different components during theirmanipulation [3].

The rest of the paper is structured as follows. Section 2 presentsthe state of the art on the use of AR for assembly/disassemblytasks. Section 3 describes in detail our AR disassembler. Section 4presents the results and tests performed on our prototype system.Finally, Section 5 presents the conclusions reached in this work andfuture research lines.

2 RELATED WORK

Several works [30, 38] demonstrate the benefits of AR basedapproaches applied to the assistance in maintenance and repairoperations in terms of operation efficiency. Compared to VirtualReality (VR), AR offers a more realistic experience because theuser interacts with real objects, while VR manipulates virtualobjects and is limited by the lack of suitable sensors feedback.Further, VR is oriented to training rather than guidance, improvingthe skills of users through multiple simulations, while AR canbe used for both training and guidance. Because of that, manyauthors have been addressed the problem of building an AR toolthat provides guidance for maintenance operations. The maincharacteristics of some of these solutions are detailed below.

[45] describes an AR application for assembly guidance usinga virtual interactive tool. The user controls different instructionmessages with a pen and the assembly steps are displayed inthe image. However, instructions are limited to 2D photographicimages on the camera image border. The system does not perform3D model tracking to offer 3D augmentation data. Similarly, [4]describes an AR maintenance tool based on an untracked system.

Other approaches [5, 10, 13, 24, 34, 41, 46] rely on marker basedtracking systems to compute the camera position and orientation.Thus, instructions are registered in the real world with a high degreeof realism. All these systems require the physical modification ofthe environment by adding patterns that are readily detectable inthe image. These patterns can simply be a piece of paper [5, 10,24, 41, 46], or more sophisticated infrared markers [13, 34]. Theyachieve high accuracy with low computational cost. However, thedrawbacks of this kind of system are well known. They requireenvironment adaptation, which is not always possible, and marker

181

IEEE International Symposium on Mixed and Augmented Reality 2011Science and Technolgy Proceedings26 -29 October, Basel, Switzerland978-1-4577-2184-7/10/$26.00 ©2011 IEEE

and 3D model coordinate systems must be properly calibratedby the user. In addition, marker tracking systems can fail whenthe markers are partially occluded. This is a likely scenario inmaintenance operations, since occlusions with the hands of theworker and tools are common. This problem is minimized by someauthors using a multi camera system [13, 34] or a multi markersystem [18].

Other alternatives use natural features to retrieve the 6 DOFof the camera. During an offline process, some views of thescenario are taken. These keyframes are used as input for atraining processes, where 2D-3D correspondences are establishedby back-projecting detected 2D image features into the known3D model. During the online phase, 2D features are extractedfrom the camera image and matched to those that were indexedin the offline phase, thus, obtaining the camera pose. Using thisprocedure, [9] describes an application for airplane maintenance,while [27] details a prototype for car repair guidance. Thesesolutions do not modify the environment, but they need some userintervention to generate the training datasets. Furthermore, theyare optimized for textured scenes, which do not always exist inindustrial environments.

Some interesting solutions do not need the 3D model, butrequire high interactivity with users [31]. Some virtual annotations(disassembly instructions) are added manually to the scene byclicking on the image position that should be. This way, thesystem estimates the 3D location of the notes automatically andrenders them correctly registered to the environment using SLAMtechniques.

Compared to all these works, our approach proposes a morerobust framework. It is able to compute automatically theassembly/disasssembly sequence, which has not been addressedby any of the papers cited above, since this information is usuallyconsidered an input parameter given by the user. Additionally, ourmarkerless tracking is based on untextured 3D model geometry,and all the tracking data is extracted automatically, without userintervention. This lets us overcome the problems outlined earlier:environment adaptation and textureless scenes.

3 AUTOMATIC MARKERLESS AR DISASSEMBLER

Figure 1 shows all the steps of the proposed framework. Theautomatic offline phase receives as input a 3D model of thesystem to maintain. This model is composed of several parts,each one with its own 3D triangle mesh and located in itsassembly position (Figure 1). With this minimal information, thedisassembly-planning module is able to extract the precedenceorder in which the parts should be placed or removed (Section 3.1).Furthermore, some geometric features, such as edges and junctions,are extracted to address the tracking problem (Section 3.2.1).These geometric features let us perform the 3D model recognition(Section 3.2.2) and tracking (Section 3.2.3) during the onlinephase. Within this phase, once the camera’s 6 DOF have beencomputed, the system offers an augmented instruction, indicatinghow to assemble/disassemble the next part. Then, the users performthe corresponding operation and notify the system that they wantto continue with the next step. Currently this notification isimplemented as a simple keystroke, but it could be implementedwith a more sophisticated voice command. Each step of this processis explained in depth in the next sections.

3.1 Disassembly Planning

The assembly or disassembly procedure for a product can bedescribed by two main components: a precedence graph and thedisassembly paths for each component [22]. The precedence graphdescribes the order of the operations as a graph. In this grapheach node represents a component and it is connected to a set ofother elements whose removal must precede the disassembly of the

Figure 1: Automatic AR Disassembler overview.

element represented by it. The second component, the disassemblypath, represents the motion that is required to extract the componentfrom the assembly.

The initial works on the generation of assembly and disassemblysequences started at the end of the 80’s and early 90’s [15]. Theworks on disassembly planning can be classified in two maingroups (component based and product based planners) accordingto the input information used. In component based planning,this information is based on the description of the componentsof the product by means of its geometry, behaviour, type, etc.The algorithm is usually divided into two sub-problems: thelocalization of a direction which allows the local translation of thecomponents and a second step validates this extraction directionby checking if the generated path is collision free. Some of thefirst works were based on the recursive partitioning of the assemblyinto subassemblies [42]. The second class are product basedplanners, which use more abstract input describing the product,as the precedence relationships of the disassembly process. Theirobjective is the generation of optimal sequences. For this purposethese approaches used optimization algorithms such as GeneticAlgorithms [26].

Due to the objectives of this work, we use a componentbased planner [1], whose outcome is the precedence graph anddisassembly path for a system. Once this information is computed,the assembly/disassembly sequences can be obtained by traversingthe graph from a target component, and removing each other nodeconnected to it.

3.1.1 Model Format

Input 3D models are composed of a set of parts that must bedisassembled. The 3D model of Figure 1, for example, is composedby 7 parts: a box, a lid, four small screws, and a cord connectorsimilar to a big screw.

182

No texture information is required as input because it is limitedto the geometry of the 3D model. Each part of the model is definedas a simple 3D triangle mesh, and all of them must be provided withrespect to the same origin and located in their initial configuration,i.e., as if the model was assembled. It is noteworthy that althoughCAD software usually allows defining explicitly some relationshipsbetween the geometric position of the different components of asystem, they are not required for our system.

3.1.2 Precedence Graph and Disassembly Path

To obtain the precedence graph and disassembly paths, we proceedheuristically to select different components, and then, we try togenerate a disassembly path for them. For most components inindustrial settings the extraction path of any component can berepresented by a single straight line. We use the contact informationof the different components in order to find the set of free localextraction directions. Afterwards, we test the different directionsfor collisions in the extraction. Unfortunately, this procedure failswhen the extraction path is more complex. In this case, we usea more robust path planning algorithm based on Rapid GrowingRandom Trees (RRT) [2]. This approach is more flexible but alsomore expensive computationally (see Section 4).

If an extraction path is found for a component, we can find whichcomponents need to be previously removed by testing the removalpath with a set of all the components in their initial configuration.Any component that collides with positions in the extraction pathneeds to be removed prior to it. This way, we can build the completeprecedence graph in a trial and error process.

Notice that the assembly/disassembly sequence graph iscomputed only once for each input 3D model. The entire process isexecuted in a few seconds during the offline stage. This sequence isused by the Recognition and Tracking module to figure out whichis the following scenario to recognize and track (Section 3.2).

3.2 Recognition and Tracking

The markerless tracking algorithm is divided into two stages:automatic offline processing and online tracking. In the offlinestage, some geometric features, such as edges and junctions, areextracted automatically from the input 3D triangle mesh. Inaddition, a set of synthetic views of these geometric features areindexed in a database. This stage is executed only once permodel. During the online stage, this data is used to calculatethe camera pose. This stage consists of two different problems -recognition and tracking. The recognition problem tries to computethe camera pose without previous knowledge. The input cameraimage is matched to one of the synthetic views generated duringthe offline stage. Thus, it is used for both initialization and trackingfailure recovery. The tracking problem, in turn, assumes that theinformation from the previous frame is available. It performs aframe-to-frame tracking of the 3D model edges extracted duringthe offline stage.

An input 3D model is composed if several components.Therefore, the presence or removal of a part modifies thegeometric composition of the model (Figure 2). Because of this,for each assembly/disassembly step detected by the automaticdisassembly-planning module, all the processes explained aboveare executed. Each geometric configuration that results from a partremoval is treated as a separate 3D model, for which geometricfeatures are extracted. Notice that although multiple disassemblysequences can be considered, only the sequential order set bythe disassembly-planning module is used to get the collection ofpossible geometric configurations. In the same way, during theonline execution the next step is known, and it is clear whichgeometric configuration should be loaded, since the sequence has apredefined order. This method increases the memory requirements,but it offers robust recognition and tracking results.

3.2.1 Geometric Feature Extraction

Given an input 3D triangle mesh (without texture) our geometricfeature extraction module detects automatically (without userintervention) 3D sharp edges and junctions. Meanwhile the 3Dsharp edges extraction has been inspired by [16, 44], as far aswe know, the 3D junctions extraction has not previously beenaddressed.

Figure 2: Geometric features for different geometric configurations.

3D sharp edges are defined as edges of neighbour facets whosenormal angle is larger than a predefined threshold (40 degrees inour experiments). Moreover, a junction is defined as a point whereseveral sharp edges meet. This way, if two sharp edges share onevertex, and the angle that they form is close to 90 degrees, then a 3Djunction is built. These types of junctions are known as L junctions.

However, this technique also considers sharp edges and junctionsthat belong to the inner boundary of the model (Figure 3). Thesetypes of edges and junctions are always hidden by other parts ofthe model, and do not provide relevant information. Similarly,undesired sharp edges and junctions can be detected due to badtessellation. Thus, a refinement step that uses the OpenGL pipelinehas been developed to discard these outlier edges and junctions.This step is similar to that explained in [28], but rather than usinga set of 2D views to build 3D edges, we proceed in reverse order,validating the quality of an initial estimation of 3D edges througha set of 2D views. Therefore, the model is rendered from differentpoints of view, following a sphere path. Good results have beenobtained with steps of altitude and azimuth of 30 degrees, whichgives a total of 144 views. For each view, a 2D edge map isbuilt, applying the sobel operator to the depth buffer image thatis generated by OpenGL when the 3D model triangle mesh isrendered, similar to [43]. This 2D edge map stores the projectionof strong edges and junctions that are visible from the current pointof view. On the other hand, visible 3D edges and junctions areextracted for the current view from the collection of 3D sharp edgesand junctions using the OpenGL occlusion query. In a case wherea 3D sharp edge is visible and its projection appears in the 2D edgemap, it receives a vote. This vote scheme is analogous for junctionsand repeated for all views. Finally, only the 3D sharp edges andjunctions with a minimum number of votes are retained as inliers,since they are strong 3D edges and junctions that belong to theouter boundary. We have noticed that this refinement procedureis a necessary step to guarantee the robustness of the recognitionand tracking procedures.

Figure 3: 3D sharp edges before and after refinement.

183

3.2.2 Recognition

3D model recognition in monocular image consists of recoveringthe camera pose that defines the viewpoint from which the modelis shown in the image. This task is made more difficult by the lackof knowledge about the previous frames. Several authors use anoffline learning stage to solve this problem [9, 14, 27], where aset of keyframes are obtained to build a database of 2D-3D featurevisual cues. Nevertheless, they require user intervention to get thetraining dataset, and some of them do not work well in texturelessscenes. Because of this, we have developed a recognition methodthat automatically builds the training dataset, rendering the virtual3D model from multiple points of view. It uses geometric propertiesof an input 3D model to get positive correspondences during therecognition process, similar to [20, 35, 36], which allow handlinguntextured environments. Compared to [20], both approaches usemultiple camera pose hypotheses and a voting scheme based ongeometric properties to select the correct pose, but our recognitionmethod does not require the attachment of an inclination sensor tothe camera, which simplifies the problem to 4 unknowns (cameraazimuth and position) and reduces its application domain.

In case of our recognition module, the only input parametersused are the intrinsic camera matrix and the geometric featuresextracted in the previous section. These 3D geometric features arematched to 2D image features, obtaining a set of potential cameraposes, which are weighted using a contour similarity measure.Finally, the camera pose with highest score is returned. A deeperexplanation of this method can be found in [3], but we give a briefexplanation here for the completeness of the paper.

During the automatic offline learning stage, several 2D syntheticviews of these 3D features are generated moving a virtual cameraalong a sphere. Thus, rotation ranges and steps for each axisare parameters that must be set. They can be set by the user orinferred from the disassembly-planning module. Since the next partof assembly/disassembly is known, we can train only the viewsthat put this component in front of the camera, ruling out otherviews. This is not a strict requirement, since the user will usuallytry to focus the camera on this part as an instinctive behaviour.Moreover, this action improves the memory requirements and thecomputational cost, since there will be fewer keyframes to process.Assuming that the next part to process is put in front of the camera(with all the model completely visible and covering the entireimage), good results have been obtained processing only thoseviews that are in the [−45,45] range for all axes. Additionally,angle steps that vary from 5 to 8 degrees have been used, whichgives a total of 2000-5000 keyframes, depending on the complexityof the model and the maximum storage capacity.

Then, each virtual keyframe is processed individually. For eachvirtual keyframe the projection of visible 3D sharp edges andjunctions is stored. Notice that the visibility test is determined bythe OpenGL occlusion query. Furthermore, each keyframe storesa set of junction basis. A junction basis is defined as a descriptorthat retains the relative orientation of two 2D junctions (Figure 4).Given the 2D junctions i and j, the junction basis is computed as:

junction basis(i, j) = (αi,βi,α j,β j,θi j,di j) (1)

where α indicates the minimum angle in the clockwise directionbetween the branches of a junction, β reflects the smallest angle inclockwise between the branches of a junction and x axis, θi j is the

angle between the vij vector and x axis, di j =∥

∥vij

∥

∥

2, and the vij

vector is defined by the junction i and junction j centres.Once all virtual keyframes have been computed, all junction

bases are indexed in a global hash table. Each record of the tablepoints to the corresponding keyframe and junction basis, and itallows computing fast correspondences for an input junction basis.

During the online stage, the junctions of the input camera imageare extracted using the JUDOCA algorithm [21]. This algorithm

Figure 4: Junction basis definition.

considers each image pixel as a junction center candidate, andtries to identify the image edges that emanate from it. In a casewhere two image edges are considered as possible branches, a 2DL junction is returned. Next, with each pair of detected junctions ajuction basis will be created, and it will be matched to the offlinekeyframes and junction basis indicated by the global hash table(Figure 5). As a result of this hashing step, each keyframe will havea set of possible junction basis correspondences. However, thesecorrespondences could refer to different scales and positions of thesame keyframe. Therefore, the correspondences of each keyframeare clustered, grouping those matches that have a similar scaleproportion and point to the same image position. For each junctionbasis, the vector that points to the gravity center in the offline stage,is applied to the current junction basis location. Thus, we get theimage positions that junction bases point to for clustering. Thescale proportion, on the other hand, is computed as:

s( junction basisi, junction basis j) = di/d j (2)

where di is the distance between the junction centres ofjunction basisi and d j is the distance between the junction centresof junction basis j .

Afterwards, for each cluster the 2D affine transformation thatmaps the offline 2D junction centres to the current 2D junctioncentres is computed. For each cluster its matching score isobtained using the similarity measure described in [37]. The 2Dedges extracted in the offline stage are transformed to the currentcamera image using the 2D affine transformation (Match Scorestep of Figure 5). Thus, the score is the sum of the dot productsbetween offline edge gradients and current edge gradients at thecorresponding image locations. Only the cluster with highest scoreand greater than a predefined threshold (∼ 0.8 for our experiments)is returned. Finally, using the best 2D affine transformation thecamera pose is extrapolated. Since the 2D location of the edgesin the current image is known, and the corresponding 3D valueshave been stored during the offline stage, this PnP problem canbe efficiently solved using the non-iterative procedure developedby [23]. This method writes the coordinates of the n 3D pointsas a weighted sum of four virtual control points, simplifying theproblem to the estimation of the coordinates of these control pointsin camera referential. Additionally, the resulting camera pose canbe refined applying the technique proposed by [39]. This techniqueuses an initial camera pose to match some 3D edge samples withthose 2D image edges that are in the local neighbourhood of theirprojection. The final camera pose is based on the non-linearoptimisation of the edge reprojection error.

3.2.3 Tracking

Given the intrinsic camera parameters and the data generated in theprevious frame, the frame-to-frame tracking implemented for thiswork computes the extrinsic camera parameters, based on temporalcoherence and the combination of multiple visual cues. See [11]for a detailed explanation about tracking methods that use differentvisual cues.

184

Figure 5: 3D Recognition pipeline during the online execution.

Edge Tracker

The 3D edges extracted in the offline stage are used to refine thecamera pose of the current frame, similar to [39, 43]. Visible 3Dedges are detected and projected into the current camera imageusing the OpenGL occlusion query and the previous camera pose.Then, for each projected edge a set of 2D edge samples aregenerated. Notice that the 3D coordinates of these samples areknown. A local search along the edge normal is done to establishthe presence of these samples in the current camera image. Ifan edge pixel is detected in the vicinity of an edge sample, thena 2D-3D correspondence is found. This procedure can providemultiple match hypotheses for each edge sample, which are handledwith the robust Tukey estimator. A non-linear optimisation of theTukey reprojection error is applied to get the final camera pose ofthe current frame.

Feature Tracker

The recursive edge tracking presented above is fast and precise, butit does not support fast movements. To cope with abrupt cameramotions, several authors [32, 39] have integrated an accurate 3Dedge tracker with a robust feature tracker. It is well known thatpoint based trackers can deal with rapid motions, since features arerelatively easy to localize and match between frames. Therefore,we apply the best properties of the point based trackers presentedin [6, 27]. Optical flow estimates the new feature locations, whilethe SIFT descriptor is used to get a more accurate position. Itcan be divided into two main steps: 3D point generation and3D point tracking. The 3D point generation step extracts the 2Dfeatures from the previous frame using the FAST operator [32].Then, assuming that the pose of the previous frame is known,features that lie on the surface of the model are back-projected usingbarycentric coordinates. Although many features can appear onthe surface of the model, we only retain a small subset of them

due to real-time requirements (30 features in our experiments).This way, the features with strongest response are selected first.Moreover, for each selected feature the simplified version of theSIFT descriptor [33] is stored. Compared to the original SIFTdescriptor, this version uses a fixed size patch (25 pixels for ourexperiments) and only the strongest orientation is returned. Thesemodifications, combined with parallel techniques, offer a reductionin computational cost at the expense of robustness. Nonetheless, theresulting robustness is enough for our tracking purposes, and thecomputational cost is decreased considerably, satisfying real timerequirements (see Section 4). Additionally, the 3D point generationstep is executed in every frame, continuously refreshing the 3Dpoint cloud to deal with appearing and disappearing points due ingeneral to rotations. During the 3D point tracking step, featuresselected by the 3D point generation step are tracked with KLT [25].This provides an estimation of the current 2D position of these3D points. To get the correct new 2D location, the current imageis searched for a feature with a similar SIFT descriptor. Takinginto account the temporal coherence between consecutive frames,this search is limited to the vicinity of each feature (a windowsearch of 75 pixels for our experiments), while the matchingof two SIFT descriptors follows the simple rule mentioned in[6]: two descriptors are matched if the euclidean distance of thesecond nearest descriptor is significantly greater than the distanceto the nearest descriptor. If no match is found, this 2D-3Dcorrespondence is discarded. Finally, the new camera pose isupdated with the remaining 2D-3D correspondences. Using theprevious camera pose as an initial guess, a non-linear optimizationof the point reprojection error is performed. More precisely, thepseudo-Huber robust cost function [27] is minimized to reduce theinfluence of outlier correspondences.

Particle Filter

The combination of the edge tracker and point based trackerpresented above offers good tracking results. However, the pointbased tracker is only available when a minimum set of featuresare detected on the surface of the model. This requirement isgenerally satisfied, since the placement or removal of parts favoursthe presence of corners. An example of this property can beobserved in Figure 8, where assembled screws generate multiplesalient points on the surface of the Box-1 model. Notice that thisis not the case for some models (Matryoshka model in Figure 8),where not enough features are detected due to the homogeneity ofthe outer surface. Thus, we have incorporated a 3D particle filterthat uses multiple cues to handle these cases. Particle filters are arobust method against non-static scenes in which multi-modality islikely [29]. They are divided into two stages: particle generationand particle evaluation. In the particle generation step, theprevious camera pose is perturbed to generate multiple camerapose candidates (particles). Thus, the posterior density of themotion parameters is approximated by a set of particles. Then, theparticle evaluation step is responsible for assigning a weight to eachparticle and selecting the correct one. In our experiments, particlegeneration uses a random walk motion model, while the particleevaluation uses the 3D junctions and edges extracted in the offlinestage and the 3D points reconstructed by the point based tracker.The likelihood (w) of each particle (Pi) is inversely proportionalto the reprojection error of visible 3D junction centers (Xi) andback-projected 3D features (Yi), taking into account how manypixels of the projection of visible 3D edges coincide with an imageedge:

w(Pi) =|Vi|/ |Zi|

∑nj=1 C(δ (Pi,X j))+∑m

j=1 C(δ (Pi,Y j))(3)

where |Zi| is the number of pixels that results from the projectionof 3D visible edges with the camera matrix Pi provided by the

185

particle i, |Vi| is the number of Zi pixels whose distance to theclosest image edge point is below a predefined threshold, δ definesthe reprojection error, C represents the pseudo-Huber cost function(with a threshold of ∼ 10 pixels for our experiments), n is thenumber of visible 3D junction centres and m is the number ofback-projected 3D features.

To calculate the reprojection error, the current 2D location of thecorresponding points must be known. Thus, to get the 2D locationof the 3D junction centres in the current frame, those that werevisible in the previous frame are tracked using KLT. Notice that thecurrent 2D position of back-projected 3D features has been trackedpreviously by the point based tracker, and it is not calculated againby the 3D particle filter. All these 2D translation estimations arealso used to fix the optimum number of particles in accordance withdemand (between 200 and 1000 in our experiments). If strong 2Dtranslations are detected, the number of particles is increased, whilein the absence of movement, few particles are needed.

Efficient distance transforms are used to obtain the ratio ofinlier edge pixels (|Vi|/|Zi|), similar to [19], but executing all thecalculations in the CPU. Visible 3D edges are projected, and foreach 2D edge sample, the distance to the closest edge pixel isobtained with a simple look-up in the image distance transform.Thus, Zi is the set of all 2D edge samples, while Vi is the set of 2Dedge samples whose distance to the closest edge pixel is below apredefined threshold. Furthermore, we consider the orientation ofedges to discard false matches [12]. Thus, the current camera edgeimage is split into four orientation specific edge images, and foreach edge image its corresponding distance transform is computed.In this way, an input 2D edge sample is only compared to the imagedistance transform indexed by its orientation.

Further, to reduce the computational cost, visible 3D edges andjunctions are pre-computed for a set of points of views along a 3Dsphere. This increases memory requirements, but decreases theexecution time. In addition, a particle annealing is applied to getcorrect results with a limited number of particles [29].

Integration of Multiple Trackers

The overall markerless tracking scheme is shown in Figure 6. Thepoint based tracker is executed first to handle strong movements.If a minimum number of 3D points are back-projected (7 for ourexperiments), then the 3D point tracking step is executed and thecamera pose is updated. Otherwise the 3D particle filter is calledto approximate the current camera pose. Afterwards, the edgetracker is executed to refine the camera pose, avoiding the possibleerror caused by a non-accurate reconstruction of the 3D point cloudor non-accurate particle likelihood calculation. The point basedtracker and the 3D particle filter offer a good initial guess of thecorrect camera pose, while the edge tracker calculates a preciseone. It should be noted that both the point based tracker and theparticle filter are executed in down-sample resolution images toreduce the computation time. The edge tracker, meanwhile, isexecuted at the original scale to get more accuracy (Section 4).This procedure is executed every frame while the edge trackerworks correctly. To determine the success of the edge tracker,the similarity measure explained in [37] is used. Thus, visible3D edges are projected using the camera pose refined by the edgetracker, and the score is the sum of the dot products between thegradients of these projected samples and the image edge gradientsat the corresponding locations. If this score is below a predefinedthreshold (0.7 for our experiments), then the edge tracker fails, andthe system starts the recovery mode and returns to the recognitionstage (Section 3.2.2).

4 EXPERIMENTS

In this section we present the results of the proposed ARdisassembler. The hardware setup consists of an Intel Core i7-860

Figure 6: Markerless tracking algorithm.

at 2.80GHz and 3GB of RAM equipped with a Logitech QuickCamConnect webcam.

In Table 1 the execution time of the Disassembly-Planningmodule (it includes disassembly graph as well as the disassemblypaths and sequences) is presented for the set of 3D models shownin Figure 8. It is capable of obtaining the disassembly sequenceof parts with different geometric properties in a few seconds. Itcan also find complex disassembly paths using the RRT algorithm(Section 3.1), as is required for the Gear-box example.

Table 1: Execution time for the Disassembly-Planning module.

Model Disassembly steps Time (sec)

Box-1 6 1.5

Box-2 3 0.69

Matryoshka 5 1.6

Gear-box 2 101.660 (RRT)

Elephant 20 6.09

Table 2 displays the output of the Geometric Feature Extractionmodule for different geometric configurations.

The response of the 3D recognition module against the geometricconfigurations processed in Table 2 is presented in Table 3. A setof keyframes were taken for each model using a marker trackingsystem [40] to obtain these values. Thus, the real camera pose isknown for these keyframes, and it provides us with the mechanismto compute the accuracy of the recognition. The automatic trainingof the 3D recognition module uses a range of [-45,45] degreesapplied for each axis, starting from the point of view that put thenext part to disassemble in front of the camera (Section 3.2.2). Alltest were executed at 640x480 resolution, finding the first camerapose with high accuracy in a few hundreds milliseconds.

To prove the benefits of our tracking method and compare thedifferent tracking techniques we use, we have executed the samevideo sequence (640x480 resolution) with three different trackingconfigurations. The first one is the tracker presented in the previousSection, combining a point based tracker, a particle filter and anedge tracker (Point + PF + Edges). The second one disables theparticle filter (Point + Edges), and the last one disables the pointbased tracker (PF + Edges). The main goal of this comparison is

186

Table 2: Execution time for the Geometric Feature Extraction.The number of facets measures the complexity of the extractionprocedure.

Facets Edges L Junctions Time (sec)

Box-1 3102

118 24

1.43

Box-2 (step 3) 40

21 32

1.33

Matryoshka 208

12 24

1.47

Gear-box 59658

172 59

26.84

Elephant 9084

72 80

3.65

Table 3: Accuracy and execution time for recognition.

Model Mean Reprojection Error (pixels) Time (ms)

Box-1 4.26 99.66

Box-2 (step 3) 4.47 63.16

Matryoshka 4.55 107.52

Gear-box 6.57 155.03

Elephant 4.72 356.05

to show the advantages of using different types of tracking methodstogether, obtaining a more robust tracking and satisfying the realtime requirements. The point based tracker and the particle filterperform their calculations in down-sampled resolution (320x240) toreduce the computational time, while the edge tracker is executed inthe original scale (640x480) to achieve high accuracy. Additionally,the particle filter is limited to a maximum set of 1000 particlesto control the total runtime. Similarly to the 3D recognitionexperiment, we have used a marker tracking system to obtain thereal camera pose of each video frame. Furthermore, we haverepeated all the process with three different 3D models (Box-1,Matryoshka and Gear-box) to simulate multiple conditions. Wehave chosen these three models because they offer us the majorityof possible scenarios. The Box-1 model represents an industrialmodel with some corners on its surface, the Matryoshka modelrepresents a model with an homogeneous outer surface, withoutfeature presence, and the Gear-box represents an industrial modelwith complex geometry. The results of these 3D models for thevideo sequence (provided with the paper) are shown in Figure 7. Itis noteworthy that there is an increase in measurement error because

the video sequence consists of rapid camera movements that tryto lose the tracking, and the non-perfect calibration between themarker and model reference systems. Box-1 and Gear-box aretracked successfully with the three tracking configurations, whilethe Point+Edges option fails for the Matryoshka example afterframe 110, due to high reprojection error. The reason for thisfailure is that the Matryoshka model has an homogeneous outersurface where not many features are detected. Thus, in the absenceof features the point based tracker fails, and the particle filteris the more robust tracker. Nevertheless, in the Box-1 example,where many features lie on the surface of the box due to thepresence of screws, the point based tracker runs faster and haslower reprojection error than the particle filter. For the Gear-boxexample, although the point based tracker is slower than the particlefilter, its reprojection error is lower. In addition, we have observedduring the experiments that the point based tracker can track fastcamera movements that the particle filter option can not. Thus,since it is very common to detect a minimum set of features onthe surface of the model, the point based tracker is the best optionfor most examples. A tracker that combines both of them getsthe best properties of each option, running fast and accurately formodels with features on their surface, and offering high robustnessin absence of features.

The execution time of Box-1, Matryoshka and Gear-box modelsfor the video sequence using (Point + PF + Edges) configuration isdetailed in Table 4. It displays the execution time required by eachtracking step, including the Image Processing step, which handlesa large camera image (640x480). Although the Point Tracker stepincludes the computation of the SIFT descriptors parametrized with4x4 regions and 8 orientations (4x4x8=128 bins), its execution timeis low because of its simplifications (Section 3.2.3). Notice that thetotal execution time is kept below real time limits.

Table 4: Execution time of each tracking step of the (Point + PF +Edges) configuration for Box-1, Matryoshka and Gear-box models.

Time (ms)

Box-1 Matryoshka Gear-box

Mean Std. Mean Std. Mean Std.

Image Processing 7.29 0.7 7.29 0.57 7.41 0.62

Point Tracker 12.14 1.49 8.8 2.4 19.72 2.85

Particle Filter 0.78 0.1 11.51 7.73 1.47 0.41

Edge Tracker 2.47 0.76 3.65 0.8 4.58 0.55

Total 22.73 2.15 31.3 7.66 33.23 3.54

In Figure 8, some snapshots of augmented instructions that oursystem provides are shown for the set of 3D models presentedin Table 2. This set of models covers multiple geometricconfigurations, different types of surfaces, as well as disassemblysteps of varying complexity. Figure 8 (a)-(f) shows the goodresponse of our system against models with specular surfaces. Inaddition, Figure 8 (i) displays a satisfactory output when smallparts are tracked. Figure 8 (j)-(n) demonstrates the robustness ofthe system for scenes where multiple parts with similar geometryare put in front of the camera. In fact, this also proves thatthe system will not be jeopardized by the tools that the workerplaces in the task-space, provided that objects and tools do nothave the same shape. Likewise, the recognition and tracking ofBox-2 and Matryoshka models is hampered by their texturelessand homogeneous outer surface. Figure 8 (p) indicates that ourtracking can handle models with high complexity, combiningboth rectilinear and curvilinear edges. Notice that although theextraction path of the gear is composed of several movements inmultiple directions (it can be clearly appreciated in the video file

187

(a) Box-1. (b) Matryoshka. (c) Gear-box.

Figure 7: Execution time and reprojection error of the markerless tracking for a video sequence with fast camera movements.

attached to the paper), only the direction of last extraction step isshown in Figure 8 (p). Finally, Figure 8 (q)-(v) proves the ability ofthe system to disassemble models composed of many parts, as wellas, tracking small geometric configurations.

We have also observed some limitations in the recognitionmodule. As the recognition of the model processes the entireimage, complex scenarios with cluttered backgrounds may causethe detection of some false positives. This is noticeable when theuser points to a keyboard, which has numerous edges, and thepresence of an object is identified due to a positive edge matching.If this happens, a simple camera shake will be enough to losethe tracking and restart the recognition. Note that the trackingmodule can handle cluttered backgrounds because it combinesmultiple visual cues (points, edges and junctions) and uses temporalcoherence assumptions, which imposes restrictions on the searchspace. In order to reduce false positives of recognition, high scorematching thresholds have been set, which discard hypotheses withpoor matching scores, but requires that the model should be almostcompletely visible to get a positive detection. Another optionwould be to impose a restriction on color similarity to validate thedetection, which should be set during the offline phase. On theother hand, the recognition module relies on the existence of edgesand junctions in the model, so its application domain is reduced topolyhedral objects. However, the system is oriented to texturelessmodels, so if the model has no appreciable texture or geometricproperties, then it would be difficult to address the recognitionproblem. Therefore, the inclusion of a module that handles objectswith texture and no geometric properties would extend the domainof our system, making it fairly generic.

5 CONCLUSIONS AND FUTURE WORK

This paper presents an automatic AR disassembler oriented tomaintenance and repair operations. The novelty of our system relieson its ability of being a complete framework that is self-supplied.All the data is extracted automatically from a collection of3D parts that are provided as untextured 3D triangle meshes.In addition, a Disassembly-Planning module is responsible forcomputing the assembly/disassembly sequence, which is obtainedfinding collision-free trajectories. Moreover, a 3D recognitionmodule based on geometric properties, such as junctions andedges, retrieves the first camera pose of the input untextured 3D

model. Thus, our AR disassembler can deal with scenes thatlack texture. Furthermore, a robust markerless tracking that usesmultiple techniques has been developed. This module combines anedge tracker, a point based tracker and a 3D particle filter to offera robust 3D tracking against multiple and undesirable conditions,such as homogeneous surfaces. The proposed AR disassembler is acomplete framework that runs in real-time and tries to facilitate thework of workers, replacing the paper-based documentation.

A visual inspection that ensures the correctness of theassembly/disassembly task and provides guidance feedback is aproblem that will be addressed by the authors in the future. AGPU implementation of the 3D particle filter and a set of userexperiments to validate the usability of the system will also beexplored further.

ACKNOWLEDGEMENTS

The contract of Hugo Alvarez is funded by the Goverment of theBasque Country within the Formacion de Personal Investigadorprogram that belongs to the Education, University and Researchdepartment.

REFERENCES

[1] I. Aguinaga, D. Borro, and L. Matey. Path planning techniques

for the simulation of disassembly tasks. Assembly Automation,

27(3):207–214, 2007.

[2] I. Aguinaga, D. Borro, and L. Matey. Parallel rrt-based path planning

for selective disassembly planning. The International Journal of

Advanced Manufacturing Technology, 36:1221–1233, 2008.

[3] H. Alvarez and D. Borro. Junction assisted 3d pose retrieval of

untextured 3d models in monocular images. Submitted to IEEE

Transactions on Pattern Analysis and Machine Intelligence, 2011.

[4] K. Baird and W. Barfield. Evaluating the effectiveness of augmented

reality displays for a manual assembly task. Virtual Reality,

4:250–259, 1999.

[5] M. Billinghurst, M. Hakkarainen, and C. Woodward. Augmented

assembly using a mobile phone. In Proceedings of the 7th

International Conference on Mobile and Ubiquitous Multimedia,

pages 84–87, 2008.

[6] G. Bleser, Y. Pastarmov, and D. Stricker. Real-time 3d camera

tracking for industrial augmented reality applications. International

Conference in Central Europe on Computer Graphics, Visualization

and Computer Vision, pages 47–54, 2005.

188

[7] U. Bockholt, A. Bisler, M. Becker, W. Muller-Wittig, and G. Voss.

Augmented reality for enhancement of endoscopic interventions. In

Proc. IEEE Virtual Reality, pages 97–101, 2003.

[8] O. Choudary, V. Charvillat, R. Grigoras, and P. Gurdjos. March:

mobile augmented reality for cultural heritage. In Proceedings of the

17th ACM international conference on Multimedia, pages 1023–1024,

2009.

[9] F. De Crescenzio, M. Fantini, F. Persiani, L. Di Stefano, P. Azzari,

and S. Salti. Augmented reality for aircraft maintenance training

and operations support. IEEE Computer Graphics and Applications,

31(1):96–101, 2011.

[10] I. Farkhatdinov and J.-H. Ryu. Development of educational system for

automotive engineering based on augmented reality. In International

Conference on Engineering Education and Research, 2009.

[11] P. Fua and V. Lepetit. Vision Based 3D Tracking and Pose Estimation

for Mixed Reality, chapter 1, page 122. Idea Group Publishing, 2007.

[12] B. Heisele and C. Rocha. Local shape features for object recognition.

In International Conference in Pattern Recognition, pages 1–4, 2008.

[13] S. J. Henderson and S. Feiner. Evaluating the benefits of augmented

reality for task localization in maintenance of an armored personnel

carrier turret. In Proceedings of the 8th IEEE International

Symposium on Mixed and Augmented Reality, pages 135–144, 2009.

[14] S. Hinterstoisser, V. Lepetit, S. Ilic, P. Fua, and N. Navab. Dominant

orientation templates for real-time detection of texture-less objects. In

IEEE Computer Society Conference on Computer Vision and Pattern

Recognition (CVPR’10), pages 2257–2264, June 2010.

[15] L. Homem de Mello and A. Sanderson. Representations of mechanical

assembly sequences. IEEE Transactions on Robotics and Automation,

7(2):211–227, 1991.

[16] X. Jiao and M. T. Heath. Feature detection for surface meshes.

Proc. 8th International Conf. on Numerical Grid Generation in

Computational Field Simulation, pages 705–714, 2002.

[17] M. Juan, E. Llop, F. Abad, and J. Lluch. Learning words using

augmented reality. In IEEE 10th International Conference on

Advanced Learning Technologies, pages 422–426, 2010.

[18] H. Kato and M. Billinghurst. Marker tracking and hmd calibration for

a video-based augmented reality conferencing system. In Proceedings

of the 2nd IEEE and ACM International Workshop on Augmented

Reality, page 85, 1999.

[19] G. Klein and D. Murray. Full-3d edge tracking with a particle filter.

In Proceedings of the British Machine Vision Conference, volume 3,

pages 1119–1128, 2006.

[20] D. Kotake, K. Satoh, S. Uchiyama, and H. Yamamoto. A fast

initialization method for edge-based registration using an inclination

constraint. In Proceedings of the 6th IEEE and ACM International

Symposium on Mixed and Augmented Reality (ISMAR’07), pages

1–10, November 2007.

[21] R. Laganiere and R. Elias. The detection of junction features in

images. In IEEE International Conference on Acoustics, Speech, and

Signal Processing, volume 3, pages 573–576, 2004.

[22] J.-C. Latombe. Motion planning: A journey of robots, molecules,

digital actors, and other artifacts. International Journal of Robotics

Research, 18(11):1119–1128, 1999.

[23] V. Lepetit, F. Moreno-Noguer, and P. Fua. Epnp: An accurate o(n)

solution to the pnp problem. International Journal of Computer

Vision, 81(2):155–166, 2009.

[24] A. Liverani, G. Amati, and G. Caligiana. A cad-augmented reality

integrated environment for assembly sequence check and interactive

validation. Concurrent Engineering, 12(1):67–77, 2004.

[25] B. D. Lucas and T. Kanade. An iterative image registration technique

with an application to stereo vision. In International Joint Conference

on Artificial Intelligence, pages 674–679, 1981.

[26] R. M. Marian, L. H. Luong, and K. Abhary. Assembly sequence

planning and optimisation using genetic algorithms part i. automatic

generation of feasible assembly sequences. Applied Soft Computing,

2/3F:223–253, 2003.

[27] J. Platonov, H. Heibel, P. Meier, and B. Grollmann. A mobile

markerless ar system for maintenance and repair. In IEEE/ACM

International Symposium on Mixed and Augmented Reality, pages

105–108, 2006.

[28] J. Platonov and M. Langer. Automatic contour model creation

out of polygonal cad models for markerless augmented reality. In

Proceedings of the 6th IEEE and ACM International Symposium on

Mixed and Augmented Reality (ISMAR’07), pages 1–4, Nov. 2007.

[29] M. Pupilli and A. Calway. Real-time camera tracking using a particle

filter. In Proceedings of the British Machine Vision Conference, pages

519–528, 2005.

[30] V. Raghavan, J. Molineros, and R. Sharma. Interactive evaluation of

assembly sequences using augmented reality. IEEE Transactions on

Robotics and Automation, 15(3):435–449, June 1999.

[31] G. Reitmayr, E. Eade, and T. W. Drummond. Semi-automatic

annotations in unknown environments. In Proceedings of the 6th IEEE

and ACM International Symposium on Mixed and Augmented Reality

(ISMAR’07), pages 1–4, November 2007.

[32] E. Rosten and T. Drummond. Fusing points and lines for high

performance tracking. In Tenth IEEE International Conference on

Computer Vision, volume 2, pages 1508–1515, oct. 2005.

[33] J. Sanchez, H. Alvarez, and D. Borro. Towards real time 3d tracking

and reconstruction on a gpu using monte carlo simulations. In

Proceedings of the IEEE International Symposium on Mixed and

Augmented Reality, pages 185–192, 2010.

[34] R. Schoenfelder and D. Schmalstieg. Augmented reality for industrial

building acceptance. In IEEE Virtual Reality Conference, pages

83–90, 2008.

[35] A. Selinger and R. C. Nelson. A perceptual grouping hierarchy for

appearance-based 3d object recognition. Computer Vision and Image

Understanding, 76:83–92, 1999.

[36] A. Shahrokni, L. Vacchetti, V. Lepetit, and P. Fua. Polyhedral object

detection and pose estimation for augmented reality applications. In

Proceedings of the Computer Animation (CA’02), pages 65–69, 2002.

[37] C. Steger. Occlusion, clutter, and illumination invariant object

recognition. In International Archives of Photogrammetry and Remote

Sensing, volume XXXIV, part 3A, pages 345–350, 2002.

[38] A. Tang, C. Owen, F. Biocca, and W. Mou. Comparative effectiveness

of augmented reality in object assembly. In Proceedings of the

SIGCHI conference on Human factors in computing systems, pages

73–80, 2003.

[39] L. Vacchetti, V. Lepetit, and P. Fua. Combining edge and texture

information for real-time accurate 3d camera tracking. In Proceedings

of the 3rd IEEE/ACM International Symposium on Mixed and

Augmented Reality, pages 48–57, 2004.

[40] D. Wagner and D. Schmalstieg. Artoolkitplus for pose tracking on

mobile devices. In Proceedings of 12th Computer Vision Winter

Workshop, pages 139–146, 2007.

[41] J. Weidenhausen, C. Knoepfle, and D. Stricker. Lessons learned on

the way to industrial augmented reality applications, a retrospective

on arvika. Computers and Graphics, 27(6):887–891, 2003.

[42] R. H. Wilson. On Geometric Assembly Planning. Phd-thesis, Stanford

University, 1992.

[43] H. Wuest, F. Wientapper, and D. Stricker. Adaptable model-based

tracking using analysis-by-synthesis techniques. In Proceedings of

the 12th international conference on Computer analysis of images and

patterns (CAIP’07), pages 20–27, August 2007.

[44] S. Yamakawa and K. Shimada. Polygon crawling: Feature-edge

extraction from a general polygonal surface for mesh generation.

In Proceedings of 14th International Meshing Roundtable, pages

257–274, 2005.

[45] O. S. Yuan, M.L. and A. Nee. Augmented reality for assembly

guidance using a virtual interactive tool. International Journal of

Production Research, 46(7):1745–1767, 2008.

[46] J. Zauner, M. Haller, A. Brandl, and W. Hartman. Authoring of

a mixed reality assembly instructor for hierarchical structures. In

Proceedings of the Second IEEE and ACM International Symposium

on Mixed and Augmented Reality, pages 237–246, 2003.

189

(a) Box-1, Step 1. (b) Box-1, Step 2. (c) Box-1, Step 3. (d) Box-1, Step 4. (e) Box-1, Step 5. (f) Box-1, Step 6.

(g) Box-2, Step 1. (h) Box-2, Step 2. (i) Box-2, Step 3.

(j) Matryoshka, Step 1. (k) Matryoshka, Step 2. (l) Matryoshka, Step 3. (m) Matryoshka, Step 4. (n) Matryoshka, Step 5.

(o) Gear-box, Step 1. (p) Gear-box, Step 2.

(q) Elephant, Step 1. (r) Elephant, Step 4. (s) Elephant, Step 8. (t) Elephant, Step 12. (u) Elephant, Step 16. (v) Elephant, Step 20.

Figure 8: Examples of augmented instructions that our system displays for a set of different 3D models. Instructions are in red, while the 3Dmodel that is being tracked is highlighted in green.

190

providing guidance for maintenance operations using ... hugo... · providing guidance for...

Documents