gong 2011 automation in construction

xtc

TX 7

nstes deddeoethointeucttrucd. Tng processes, cycle times, and delays, with an accuracy that was comparable to

e operancidenf craft tamplinalue-aent ov

Automation in Construction 20 (2011) 12111226

Contents lists available at ScienceDirect

Automation in

e lstives. To ensure the efciency of construction processes, the waste,manifesting itself in the form of waiting, idle, excessive traveland transporting time, etc., has to be measured. Traditional methods,such as work sampling, time studies, activity rating, and crew balancechart, are effective means to measure the process of constructionoperations [3]. But, the signicant manual efforts required inthese methods often make them cost prohibitive for most contractors[4,5].

Recently, considerable research in the construction domain hasbeen centered on information and sensing systems for automated

potential overow of low-level data or information that can placeextra burden on project managers [14] and the lack of focus on themeasurement of workow in construction operations [13]. Theseissues present two major challenges to current sensor-basedproductivity data collection methods. First, the methods need to beintelligent at reasoning low-level data, such as sensor readings, intoinformation relevant to productivity. The information can be timeutilization, work sequences, and so on. Second, themethods should becapable of measuring key parameters about operation processes, suchas working sequence and cycle time. To meet these challenges,onsite productivity data collection. These stareas: (1) automated project progress tconstruction output [68]; and (2) automa

Corresponding author. Tel.: +1 618/650 2498.E-mail addresses: [email protected] (J. Gong), caldas@

1 Tel.: +1 512/471 6014.

0926-5805/$ see front matter 2011 Elsevier B.V. Aldoi:10.1016/j.autcon.2011.05.005er the last forty years [2].construction operationsity improvement initia-

argued that the industry's excessive focus on the conversion model isnot favorable for productivity improvement [13]. Therefore, twopertinent issues with automated productivity data collection are theSystematic measurement and analysis ofconstitute a foundational block of productiv1. Introduction

In construction projects, daily sitinefciencies caused by a variety of iand waiting [1]. Our recent analysis othe data from a long-term work sonly 45.5% of craft time is devoted to vpercentage shows no sign of improvemtions often suffer fromts such as interruptionsime utilization based ong study indicated thatdding activities, and this

tracking for measuring construction input [912]. These studiesdemonstrated that it is now technologically feasible to gather agreat volume and variety of ever-changing data during construction tosupport project managers on decision making. Also, these studies, byfocusing on measuring either input or output quantities, haveimplicitly adopted the concept that construction production is aconversion process rather than a ow process. Recent studies haveudies fall in two broadracking for measuringted resource utilization

intelligent sensocan gather, procconstruction ope

In this paper,that can interpreas an importantvision systems foing sections of t

mail.utexas.edu (C.H. Caldas).

l rights reserved. 2011 Elsevier B.V. All rights reserved.

manual analysis, without the limitations of on-site human observation.video interpretation methoinformation, such as workiAn object recognition, tracking, and contemethod for rapid productivity analysis of

Jie Gong a,, Carlos H. Caldas b,1

a Dept. of Construction, Southern Illinois University Edwardsville, IL 62025, USAb Dept. of Civil, Architectural & Environmental Engineering, University of Texas at Austin,

a b s t r a c ta r t i c l e i n f o

Article history:Accepted 9 May 2011

Keywords:Construction productivityComputer visionVideo interpretationAutomated data collection and analysisProcess measurement

Measuring the process of comost construction companipaper proposed and describintelligent construction virecognition and tracking msystem was developed toalgorithms. Videos of constrthe traditional manual cons

j ourna l homepage: www.ual reasoning-based video interpretationonstruction operations

8712, USA

ruction operations for productivity improvement remains a difcult task forue to the manual effort required in most activity measurement methods. Thisthe elements, processes, and algorithms that comprise a computational andinterpretation method. A number of vision-based construction objectds were evaluated to provide guidance for algorithm selection. A prototypegrate the proposed video analysis processes and selected computer visionion operations were analyzed to validate the proposed method. Comparing totion video analysis method, the proposed method provided a semi-automatedhe new method enabled the interpretation of these videos into productivity

Construction

ev ie r.com/ locate /autconr-enabled productivity data collection methods thatess, and reason about productivity data in dynamicration processes are of great need.we proposed and validated a computational methodt construction videos into productivity informationstep towards developing eld deployable intelligentr autonomous productivity assessment. The follow-his paper outline the background of this research,

describe the method development, and provide testing cases todemonstrate the validity of the developed method.

2. Research background

Among other applications, videotaping is commonly employed asa traditional productivity data collection method, and its use has beenunder extensive research for a long period of time [1521]. But, thepotential of videotaping has not been fully utilized in these studies.This is reected in the existing process of applying videotaping as aproductivity improvement method. As shown in Fig. 1, a commonprocess of video-based productivity data collection essentially in-volves four parties, including a data collector, working crew andforeman, engineers, and managers at certain levels, and foursequential sub-processes, which are preparation, video recording,review and analysis, and implementation [3,17].

Notably, the majority of data collection efforts are taken by thedata collector. A data collector typically makes the followingpreparations prior to actual recording: (1) plan the camera position;and (2) record relevant information to aid understanding duringsubsequent viewing. After videos have been recorded, a data collectorcan choose to conduct informal reviews with the workforce tobrainstorm improvementmethods. Then the ndings are presented tomanagers to seek approval for implementation. A formal review and

on equipment hours and labor hours, since these are the two mostcommon types of resources measured to assess productivity. Inproject control systems, equipment hours and labor hours aremeasured from a holistic point of view: both effective hours andnon-effective hours are rolled into nal work hours for projectcontrols. To go beyond this level of detail, video monitoring methodscan be used to automatically measure how these work hours havebeen spent, whether productively or not. To this end, such monitoringrelies to a great extent on advanced pattern recognition and computervision techniques within the construction context [26].

Recently, vision-based object recognition and tracking on con-struction jobsites has been the focus of a number of research studies inthe construction domain. Some of these studies demonstrated thefeasibility of using visual recognition and tracking methods to analyzethe productivity of certain types of construction operations[11,22,25,30]; while others have focused on assessing the perfor-mance of visual recognition and tracking algorithms on constructionsites [23,27]. In [30], we proposed a video interpretation model thatprovides an overall schema of combining computer vision methodswith operation process model for analyzing cyclic constructionoperations. In this paper, we rst explain how to extend the modelthat was proposed in [30] to non-cyclic construction operations, andcompare the video interpretation model with the traditional processas shown in Fig. 1. Then, we describe various computer vision-based

tool, which also provides mechanisms for dening prior constructionknowledge and visualizing operation process and interpretation

1212 J. Gong, C.H. Caldas / Automation in Construction 20 (2011) 12111226analysis session can be held with all involved parties to seekimprovement methods. In such a process, it is evident that intensivemanual effort is pervasive, particularly in the tasks of informalreview, present ndings, and formal analysis and review [4,22].The delay and expense caused by intensive manual efforts can extendbeyond individual tasks across the whole spectrum of the processmap, causing high data collection expenses and the dilution of theeffectiveness of such a program. Consequently, jobsite leadershipfrequently considers construction cameras as just a contractual toolfor site surveillance, dispute resolution, and site demonstration.

Research studies on leveraging construction cameras for automat-ed project progress tracking [8] and automated resource utilizationtracking have recently emerged [2225]. When compared withautomated project progress tracking, automated resource utilizationtracking has advanced at a slower pace, largely due to the technicaldifculties of monitoring dynamic objects through computingmethods. In essence, automated resource utilization tracking focusesFig. 1. A process map of video-baseresults. The tool was used to analyze four operations to validate therened model as well as to demonstrate the performance of variousvision recognition and tracking algorithms.

3. A computational representation of video-based productivityanalysis

Tomake this paper self-contained, this section rst provides a briefdescription of the development of video interpretation model forobject recognition and tracking methods that were either implemen-ted or developed to support the computational tasks in the renedmodel. The performances of these methods are evaluated inconstruction environments. At last, we integrated these algorithmsinto a video processing module. The module is a part of a softwared productivity data collection.

cyclic construction operations [30]. Then, the extension of this modelto non-cyclic construction operations is described.

To automate the workow as shown in Fig. 1, a hierarchy ofcomputational methods, including visual recognition and tracking,model-based reasoning, and video content organization, was pro-posed to gradually process video data into information relevant toproductivity (Fig. 2). These methods accomplish two principal tasks.They are to replace the manual steps in the process map withcomputing methods and to dene a reasoning mechanism that can beused by computers to link these steps. The following paragraphsexplain how the elements and processes in Fig. 2 accomplish thesetwo tasks.

For the rst task, a variety of computing methods can be employedto automate or support the processes in video-based constructionproductivity analysis. In general, there are four essential steps inmanual construction video analysis after a construction video isrecorded. They are: (1) recognizing what objects are in the video; (2)understanding what is happening in the video; (3) summarizing whathappened in the video; and (4) reviewing what happened in thevideo. For a computer to accomplish these tasks, it would requireintelligent methods that are capable of: (1) autonomously recogniz-ing and tracking objects in the video; (2) intelligently reasoning aboutwhat is happening in the video; (3) efciently summarizing theinformation; and (4) conveniently recalling what happened in thevideo. To this end, three general categories of methods, includingcomputer vision methods, computer reasoning methods, and multi-media processing methods, can be connected to automate theprocesses in video-based construction productivity analysis. Withineach of these categories, we choose visual recognition and tracking asthe computer vision method, model-based reasoning as the computerreasoningmethod, and video content organization and video retrievalas the multimedia processing method (Fig. 2). This is because they

have frequently used in similar studies in the eld of computer vision[39,40].

The visual recognition and tracking is to use computer visionmethods to recognize and track construction objects of interest(Element 1 on Fig. 2). Meanwhile, the use of these methods is notmerely an application. Vision-based construction object recognitionand tracking should leverage domain knowledge. The benets ofusing domain knowledge include, but are not limited to, the reductionof the number of objects to be tracked and the guidance of computeralgorithm selection. For example, there are typically a variety ofresources present in a video of construction operations. But for thepurpose of productivity analysis, it is often not necessary to track allthe resources. The Method Productivity Delay Model [28], which canbe considered as one type of domain knowledge, can be employed as arational way to determine the critical resources in an operation. Theconcept of production unit and the method's leading resource in thismodel is a concise way to determine the objects to be tracked,therefore avoiding unnecessary recognition and tracking efforts. Inaddition to leveraging domain knowledge, the visual and tracking stepalso needs a set of supporting computer vision algorithms. The need toassess the performance of computer vision recognition and trackingalgorithms is echoed in a number of recent studies [23] and [29]. Thisresearch also investigated a set of computer vision algorithms with afocus on characterizing their performances and designing a module tosupport the exible conguration of a video processing pipeline.There is a dedicated section in this paper that describes our ndingson this perspective.

The model-based reasoning can be designed as a three-stepreasoning process that represents a hierarchical reasoning procedure.These steps are state classication, event detection, and scenariorecognition (Element 2 on Fig. 2). State classication determinesresource utilization, which concerns the time utilized in different

1213J. Gong, C.H. Caldas / Automation in Construction 20 (2011) 12111226Fig. 2. A conceptual construction video interpretation method.

working states by construction resources. Event detection spots statetransition points, therefore determining work sequences. Scenariorecognition uses a set of user-dened rules to decide if there is anyabnormal scenario in an operation based on the informationrecovered from state classication and event detection. Abnormalproduction scenarios are often related to inefciency, which is a focalpoint of productivity improvement.

The video content organization is the process of organizing extractedinformation into user-friendly productivity information (Element 3 onFig. 2). Video retrieval is a process of querying video data based ondiscovered productivity information to support cause nding (Element4 on Fig. 2). Both steps are stemmed from the fact that interpretation ofconstruction videos from the productivity analysis perspective does notend only with productivity information extraction. Although theinformation extracted can give a concise view of what happened inthe video, the search for reasons why some situations happen goes wellbeyond the level of information extraction. Therefore, integration of

different task elements. Therefore, when construction objects arerecognized and tracked, their locations and trajectories are used todetermine the state of the operation. Also, the sequence of operationsas dened in a Cyclone model or a process chart determines thescheme of state transition (temporal context), constituting thetemporal context for detecting any sequence violation. Finally, eachof those working states (task elements) can be assigned within a timeconstraint that is simply the maximum amount of time allowed to bespent within that specic state during each cycle. By doing so,Cyclone-based processmodels and process charts can be incorporatedinto the overall method as knowledge representations to support thevideo reasoning process (Fig. 2).

4. Computer vision algorithms for construction object recognitionand tracking

4.1. Construction object recognition

1214 J. Gong, C.H. Caldas / Automation in Construction 20 (2011) 12111226extracted information with original video contents is necessary. In thisresearch, we index the moments of occurrence of states, events, andscenarios in the original video for later quick retrieval of related videosegments. An information table is employed as a way of visualizingoperation status as well as a way of retrieving video contents. The latteris achieved by linking each of the table cells to corresponding videosegments.

The second task involves the design of a structured reasoningmechanism. In other words, these elements and processes describedabove should utilize a set of rules, heuristics, and logic toautomatically carry out information extraction tasks. Human beingsusually reason within a context. They utilize their knowledge andexperience as background information when they process visualcontents in a video. In the traditional process as shown in Fig. 1, a datacollector usually uses line sketches, notes, or even mental memoriesas means of recording prior knowledge such as task elements, workzone layout, and task sequences. Such knowledge can be formalizedinto three types of contextual information: semantic context (taskelements), spatial context (work zone layout), and temporal context(task sequences). For cyclic construction operations, we propose touse Cyclone's notations [31] to describe the process of constructionoperations. It is also possible to use notations dened by Stroboscope[32]. In addition, process charts [3] that are frequently used inconstruction method studies can be used to represent non-cyclicconstruction operations. These two methods provide concise anduser-friendly means to represent a data collector's contextualknowledge regarding task elements and task sequences. Furthermore,an operation's spatial context can be established by linking its taskelements to spatial regions in video images. This is essentially aprocess of dividing an image into various work zones through imagesurface marking (spatial context), each of them corresponding toFig. 3. Testing video sequences for bThe algorithms that were implemented and tested includebackground subtraction methods, a color-based recognition method,and a method based on the cascade of simple features [33]. A set ofconstruction videos were processed using these algorithms tocharacterize their performances. These algorithms were selectedbecause they represent three popular approaches to detect objects inimages, and they are relatively easy to use for developing objectrecognition models. In general, Color-based and Viola Jones-basedrecognition methods provide two general approaches to train modelsfor specic construction objects using a large number of imagesamples. Background subtraction methods generally use an onlinetraining process, meaning they use a portion of live video data astraining samples to develop background and foregroundmodels. Theycan be directly applied on construction videos to isolate interestedobjects from background scenes.

4.1.1. Background subtraction methodsConsidering the constantly changing nature of construction

jobsites, three different types of background subtraction wereimplemented and tested. Thesemethods includeMixtures of Gaussian[34] for outdoor scenes which contain lighting changes, repetitivemotions from clutter, and long-term scene changes, a Codebook-based method [35] for scenes containing complicated moving objectsby taking advantage of adaptive ltering concepts, and a BayesianModel-based method [36] for modeling background which containsboth stationary and moving background objects and undergoes bothgradual and sudden once-off changes.

Two information retrieval measurements, recall and precision, areused to quantify how well each of the algorithms described aboveperforms in a construction environment. The evaluation is based onackground subtraction methods.

how well the segmented foreground matches the ground truth that isobtained through manual extraction processes. Recall and precisionare dened as:

Recall =Numberof foregroundpixelscorrectly identifiedbythealgorithm

Numberof foregroundpixels inground truth1

Precision =Numberof foregroundpixelscorrectly identifiedbythealgorithm

Numberof foregroundpixelsdetectedbytheaglgorithm:

2

Two construction video sequences are used as testing data sets tocompare the performances of these algorithms.

The video sequences show a crew of workers installing formworkand an earthmoving operation, respectively (Fig. 3). Examplesegmentation results from different algorithms and the groundtruth provided by the manual extraction process are shown inFig. 4. The recall and precision of the algorithms in investigation inboth cases are summarized in Tables 1 and 2 respectively.

It can be noted that all algorithms have shown very low rate ofrecall. At the same time, the Code Book-based algorithm and theBayesian model-based algorithm have achieved much higher rate ofprecision than Gaussian Mixture based method does. The low rate ofrecall means that only small parts of foreground objects are correctlysegmented out. This may be acceptable in situations where fore-ground objects are objects of interest and further differentiation offoreground objects are not needed. Low precision is a more seriousproblem since it may cause many false positives. False positives heremean that the algorithmmistakenly classies background objects intoforeground objects. It is safe to draw the conclusion based on theperformances of the algorithms on the above videos that the

suitable than the Gaussian Mixture-based method to be used inconstruction scenarios. Overall, background subtraction is best suitedfor applications that have relatively static backgrounds. Constantchanging backgrounds or scenes that are shot with moving cameraspresent signicant challenges to foreground segmentation.

4.1.2. Color-based recognition methodIn this section, we developed a safety vest-based construction

worker recognition model to demonstrate the overall procedure fordeveloping color-based object recognition models and to characterizethe performance of color-based object recognition methods. Con-struction workers are an integral part of construction operations.When analyzing construction operations, detection and tracking ofconstruction workers might be required in many cases. When

Table 1The performance of background subtraction algorithms on video sequence no. 1.

Code book Gaussian mixture Bayesianbackground

Recall Precision Recall Precision Recall Precision

Frame 1 51.4% 66.2% 65% 48% 39.6% 60.6%Frame 2 53% 72.9% 57% 52% 29% 70.7%Frame 3 54.2% 73.1% 57.3% 50.2% 26% 76.9%Frame 4 55% 69% 50% 48.7% 27% 69%Frame 5 59.3% 67% 46.9% 46% 25% 68%Frame 6 59.6% 65% 48% 42% 22% 72.5%Frame 7 57% 68% 53.7% 27% 23.8% 67.5%Frame 8 66% 62% 55.7% 46% 27% 73.1%Frame 9 68% 64% 52% 44.6% 30% 71.6%Frame 10 60% 52.2% 45.7% 35.9% 27% 55.7%Average 58.35% 69.54% 53.13% 44.04% 27.64% 68.56%

1215J. Gong, C.H. Caldas / Automation in Construction 20 (2011) 12111226Codebook-based and the Bayesian Model-based methods are moreFig. 4. Example backgroundconstruction workers wear safety vests, which typically are neonsegmentation results.

orange or chartreuse, the detection and tracking of these constructionworkers can be fairly reliable.

To develop a reliable color-based object recognition model, it isimportant to understand that the color of anobject captured in an imagevaries signicantly with lighting conditions, view angles, glare, and soon. Development of a color-based object recognition model oftenrequires taking a probabilistic approach. In this research, we treat theproblemof developing a safety vest recognitionmodel as theproblemofdeveloping a Mixture of Gaussian Models to represent the colordistributions of safety vests and background scenes. A requirement forusing a probabilistic approach is having a large number of samples, sothe distribution can be reasonably estimated. A sample here is a pixel

photos of backgrounds. The order of the number of the samples is in themillions.

There aremany color spaces that can be used to represent the valuesof sample pixels. RGB is a very common color space used to representcolor values. But, in many cases, the other color spaces, such as Lab, Luv,and HSV, is preferred over RGB color space for modeling colordistributions of particular objects. This is because these color spacesgenerally can produce a compact representation of color distributionsinstead of generating widely scattered distributions, a commonconsequence of the RGB color space. For instance, the samples of safetyvest pixels are plotted in different color spaces in Fig. 5. In addition, inorder to reduce computational complexity, it is not uncommon thatonlypart of the color channels, such as only R and G in the RGB space, areused. In this research, a and b channels in the Lab color space wereused to represent the color distribution of chartreuse safety vests sincethey give the most compact representation.

Given the color samples of chartreuse safety vests, the ExpectationMaximization (EM) method is used to iteratively estimate theparameters of Gaussian models that can be used to represent thecolor of backgrounds and safety vests. We used two Gaussian modelsto represent the color of backgrounds and two Gaussian models torepresent the color of chartreuse safety vest. The results are graduallyconverged to the true parameters. The nal parameters (centers andvariances) for these Gaussian Mixtures are shown in Fig. 6. The bluedots in the plot represent the sample pixels from the background. Thecontours of these Gaussian models are also plotted over the trainingsamples for visualizing the distributions. Now given a new pixel froma photo, like the one shown in the Fig. 6 (red dot), the distance of this

Table 2The performance of background subtraction algorithms on video sequence no. 2.

Code book Gaussian mixture Bayesianbackground

Recall Precision Recall Precision Recall Precision

Frame 1 11.9% 69% 24.6% 12.75% 21.9% 76%Frame 2 18.8% 85.4% 22.4% 14.7% 19% 83.9%Frame 3 17.3% 82.8% 19.5% 12.7% 19% 82.3%Frame 4 9.5% 78.9% 15% 8.9% 17.8% 84.8%Frame 5 9.5% 85.3% 11.7% 6.7% 10% 83.4%Frame 6 11.3% 74.8% 12.2% 6.6% 8% 91.5%Frame 7 10% 78.6% 11.2% 6.1% 6.8% 71.7%Frame 8 11.5% 82.6% 10.9% 5.1% 5.4% 80.8%Frame 9 17.3% 85.7% 11.3% 5.8% 4.1% 92%Frame 10 10.7% 77.6% 11.2% 6.4% 3% 96%Average 12.78% 80.07% 15.00% 8.58% 11.50% 84.24%

1216 J. Gong, C.H. Caldas / Automation in Construction 20 (2011) 12111226that either belongs to a safety vest or the background. In this case, thesesamples are randomly taken from a large collection of photos. Thesephotos were taken at different jobsites under different conditions. Sothey capture the variability of the color values of pixels on a safety vest,which can be attributed to the changing site conditions. Within thesesamples from the collection of photos, the positive samples are from thephotos of chartreuse safety vests, and the negative samples are from theFig. 5. Sample safety vest pixelpixel to the centers of different Gaussian Models are calculated usingMahalanobis Distance (Fig. 6). The lines a, b, c, and d in thegure can be thought of the distances to these centers. This pixel willbe classied as a safety vest only if the following conditions can bemet:

a + b N G c + d 3s in different color spaces.

Gaus

1217J. Gong, C.H. Caldas / Automation in Construction 20 (2011) 12111226In the formula (1), G is the gain. It can be in values of 1, 2, 3, and so on.When the value of G is increased, the false positive classication rate,which is the rate of classifying a pixel from the background as a safetyvest, can be reduced, but at the cost of reducing the true positive rate.It is essentially to provide a means for controlling the false positiverate. In many situations, misclassication of a negative sample orpositive sample can have very different costs associated with theresulting decisions.

The trained model was tested on testing video data that were notused in the training process. The manipulation of the gain G todifferent values can produce a receiver operating characteristic (ROC)curve for the trained safety vest model when it is used for classifyingtraining and testing data as shown in Fig. 7. The ROC is essentially theplot of the false position rate vs. true positive rate when differentclassication criteria are used. Depending on the targeted falsepositive rate, the corresponding Gain can be chosen. It is important

Fig. 6. The convergedto recognize that the model developed here only concerns classica-tion at the pixel level. Knowing only the locations of pixels does notgive a holistic representation of detected objects. To realize suchpurposes, these classied pixels need to be grouped together

Fig. 7. The receiver operating characteaccording to their connectivity to form connected regions or blobs.These regions or blobs correspond to the objects of interest in thescene. Connected component analysis, a well-established procedureto connect adjacent pixels together according to specied neighboringcriteria, can be used to take on this task. Then each of these connectedregions or blobs represent a construction worker. Fig. 8 shows thedetection of construction workers based on the safety vest colormodel and connected component analysis.

4.1.3. A cascade of simple features for construction object recognitionalgorithms

A Cascade of Simple Haar Features, also called the Viola-JonesClassier (Viola and Jones 2001) is an algorithm that was originallydeveloped for face detection. The algorithm is built on the well-known statistical boosting techniques. Basically, this method parti-

sian mixture models.tions video images into a set of overlapping windows at differentscales, then a decision is made involving each window about whetherit contains a target object or not. This decision is made based on a setof simple features such as edge features and line features. The featuresthat are distinctive to target objects are identied through the training

ristic curve for the trained model.

n b

1218 J. Gong, C.H. Caldas / Automation in Coprocess, and then these features are placed in a cascade form to speedup the detection. Currently this method is widely used to developmodels for pedestrians, trafc signs, and facial recognition applica-tions because of its high performance levels in terms of speed andaccuracy. Themodel is investigated in this research for its potential fordeveloping recognition models for rigid construction objects, such asconcrete buckets and concrete delivery trucks.

Training object recognition models with ViolaJones Classierrequires a large quantity of images. Here large quantities of imagesmeans thousands of object examples and tens of thousands of non-object examples. A rule of thumb is that themore images are used fortraining, the more generic and the lower the false positive rate thenal model can achieve. We used this method in this research todevelop object recognition models for concrete buckets and hardhats.

In the development of a hard hat recognition model, 3000instances of hard hat samples that were cropped from images ofconstruction workers were used to represent the positive data. Thenegative samples are 5000 images that do not have workers with hardhats. Similarly, the concrete bucket detection model is trained from1260 positive site images (containing concrete buckets) and 3000negative site images (not containing concrete buckets). The positiveimages containing buckets are collected from three different jobsitesand Google images. The negative images in the dataset are collectedfrom eight different jobsites and Google images.

The training process can take the order of days on a P4-2.4 GHzcomputer depending on the targeted number of layers, accuracy, andspeed of convergence. In both cases, a 20-layer classier is trained,which means that the resulting classier has 20 hierarchies. Once thetraining is nished, detection of objects on a 720640 color imageusing the nal model requires only 50 ms. The resulting classierswere tested on testing data that were randomly left out of the trainingdata set. The recognition accuracies are 87% and 84.7% respectively for

Fig. 8. Example worker detectiothe hardhat model and bucket model.

4.1.4. DiscussionIn summary, the following conclusions can be drawn on the

investigated object recognitionmethods. First, among the backgroundsubtraction methods that were investigated, the Gaussian Mixturebased method suffers from the high false positive rate. This is largelydue to the fact that sudden changes, which the Gaussian Mixturemethod is not designed for, are frequent in construction environ-ments. The downside of background subtraction methods is that theytend to identify all foreground objects without differentiating whichforeground objects are objects of interests. This may work incontrolled construction scenarios where unexpected constructionobjects are not frequently present. Second, in construction environ-ments, object color, if the object of interest has distinctive color,remains to be themost reliable visual cue for object recognition. Third,when large numbers of photos about the object of interest areavailable, the object recognition method based on the cascade ofsimple Haar features is reliable and fast.

4.2. Construction object tracking

In this research, we evaluated two general types of trackingmethods, including Filtering and Data Association and TargetRepresentation and Localization. Filtering and Data Association is atop-down process that deals with the dynamics of the tracked object,awareness of scene priors, and evaluations of different hypotheses.Typical example algorithms are Kalman Filter and Particle Filter.Target Representation and Localization is a bottom-up process to copewith changes in the appearance of the target. A typical method in thiscategory is Mean Shift [37]. In practices, the methods mentionedabove are often used in combination. The following provides a briefdescription of each method.

4.2.1. Kalman lter-based tracking algorithmsThe basic idea of a Kalman Filter is that, under a strong but

reasonable set of assumptions, it should be possiblegiven a historyof measurement of a systemto build a model for the state of thesystem that maximizes the a posteriori probability of those previousmeasurement. There are three assumptions required in the construc-tion of a Kalman Filter: (1) the system should be linear; (2), the noisethat measurements are subject to is random, and (3), this noise is alsoGaussian in nature. In this research, we use a linear motion system tomodel the movement of detected objects in the video images. In thelinear motion system, the system state vector is dened as:

x y w h dx dy

ased on the safety vest model.

nstruction 20 (2011) 12111226where:

x x coordinate of the object center (in pixels)y y coordinate of the object center (in pixels)w width of the object (in pixels)h height of the object (in pixels)dx x velocity of the objectdy y velocity of the object

The transitional matrix is dened as:

F =

1 0 0 0 1 00 1 0 0 0 10 0 1 0 0 00 0 0 1 0 00 0 0 0 1 00 0 0 0 0 1

26666664

37777775

4

zk = Hxk + vk 7

J. Gong, C.H. Caldas / Automation in Cowhere wk is the process noise, vk is the measurement noise, and zk isthe measurement of the current position of tracking objects.

4.2.2. Particle lter algorithmsThe major limitation of the above system is that it models a single

hypothesis and thus cannot represent multimodal distributions. Aunimodal Gaussian model is assumed to be the underlying model ofthe probability distribution for that hypothesis. Therefore, it is notpossible to represent multiple hypotheses simultaneously using theKalman Filter. This can be problematic in cases such as an objectmoving behind an occlusion. In this case, the object can be continuingat a constant speed, or may stop, or reverse direction. The KalmanFilter cannot represent these multiple possibilities other than bysimply broadening the uncertainty associated with the distribution ofthe object's location. Particle Filter starts with a set of uniformlydistributed samples to represent multiple hypotheses. Each of thesesamples is associated with a weight that determines the probability ofthis particular sample to be drawn from the distribution. Theseweights will be updated in light of whatever new information (newmeasurements) has become available since the previous update.Because of this bootstrap sampling approach, such a set of particles/samples can represent multimodal distributions, yielding betterperformance in modeling complex dynamic systems. In this research,a particle or sample is a hypothesis of the position of an object, andeach of these hypotheses has a weight that represents its likelihood ofbeing the true position. The number of particles that are used in thisresearch ranges from 200 to 1000, with more particles morecomputational expense.

On the other hand, one of the potential problems with ParticleFilters is that the particles can quickly become dispersed, with most ofthe particles carrying negligible weights. Increasing the number ofparticles can alleviate the problem, but at the expense of a highcomputational cost. A common workaround is to use the Mean Shiftmethod as an intermediate step to herd the propagated particles totheir nearby maxima followed by a reweighting process. This is theessential working principle of the Mean Shift and Particle Filter(MSPF) object tracker.

4.2.3. Mean shift algorithmsInstead of tracking the objects based on their dynamics, another

approach would be to track the objects based on the distribution ofparticular features belonging to each object. These features could becolor, texture, and so on. The Mean Shift is a robust method of ndinglocal extrema in the density distribution of a data set. For a colorimage, it is equivalent to nding the peaks of RGB distribution in localregions. Statistically, it is intended to nd the mode of distributions ina window that is specied on an image. In this sense, the mode-nding algorithm can be used to track moving objects in videos [37].In practice, the initialization of this method requires the input of asearch window that boxes the object to be tracked. User inputs orThe measurement matrix is dened as:

H =

1 0 0 0 0 00 1 0 0 0 00 0 1 0 0 00 0 0 1 0 0

2664

3775 5

Measurement vector: x y w h Therefore the whole system canbe modeled as:

xk = Fxk1 + wk 6object recognition methods can provide such information. Thislocation yrows and columns in this caserepresented as a 3Dhistogram as well:

target candidate:p y = pu y

u=132768 32768u = 1 pu = 1 9

Now the task is to iteratively compute a new location y that wouldmaximize the similarity measures between the target model and thetarget candidate. The similarity measures used in the Mean Shift is ametric based on the Bhattacharyya coefcient that is dened as thefollowing equation:

y =32768u = 1pu y qu

q: 10

Now if the assumption that only small changes occur in the locationand appearance of the target in two consecutive frames holds, and inmost cases it does, gradient based algorithms can be used toiteratively compute the new position y that can maximize theBhattacharyya coefcient.

Another important aspect of Mean Shift tracking is the introduc-tion of a kernel function that regulates the similarity function. Thereare a number of reasons to use a kernel function. First, a histogramitself only carries spectral information and ignores spatial informa-tion; hence the similarity function is likely to display huge variationsfor adjacent locations on the image lattice. It is difcult to applygradient-based optimization procedures on such a function. Second, aproperly dened kernel, such as an isotropic kernel that has a convexand monotonic decreasing kernel prole, can reduce the inuence ofboundary pixels by assigning smaller weights to them, thereforemaking the data more resilient to occlusion and interference from thebackground. As a result of incorporating a kernel to regulate thesimilarity function, the target model and the target candidates can beexpressed as:

target model: qu = Cni = 1k xi

2

b xi u 11

target candidate: pu y = Ch nh

i=1k yxih

2 !

b xi u 12

where =1

ni = 1kxi2 , Ch =

1

nhi = 1k yxih

2

, and k is the

kernel function. The similarity function then becomes:

p y ; qh i

12mu = 1

pu y0

qu

r+

Ch2nhi = 1wik yxih

2 !

13

where

wi =mu = 1

qu

pu y0

s b xi u 14process can also be further enhanced by incorporating Kalman Filtersthat can predict the new positions of objects.

In this study, the model developed based on the Mean Shift is asfollows: the target model q from the input window (8-bit colorimage) is represented as a 3D histogram that has 32768 bins (32 binsfor R channel*32 bins for G channel*32 bins for B channel).

target model:q = qu

u=132768 32768u = 1 qu = 1 8

And the target candidate at the subsequent frame is dened at

1219nstruction 20 (2011) 12111226

By applying the Mean Shift procedure, the new location y1 can befound according to this relation:

y1 =

nhi = 1xiwig y0xh 2

!

nhi = 1wig y0xh 2

! 15

where g x = k x .In this study, the target or the object in the input window is

represented by an ellipsoidal region, and an isotropic kernel with an

ground subtraction method, is used to ensure consistency of objectdetection. The detected objects, including equipment and workers,are tracked in these videos. The videos include various challengessuch as multiple instances of object path crossing, frequent enteringand exiting objects, object shadows, and occlusions (Fig. 9). Wedivided the tracking errors into two types. The rst type is falsenegative mistakes, meaning losing the track of objects; the secondtype is false positive mistakes, which include tracking objects that arenot exist or associating objects in this frame with objects in the lastframe incorrectly. The performance of the six tracking methods issummarized in Table 3. It can be noted that the tracking methodsperforms much worse in the second video data set than in the rst

provides means to lter or smooth the trajectories of detected objects.

fram

nces

1220 J. Gong, C.H. Caldas / Automation in Construction 20 (2011) 12111226Epanechnikov prole (Eq. (14)) is used.

k x =12C1d d + 2 1x if x 10 otherwise

8

gong 2011 automation in construction

Documents