the case dataset of candidate spaces for advert implantation · the case dataset of candidate...

The CASE Dataset of Candidate Spaces for Advert Implantation

Soumyabrata Dev1, Murhaf Hossari1, Matthew Nicholson1, Killian McCabe1, Atul Nautiyal1

Clare Conran1, Jian Tang3, Wei Xu3, and Francois Pitie1,2 ∗ †

1The ADAPT SFI Research Centre, Trinity College Dublin2Department of Electronic & Electrical Engineering, Trinity College Dublin

3Huawei Ireland Research Center, Dublin

Abstract

With the advent of faster internet services andgrowth of multimedia content, we observe a massivegrowth in the number of online videos. The users gen-erate these video contents at an unprecedented rate,owing to the use of smart-phones and other hand-heldvideo capturing devices. This creates immense poten-tial for the advertising and marketing agencies to cre-ate personalized content for the users. In this paper,we attempt to assist the video editors to generate aug-mented video content, by proposing candidate spacesin video frames. We propose and release a large-scaledataset of outdoor scenes, along with manually anno-tated maps for candidate spaces. We also benchmarkseveral deep-learning based semantic segmentation al-gorithms on this proposed dataset.

1 Introduction

Impressive technological advances have provided aplatform to create and share on-demand video con-tents over the internet. The consumers have an in-satiable appetite for media, and artificial intelligence(AI)-based techniques assist us in understanding suchmedia [3]. This creates opportunity for the marketingand the advertisement agencies to generate personal-ized marketing products, according to the consumers’likes and dislikes. However, the current skip-ad gener-ation have a limited attention span [1], and therefore,new strategies of advertisement needs to be employed.One such technique include product placement or em-bedded marketing, that involves the video editors toartificially integrate new adverts into original scenes.

Online videos provide several opportunities for inte-grating new advertisements in the original videos. Wedeveloped a deep-learning based advert creation sys-tem [6] that can automatically identify the frames ina video containing an advertisement [4]. Additionally,we also identify the specific frames in a video wherenew advertisements can be artificially augmented, con-forming to the subjective human judgement. It is im-portant to select the candidate space in a frame in asystematic manner, such that the new advert can beaugmented in a seamless and sensical manner. Forexample, adding advertisement boards by the side ofthe road, and onto empty wall spaces are good candi-dates for advert integration. Figure 1 describes such ascenario, wherein candidate spaces are marked based

∗Send correspondence to F. Pitie ([email protected]).†The ADAPT Centre for Digital Content Technology is

funded under the SFI Research Centres Programme (Grant13/RC/2106) and is co-funded under the European Regional De-velopment Fund.

on the scene understanding. These candidate spacesof new advertisements in a video frame are generallymanually selected by the video-editors during the post-processing stage.

(a) Floating candidate (b) Anchored candidate

Figure 1: Potential spaces for advertisement integra-tion in outdoor scenes. The candidate spaces can eitherbe floating across the scene, or anchored against wallsin the scene.

In most cases, the video editors uses his/her under-standing of the scene, and proposes a new space foradvert integration. In the literature, there are no re-lated works that can automatically parse the scene inthe image, and propose a candidate space for integrat-ing advertisements. We attempt to bridge this gap inthe literature, by releasing the first large-scale datasetof outdoor scenes, with manually annotated candidatespaces.

The main contributions of this paper are as follows:(a) we propose and release the first large scale datasetof candidate placements in outdoor scene images, (b)we also provide a systematic and detailed benchmark-ing results of several popular deep learning based seg-mentation frameworks in our proposed dataset. Webelieve that this dataset will greatly serve the multi-media and advertisement communities, and assist inproviding interesting research in this field. The rest ofthe paper is arranged as follows: Section 2 describesthe dataset, and its characteristics. We also provide afew benchmarking results on the proposed dataset inSection 3. Finally, Section 4 concludes the paper, anddescribes our future work.

2 Dataset

We refer our dataset as CASE dataset, thatstands for CAndidate Spaces for advErt implantationdataset 1. The sources of the images in this dataset isCityscapes dataset [2] comprising street view images.We randomly select 10000 images from the Cityscapesdataset that for annotating with potential placementsfor advertisements.

1The download link of the dataset is available here:https://www.adaptcentre.ie/case-studies/ai-based-

advert-creation-system-for-next-generation-publicity

arX

iv:1

903.

0894

3v2

[cs

.CV

] 2

9 A

pr 2

019

Figure 2: Representative images of the CASE dataset, along with manually annotated binary candidate maps forintegrating new adverts.

2.1 Creation of candidate spaces

We are interested in learning candidate placementof objects within a scenes. We have chosen to investi-gate placing regularly shaped billboard in street viewimages. In our application case study, billboards areconsidered as rectangular areas that can be added tothe sides of buildings, or fixed by the pathways androad sides for placement of posters or advertisements.We employ paid volunteers to manually go througheach of our selected 10000 images, and mark the lo-cation where they can assume that a billboard couldbe placed. This process is done carefully, such thatthe embedded advertisement matches the perspectiveof the scene. We use the VGG Image Annotator tool 2

for annotating the candidate spaces. The user anno-tate each image in the dataset by selecting 4 points,that either form an anchored placement (against wallor surface) or floating placement in the scene.

Although there can be several potential spaces toembed the new advertisement, we restrict the annota-tors to include only a single candidate placement in animage. This is intentionally done, such that the anno-tator chooses the best candidate space for integration,according to his/her subjective judgment. Examplesof good placement include, but not limited to, plac-ing advert on a blank wall, beside existing billboards,hanging beside a lamppost, or over a window/group ofwindows. We do not annotate existing billboards orposters. This process is completed for all the selectedimages of our dataset. Post the annotation process,we manually go through all the generated candidatespaces, to ensure that the annotations are performedproperly.

Figure 2 shows a few representative images of theCASE dataset, along with its manually annotated bi-nary masks. The binary masks are generated from thefour corners of the annotated billboard.

2The tool is available via http://www.robots.ox.ac.uk/

~vgg/software/via/via.html.

2.2 Dataset characteristics

The candidate spaces in the images are annotatedwith varying sizes. We define the amount of image areacovered with a candidate space as the advert coveragevalue. We ensure that the CASE dataset has a varyingvalues of advert coverage values, across all the images.Figure 3 describes the distribution of the coverage ofthe candidate spaces.

0 10 20 30 40 50Advert coverage value [in %]

0

10

20

30

40

Perc

enta

ge o

f occ

uren

ces

Figure 3: Statistical distribution of the percentage ofoccurrences of the advertisement spaces w.r.t. the ad-vert coverage values.

3 Benchmarking Experiments

In addition to the proposed dataset of candidatespaces, we also provide a few benchmarking experi-ments of several semantic segmentation algorithms inthe CASE dataset. With the success of deep neu-ral networks, several convolutional neural networkshave been proposed that provides impressive resultson various semantic segmentation tasks. We bench-mark the performance of fully convolutional neural net-work (FCN) [5], pyramid scene parsing network (PSP-Net) [8], and U-Net [7].

The FCN network by Long et al. employs only lo-cally connected layers, without the use of any denselayers in its architecture. The layers in FCN include

convolution, pooling and upsampling layers. It pro-vides dense segmentation masks of input images of anydimension. Zhao et al. in [8] introduced a pyramidpooling module, that can better grasp the context ofthe input image. The PSPNet produced impressive re-sults on the ImageNet scene parsing challenge [9]. Theadvantage of the pyramid pooling module is that itcaptures information at varying scales, across differentsub-regions of the input image. Recently, Ronnebergeret al. introduced a convolutional neural network calledU-Net that provided impressive results in biomedicalimage segmentation. The U-Net network has a sym-metric structure, and consists of three primary parts– contracting, bottleneck, and expanding path [7]. Itprovides skip connection between the contracting andexpanding paths, that provides the local informationto the global information, during the upsampling pro-cess. These networks are popularly used in the areaof semantic segmentation, for an efficient generation ofsegmentation masks. We use these networks as bench-marking models for our proposed dataset. Currently,there are no networks that are specifically designed foridentifying candidate spaces in input images.

3.1 Subjective Evaluation

Figure 4 shows a few sample results of the bench-marking algorithms on the CASE dataset. We visual-ize the results obtained from these algorithms, prior tothe final softmax layer. This is useful, as it providesus a probabilistic viewpoint on the possibility of a can-didate space in an image. Most of these approachesgenerate a lot of false positives in the generated bi-nary maps. This is understandable because there is nounique position of advertisement in an input image.The position of the embedded advertisement is com-pletely dependent on the subjective judgement of theannotator. It is interesting to observe from Fig. 4(d)and (e), that PSPNet and U-Net identifies the build-ings and the sideways of the roads as possible place-ments for new adverts. The FCN network, on the otherhand is conservative in nature – it produces signifi-cantly fewer false positives, and do not propose a par-ticular area of the image as a candidate space for ad-vertisement integration. However, we observe from thevisual results, that most of these networks viz. PSP-Net and U-Net can successfully learn the semantics ofthe input scenes, and provide reasonable maps of can-didate spaces.

3.2 Objective Evaluation

In addition to the subjective evaluation of the bench-marking algorithms, we also provide an objective eval-uation of the different approaches. We report the av-erage values of the following metrics – pixel accuracy,mean accuracy, mean intersection over union, and fre-quency weighted intersection over union. These met-rics are variations of classification accuracy and inter-section over union (IOU). Let us assume that the totalnumber of pixels belonging to class i, and predicted tobelong to class j, is nij . The total number of classesin this particular task of semantic segmentation is as-sumed to be ncl.

Pixel Accuracy =

∑i nii∑i ti

(1)

Mean Accuracy =1

ncl

∑i

nii

ti(2)

Mean Intersection Over Union =1

ncl

∑i nii

ti +∑

j nji − nii

(3)

(4)Frequency Weighted Intersection Over Union

=1∑k tk

∑i tinii

ti +∑

j nji − nii,

where ti =∑ncl

j=1 nij is the total number of pixels inclass i.

Table 1 summarizes the results of the various bench-marking algorithms in CASE dataset.

PixelAccu-racy

MeanAccu-racy

MeanIOU

FrequencyWeightedIOU

FCN 0.978 0.509 0.498 0.959PSPNet 0.545 0.625 0.284 0.529U-Net 0.619 0.727 0.327 0.601

Table 1: Benchmarking of the CASE dataset with vari-ous deep-learning based segmentation algorithms. Thebest performance for each metric is indicated in bold.

Amongst the benchmarking networks, we observethat the FCN network has the best evaluation scores.The FCN network can learn the semantics of the an-notated scene, and provides a good understanding ofa candidate space in an image. The results obtainedfrom PSPNet and U-Net can be further improved bytraining these networks on larger dataset of diverse im-age types, and employing batch normalization acrossmultiple graphics processing units (GPUs). These re-sults provide a detailed benchmark of the popular seg-mentation algorithms on the candidate space dataset.The results can be further improved, by designing abespoke neural network that is specifically designedfor the task of identifying candidate spaces in outdoorscenes.

4 Conclusion and Future Work

In this paper, we propose CASE – a large-scaledataset of outdoor scenes with manually annotated bi-nary maps of candidate spaces. This dataset is thefirst of its kind that assists the video editors an ef-ficient and systematic manner of creating augmentedvideo contents with new advertisements. In the future,we intend to relax this criterion of outdoor scenes, andalso further extend this dataset, by including imagesfrom indoor scenes and entertainment television shows.

Acknowledgement

The authors would like to acknowledge the contribu-tion of the various paid volunteers, who assisted in thecreation of high-quality candidate maps in this dataset.

(a) Input image (b) Ground truth (c) FCN result (d) PSPNet result (e) U-Net result

Figure 4: Subjective evaluation of various segmentation algorithms on CASE dataset. The results in (c–e) arevisualized, prior to the final softmax layer of the individual networks. The probabilistic maps of FCN are inverted,in order to illustrate the detection of candidate space (instead of non-candidate space) class.

References[1] Chalawsky, M.: Ad skip feature for characterizing ad-

vertisement effectiveness (June 18 2013), US Patent8,468,056

[2] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., En-zweiler, M., Benenson, R., Franke, U., Roth, S., Schiele,B.: The cityscapes dataset for semantic urban scene un-derstanding. In: Proceedings of the IEEE conference oncomputer vision and pattern recognition. pp. 3213–3223(2016)

[3] Hossari, M., Dev, S., Kelleher, J.D.: TEST: A termi-nology extraction system for technology related terms.arXiv preprint arXiv:1812.09541 (2018)

[4] Hossari, M., Dev, S., Nicholson, M., McCabe, K., Nau-tiyal, A., Conran, C., Tang, J., Xu, W., Pitie, F.:ADNet: A deep network for detecting adverts. arXivpreprint arXiv:1811.04115 (2018)

[5] Long, J., Shelhamer, E., Darrell, T.: Fully convolu-tional networks for semantic segmentation. In: Pro-ceedings of the IEEE conference on computer vision

and pattern recognition. pp. 3431–3440 (2015)[6] Nautiyal, A., McCabe, K., Hossari, M., Dev, S., Nichol-

son, M., Conran, C., McKibben, D., Tang, J., Wei,X., Pitie, F.: An advert creation system for next-genpublicity. In: Proc. European Conference on MachineLearning and Principles and Practice of Knowledge Dis-covery in Databases (ECML-PKDD) (2018)

[7] Ronneberger, O., Fischer, P., Brox, T.: U-net: Convo-lutional networks for biomedical image segmentation.In: International Conference on Medical image comput-ing and computer-assisted intervention. pp. 234–241.Springer (2015)

[8] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramidscene parsing network. In: IEEE Conf. on Computer Vi-sion and Pattern Recognition (CVPR). pp. 2881–2890(2017)

[9] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A.,Torralba, A.: Semantic understanding of scenes throughthe ADE20K dataset. arXiv preprint arXiv:1608.05442(2016)

the case dataset of candidate spaces for advert implantation · the case dataset of candidate...

Documents