from pattern recognition to deep learning:...

FROM PATTERN RECOGNITION TO DEEP LEARNING:METHODOLOGIES AND APPLICATIONS AT IMAGELAB

R. Cucchiara, C. Grana, R. Vezzani, S. Calderara, G. SerraImageLab - University of Modena and Reggio Emilia

Abstract In this paper we will describe the recent experiences at ImageLab about archi-tectural and algorithmic studies, mainly devoted to surveillance, cultural experi-ence, automotive and video big data analysis.

1. IntroductionIn May 1976, Prof. K.S. Fu, together with the Standing Conference Com-

mittee of the young International Joined Conference on Pattern Recognition,proposed the constitution of the International Association of Pattern Recogni-tion. Since forty years this no-profit association works, joining countries overthe world, "to promote research in pattern recognition, computer vision imageprocessing in a broad sense1". The research in these topics was, at that time,a visionary theoretical arena, with an enormous potential in applications, butstrongly limited by the lack of adequate computing capability, and often by thelimited knowledge on the perceptual theory and the application contexts.

Much water has flowed under the London Bridge since those days, and nowtheories and technologies related with images, videos and data analysis arebecoming the most popular and important trends of this decade. All scientificsocieties (such as ACM, IEEE, etc.) include program and activities in thesetopics. Since 2011, the Computer Vision Foundation is working to promotecomputer vision and related conferences such as CVPR and ICCV that collectthousands of attendees each year. In May of 2016, prof. Gerard Medioni,Director of Research in Amazon, concluded at the IWCV promoted by CVFby defining nowadays as the "golden era of Computer Vision".Indeed, this is true also in the market. According with Tractica2, the computervision revenue worldwide is going to move from a 5 billion dollars in 2014, to

1www.iapr.org2www.tractica.com

2

a 13 billion dollars in 2017 covering several application fields, from sport andentertainment, to security and surveillance, automotive, robotics and machinevision, to medical and consumer markets. In this exiting context, where themarket doubles every two years, also the research is evolving very rapidly. Forinstance, pattern recognition is evolving in the direction of limiting as much aspossible the human intervention in the direction of the automated recognitionnot only of patterns but also on the pattern description techniques, with thesupport of machine learning, and recently of deep learning. Nevertheless, theresearch methodologies defined by pattern recognition remain invariable.

This is what we are doing at ImageLab - University of Modena and ReggioEmilia. Since 1998, Imagelab is providing new research results in computerscience for pattern recognition methodology, and their engineerization in dif-ferent application fields. In this document we provide some hints on our lastresearch advancements in methodologies, according with the state of the artand in solutions for some classical and emerging application fields, namelysurveillance, cultural experience, automotive and video big data analysis.

2. From Pattern Recognition to Deep LearningThe classical Pattern Recognition pipeline is well known, where the human

experience, coded by the knowledge hardwired in the problem or acquired bydata, guides the selections of image/signal processing techniques, the featuresdefinition and extraction and the classification designing and tuning. We pro-posed several advances in all the steps of the pipeline.We developed a sensing floor architecture [Vezzani et al., 2015], able to gen-erate pressure images. Each pixel value corresponds to the pressure exertedby objects or people located on the floor. Floor images are then fed to PRpipelines to detect, track and classify people behaviors (see Fig. 1.c).For low level processing, we proposed a new paradigm 8-connection labeling,which employs a general approach to improve neighborhood exploration and

Figure 1. Examples of (a) structural learning based tracking system and (b) group detection.(c) The sensing floor prototype installed at the MUST museum in Lecce

Research activity at ImageLab 3

minimizes the number of memory accesses [Grana et al., 2010].In [Serra et al., 2015] we proposed a new Data Representation with Gaussianof Local Descriptors (GoLD features) which models the SIFT descriptor dis-tribution as a multivariate Gaussian and to transform the mean/covariance cou-ple in a high dimensional vector. GoLD features can be exploited on linearclassifiers, opening the way to efficient and large scale image annotation.We explored structural learning frameworks for people analysis. We ex-ploited a Structural SVM for metric learning in a supervised clustering settingapplied to the context of discovering small groups of pedestrian in crowd. Thechosen features have been selected following years of psychological studiesabout crowd. Features are then combined exploiting an online Structural SVMoptimization technique (i.e. the Frank Wolfe optimization algorithm) by op-timizing an innovative loss function that considers all the peculiarities of thegroup formation process i.e. the Group MITRE loss. The approach has provedeffective in both the context of people tracking and group detection in crowdedenvironments [Solera et al., 2016; Solera et al., 2015], as can be also observedin Fig. 1.

The advent of deep learning with convolutional and recurrent neural net-works changed the pattern recognition perspective. Somehow, Deep Learningaims ad substituting the human contextual knowledge in the feature definitionand classification pipeline with a single paradigm: features are no more a-prioridefined but learned in an adaptive manner from the data. The human contextualexperiences is slightly moved from the definition of the features to the reliabledefinition and annotation of very large datasets, from the parameter setting tothe choice of the best network architecture. To give an example, we present theestimation of visual saliency from images and videos through Deep Learning.Predict where humans look is an important area of research in computer visionand neuroscience. Visual saliency determines how much each pixel of a sceneattracts the observer’s attention.

Visual saliency in imagesTraditionally, the detection of salient areas on images has been done through

the extraction of hand crafted features. However, thanks to the recent advancesin deep learning, considerable progress has been made in the field of saliencyprediction. We propose a multilevel approach to predict saliency in imagesusing Convolutional Neural Networks. Our saliency model, shown in Fig. 1, iscomposed by three main parts. The first component is a Fully Convolutionalnetwork with 13 layers, which takes the input image and produces low andhigh level features maps. The second component of our model is an Encod-ing Network which learns how to weight features extracted at different levelsfrom previous layers and produces saliency specific features. Finally a prior

4

MultiLevelfeatures maps

Conv 3x3

Saliencyfeatures maps

Conv 1x1

Input image

Bilinear Upsampling

Learned Prior

Saliency map

Figure 2. Overview of the proposed model for saliency prediction in images.

is learned and applied to produce the final predicted saliency map. Instead ofusing pre-defined priors, we let the network learn its own custom prior. Theproposed approach achieves state-of-the-art results on the SALICON dataset[Jiang et al., 2015], which is currently the largest public dataset for saliencyprediction. Fig. 3 presents qualitative results on two sample images.

Visual saliency in videosVisual saliency in static images can be approached by considering either a

bottom up or a top down strategy. The former refers to data-driven saliency, aswhen a salient event pops out in the image. On the opposite, top down saliencyis task-driven, and refers to the object characteristics which are relevant withrespect to the ongoing task. Detecting saliency in videos is indeed a more dif-ficult task because motion affects the attention mechanism driving the humangaze. Our proposal relies on the use of a 3D convolutional network that jointlyconsiders both the temporal dimension and bottom up features. The modeltakes as input a fixed-size video clip of 16 consecutive frames and outputs thesaliency map for the middle frame of the input clip. From an high-level per-

Image Ground truth Our method

Figure 3. Qualitative results on validation images from SALICON dataset.

Research activity at ImageLab 5

Figure 4. Left: ground Truth fixations. Right: predicted saliency on 16 frames clip.

spective, our model is basically composed by two parts: while the first partof the network act as an encoder of the input video clip, the second part de-codes this feature representation into a 2D saliency map.The whole network istrainable end-to-end with the usual back-propagation algorithm.

The model have been trained on public video saliency datasets [Stefan Mathe,2015]. Moreover, we created a novel video saliency dataset for the specific taskof driving cars. The Dr(eye)ve dataset, [Palazzi et al., 2016] is the first of thiskind, it consists of 74 clips of driving experience in different environmentalconditions. Fixation are acquired using an eye tracker and projected on a cam-era placed on the car roof. Saliency maps were computed using our videosaliency architecture; visual examples are depicted in Fig. 3.

ReferencesGrana, C., Borghesani, D., and Cucchiara, R. (2010). Optimized block-based connected com-

ponents labeling with decision trees. IEEE Trans. Image Process., 19(6):1596–1609.Jiang, M., Huang, S., Duan, J., and Zhao, Q. (2015). Salicon: Saliency in context. In Proc. of

CVPR, 2015, pages 1072–1080. IEEE.Palazzi, A., Solera, F., Calderara, S., Alletto, S., and Cucchiara, R. (2016). Dr(eye)ve: a dataset

for attention-based tasks with applications to autonomous and assisted driving. In to appearin CVVT workshop CVPR 2016.

Serra, G., Grana, C., Manfredi, M., and Cucchiara, R. (2015). Gold: Gaussians of local descrip-tors for image representation. Computer Vision and Image Understanding, 134:22–32.

Solera, F., Calderara, S., and Cucchiara, R. (2015). Learning to divide and conquer for onlinemulti-target tracking. In IEEE Int’l Conf. on Computer Vision ICCV 2015, Chile, 2015, pages4373–4381.

Solera, F., Calderara, S., and Cucchiara, R. (2016). Socially constrained structural learning forgroups detection in crowd. IEEE Trans. Pattern Anal. Mach. Intell., 38(5):995–1008.

Stefan Mathe, C. S. (2015). Actions in the eye: Dynamic gaze datasets and learnt saliency mod-els for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell., 37.

Vezzani, R., Lombardi, M., Pieracci, A., Santinelli, P., and Cucchiara, R. (2015). A general-purpose sensing floor architecture for human-environment interaction. ACM Trans. Interact.Intell. Syst., 5(2):10:1–10:26.

from pattern recognition to deep learning:...

Documents