decore - inriathoth.inrialpes.fr/decore/decore-proposal.pdf · caption generation will play a...

30
DeCoRe Deep Convolutional and Recurrent networks for image, speech, and text Action-team proposal, LabEx PERSYVAL, section Advanced Data Mining March 30, 2016 Contents 1 Synopsis 2 2 Methodology 3 2.1 Participating research groups ...................................... 3 2.1.1 THOTH team, INRIA/LJK ................................... 4 2.1.2 GETALP team UGA/CNRS/LIG ............................... 4 2.1.3 MRIM team UGA/CNRS/LIG ................................. 5 2.1.4 AGPIG team UGA/CNRS/GIPSA-LAB ............................ 5 2.1.5 AMA team UGA/CNRS/LIG .................................. 5 2.2 Challenges and research directions.................................... 6 2.2.1 Object recognition and localization ............................... 6 2.2.2 Speech recognition ........................................ 6 2.2.3 Distributed representations for texts and sequences ...................... 6 2.2.4 Image caption generation .................................... 7 2.2.5 Selecting and evolving model structures ............................ 7 2.2.6 Higher-order potentials for dense prediction tasks ....................... 7 3 Expected results 8 4 Detailed research plan for PhD scholarships and PostDoc 8 4.1 PhD Thesis 1: encoder/decoder approaches for multilingual image captioning ........... 8 4.2 PhD Thesis 2: incremental learning for visual recognition ...................... 9 4.3 PostDoc: representation learning for sequences ............................ 10 5 Positioning and aligned actions 10 5.1 Positioning in LabEx Persyval ...................................... 10 5.2 Aligned actions outside LabEx Persyval ................................ 11 6 Requested resources 11 A CV of principal investigators 15 A.1 Laurent Besacier ............................................. 16 A.2 Denis Pellerin ............................................... 19 A.3 Georges Qu´ enot .............................................. 21 A.4 Jakob Verbeek .............................................. 24 1

Upload: others

Post on 28-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DeCoRe - Inriathoth.inrialpes.fr/decore/decore-proposal.pdf · Caption generation will play a central role, integrating image understanding and language generation models. Positioning

DeCoReDeep Convolutional and Recurrent

networks for image, speech, and text

Action-team proposal, LabEx PERSYVAL, section Advanced Data Mining

March 30, 2016

Contents

1 Synopsis 2

2 Methodology 32.1 Participating research groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1.1 THOTH team, INRIA/LJK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.2 GETALP team UGA/CNRS/LIG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.3 MRIM team UGA/CNRS/LIG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.4 AGPIG team UGA/CNRS/GIPSA-LAB . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.5 AMA team UGA/CNRS/LIG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Challenges and research directions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2.1 Object recognition and localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2.2 Speech recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2.3 Distributed representations for texts and sequences . . . . . . . . . . . . . . . . . . . . . . 62.2.4 Image caption generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2.5 Selecting and evolving model structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2.6 Higher-order potentials for dense prediction tasks . . . . . . . . . . . . . . . . . . . . . . . 7

3 Expected results 8

4 Detailed research plan for PhD scholarships and PostDoc 84.1 PhD Thesis 1: encoder/decoder approaches for multilingual image captioning . . . . . . . . . . . 84.2 PhD Thesis 2: incremental learning for visual recognition . . . . . . . . . . . . . . . . . . . . . . 94.3 PostDoc: representation learning for sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

5 Positioning and aligned actions 105.1 Positioning in LabEx Persyval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105.2 Aligned actions outside LabEx Persyval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

6 Requested resources 11

A CV of principal investigators 15A.1 Laurent Besacier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16A.2 Denis Pellerin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19A.3 Georges Quenot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21A.4 Jakob Verbeek . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

1

Page 2: DeCoRe - Inriathoth.inrialpes.fr/decore/decore-proposal.pdf · Caption generation will play a central role, integrating image understanding and language generation models. Positioning

1 Synopsis

Scientific context. Recently, deep convolutional neural networks (CNNs) and recurrent neural networks(RNNs) have yielded breakthroughs in different areas [29], including object recognition, machine translation,and speech recognition. One of the key distinguishing properties of these approaches, across different applicationdomains, is that they are end-to-end trainable. That is: whereas conventional methods typically rely on a signalpre-processing stage in which features are extracted, such as MFCC [3] for speech or SIFT [33] for images, indeep end-to-end trainable systems each processing layer (from the raw input signal upwards) involves trainableparameters which allow the system to learn the most appropriate features.

DeCoRe gathers experts from LJK, GIPSA-LAB and LIG in computer vision, machine learning, speech,natural language processing, and information retrieval, to foster collaborative interdisciplinary research in thisrapidly evolving area which is likely to underpin future advances in these research areas for the next decade.We believe that DeCoRe project is a remarkable opportunity to get together research groups of Grenoble areawith a critical mass on deep learning. It is also the chance to foster exciting research spanning different fields,such as computer vision and natural language processing.

Challenges and research directions. Within the broader scope of DeCoRe , funding and effort will befocused on several specific areas, these include:

• Object recognition and localization. While neural networks have been used for a long time in im-age/object recognition, firstly in character recognition [6] and in face detection [10], they were only recentlyshown to be effective for general object recognition [26]. This was due to advances in effective training al-gorithms [38], the availability of very powerful parallel GPU hardware, and the availability of huge quantityof clean annotated data [5]. Open challenges that will be addressed in DeCoRe include efficient detectingand localizing very large sets of categories, weakly supervised learning for object localization and semanticsegmentation, as well as developing structured models to capture co-occurrence and spatial relation patternsto improve object localization. These themes will be studied for applications in both images and videos.

• Speech recognition. Neural networks have been used as feature extractors in HMM-based speech recogni-tion systems [2, 18]. Recently, neural networks started to replace larger parts of the speech processing chainpreviously dominated by HMMs [16]. There is also an increasing number of new studies trying to addressspeech processing tasks (notably speech recognition) with the use of CNN based systems with only spectro-grams as input [9, 41]. The objectives of DeCoRe in this area are (i) to propose and benchmark end-to-endneural speech recognition pipelines, (ii) to better understand the information captured by CNN or RNN inacoustic speech modelling (as recently done for CNN-based image recognition [57]), and (iii) to investigatethe potential of multi-task learning for deep neural network (DNN) based speech recognition (possibility toexploit multi-genre training data to train a single system dedicated to several tasks or to several languages).

• Distributed representations for text. There has been a growing interest in distributed representationsfor text, largely due to [36] who proposed simple neural network architectures which can be trained on hugeamounts of text (in the order of 100 billion words). A number of contributions have extended this work tophrases [37], text sequences [24, 28], and bilingual distributed representations [35]. These representations, alsocalled word embeddings, can capture similarities between words or phrases at different levels (morphological,semantic). Bi-lingual word embedding (common representation for two languages) opens avenues for newtasks such as cross-lingual image captioning (train in English, caption in French) for instance.

• Image caption generation. Recently, RNNs [7, 19] have proven effective to produce natural languagedescriptions of images [21, 54]. Although these results are impressive, there are a number of challenges inthis area that will be addressed in DeCoRe . These include addressing the scalability to use such models fornatural-language-based image search, and generalization to words that were not seen in the training data.Another challenge is to develop methods that associate words in the caption with image regions [23], with thegoal to improve generalization by being able to exploit visual scene compositionality. A final challenge is toinfer basic spatial relations among objects from the image and to report these in the generated descriptions(“A man on a bike” vs. “A man on the left of a large bike”).

Caption generation will play a central role, integrating image understanding and language generation models.

Positioning. DeCoRe fits excellently in Persyval’s research action Advanced Data Mining (ADM), and di-rectly addresses one of its three main challenges: “Mining multi-modal data”. The understanding of speech,visual content, and text are among the core topics of modern data mining. None of the existing Persyval-fundedactions have a direct overlap with DeCoRe .

2

Page 3: DeCoRe - Inriathoth.inrialpes.fr/decore/decore-proposal.pdf · Caption generation will play a central role, integrating image understanding and language generation models. Positioning

Figure 1: Schematic comparison of conventional hand-crafter feature approaches and deep-learning end-to-endtrainable based approach. Figure credit: Yann Lecun

2 Methodology

Deep convolutional neural networks (CNNs) and recurrent neural networks (RNNs) have yielded breakthroughsin several areas [29], including object recognition [26], machine translation [52], and speech recognition [16].These approaches are end-to-end trainable: feature processing (MFCC [3], SIFT [33]) and mid-level featuresextraction are replaced by neural machines for which each processing layer is trainable . This allows the systemto learn the appropriate hierarchical features from raw data (signal, image, spectrogram).

The second key property is the “deep” layered hierarchical structure of these models. While 2-layer per-ceptrons have long been known to be universal function approximators [55], they require an arbitrarily largenumber of units in the single hidden layer. The power of “deep” layered architectures lies in their efficiency interms of the number of parameters to specify highly complex patterns [39]. Intuitively, this efficiency is a resultof compositionality: each layer of the network extracts non-linear features which are defined in terms of thefeatures of the previous layer. In this manner, after several layers, complex object-level patterns can be detectedas a constellation of parts, which are detected as a constellation of sub-parts, etc. [13]. Visualization of theactivations of neurons in convolutional networks confirm this intuition [57]. Earlier state-of-the-art computervision models were mostly based on hand-crafted single-level representations (e.g. color histograms [53]), orunsupervised two-level representations (bag-of-words [50], Fisher vectors [43]) and three-level representations[8]). Deeper models only became an viable alternative to such shallow networks once the right regularizationmethods [51], large datasets [4], and massively parallel GPU compute hardware were all in place.

The exceptional results obtained when taken together, in deep end-to-end trainable systems, underlines theimportance of learning the “feature” or “representation” rather than just the “classifier” which has been thedominant approach before. See Figure 1 for a schematic illustration of how the deep learning approach comparesto the conventional approach based on hand-crafted features with trainable linear classifiers. Interestingly, it hasalso recently been found that the activations in deep CNN models correlate better than traditional feature-basedapproaches with activations in the inferior temporal (IT) cortex in primates [22].

DeCoRe brings together a critical mass of experts in information retrieval, computer vision, machines learn-ing, natural language processing and speech recognition from five research groups hosted in Grenoble’s threecomputer science and applied mathematics laboratories. The main objective of DeCoRe is to foster collabora-tions in the Grenoble research community in the area of deep learning which is rapidly evolving, and likely tounderpin future advances in the considered application areas for the next decade. The collaboration involvescross-institute research, training of PhD students and MSc interns, but also the organization of reading groups,workshops, and teaching of MSc-level courses.

2.1 Participating research groups

In this section we give a description of the five research groups that host most participating researchers. For eachwe list the participating research staff, a description of the research directions and the principal investigator.

3

Page 4: DeCoRe - Inriathoth.inrialpes.fr/decore/decore-proposal.pdf · Caption generation will play a central role, integrating image understanding and language generation models. Positioning

The SigmaPhy team at the GIPSA-LAB (http://www.gipsa-lab.fr/sigmaphy/accueil-sigmaphy) isalso part of the network of teams in DeCoRe working on deep learning, but not part of the core organizingand funds requesting teams. SigmaPhy studies image processing and wave physics for natural environmentcharacterization and surveillance. This includes underwater acoustics (active and passive observing, localizationin complex environment), optical and radar remote sensing, and transient signal imagery (seismic imagery,ultrasonic signals, fluorescence signals). In November 2015 M. Malfante started a PhD thesis supervised by J.Mars and M. Dalla Mura on deep learning for recognition problems in submarine acoustic signals.

2.1.1 THOTH team, INRIA/LJK

• Website: http://lear.inrialpes.fr

• Participants: Jakob Verbeek (coordinator, CR), Cordelia Schmid (DR), Julien Mairal (CR), KarteekAlahari (CR).

• Team description: THOTH (formerly known as LEAR, renamed in March 2016) is focused on computervision and machine learning. It’s main long term objective is to learn structured visual recognition modelsfrom little or no manual supervision. Research focuses on the design of deep convolutional and recurrentneural network architectures: in particular those that can be used as a general-purpose visual recognitionengine that is suitable to support many different tasks (recognition of objects, faces, actions, localizationof objects and parts, pose estimation, textual image description, etc.). A second research axis focusesspecifically on learning such models from as little supervision as possible. The third research direction islarge-scale machine learning, needed to deploy such models on large datasets with little or no supervision.

• Principal investigator: J. Verbeek currently supervises two PhD students. One in co-supervisionwith C. Couprie from Facebook AI Research (FAIR), on the topic of deep learning for weakly supervisedsemantic video segmentation. The other is funded by an national ANR project on metric learning and CNNmodels for face recognition in unconstrained conditions including non-cooperative, non-visible spectrumimages, etc. He also supervises a PostDoc and MSc intern on RNN models for image captioning. He isinvolved in a national ANR grant application which federates six research centers across France around thetopic of low-power embedded applications of deep learning. J. Verbeek is teaching the course AdvancedLearning Models on (deep) neural networks in the Industrial and Applied Mathematics MSc program atthe Univ. of Grenoble.

2.1.2 GETALP team UGA/CNRS/LIG

• Website: http://getalp.imag.fr

• Participants: Laurent Besacier (Prof. co-organizer), Benjamin Lecouteux (MC), Christophe Servan(Postdoc).

• Team description: The GETALP (Study Group for Machine Translation and Automated Processingof Languages and Speech) was born in 2007 when LIG was created. Born from the virtuous union ofresearchers in spoken and written language processing, GETALP is a multidisciplinary group (computerscientists, linguists, phoneticians, translators and signal processing specialists) whose objective is to ad-dress all theoretical, methodological and practical aspects of multilingual communication and multilingual(written or spoken) information processing, with a focus on speech recognition and machine translation.GETALP’s methodology relies on continuous investigations between data collection, fundamental research,development of systems, applications and experimental evaluations.

• Principal investigator: L. Besacier started to have interest for deep learning approaches for spokenlanguage processing three years ago and he has supervised a PhD student on automatic speech recog-nition for under-resourced languages using deep neural networks (Sarah Samson Juan - PhD defendedin 2015). He currently supervises or co-supervises several PhDs on topics related to DeCoRe : deep andactive learning for multimedia (Mateusz Budnik - with MRIM), recurrent neural networks for cross-lingualannotation propagation (Othman Zenaki - with CEA/LIST) and cross-language plagiarism detection usingword embeddings (Jeremy Ferrero - with Compilatio S.A.). Currently he supervises three MSc interns; onLong-Short-Term-Memory (LSTM) networks for speech recognition, DNN compression for speech tran-scription, and neural machine translation.

4

Page 5: DeCoRe - Inriathoth.inrialpes.fr/decore/decore-proposal.pdf · Caption generation will play a central role, integrating image understanding and language generation models. Positioning

2.1.3 MRIM team UGA/CNRS/LIG

• Website: http://lig-mrim.imag.fr

• Participants: Georges Quenot (DR, co-organizer), Jean-Pierre Chevallet (MC), and Philippe Mulhem(CR).

• Team description: The research carried out in the MRIM targets information retrieval and mobilecomputing domains. While studies done in Information Retrieval are dedicated to satisfy users informationneeds from a huge corpus of documents, those which are conducted in mobile computing are dedicatedto satisfy mobile users needs in terms of services taken from a corpus of services and then, composedaltogether: in both domains, users express their needs through queries, and the system gives back relevantdocuments or personalised services i.e., documents/services that match users’ query.

• Principal investigator: Georges Quenot has worked for over 15 years in video contents indexing andretrieval. He is co-organizer of TRECVid since its beginning in 2001. He has started using deep learning inthis context three years ago and is co-supervising a PhD student (Mateusz Budni, with GETALP group)and a Master student (Anuvabh Dutt, with the AGPIG group) on this subject. He obtained excellentresults at the TRECVid semantic indexing task (between second and fourth) using this approach. Healso successfully applied the same method on still images: currently ranking first at VOC 2012 objectclassification task (comp1, post-campaign).

2.1.4 AGPIG team UGA/CNRS/GIPSA-LAB

• Website: http://www.gipsa-lab.grenoble-inp.fr/agpig,

• Participants: Denis Pellerin (Prof., co-organizer), Michele Rombaut (Prof.)

• Team description: GIPSA-lab (Laboratoire Grenoble Images Parole Signal Automatique) is a researchunit between CNRS, Grenoble-INP and University Grenoble Alpes. The Architecture Geometry Per-ception Image Gesture (AGPIG) team of GIPSA-lab has a long experience of image/video analysis andindexing. It research interests include image/video classification, human action recognition, facial anal-ysis, audiovisual scene analysis for robot companions. It has an expertise of visual attention modelling,data fusion with transferable belief models, dictionary learning, as well as architecture/algorithm jointexploration.

• Principal investigator: Denis Pellerin started to work on deep learning networks for image classificationtwo years ago. With Georges Quenot, he co-supervised one master student (Efrain-Leonardo Gutierrez-Gomez in 2015) and is co-supervising one master student (Anuvabh Dutt in 2016) on this subject. Hisresearch interests include i) video analysis and indexing: image and video classification, human actionrecognition, video summarization, active vision for robots, ii) visual perception and modeling: visualsalience, attention models, visual substitution.

2.1.5 AMA team UGA/CNRS/LIG

• Website: http://ama.liglab.fr

• Participants: Eric Gaussier (Prof.), Ahlame Douzal (MC).

• Team description: The research of the AMA team fits within the general framework of data science,with a strong focus on data analysis, machine learning and information modeling. Within this framework,the AMA team is interested in developing new theoretical tools, algorithms and systems for analyzing andmaking decisions on complex data. The research of the team is organized in three main, complementaryaxes: data analysis and learning theory, learning and perception systems, and modeling social systems.

• Principal investigator: Eric Gaussier started to work on deep learning for information access two yearsago. He was particularly interested in obtaining collection independent representations that can be usedfor transfer learning. More recently, in collaboration with Ahlame Douzal, he is interested in deep learningrepresentations for time series, with applications on prediction and classification. This topic is the focusof the ANR project LOCUST (with LIP6, UPMC) which started in January 2016.

5

Page 6: DeCoRe - Inriathoth.inrialpes.fr/decore/decore-proposal.pdf · Caption generation will play a central role, integrating image understanding and language generation models. Positioning

2.2 Challenges and research directions.

Within the broader scope of DeCoRe , effort will be focused on several more specific topics which are presentedin the following sections. Some of these topics are oriented towards a specific application domain, others towardsscientific challenges that reach across the scope of all the considered application domains.

2.2.1 Object recognition and localization

While neural networks have been used for a long time in visual object recognition, firstly in character recog-nition [6] and in face detection [10]. They were only recently shown to be effective for general object recog-nition [26]. This was due to advances in effective training algorithms [38], the availability of very powerfulparallel GPU hardware, and the availability of huge quantity of cleanly annotated data [5]. Since then, manyimprovements have been brought including the use of very deep (19 layers) [49] and even ultra deep (152 lay-ers) [17] architectures, and the localization of objects using CNNs [12, 14, 40, 46, 48]. In order to avoid completere-training of large networks, incremental methods have recently been proposed for the dynamic inclusion ofnew categories [56].

The main objectives of DeCoRe in this area are the development of new methods for (i) efficiently detectingand localizing very large sets of categories, (ii) weakly supervised learning for object localization and semanticsegmentation, (iii) developing of structured models to capture co-occurrence and spatial relation patterns to im-prove object localization, and (iv) building models for dynamically evolving sets of categories using incrementallearning.

Object recognition and localization is the main topic of one funded PhD scholarship subject further describedin Section 4.2.

2.2.2 Speech recognition

Neural networks have been used as feature extractors in HMM-based speech recognition systems [2, 18]. Re-cently, neural networks started to replace larger parts of the speech processing chain previously dominated byHMMs [16]. There is also an increasing number of new studies trying to address speech processing tasks (no-tably speech recognition) with the use of CNN based systems with only spectrograms as input [9, 41]. Lately,Recurrent Neural Networks (RNNs) have also been introduced for speech recognition because of their modellingcapabilities for sequences. RNNs allow the model to store temporal contextual information directly withoutexplicitely defining the size of temporal contexts (e.g. the time convolution filter size in CNNs). Among severalimplementations of RNNs, Long Short Term Memory (LSTM) [19] networks have the capability to memorizesequences with long range temporal dependencies and start to be used for end-to-end speech recognition.

The main objectives of DeCoRe in this area are: (i) Propose and benchmark an efficient end-to-end speechrecognition pipeline for multiple languages including English and French. (ii) Better understand the informa-tion captured by CNN or RNN in acoustic speech modelling (as recently done for CNN-based image recognition[57]). (iii) Propose architectures which combine front-end deep CNN models (acting as trainable feature extrac-tors) with LSTMs (modeling the context from the sequence acoustic signal). (iv) Explore data augmentationtechniques for speech recognition. Data augmentation consists in increasing the quantity of training data andhave been widely used in image processing, see e.g. [42], but hardly ever in speech processing. (v) Exploit theability of deep neural networks to benefit from transfer learning (transferring knowledge between tasks) whichhas been widely studied in neural network literature. For instance, it is particularly useful to transfer knowledgefrom one language to another for crosslingual speech modeling and rapid development systems for new targetlanguages. Encoder-decoder approaches [52] lend them selves extremely well for such an approach [34].

This research topic is studied in GETALP group through several MSc and will be strengthened by collabora-tions within DeCoRe .

2.2.3 Distributed representations for texts and sequences

There has been a growing interest in distributed representations for text, largely due to [36] who propose simpleneural network architectures which can be trained on huge amounts of text (in the order of 100 billion words). Anumber of contributions have extended this work to phrases [37], text sequences [28], and bilingual distributedrepresentations [35]. These representations, also called word embeddings, can capture similarities between wordsor phrases at different levels (morphological, semantic). Bi-lingual word embedding (common representation fortwo languages) opens avenues for new tasks such as cross-lingual image captioning (train in English, caption inFrench) and neural machine translation for instance [34].

Beyond texts, sequences of objects, as time series, can also be embedded into representations that allowone to abstract away from the representation problems raised by multi-scale, multi-variate and multi-modal

6

Page 7: DeCoRe - Inriathoth.inrialpes.fr/decore/decore-proposal.pdf · Caption generation will play a central role, integrating image understanding and language generation models. Positioning

sequences. Deep learning offers here an integrated solution for sequences that can be used in a variety ofcontexts.

Bi-lingual word embedding is part of one funded PhD scholarship subject further described in Section 4.1.Sequence embedding will also be studied by the requested post-doc co-supervised by AMA, GETALP and THOTH.

2.2.4 Image caption generation

Recently RNNs [7, 19] have proven effective to produce natural language descriptions of images [21, 54]. Al-though these results are impressive, there are a number of open challenges in this area. These include addressingthe scalability to use such models for natural-language-based image search, and generalization to words thatwere not seen in the training data. Another challenge is to develop methods that associate words in the captionwith image regions. To date only very few works exist along these lines [21, 23]. The goal is to improve general-ization by being able to exploit visual scene compositionality. Moreover, region-based visual modeling will alsobe key to inferring spatial relationships between objects, and for visual “grounding”, so that if multiple objectsof the same category exist in a scene, the model is able to distinguish them, and to associate properties to theindividual instances.

Caption generation will play a central role in DeCoRe since it brings together image understanding modelsand sequential language generation models. One of the two funded PhD scholarships will specifically address thisresearch area. More details are given in Section 4.1.

2.2.5 Selecting and evolving model structures

One of the main problems in applying deep neural networks is the architecture choice. The space of architecturesis large and discrete: a specific network is defined by the number of layers, number of nodes per layer, typeof non-linearity (sigmoid, rectifiers, maxout [15]), filter sizes for CNN, type of pooling operations, ordering ofpooling and covolutional layers, etc. Naively testing different architectures one-by-one is a hopelessly intractableapproach, and more systematic approaches are needed. For example by using sparsity inducing regularizersover the weight space [27], using hierarchical non-parametric approaches to learn the structure of probabilisticgraphical models [1].

The design of efficient model selection approaches, for example based on (structured) regularization, is animportant research topic today regardless of the application domain. Moreover, adapting and expanding thenetwork architecture over time —as more training data becomes available, or simply more data has been seenby the model during training— will be important for future large-scale learning scenarios where training themodel will not be a matter of hours or days, but rather weeks, months, or longer. Such scenarios are particularlyimportant in the context of learning from very large minimally supervised datasets. Network adaptation willrequire methods to assess to what extent the current network capacity has been saturated with the trainingdata, and so as to determine if the network needs to be expanded.

This research topic will be studied within the context of two submitted ANR projects by THOTH and MRIM.

2.2.6 Higher-order potentials for dense prediction tasks

Many tasks in computer vision require dense predictions at the pixel level. For example, in semantic segmenta-tion the goal is to predict the semantic category label for each pixel (e.g. pedestrian, car, building, road, sign,bicyle, tree, sky, etc.). Other dense prediction tasks include optical flow estimation, depth estimation, imagede-noising, super resolution, colorization, deblurring, etc. These dense prediction tasks are typically solved using(conditional) Markov random fields [11], which include unary data terms for each pixel, and pairwise terms toensure spatial regularity of the output predictions. Deep networks have been used for such tasks [32] to definedata dependent unary and pairwise terms [30]. Moreover, recently it has been shown that variational mean-fieldinference [20] in Markov random fields can be expressed as a special recurrent neural network [47, 58]. Thisallows the training of the unary and pairwise potentials to be done in a way that is coherent with the MRFstructure, and optimal wrt. the approximate inference method used for prediction.

While higher-order potentials (which model interactions of more than two prediction variables at a time)have been proven effective in the past for dense prediction tasks [25]. Efficient inference is only possible for avery small and specific class of higher-order potentials. An open question we will study in DeCoRe is how moregeneral higher-order potentials can be formulated using deep convolutional networks over label fields, in a waythat permits efficient approximate inference. For example building upon the recurrent convolutional model ofPinheiro and Collobert [44].

This research topic is studied in particular in the context of the PhD thesis between THOTH and FacebookAI Research.

7

Page 8: DeCoRe - Inriathoth.inrialpes.fr/decore/decore-proposal.pdf · Caption generation will play a central role, integrating image understanding and language generation models. Positioning

3 Expected results

The objective of DeCoRe is to generate the following outcomes.

• Scientific knowledge: disseminated mainly in the form of scientific conference and journal papers,preferably in open-access venues.

• Transfer: particular research results may give rise to technology that can be protected or transferred toindustry. Locally, both Xerox Research Center Europe (Meylan), ST Microelectronics (Grenoble), andNVIDIA (Grenoble) are active in deep learning for computer vision, and could therefore be logical partnersfor transfer.

• Infra-structure know-how: exchanges on the most effective and cost-efficient hardware setups to traindeep neural networks. This also includes exchanges on multi-GPU and multi-machine implementations.The contacts between INRIA and an NVIDIA researcher on computer vision and deep learning in Grenobleis extremely useful in this respect.

• Software: we will contribute our research results in the form of code to open-source tools that are essentialin this fast evolving area

– Caffe: Convolutional architecture for fast feature embedding. See http://caffe.berkeleyvision.

org

– Theano: general purpose (deep) neural network library, particularly suitable for recurrent networks.See http://deeplearning.net/software/theano

– Kaldi: Open-source toolkit for automatic speech recognition http://kaldi.sourceforge.net

– MultiVec (partially developed by LIG in collaboration with LIFL lab.): a multilingual and multilevelrepresentation learning toolkit for NLP https://github.com/eske/multivec

• Training: funding and supervision of 2 PhD students, and 6 MSc students, structuring MSc teaching ondeep learning in Grenoble

• Interaction: invited researchers, organization of workshops, seminars, and cross-institute reading groups.

4 Detailed research plan for PhD scholarships and PostDoc

4.1 PhD Thesis 1: encoder/decoder approaches for multilingual image captioning

• Supervisors: L. Besacier and J. Verbeek

• Localization: 50% between GETALP and THOTH teams

• Topic: The focus of this PhD will be on recurrent encoder-decoder models and their application to severalmodalities (image, speech, text). Such models have been found effective for machine translation [52], andlend themselves well for image captioning [23]. The idea is to encode the input (image or sentence) into acontinuous semantic space. The encoder can be a recurrent LSTM [19] network for a sentence, or a CNNmodel for an image. The decoder takes the input encoding and generates a sequential output of variablelength (e.g. a sequence of words) in a step-by-step manner. See Figure 2 for several examples of imageswith automatically generated captions.

As a key application, we will consider multilingual image captioning which is the generation of imagedescriptions in a target language, given training data which includes a collection of images and theirdescription in a different source language. The Multimodal Machine Translation Challenge providesexcellent benchmark data for this problem, see http://www.statmt.org/wmt16/multimodal-task.html

• Focus areas:

– Text encoder architectures: since the input sentence is given at once (and not generated) there aremany possibilities for the architecture of the input encoder. For example, bidirectional RNNs maybe used [21], instead of uni-directional models. We will evaluate existing sequence encoding modelsfor image captioning, and propose novel ones based on the results.

8

Page 9: DeCoRe - Inriathoth.inrialpes.fr/decore/decore-proposal.pdf · Caption generation will play a central role, integrating image understanding and language generation models. Positioning

A cat sitting on top of a suitcase. A group of people riding skis downa snow covered slope.

A close up of a plate of food on atable.

Figure 2: Example images with natural language descriptions automatically generated with an RNN modelwith LSTM units. The COCO dataset [31] was used to train the model, and examples come from the test set.

– Learning from weak supervision: in current research, image captioning models are trained from su-pervised training data where images are annotated by hand with multiple very descriptive sentences,sometime also localized in the image [45]. While this is good for initial research, it will not scaleto real applications, where large and diverse training datasets are needed. Annotating such datasets is too costly, and hence weakly supervised learning is needed. We will develop latent variablemodels to infer object locations from image-sentence pairs, and learn models from internet data suchas stock-photography websites which host many images with natural language descriptions, see e.g.http://www.shutterstock.com. We will also consider the use of aligned multi-lingual text-corporato pre-train text encoder-decoder models, which can be combined with image encoder models. Inparticular, we expect larger pure-text corpora to considerably improve the text generation (decoder)quality.

– Region-based image representation: a distributed region-based image representation is promising forat least three reasons. To improve generalization (combining a limited number of object categories inmany different scenes), to enable relative geometrical statements (a is on the left of b), and to enablegrounding of properties and attributes to individual object instances (there may be a tiny whitehorse, and a large black one in the scene, and a good description will not mix properties of differentobjects even if they belong to the same category). Region-based encoder-decoder models for images,however, have hardly been proposed in the literature [21, 23]. We will develop new region-basedimage representations for this purpose based on convolutional and recurrent network structures.

– Data augmentation: increasing the quantity of training data and has been widely used in imageprocessing, see e.g. [42]. For cross-lingual image captioning, several (instead of one) captions perimage can be easily obtained using automatic paraphrasing (for a mono-lingual image captioning task)or machine translation (for a cross-lingual image captioning task). We will explore data augmentationscenarios for image captioning that operate jointly at the image (image transformations) level andtext (paraphrasing) level.

4.2 PhD Thesis 2: incremental learning for visual recognition

• Supervisors: Georges Quenot and Denis Pellerin.

• Localization: 50% between MRIM and AGPIG teams

• Topic: This PhD will focus on the detection of visual categories in still images and videos. It will especiallystudy the problem of the dynamic adaptation of CNN models to newly available training data, to newneeded target categories and/or to new or specific application domains (e.g. medical, satellite or life-logdata). Effective architectures are now very deep (19 layers) [49] and even ultra deep (152 layers) [17] andneed very long training times: up to several weeks even using very powerful multi-GPU hardware. It isnot possible or efficient to retrain a complete model for a particular set of new categories or for applyingalready trained categories to different domains. Incremental learning [56] is a way to adapt already trainednetworks for such needs at a low marginal cost. Also, various forms of weakly supervised learning andactive learning can be used in conjunction to further improve the system performance. Localization oftarget categories [40] is also very important. First, knowing where objects are located in images helpsbuilding better model, especially in a semi-supervised way. Second, in the context of DeCoRe , it will beessential for providing elements for the generation of a detailed textual description.

9

Page 10: DeCoRe - Inriathoth.inrialpes.fr/decore/decore-proposal.pdf · Caption generation will play a central role, integrating image understanding and language generation models. Positioning

• Focus areas:

– Incremental learning and evolving network architectures: new methods will be studied for buildingnetworks that operate in a ”continuous learning” mode for permanently improving themselves. Im-provements will be possible by a continuous inclusion on new target concepts (possibly includingthe full ImageNet set and even beyond), and by the adaptation of already trained concepts to newtarget domains (e.g. satellite images or life-logging content). Incremental learning methods will beconsidered as well as network architecture evolution.

– Active learning and weakly supervised learning : various forms of these approaches as well as of semi-supervised learning have proven very effective and efficient for content-based indexing of images andvideos, both at the image or shot level and at the region or even pixel level. These also fit verywell with incremental learning. The goal here will be to efficiently integrate them in order to extractas much information as possible from all available annotated, non-annotated, and weakly annotateddata. This will also involve classification using hierarchical sets of categories, and knowledge transferbetween categories and between application domains. Data augmentation will also be consideredspecifically in the context of active learning.

– Salience: salience is a very important prior in object detection. It can be considered from twoperspectives, using either user gaze information or main categories localization. In both cases, saliencecan be learned using deep networks and later used for improving object detection and localization.We will explore how salience extraction and use can be efficiently combined with incremental andactive learning.

4.3 PostDoc: representation learning for sequences

• Supervisors: Laurent Besacier, Eric Gaussier and Jakob Verbeek

• Localization: 30% between AMA, GETALP and THOT teams

• Topic: Encoding/decoding architectures as the ones envisaged in Section 4.1 capture local and globaldependencies, as well as ordering information. Such architectures are well suited for addressing severalgeneric problems pertaining to sequence data (as prediction, classification and clustering), and the goalof this postdoc will be to extend current encoding/decoding architectures to times series. In particular,we will (1) design a method to transform general time series into input vectors for encoding/decodingarchitectures, and (2) adapt the decoding module to output multi-modal, multi-variate times series.

• Focus areas:

– Advanced encoder models: Machine learning techniques for prediction, classification and clusteringusually operate on vectors; it is thus important to find fixed-size representations of the examplesconsidered. Such representations, for standard time series, can be obtained using RNN-based encodermodels that assume a single input sequence sampled at a constant rate, without any missing values.The problem is however more complex for multi-scale, multi-modal and multi-variate time series, asthe ones we plan to study, inasmuch as (a) the sampling time of a given variable varies over time, and(b) several values are missing, due to the unreliability of the associated sensors for example. We planto investigate encoder models for such complex time series, in particular by making the recurrentupdates dependent on the observation intervals.

– Complex multi-variate decoders: Complex time series also require specific outputs, in which one canhave several ordered sequences (instead of just one ordered sequence in the case of text). We willstudy here the extension of standard decoding architectures to deal with several ordered sequences,possibly sampled at different frequencies.

5 Positioning and aligned actions

5.1 Positioning in LabEx Persyval

DeCoRe fits excellently in Persyval’s research action Advanced Data Mining (ADM), and directly addresses oneof its three main challenges: “Mining multi-modal data”. The understanding of speech, visual content, and textare among the core topics of modern data mining.

10

Page 11: DeCoRe - Inriathoth.inrialpes.fr/decore/decore-proposal.pdf · Caption generation will play a central role, integrating image understanding and language generation models. Positioning

Although no existing Persyval-funded actions have a direct overlap with DeCoRe , we list related ones forcompleteness. The exploratory project Phon&Stat deals with speech data but its goal is to use statistical dataanalysis models and tools for experimental phonology and phonetics. The project-team Khronos focuses ontheoretical analysis and statistical modeling of time-series data with non-iid data models. The project-teamPersyvact2 aims at applying data science methods to medical data and specifically high-dimensional and largescale ones. None of these projects has a strong overlap with DeCoRe .

5.2 Aligned actions outside LabEx Persyval

The main objective of DeCoRe is to strengthen competences and collaborations in the Grenoble research com-munity in the area of deep learning. The collaboration involves cross-institute research, training of PhD studentsand MSc interns, but also the organization of reading groups, workshops, and teaching of MSc-level courses.While DeCoRe is an important vehicle towards this goal (by financing two full PhDs and a number of otherexpenses, see Section 6), alignment with other actions helps to ensure a bigger impact by building a criticalmass of involved non-permanent research staff.

Several related actions undertaken by the principal investigators of DeCoRe are, or will be, running inparallel. These include one PhD thesis at THOTH (J. Verbeek) funded by a Cifre grant with Facebook AIResearch, Paris (started in January 2016) on weakly supervised semantic video segmentation with deep hybridCNN and RNN models. The ANR project LOCUST at AMA (with LIP6-UPMC, started in January 2016)which studies deep learning representations for time series, with applications on prediction and classification.Furthermore, two ANR projects are in submission (selected for the final evaluation phase): one by THOTH,and another by both MRIM and AGPIG. These projects each fund an additional PhD student: one on modelselection and one on incremental learning with deep convolutional models respectively.

6 Requested resources

Table 1 gives an overview of the requested financial resources. The large majority (> 80%) of the requestedfunds will be spent on human resources: two full PhD scholarships, 6 months of PostDoc salary, and six MScinternships. The topics of the PhD scholarships and PostDoc are detailed in Section 4.

Learning deep convolutional and recurrent networks poses a formidable computational challenge. For large-scale experimentation on hard real-world problems and benchmarks, the use of GPU hardware is mandatory tobe able to run experiments in a tractable amount of time. An ambitious research program on this topic shouldtherefore be aligned with a suitable hardware platform to have a chance to succeed.

INRIA-Grenoble has recently entered the NVIDIA GPU research center program (coordinator J. Verbeek),which enables DeCoRe to use the latest hardware and benefit from technical NVIDIA support thanks to thehosting of an NVIDIA researcher. Currently, THOTH disposes of a cluster of 30 GPU boards (mostly TitanXclass). LIG has also recently acquired several machines with GPUs, shared between both GETALP and MRIMresearch groups. To ensure that a sufficient hardware platform for the proposed research, we reserve a partof the budget (11%) to acquire four servers that can host two GPUs each. In parallel we have submitted arequest to join the Facebook AI Research hardware donation program. If accepted, this is a supplementarypath to ensure sufficient computational resources. Our goal is to integrate the GPU compute resources ina mutually accessible cluster structure, that is at least available to all partners in DeCoRe , e.g. Grenoble’sCIMENT high-performance compute center (https://ciment.ujf-grenoble.fr).

The remaining budget will be spent on travel (8%): conference attendance, visiting researchers, and invitedspeakers. We will acquire external funding for workshop organization and other dissemination activities.

Expense Cost Quantity BudgetFull PhD scholarships 100 kE 2 200 kEMSc Internships 4 kE 6 24 kEPostDoc months (*) 4 kE 6 24 kETravel (conferences, etc.) 1.5 kE 16 24 kEGPUs (Nvidia TitanX) 1 kE 8 8 kEServers (Dell R730) 6 kE 4 24 kETotal 304 kE

Table 1: Breakdown of overall requested budget. (*) The 24 kE for 6 months PostDoc are conditioned on theavailability of additional funding over the 280 kE specified in the call.

11

Page 12: DeCoRe - Inriathoth.inrialpes.fr/decore/decore-proposal.pdf · Caption generation will play a central role, integrating image understanding and language generation models. Positioning

References

[1] R. Adams, H. Wallach, and Z. Ghahramani. Learning the structure of deep sparse graphical models. InAISTATS, 2010.

[2] Herve Bourlard and Nelson Morgan. Connectionist speech recognition. a hybrid approach. 1994.

[3] S. Davis and P. Mermelstein. Comparison of parametric representations for monosyllabic word recognitionin continuously spoken sentences.

[4] J. Deng, A. Berg, K. Li, and L. Fei-Fei. What does classifying more than 10,000 image categories tell us?In ECCV, 2010.

[5] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical ImageDatabase. In CVPR09, 2009.

[6] H. Drucker and Y LeCun. Improving generalization performance in character recognition. In Proceedings ofthe IEEE Workshop on Neural Networks for Signal Processing, pages 198–207. IEEE Press, 1991. catalognumber 91TH0385-5, ISBN 0-7803-0118-8.

[7] J. Elman. Finding structure in time. Cognitive Science, 14:179–211, 1990.

[8] P. Felzenszwalb and D. Huttenlocher. Efficient graph-based image segmentation. IJCV, 59(2):167–181,2004.

[9] Sriram Ganapathy, Kyu Han, Samuel Thomas, Mohamed Omar, Maarten Van Segbroeck, and Shrikanth SNarayanan. Robust language identification using convolutional neural network features. In Proc. INTER-SPEECH, 2014.

[10] Christophe Garcia and Manolis Delakis. Convolutional face finder: A neural architecture for fast androbust face detection. IEEE Trans. Pattern Anal. Mach. Intell., 26(11):1408–1423, November 2004.

[11] S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images.PAMI, 6(6):712–741, 1984.

[12] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detectionand semantic segmentation. In CVPR, 2014.

[13] R. Girshick, F. Iandola, T. Darrell, and J. Malik. Deformable part models are convolutional neural networks.In CVPR, 2015.

[14] Ross Girshick. Fast r-cnn. In International Conference on Computer Vision (ICCV), 2015.

[15] I. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks. In ICML, 2013.

[16] Alex Graves and Navdeep Jaitly. Towards end-to-end speech recognition with recurrent neural networks.In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China,21-26 June 2014, pages 1764–1772, 2014.

[17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.CoRR, abs/1512.03385, 2015.

[18] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, AndrewSenior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural networks for acousticmodeling in speech recognition: The shared views of four research groups. Signal Processing Magazine,IEEE, 29(6):82–97, 2012.

[19] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.

[20] M. Jordan, Z. Ghahramani, T. Jaakola, and L. Saul. An introduction to variational methods for graphicalmodels. Machine Learning, 37(2):183–233, 1999.

[21] A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In CVPR,2015.

12

Page 13: DeCoRe - Inriathoth.inrialpes.fr/decore/decore-proposal.pdf · Caption generation will play a central role, integrating image understanding and language generation models. Positioning

[22] S.-M. Khaligh-Razavi and N. Kriegeskorte. Deep supervised, but not unsupervised, models may explain itcortical representation. PLoS Computational Biology, 10(11):1–29, 11 2014.

[23] R. Kiros, R. Salakhutdinov, and R. Zemel. Unifying visual-semantic embeddings with multimodal neurallanguage models. TACL, 2015. to appear.

[24] R. Kiros, Y. Zhu, R. Salakhutdinov, R. Zemel, A. Torralba, R. Urtasun, and S. Fidler. Skip-thoughtvectors. In NIPS, 2015.

[25] P. Kohli, L. Ladicky, and P. Torr. Robust higher order potentials for enforcing label consistency. IJCV,82(3):302–324, 2009.

[26] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks.In NIPS, 2012.

[27] P. Kulkarni, J. Zepeda, F. Jurie, P. Perez, and L. Chevallier. Learning the structure of deep architecturesusing l1 regularization. In BMVC, 2015.

[28] Quoc V. Le and Tomas Mikolov. Distributed Representations of Sentences and Documents. arXiv:1405.4053[cs], 2014.

[29] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 52:436–444, 2015.

[30] G. Lin, C. Shen, I. Reid, and A. van den Hengel. Efficient piecewise training of deep structured models forsemantic segmentation. Arxiv.

[31] T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, P. Dollar,and C. Zitnick. Microsoft COCO: common objects in context. In ECCV, 2014.

[32] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR,2015.

[33] D. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91–110, 2004.

[34] M.-T. Luong, Q. Le, I. Sutskever, O. Vinyals, and L. Kaiser. Multi-task sequence to sequence learning. InICLR, 2016.

[35] Thang Luong, Hieu Pham, and Christopher D. Manning. Bilingual word representations with monolingualquality in mind. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural LanguageProcessing, pages 151–159, 2015.

[36] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representationsin Vector Space. arXiv:1301.3781 [cs], 2013.

[37] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed Representationsof Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems26, pages 3111–3119. 2013.

[38] G Montavon, G.B. Orr, and K.R. Muller. Neural Networks: Tricks of the Trade. Number LNCS 7700 inLecture Notes in Computer Science Series. Springer Verlag, 2012.

[39] G. Montufar, R. Pascanu, K. Cho, and Y. Bengio. On the number of linear regions of deep neural networks.In NIPS, 2014.

[40] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Is object localization for free? – weakly-supervised learningwith convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision andPattern Recognition, 2015.

[41] Dimitri Palaz, Ronan Collobert, et al. Analysis of cnn-based speech recognition system using raw speechas input. In Proc. INTERSPEECH, 2015.

[42] M. Paulin, J. Revaud, Z. Harchaoui, F. Perronnin, and C. Schmid. Transformation pursuit for imageclassification. In CVPR, 2014.

[43] F. Perronnin and C. Dance. Fisher kernels on visual vocabularies for image categorization. In CVPR, 2007.

13

Page 14: DeCoRe - Inriathoth.inrialpes.fr/decore/decore-proposal.pdf · Caption generation will play a central role, integrating image understanding and language generation models. Positioning

[44] P. Pinheiro and R. Collobert. Recurrent convolutional neural networks for scene labeling. In ICML, 2014.

[45] B. Plummer, L. Wang, C. Cervantes, J. Caicedo, J. Hockenmaier, and S. Lazebnik. Flickr30k entities:Collecting region-to-phrase correspondences for richer image-to-sentence models. In ICCV, 2015.

[46] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: towards real-time object detection with regionproposal networks. CoRR, abs/1506.01497, 2015.

[47] A. Schwing and R. Urtasun. Fully connected deep structured networks. CoRR, abs/1503.02351, 2015.

[48] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated recognition,localization and detection using convolutional networks. In ICLR, 2014.

[49] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-tion. CoRR, abs/1409.1556, 2014.

[50] J. Sivic and A. Zisserman. Video Google: a text retrieval approach to object matching in videos. In ICCV,2003.

[51] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way toprevent neural networks from overfitting. JMLR, 2014.

[52] I. Sutskever, O. Vinyals, and Q. Le. Sequence to sequence learning with neural networks. In NIPS, 2014.

[53] M. Swain and D. Ballard. Color indexing. IJCV, 1991.

[54] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. InCVPR, 2015.

[55] A.R. Webb. An approach to non-linear principal components analysis using radially symmetric kernelfunctions. Statistics and Computing, 6:159–168, 1996.

[56] Tianjun Xiao, Jiaxing Zhang, Kuiyuan Yang, Yuxin Peng, and Zheng Zhang. Error-driven incrementallearning in deep convolutional neural network for large-scale image classification. In Proceedings of the22Nd ACM International Conference on Multimedia, MM ’14, pages 177–186, New York, NY, USA, 2014.ACM.

[57] M. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In ECCV, 2014.

[58] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. Torr. Conditionalrandom fields as recurrent neural networks. In ICCV, 2015.

14

Page 15: DeCoRe - Inriathoth.inrialpes.fr/decore/decore-proposal.pdf · Caption generation will play a central role, integrating image understanding and language generation models. Positioning

A CV of principal investigators

15

Page 16: DeCoRe - Inriathoth.inrialpes.fr/decore/decore-proposal.pdf · Caption generation will play a central role, integrating image understanding and language generation models. Positioning

CURRICULUM VITAELaurent BESACIER

Laurent BesacierMarried, 3 childrenProfessor (1st class) at Univ. Grenoble Alpes (UGA), HDRLaboratory of Informatics of Grenoble (LIG), leader of GETALP groupDirector of MSTII (Math-Info) Doctoral School of [email protected]

1. Short bio

Prof. Laurent Besacier defended his PhD thesis (Univ. Avignon, France) in Computer Science in 1998 on “A parallel model for automatic speaker recognition”. Then he spent one and a half year at the Institute of Microengineering (EPFL, Neuchatel site, Switzerland) as an associate researcher working on multimodal person authentication (M2VTS European project). Since 1999 he is an associate professor (full professor since 2009) in Computer Science at Univ. Grenoble Alpes (he was formerly at U. Joseph Fourier). From September 2005 to October 2006, he was an invited scientist at IBM Watson Research Center (NY, USA) working on Speech to Speech Translation. His research interests are mainly related to multilingual speech recognition and machine translation. Laurent Besacier has published 200 papers in conferences and journals related to speech and language processing. He supervised or co-supervised 20 PhDs and 30 Masters. He has been involved in several national and international projects as well as several evaluation campaigns. Since October 2012, Laurent Besacier is a junior member of the “Institut Universitaire de France” with a project entitled “From under-resourced languages processing to machine translation: an ecological approach”.

2. Diploma

• HDR (Ability to supervise research), specializing in Computer Science, University Joseph Fourier (January 2007). Thesis title: Rich transcription in a multilingual and multimodal world,

• PhD in Computer Science (1998), Université d'Avignon, Thesis title: A parallel model for speaker recognition, under the direction of Jean-François Bonastre and Henri Meloni,

• Master Degree at INPG (1995), specialty Signal-Image-Speech,• Engineer from the school of Chemistry, Physics, and Electronics of Lyon (CPE, 1995), option electronics and

information processing.

3. Scientific Activity

3.1 Prices / Honors / Highlights

• Winner (best system) NIST 2002 evaluation of speaker segmentation systems (meeting task)• Winner (best system) in the evaluation of the project DARPA / TRANSTAC 2006 Arabic-English Spoken Translation (done during my stay at IBM Watson research center)• Best Paper Award in 2007 for D. Istrate, E. Castelli, M. Vacher, L Besacier., Serignat J.-F. (2007). Information extraction from sound for medical telemonitoring. IMIA Yearbook 2007 21: 72-72.IEEE Trans. Inf. Technol. Biomed. January 2006, 10 (2) :264-274.• Star Challenge 2008 Finalist (Content-based search in video documents) – top 5 among 50 participants.

• Chair of the conference JEP-TALN-RECITAL 2012 (300-350 participants).• Keynote speaker for IALP conference (International Conference on Asian Language Processing) 2012• Junior member of the “Institut Universitaire de France” (awarded in 2012).• My paper « Automatic speech recognition for under-resourced languages: A survey » published in Speech Communication Journal (Elsevier) was in the top 3 of the most downloaded papers in 2014 as assessed by http://top25.sciencedirect.com/subject/computer-science/7/journal/speech-communication/01676393/archive/59/

3.2 Scientific Committees and proofreading of articles

• Editorial comitee of TAL journal (Traitement Automatique des Langues) since 2011

• Reviewing for International AvenuesIEEE Transactions on Acoustics, Speech and Language Processing (IEEE ASL) ; Computer Speech and Language Journal ; Speech Communication Journal ; IEEE Transactions on Speech and Audio Processing ; IEEE Signal Processing Letters ; IEEE Transactions on Signal Processing ; IEEE Transactions on Multimedia ; IEEE

Page 17: DeCoRe - Inriathoth.inrialpes.fr/decore/decore-proposal.pdf · Caption generation will play a central role, integrating image understanding and language generation models. Positioning

Transactions on Information Forensics and Security ; Pattern Recognition Letters ; Machine Translation Journal ; Language Ressources and Evaluation Journal (LRE)

• Reviewing for National JournalsTraitement du Signal ; Acta Acustica ; Revue I3 ; Traitement Automatique des Langues (TAL)

• International Conferences Comitees(non exhaustive list)Interspeech (every year since 2005) ; IEEE ICASSP (every year since 2007) ; IEEE ASRU (Technical Review Committee, since 2009) ; EUSIPCO (since 2006, stop in 2011) ; Speaker Odyssee, Workshop on Speaker Identification and Verification, (since 2004) ; International Workshop on Spoken Language Translation (since 2008) ; EAMT ; NAACL-HLT 2012 ; Workshop on South and Southeast Asian Natural Languages Processing (WSSANLP) ; COLING, 2008 2012 ; ACL 2013 ; SpeD (since 2004).

3.3 Expert Assessment

• Expert for project proposals to ACI (2005), ANR (2006-2016), Microsoft Research Fellowship in 2009 (Microsoft Research PhD Scholarship), ANR-JST (Japan-France) in 2010.

• Expert for OSEO-Anvar (2008), for the European Community (ERC Starting Grant 3rd Call - 2010).• Selection Committee for Research grants of Region Rhone-Alpes 2011-2014• Participation to the working group defining the scope of the future research call in Rhône-Alpes region and board

member of action - November 2011.• Regular member of ANR (National Research Agency) comitees.

3.4 Projects

Participation to or coordination of 3 European projects, 10 french ANR projects, DGA projects and several bilateral projects with foreign countries (Singapore, Colombia, Brasil, Germany).Industrial collaborations via CIFRE PhD or projects (ST micro-electronics, Lingua&Machina, Voxygen, Compilatio).

3.5 International collaborations

• Institute for Infocomm Research (Singapore): Franco-Singaporean project (Merlion) on multilingual speech recognition with Prof. Haizhou LI. Respective visits and exchanges of students and / or postdocs, 2009-2011.

• IBM Watson Research Center (NY, United States): collaboration with the spoken language translation group of Y. Gao (visiting scholar for 13 months in 2005/06, co-signatures of articles IEEE ICASSP2007, Interspeech 2007, IEEE / ACL SLT 2006, HLT 2006).

• Interactive Systems Lab. (ISL) at CMU (United States) and Karlsruhe Institute of Technology (KIT, Germany) with T. Schultz on multilingual speech recognition (including co-authorship of a paper at the conference IEEE ICASSP 2006). with S. Stucker on the unsupervised discovery of words from phonetic streams (paper at Interspeech 2009).

• European Commission - Joint Research Centre (JRC) with B. Pouliquen on automatic transliteration of named entities in a highly multilingual context (2008).

• Laboratory MICA, Hanoi (Vietnam): co-supervision of PhD students and joint work around the Vietnamese language processing with the international laboratory MICA (INPG / CNRS / HPI).

• Laboratory ITC (Cambodia): co-supervision and joint work around Khmer language processing.• Polytechnic Institute of Bucharest (Human-Computer Dialogue Group): scientific exchanges with Prof. Corneliu

Burileanu, co-supervision of master students, PhD students.• Universiti Sains Malaysia (Malaysia): Hosting and supervision of two doctoral students on speech recognition (since

2005)• University of Addis Ababa (Ethiopia): supervision of a PhD on machine translation of Amharic, hosting post-doctoral

researchers from Ethiopia (since 2010)• University of Cauca (Colombia): co-supervision of a PhD student and project around the revitalization of an

endangered language of southwestern Colombia (since 2011).• UFRGS and Ufscar (Brasil) : CNRS-FAP (french-Brasil) project on the analysis and integration of MultiWord

Expressions (MWEs) in speech and translation (2014-2016)• ITU and Ozyegin univ. (Turkey) : joint work and joint papers in the framework of the CAMOMILE project (ERA-

NET) on collaborative annotation of multi-modal, multi-lingual and multi media documents.

4. Organization of Scientific Events

• Chair of the next conference JEP-TALN-RECITAL 2012 (300-350 people)• Responsible for the monthly keynotes of my lab (LIG) - 2010-2014 (some guests: Moshe Vardi, Sacha Krakowiak, P.

Flajolet, G. Dowek, A. Colmerauer, A. Pentland, S. Abiteboul, W. Zadrozny, J. Sifakis, H. Hermanns, J. Hellersetein, etc. see http://www.liglab.fr/spip.php?article884 )

• Member of the organizing committee of Interspeech 2013 in Lyon (1500 persons – Satellites Workshop Coordinator).

Page 18: DeCoRe - Inriathoth.inrialpes.fr/decore/decore-proposal.pdf · Caption generation will play a central role, integrating image understanding and language generation models. Positioning

• Co-organizer of a special session at Interspeech 2011 (Speech technology for under-resourced languages) and Interspeech 2016 (Sub-Saharan African languages : from speech fundamentals to applications)

• Invited editor for a special issue of the "speech communication" journal (special issue around "Speech technology for under-resourced languages"). 2014.

• Chairman and organizer of the first two and of the fifth International Workshop SLTU (Spoken Language Technologies for Under-resourced Languages) : Hanoi, Vietnam, May 2008 ; Penang, Malaysia, May 2010, and Yogyakarta, Indonesia, 2016.

• Organizer of the AFCP seminar Spoken Language Processing for under-resourced languages, in June 2007.• Organizing a special session on biometrics at the conference ISPA 2005.

5. Publications

A complete list of my most recent publications can be found on : https://cv.archives-ouvertes.fr/laurent-besacier and on https://www.researchgate.net/profile/Laurent_Besacier

5 most significant (and recent) publications

• Laurent Besacier, Etienne Barnard, Alexey Karpov, Tanja Schultz. Automatic speech recognition for under-resourced languages: A survey. Speech Communication Journal, vol. 56 - Special Issue on Processing Under-Resourced Languages:85-100, January 2014. Note: (Impact-F 1.28 estim. in 2012).

• Martha Tachbelie, Solomon Teferra Abate, Laurent Besacier. Using different acoustic, lexical and language modeling units for ASR of an under-resourced language – Amharic. Speech Communication Journal, Vol. 56 - Special Issue on Processing Under-Resourced Languages:181-194, January 2014. Note: (Impact-F 1.28 estim. in 2012).

• Horia Cucu, Andi Buzo, Laurent Besacier, Corneliu Burileanu. SMT-Based ASR Domain Adaptation Methods for Under- Resourced Languages: Application to Romanian. Speech Communication Journal, Vol. 56 - Special Issue on Processing Under-Resourced Languages:195-212, January 2014. Note: (Impact-F 1.28 estim. in 2012).

• Johann Poignant, Laurent Besacier, Georges Quénot. Unsupervised Speaker Identification in TV Broadcast Based on Written Names. IEEE Transactions on Audio, Speech and Language Processing, 2015, 23 (1), pp.57-68.

• Ngoc-Quang Luong, Laurent Besacier, Benjamin Lecouteux. Towards Accurate Predictors of Word Quality for Machine Translation: Lessons Learned on French - English and English - Spanish Systems. Data and Knowledge Engineering, Elsevier, 2015, pp.11.

Page 19: DeCoRe - Inriathoth.inrialpes.fr/decore/decore-proposal.pdf · Caption generation will play a central role, integrating image understanding and language generation models. Positioning

CURRICULUM VITAE

Denis PELLERIN Professor (1st class) at University Grenoble Alpes (UGA), HDR

Grenoble Images Speech Signal Automatic laboratory (GIPSA-lab), UMR 5216

[email protected]

Tel. 04 76 57 43 69

1. Short biography

Denis Pellerin is professor at the University Grenoble Alpes (UGA). He received the engineering degree

in electrical engineering in 1984 and the Ph.D. degree in 1988 from the Institut National des Sciences

Appliquées (INSA-Lyon), France. Since 1989 he is assistant professor (full professor since 2006) in

signal and image processing at Univ. Grenoble Alpes (He was formely at Univ. Joseph Fourier Grenoble).

He is with the AGPIG team (for Architecture, Geometry, Perception, Images, Gestures) at GIPSA-lab

(Grenoble Images Speech Signal Automatic laboratory). His research interests include i) video analysis

and indexing: image and video classification, human action recognition, video summarization, active

vision for robots, ii) visual perception and modelling: visual saliency, audio saliency, attention model,

visual substitution.

2. Education

• HDR (Ability to supervise research) in Signal and Image Processing, Univ. Joseph Fourier Grenoble,

France, 2001

• Ph.D. in Electronic Systems, Institut National des Sciences Appliquées, Lyon, France, 1988,

• Engineer in Electrical Engineering (Honours) from Institut National des Sciences Appliquées, Lyon,

France, 1984

3. Scientific Activity

Reviewer for international journals and conferences

• IEEE Transactions on Multimedia, IEEE Transactions on Circuits and Systems for Video Technology,

IEEE Transactions on Image Processing, Computer Vision and Image Understanding, IEEE Journal of

Selected Topics in Applied Earth Observations and Remote Sensing.

• ACM International Conference on Multimedia Retrieval (ICMR), workshop Content-Based Multimedia

Indexing (CBMI), European Signal Processing Conference (EUSIPCO).

Main responsabilities

• 2011-2015: Member of the doctoral school EEATS (Electronics, Electrotechnics, Automatic, Signal

Processing) of Grenoble.

• 2011-2015: Member of the research committee for the UFR IM2AG of UJF.

• Since 2007: Responsible of the research group "Perception and Analysis of Videos and Images" (seven

researchers) of AGPIG team at GIPSA-lab.

• Since 2006: Responsible for the organization of the 5th school year (M.Sc) in the Industrial Computing

and Instrumentation Department (30 students and 2 options) of the engineering school

Polytech’Grenoble.

• 2003-2008: Assistant director (team of four persons) of the Master degree in “Signal Image Speech

Telecommunication” (40 students).

Main projects and collaborations

• 2013-2015: Co-responsible of the exploratory project "Attentive" supported by the LabEx Persyval-lab.

Collaboration with O. Aycard and C. Garbay of Laboratory of Informatics of Grenoble

(LIG) and M. Rombaut (GIPSA-lab). Development of a mobile robotics platform intended

to participate in the surveillance of a set of people in situation of fragility.

• 2010-2013: Regional project "Plateforme de calcul parallèle pour des modèles de vision artificielle bio-

inspirée". Collaboration with D. Houzet (GIPSA-lab) and A. Trémeau of the Laboratory

Hubert Curien (LaHC, Univ. Jean Monnet, Saint Etienne).

Page 20: DeCoRe - Inriathoth.inrialpes.fr/decore/decore-proposal.pdf · Caption generation will play a central role, integrating image understanding and language generation models. Positioning

• 2007-2012: National project IRIM (Content based Multimedia Information Retrieval) with the GDR

ISIS (Research association in Information Signal Image viSion), participation in the annual

international challenges TRECVID of video retrieval evaluation.

• 2007-2010: Regional project "LIMA" (Leisure and IMAges), participation in the task "video analysis

and indexation".

• 2006-2009: Responsible for the project with the INA (National Audiovisual Institute) about image

classification (PhD of H. Goeau, co-supervised with O. Buisson).

• 2003-2007: European Network of Excellence SIMILAR for the study of multimodal interfaces

efficiently answering to vision, gesture and voice. Collaboration with the Computer

Science Department, University of Crete, Greece (C. Panagiotakis and G. Tziritas) in the

task human action recognition.

• 2001-2003: Regional project "ACTIV II" (Colour, Image processing and Vision), participation in the

task "video indexation".

• 1998-2001: European project "Art-live" (ARchitecture and authoring Tool prototype for Living Images

and new Video Experiments), participation in the task "moving people detection".

Recent supervision of PhD students

• Since 2013: Q. Labourey, Développement d'un robot attentionné pour la surveillance de personnes en

situation de fragilité (co-supervised with O. Aycard).

• Since 2013: S. Chan Wai Tim, Classification d’images et de vidéos par apprentissage de dictionnaire

(co-supervised with M. Rombaut).

• 2013: G. Song, Effect of sound in videos on gaze: Contribution to audio-visual saliency

modeling.

• 2013: A. Rahman, Face perception in videos: Contributions to a visual saliency model and its

implementation on GPUs (co-supervised with D. Houzet).

• 2010: S. Marat, Modèles de saillance visuelle par fusion d’informations sur la luminance, le

mouvement et les visages pour la prédiction de mouvements oculaires lors de l’exploration

de vidéos (co-supervised with N. Guyader)

4. Publications

A complete list of my publications can be found on:

http://www.gipsa-lab.fr/~denis.pellerin/publications_en.html

Six most significant (and recent) publications:

[1] Budnik M., Gutierrez-Gomez E.-L., Safadi B., Pellerin D., Quénot G., Learned features versus

engineered features for multimedia indexing, Multimedia Tools and Applications, Springer Verlag,

To appear.

[2] Labourey Q., Aycard O., Pellerin D., Rombaut R., Garbay C., An evidential filter for indoor

navigation of a mobile robot in dynamic environment, International Conference on Information

Processing and Management of Uncertainty in Knowledge-Based Systems (IPMU’2016),

Eindhoven, The Netherlands, June 2016.

[3] Chan Wai Tim S., Rombaut M., Pellerin D., Rejection-based classification for action recognition

using a spatio-temporal dictionary, European Signal Processing Conference (EUSIPCO'2015), Nice,

France, August 2015.

[4] Stoll C., Palluel-Germain R., Fristot V., Pellerin D., Alleysson D., Graff C., Navigating from a depth

image converted into sound, Applied Bionics and Biomechanics, volume 2015, article ID 543492,

2015.

[5] Rahman A., Pellerin D., Houzet D., Influence of number, location and size of faces on gaze in video,

Journal of Eye Movement Research, 7(2):5, 1-11, 2014

[6] Marat S., Rahman A., Pellerin D., Guyader N., Houzet D., Improving visual saliency by adding “face

feature map” and “center bias”, Cognitive Computation, 5(1): 63-75, 2013.

Page 21: DeCoRe - Inriathoth.inrialpes.fr/decore/decore-proposal.pdf · Caption generation will play a central role, integrating image understanding and language generation models. Positioning

Curriculum Vitae Georges QUÉNOT

Last name: QUÉNOT First Name: Georges Born: May 14, 1960. Married. 2 children. Employment: Senior researcher (CNRS) at Laboratoire d’Informatique de Grenoble. Professional address: Laboratoire d'Informatique de Grenoble – CNRS UMR 5217 Bâtiment B, 41, rue des mathématiques, B.P. 53, 38041 Grenoble Cedex 9 Direct tel.: +33 (0)4 76 63 58 55 Fax: +33 (0)4 76 63 56 86 Mail: [email protected] Webpage: http://lig-membres.imag.fr/quenot/ 1) BIOGRAPHY

Education:

1983: Engineer from École Polytechnique, Palaiseau.

1988: Ph.D. in Computer Science, University of Orsay – Paris XI.

1998: HDR in Computer Science, University of Orsay – Paris XI.

Research interests:

Multimedia information indexing and retrieval;

Concept indexing in image and video documents;

Machine learning.

Current functions:

Leader of the Multimedia Information Indexing and Retrieval group (MRIM) of the Laboratoire d'Informatique de Grenoble (LIG);

Responsible for their activities on video indexing and retrieval.

Student-researcher advising:

10 former Ph.D. students and currently 1 PhD students.

Teaching:

About 60 hours per year at M1/M2 level (M2R MOSIG, RICM, M2PGI) on multimedia information indexing and retrieval.

Participations in research projects:

International Projects: o ICT ASIA project: MoSAIC (2006-2008): Mobile Search and Annotation using Images

in Context. o ICT ASIA project: ShootMyMind (2015-2016): Automatic Generation of Videos form

Scenarii. o CHIST-ERA Camomile (20012-2016): Collaborative Annotation of multi-MOdal, multI-

Lingual and multi-mEdia documents.

European project: o STREP PENG (2004-2006): PErsonalised News content programminG;

National French projects : o TechnoVision ARGOS (2004-2006): Campagne d'évaluation d'outils de surveillance

de contenus vidéos; o ANR AVEIR (2006-2009): Annotation automatique et extraction de concepts visuels

pour la recherche d'images; o OSEO-AII Quaero (2007-2013): La recherche et la reconnaissance de contenus

numériques;

Page 22: DeCoRe - Inriathoth.inrialpes.fr/decore/decore-proposal.pdf · Caption generation will play a central role, integrating image understanding and language generation models. Positioning

o ANR Contint VideoSense (2010-2013): automatic video tagging by high level concepts;

o ANR Repere QCompere (2012-2014): Quaero Consortium for Multimodal Person Recognition;

o FUI Guimuteic (2015-2018): Guide Multimédia de Tête, Informatif et Connecté.

Local project: o APIMS (2009-2010): Apprentissage Parallèle pour l'Indexation Multimédia

Sémantique.

Professional activities:

PC member or reviewers of many international conferences and journals including for instance: Proceedings of the IEEE, ACM Transactions on Multimedia Computing Communications and Applications, IEEE Transactions on Multimedia, Information Processing and Management, IEEE Transactions on Pattern Analysis and Machine Intelligence, Multimedia Tools and Applications, and Signal Processing: Image Communication.

Organization of the first École d'Automne en Recherche d'Information et Application (EARIA'06).

Organization of Content-Based Multimedia Indexing (CBMI) 2014.

Expert for project proposals and evaluation: Technovision / ANR / Digiteo.

Organization of the TRECVid semantic indexing (SIN) benchmark since 2010.

Responsible of the IRIM (Indexation et Recherche d'Information Multimédia) action of the GDR ISIS since 2008.

Member of associate professor recruitment committees (Bordeaux, Cergy-Pontoise).

Highlights:

Star Challenge 2008 Finalist (Content-based search in video documents) – top 5 among 50 participants.

Currently first at VOC 2012 Object Classification (comp1, post-campaign).

2) MOST SIGNIFICANT PUBLICATIONS

George Awad, Cees G. M. Snoek, Alan F. Smeaton Georges Quénot. TRECVid Semantic Indexing of Video: A 6-Year Retrospective. ITE Transactions on Media Technology and Applications. To appear.

Mateusz Budnik, Efrain-Leonardo Gutierrez-Gomez, Bahjat Safadi, Denis Pellerin and Georges Quénot. Learned features versus engineered features for multimedia indexing. Multimedia Tools and Applications, Springer Verlag. To appear.

Johann Poignant, Guillaume Fortier, Laurent Besacier, Georges Quénot. Naming multi-modal clusters to identify persons in TV broadcast. Multimedia Tools and Applications, Springer Verlag, pp.1-25, 2015.

Johann Poignant, Laurent Besacier, Georges Quénot. Unsupervised Speaker Identification in TV Broadcast Based on Written Names. IEEE Transactions on Audio, Speech and Language Processing, 23 (1), pp.57-68, 2015.

Bahjat Safadi, Nadia Derbas, Georges Quénot. Descriptor Optimization for Multimedia Indexing and Retrieval. Multimedia Tools and Applications. 74 (4):1267-1290, 2015.

Abdelkader Hamadi, Philippe Mulhem, Georges Quénot. Extended conceptual feedback for semantic multimedia indexing. Multimedia Tools and Applications. 23 (1):57-68, 2015.

Bogdan Ionescu, Jenny Benois-Pineau, Tomas Piatrik, Georges Quénot. Fusion in Computer Vision: Understanding Complex Visual Content. Springer international publishing, 272 p., 2014.

S. Tiberius Strat, A. Benoit, Hervé Bredin, Georges Quénot, P. Lambert. Hierarchical Late Fusion for Concept Detection in Videos. Fusion in Computer Vision: Understanding

Page 23: DeCoRe - Inriathoth.inrialpes.fr/decore/decore-proposal.pdf · Caption generation will play a central role, integrating image understanding and language generation models. Positioning

Complex Visual Content, Springer international publishing, pp.53-78, 2014.

Bahjat Safadi, Georges Quénot. Active learning with multiple classifiers for multimedia indexing. Multimedia Tools and Applications, 66(2):403-417, 2012.

Émilie Dumont, Georges Quénot. Automatic Story Segmentation for TV News Video using Multiple Modalities. International Journal of Digital Multimedia Broadcasting, 2012:1--11, 2012. Note: Article ID 732514.

Georges Quénot, Tien-Ping Tan, Viet-Bac Le, Stéphane Ayache, Laurent Besacier, Philippe Mulhem. Content-based search in multilingual audiovisual documents using the International Phonetic Alphabet. Multimedia Tools and Applications (Impact-F 1.01), 48(1):123-140, 2010.

Stéphane Ayache and Georges Quénot, “Image and Video Indexing using Networks of Operators”, in EURASIP Journal on Image and Video Processing, Vol. 2007, Article ID 56928, 13 pages, 2007.

Stéphane Ayache and Georges Quénot, “Evaluation of active learning strategies for video indexing”, in Signal Processing: Image Communication, Vol. 22/7-8 pp 692-704, August-September 2007.

Philippe Joly, Jenny Benois-Pineau, Ewa Kijak and Georges Quénot, “The ARGOS campaign: Evaluation of Video Analysis Tools”, in Signal Processing: Image Communication, Vol. 22/7-8 pp 705-717, August-September 2007.

Stéphane Ayache and Georges Quénot, “Video Corpus Annotation using Active Learning”, in 30th European Conference on Information Retrieval (ECIR'08), Glasgow, Scotland, 30th March - 3rd April, 2008.

Stéphane Ayache, Georges Quénot, Jérôme Gensel and Shin'ichi Satoh, “Using Topic Concepts for Semantic Video Shots Classification”, in International Conference on Image and Video Retrieval (CIVR'06), Tempe, AZ, USA, July 13-15, 2006.

Page 24: DeCoRe - Inriathoth.inrialpes.fr/decore/decore-proposal.pdf · Caption generation will play a central role, integrating image understanding and language generation models. Positioning

INRIA Rhone-Alpes, LEAR teamTel. +33 4 76 61 52 33, Fax +33 4 76 61 54 54655 Avenue de l’Europe, 38330 Montbonnot, France

Email: [email protected]: http://lear.inrialpes.fr/∼verbeek

Citizenship: Dutch, Date of birth: December 21, 1975

Curriculum Vitae – Jakob Verbeek

Academic Background

2004 • Doctorate Computer Science (best thesis award), Informatics Institute, University of Amsterdam. Advi-sors: Prof. Dr. Ir. F. Groen, Dr. Ir. B. Krose, and Dr. N. Vlassis. Thesis: Mixture models for clustering anddimension reduction.

2000 • Master of Science in Logic (with honours), Institute for Language, Logic, and Computation, University ofAmsterdam. Advisor: Prof. Dr. M. van Lambalgen. Thesis: An information theoretic approach to finding wordgroups for text classification.

1998 • Master of Science in Artificial Intelligence (with honours), Dutch National Research Institute for Mathe-matics and Computer Science & University of Amsterdam. Advisors: Prof. Dr. P. Vitanyi, Dr. P. Grunwald,and Dr. R. de Wolf. Thesis: Overfitting using the minimum description length principle.

Awards

2011 • Outstanding Reviewer Award, IEEE Conference on Computer Vision and Pattern Recognition.2009 • Outstanding Reviewer Award, IEEE Conference on Computer Vision and Pattern Recognition.2006 • Biannual E.S. Gelsema Award of the Dutch Society for Pattern Recognition and Image Processing for best

PhD thesis and associated international journal publications.2000 • Regional winner of yearly best MSc thesis award Dutch Society for Computer Science.

Employment

since 2007 • Researcher (CR1), LEAR project, INRIA Rhone-Alpes, Grenoble.2005-2007 • Postdoc, LEAR project, INRIA Rhone-Alpes, Grenoble.2004-2005 • Postdoc, Intelligent Autonomous Systems group, Informatics Institute, University of Amsterdam.

Professional Activities

Participation in Research Projects

2013-2016 • Physionomie: Physiognomic Recognition for Forensic Investigation , funded by French national researchagency (ANR).

2011-2015 • AXES: Access to Audiovisual Archives, European integrated project, 7th Framework Programme.2010-2013 • Quaero Consortium for Multimodal Person Recognition, funded by French national research agency

(ANR).2009-2012 • Modeling multi-media documents for cross-media access, funded by Xerox Research Centre Europe

(XRCE) and French national research agency (ANR).2008-2010 • Interactive Image Search, funded by French national research agency (ANR).2006-2009 • Cognitive-Level Annotation using Latent Statistical Structure (CLASS), funded by European Union Sixth

Framework Programme.2000-2005 • Tools for Non-linear Data Analysis, funded by Dutch Technology Foundation (STW).

Teaching

2015 • Lecturer in MSc course Kernel Methods for Statistical Learning, Ecole Nationale Superieured’Informatique et de Mathematiques Appliquees (ENSIMAG), Grenoble, France.

2008-2015 • Lecturer in MSc course Machine Learning and Category Representation, Ecole Nationale Superieured’Informatique et de Mathematiques Appliquees (ENSIMAG), Grenoble, France.

2003-2005 • Lecturer in MSc course Machine learning: pattern recognition, University of Amsterdam, The Nether-lands.

2003-2005 • Lecturer in graduate course Advanced issues in neurocomputing, Advanced School for Imaging andComputing, The Netherlands.

Page 25: DeCoRe - Inriathoth.inrialpes.fr/decore/decore-proposal.pdf · Caption generation will play a central role, integrating image understanding and language generation models. Positioning

Professional Activities (continued)

1997-2000 • Teaching assistant in courses MSc Artificial Intelligence, University of Amsterdam, The Nether-lands.

Supervision of MSc and PhD Students

2015 • Jerome Lesaint, MSc, Image and video captioning.since 2013 • Shreyas Saxena, PhD, Recognizing people in the wild.2013 • Shreyas Saxena, MSc, Metric learning for face verification.2011-2015 • Dan Oneata, PhD, Large-scale machine learning for video analysis.2010-2014 • Gokberk Cinbis, PhD, Fisher kernel based models for image classification and object localization, awarded

AFRIF best thesis award 2014.2009-2012 • Thomas Mensink, PhD, Modeling multi-media documents for cross-media access, awarded AFRIF best

thesis award 2012.2008-2011 • Josip Krapac, PhD, Image search using combined text and image content.2006-2010 • Matthieu Guillaumin, PhD, Learning models for visual recognition from weak supervision.2009 • Gaspard Jankowiak, intern, Decision tree quantization of image patches for image categorization.2007-2008 • Thomas Mensink, intern, Finding people in captioned news images.2005 • Markus Heukelom, MSc, Face detection and pose estimation using part-based models.2003 • Jan Nunnink, MSc, Large scale mixture modelling using a greedy expectation-maximisation algorithm.2003 • Noah Laith, MSc, A fast greedy k-means algorithm.

Associate Editorsince 2014 • International Journal of Computer Vision.since 2011 • Image and Vision Computing Journal.

Area Chair for International Conferences• IEEE Conference on Computer Vision and Pattern Recognition: 2015.• European Conference on Computer Vision: 2012, 2014.• British Machine Vision Conference: 2012, 2013, 2014.

Programme Committee Member for Conferences, including

• IEEE International Conference on Computer Vision: 2009, 2011, 2013, 2015.• European Conference on Computer Vision: 2008, 2010.• IEEE Conference on Computer Vision and Pattern Recognition: 2006–2014, 2016.• Neural Information Processing Systems: 2006–2010, 2012–2013.• Reconnaissance des Formes et l’Intelligence Artificielle: 2016.

Reviewer for International Journals, including

since 2008 • International Journal of Computer Vision.since 2005 • IEEE Transactions on Neural Networks.since 2004 • IEEE Transactions on Pattern Analysis and Machine Intelligence.

Reviewer of research grant proposals, including

2015 • Postdoctoral fellowship grant, Research Foundation Flanders (FWO)2014 • Collaborative Research grant, Indo-French Centre for the Promotion of Advance Research (IFCPAR)2010 • VENI grant, Netherlands Organisation for Scientific Research (NWO)

Miscellaneous

Research Visits2011 • Visiting researcher Statistical Machine Learning group, NICTA Canberra, Autralia, May 2011.2003 • Machine Learning group University of Toronto, Prof. Sam Roweis, Canada, May–September 2003.

Summer Schools & Workshops

2015 • DGA workshop on Big Data in Multimedia Information Processing, invited speaker, Paris, France, Octo-ber 22.

• Physionomie workshop at European Academy of Forensic Science conference, co-organizer and speaker,Prague, Czech Republic, September 9.

Page 26: DeCoRe - Inriathoth.inrialpes.fr/decore/decore-proposal.pdf · Caption generation will play a central role, integrating image understanding and language generation models. Positioning

Miscellaneous (continued)

• StatLearn workshop, invited speaker, April 13, 2015, Grenoble, France.2014 • 3rd Croatian Computer Vision Workshop, Center of Excellence for Computer Vision, invited speaker,

September 16, 2014, Zagreb, Croatia.2011 • 2nd IST Workshop on Computer Vision and Machine Learning, Institute of Science and Technology, in-

vited presentation, October 7, Vienna, Austria.• Workshop on 3D and 2D Face Analysis and Recognition, Ecole Centrale de Lyon / Lyon University, in-

vited presentation, January 28.2010 • NIPS Workshop on Machine Learning for Next Generation Computer Vision Challenges, co-organizer,

December 10, Whistler BC, Canada.• ECCV Workshop on Face Detection: Where are we, and what next?, invited presentation, September 10,

Hersonissos, Greece.• INRIA Visual Recognition and Machine Learning Summer School, 1h lecture, July 26–30,Grenoble,

France.2009 • Workshop “Statistiques pour le traitement de l’image”, Universite Paris 1 Pantheon-Sorbonne, invited

speaker, January 23.2008 • International Workshop on Object Recognition, poster presentation, May 16–18 2008, Moltrasio, Italy.

Seminars2015 • Societe Francaise de Statistique, Institut Henri Poincare, Paris, France, Object detection with incomplete

supervision, October 23.• Center for Machine Perception, Czech Technical University, Prague, Czech Republic, Object detection with

incomplete supervision, September 8.• Dept. of Information Engineering and Computer Science, University of Trento, Italy, Object detection with

incomplete supervision, March 16.• Computer Vision Center, Barcelona, Spain, Object detection with incomplete supervision, February 13.

2013 • Intelligent Systems Laboratory Amsterdam, University of Amsterdam, The Netherlands, SegmentationDriven Object Detection with Fisher Vectors, October 15.

• Media Integration and Communication Center at the University of Florence, Italy, Segmentation DrivenObject Detection with Fisher Vectors, September 24.

• DGA workshop on Multimedia Information Processing (TIM 2013), Paris, France, Face verification ”in thewild”, July 2.

2012 • Computer Vision and Machine Learning group, Institute of Science and Technology, Vienna, Austria,Image categorization using Fisher kernels of non-iid image models, June 11.

• Computer Vision Center, Barcelona, Spain, Image categorization using Fisher kernels of non-iid image models,June 4.

• TEXMEX Team, INRIA, Rennes, France, Image categorization using Fisher kernels of non-iid image models,April 20.

2011 • Statistical Machine Learning group, NICTA, Canberra, Australia, Modelling spatial layout for image classifi-cation, May 26.

• Canon Information Systems Research Australia, Sydney, Australia, Learning structured prediction models forinteractive image labeling, May 20.

2010 • Laboratoire TIMC-IMAG, Learning: Models and Algorithms team, Grenoble, Metric learning approachesfor image annotation and face verification, October 7.

• University of Oxford, Visual Geometry Group, Oxford, TagProp: a discriminatively trained nearest neighbormodel for image auto-annotation, February 1.

2009 • Laboratoire Jean Kuntzmann, Grenoble, Machine learning for semantic image interpretation, June 11.• University of Amsterdam, Intelligent Systems Laboratory, Discriminative learning of nearest-neighbor models

for image auto-annotation, April 28.• Universite de Caen, Laboratoire GREYC, Improving People Search Using Query Expansions, February 5.

2008 • Computer Vision Center, Autonomous University of Barcelona, Improving People Search Using Query Ex-pansions, September 26.

• Computer Vision Lab, Max Planck institute for Biological Cybernetics, Scene Segmentation with CRFsLearned from Partially Labeled Images, July 31.

• Textual and Visual Pattern Analysis team, Xerox Research Centre Europe, Scene Segmentation with CRFsLearned from Partially Labeled Images, April 24.

2006 • Parole group, LORIA Nancy, Unsupervised learning of low-dimensional structure in high-dimensional data.• Content Analysis group, Xerox Research Centre Europe, Manifold learning: unsupervised, correspondences,

and semi-supervised.2005 • Learning and Recognition in Vision group, INRIA Rhone-Alpes, Manifold learning & image segmentation.

• Computer Engineering Group, Bielefeld University, Manifold learning with local linear models and Gaussianfields.

Page 27: DeCoRe - Inriathoth.inrialpes.fr/decore/decore-proposal.pdf · Caption generation will play a central role, integrating image understanding and language generation models. Positioning

Miscellaneous (continued)

2004 • Algorithms and Complexity group, Dutch Center for Mathematics and Computer Science, Semi-superviseddimension reduction through smoothing on graphs.

2003 • Machine Learning team, Radboud University Nijmegen, Spectral methods for dimension reduction and non-linear CCA.

2002 • Information and Language Processing Systems group, University of Amsterdam, A generative model for theSelf-Organizing Map.

Selected Publications

In peer reviewed international journals

2015 • G. Cinbis, J. Verbeek, C. Schmid. Approximate Fisher kernels of non-iid image models for image categorization.IEEE Transactions on Pattern Analysis and Machine Intelligence, to appear, 2015.

• H. Wang, D. Oneata, J. Verbeek, C. Schmid. A robust and efficient video representation for action recognition.International Journal of Computer Vision, to appear, 2015.

• M. Douze, J. Revaud, J. Verbeek, H. Jegou, C. Schmid. Circulant temporal encoding for video retrieval andtemporal alignment. International Journal of Computer Vision, to appear, 2015.

2013 • J. Sanchez, F. Perronnin, T. Mensink, J. Verbeek. Image classification with the Fisher vector: theory and practice.International Journal of Computer Vision 105 (3), pp. 222–245, 2013.

• T. Mensink, J. Verbeek, F. Perronnin, G. Csurka. Distance-based image classification: generalizing to new classesat near-zero cost. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (11), pp. 2624–2637,2013.

• T. Mensink, J. Verbeek, G. Csurka. Tree-structured CRF models for interactive image labeling. IEEE Transac-tions on Pattern Analysis and Machine Intelligence 35 (2), pp. 476–489, 2013.

2012 • M. Guillaumin, T. Mensink, J. Verbeek, C. Schmid. Face recognition from caption-based supervision. Interna-tional Journal of Computer Vision, 96(1), pp. 64–82, January 2012.

2010 • H. Jegou, C. Schmid, H. Harzallah, and J. Verbeek. Accurate image search using the contextual dissimilaritymeasure. IEEE Transactions on Pattern Analysis and Machine Intelligence 32(1), pp. 2–11, January 2010.

• D. Larlus, J. Verbeek, F. Jurie. Category level object segmentation by combining bag-of-words models with Dirich-let processes and random fields. International Journal of Computer Vision 88(2), pp. 238–253, June 2010.

2009 • J. van de Weijer, C. Schmid, J. Verbeek, and D. Larlus. Learning color names for real-world applications. IEEETransactions on Image Processing 18(7), pp. 1512–1523, July 2009.

2006 • J. Verbeek, J. Nunnink, and N. Vlassis. Accelerated EM-based clustering of large data sets. Data Mining andKnowledge Discovery 13(3), pp. 291–307, November 2006.

• J. Verbeek and N. Vlassis. Gaussian fields for semi-supervised regression and correspondence learning. PatternRecognition 39(10), pp. 1864–1875, October 2006.

• J. Verbeek. Learning nonlinear image manifolds by global alignment of local linear models. IEEE Transactions onPattern Analysis and Machine Intelligence 28(8), pp. 1236–1250, August 2006.

2005 • J. Porta, J. Verbeek, B. Krose. Active appearance-based robot localization using stereo vision. AutonomousRobots 18(1), pp. 59–80, January 2005.

• J. Verbeek, N. Vlassis, and B. Krose. Self-organizing mixture models. Neurocomputing 63, pp. 99–123,January, 2005.

2003 • J. Verbeek, N. Vlassis, and B. Krose. Efficient greedy learning of Gaussian mixture models. Neural Computa-tion 15(2), pp. 469–485, February 2003.

• A. Likas, N. Vlassis, and J. Verbeek. The global k-means clustering algorithm. Pattern Recognition 36(2), pp.451–461, February 2003.

2002 • J. Verbeek, N. Vlassis, and B. Krose. A k-segments algorithm for finding principal curves. Pattern RecognitionLetters 23(8), pp. 1009–1017, June 2002.

In peer reviewed international conferences

2014 • D. Oneata, J. Revaud, J. Verbeek, C. Schmid. Spatio-Temporal Object Detection Proposals. Proceedings Euro-pean Conference on Computer Vision, September 2014.

• G. Cinbis, J. Verbeek, C. Schmid. Multi-fold MIL Training for Weakly Supervised Object Localization. Proceed-ings IEEE Conference on Computer Vision and Pattern Recognition, June 2014.

• D. Oneata, J. Verbeek, C. Schmid. Efficient Action Localization with Approximately Normalized Fisher Vectors.Proceedings IEEE Conference on Computer Vision and Pattern Recognition, June 2014.

2013 • G. Cinbis, J. Verbeek, C. Schmid. Segmentation Driven Object Detection with Fisher Vectors. ProceedingsIEEE International Conference on Computer Vision, December 2013.

• D. Oneata, J. Verbeek, C. Schmid. Action and Event Recognition with Fisher Vectors on a Compact Feature Set.Proceedings IEEE International Conference on Computer Vision, December 2013.

Page 28: DeCoRe - Inriathoth.inrialpes.fr/decore/decore-proposal.pdf · Caption generation will play a central role, integrating image understanding and language generation models. Positioning

Selected Publications (continued)

2012 • T. Mensink, J. Verbeek, F. Perronnin, G. Csurka. Metric learning for large scale image classification: generalizingto new classes at near-zero cost. Proceedings European Conference on Computer Vision, October 2012. (oral)

• G. Cinbis, J. Verbeek, C. Schmid. Image categorization using Fisher kernels of non-iid image models. Proceed-ings IEEE Conference on Computer Vision and Pattern Recognition, June 2012.

2011 • J. Krapac, J. Verbeek, F. Jurie. Modeling spatial layout with Fisher vectors for image categorization. ProceedingsIEEE International Conference on Computer Vision, November 2011.

• G. Cinbis, J. Verbeek, C. Schmid. Unsupervised metric learning for face identification in TV video. ProceedingsIEEE International Conference on Computer Vision, November 2011.

• J. Krapac, J. Verbeek, F. Jurie. Learning tree-structured descriptor quantizers for image categorization. Proceed-ings British Machine Vision Conference, September 2011.

• T. Mensink, J. Verbeek, G. Csurka. Learning structured prediction models for interactive image labeling. Pro-ceedings IEEE Conference on Computer Vision and Pattern Recognition, June 2011.

2010 • M. Guillaumin, J. Verbeek, C. Schmid. Multiple instance metric learning from automatically labeled bags offaces. Proceedings European Conference on Computer Vision, September 2010.

• M. Guillaumin, J. Verbeek, C. Schmid. Multimodal semi-supervised learning for image classication. Proceed-ings IEEE Conference on Computer Vision and Pattern Recognition, June 2010. (oral)

• J. Krapac, M. Allan, J. Verbeek, F. Jurie. Improving web image search results using query-relative classifiers.Proceedings IEEE Conference on Computer Vision and Pattern Recognition, June 2010.

• T. Mensink, J. Verbeek, G. Csurka. Trans Media Relevance Feedback for Image Autoannotation.ProceedingsBritish Machine Vision Conference, September 2010.

• T. Mensink, J. Verbeek, H. Kappen. EP for efficient stochastic control with obstacles. Proceedings EuropeanConference on Artificial Intelligence, August 2010. (oral)

• J. Verbeek, M. Guillaumin, T. Mensink, C. Schmid. Image Annotation with TagProp on the MIRFLICKR set.Proceedings ACM International Conference on Multimedia Information Retrieval, March 2010. (invitedpaper)

2009 • M. Guillaumin, T. Mensink, J. Verbeek, C. Schmid. TagProp: Discriminative metric learning in nearest neighbormodels for image auto-annotation. Proceedings IEEE International Conference on Computer Vision, Septem-ber 2009. (oral)

• M. Guillaumin, J. Verbeek, C. Schmid. Is that you? Metric learning approaches for face identification. Proceed-ings IEEE International Conference on Computer Vision, September 2009.

• M. Allan, J. Verbeek Ranking user-annotated images for multiple query terms. Proceedings British MachineVision Conference, September 2009.

2008 • M. Guillaumin, T. Mensink, J. Verbeek, C. Schmid. Automatic face naming with caption-based supervision.Proceedings IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8, June 2008.

• T. Mensink, and J. Verbeek. Improving people search using query expansions: How friends help to find people.Proceedings European Conference on Computer Vision, pp. 86–99, October 2008. (oral)

• J. Verbeek and B. Triggs. Scene segmentation with CRFs learned from partially labeled images. Advances inNeural Information Processing Systems 20, pp. 1553–1560, January 2008. (oral)

• H. Cevikalp, J. Verbeek, F. Jurie, and A. Klaser. Semi-supervised dimensionality reduction using pairwise equiv-alence constraints. Proceedings International Conference on Computer Vision Theory and Applications,pp. 489–496, January 2008.

2007 • J. van de Weijer, C. Schmid, and J. Verbeek. Learning color names from real-world images. Proceedings IEEEConference on Computer Vision and Pattern Recognition, pp. 1–8, June 2007.

• J. Verbeek and B. Triggs. Region classification with Markov field aspect models. Proceedings IEEE Conferenceon Computer Vision and Pattern Recognition, pp. 1–8, June 2007.

• J. van de Weijer, C. Schmid, and J. Verbeek. Using high-level visual information for color constancy. Proceed-ings IEEE International Conference on Computer Vision, pp. 1–8, October 2007.

2006 • Z. Zivkovic and J. Verbeek. Transformation invariant component analysis for binary images. Proceedings IEEEConference on Computer Vision and Pattern Recognition, pp. 254–259, June 2006.

2004 • J. Verbeek, S. Roweis, and N. Vlassis. Non-linear CCA and PCA by alignment of local models. Advances inNeural Information Processing Systems 16, pp. 297–304, January 2004. (oral)

2003 • J. Porta, J. Verbeek, and B. Krose. Enhancing appearance-based robot localization using non-dense disparity maps.Proceedings International Conference on Intelligent Robots and Systems, pp. 980–985, October 2003.

• J. Verbeek, N. Vlassis, and B. Krose. Self-organization by optimizing free-energy. Proceedings 11th EuropeanSymposium on Artificial Neural Networks, pp. 125–130, April 2003.

2002 • J. Verbeek, N. Vlassis, and B. Krose. Coordinating principal component analyzers. Proceedings InternationalConference on Artificial Neural Networks, pp. 914–919, August 2002. (oral)

• J. Verbeek, N. Vlassis, and B. Krose. Fast nonlinear dimensionality reduction with topology preserving networks.Proceedings 10th European Symposium on Artificial Neural Networks, pp. 193–198, April 2002. (oral)

Page 29: DeCoRe - Inriathoth.inrialpes.fr/decore/decore-proposal.pdf · Caption generation will play a central role, integrating image understanding and language generation models. Positioning

Selected Publications (continued)

2001 • J. Verbeek, N. Vlassis, and B. Krose. A soft k-segments algorithm for principal curves. Proceedings Interna-tional Conference on Artificial Neural Networks, pp. 450–456, August 2001.

Book chapters

2013 • T. Mensink, J. Verbeek, F. Perronnin, and G. Csurka. Large scale metric learning for distance-based imageclassification on open ended data sets. In: G. Farinella, S. Battiato, and R. Cipolla. Advances in Computer Visionand Pattern Recognition, Springer, 2013.

2012 • R. Benavente, J. van de Weijer, M. Vanrell, C. Schmid, R. Baldrich, J. Verbeek, and D. Larlus. Color Names.In: T. Gevers, A. Gijsenij, J. van de Weijer, and J. Geusebroek. Color in Computer Vision, Wiley, 2012.

Workshops and regional conferences

2015 • S. Saxena, and J. Verbeek. Coordinated Local Metric Learning. ICCV ChaLearn Looking at People workshop,December 2015.

• V. Zadrija, J. Krapac, J. Verbeek, and S. Segvic. Patch-level Spatial Layout for Classification and Weakly Super-vised Localization. German Conference on Pattern Recognition, October 2015.

2014 • M. Douze, D. Oneata, M. Paulin, C. Leray, N. Chesneau, D. Potapov, J. Verbeek, K. Alahari, Z. Harchaoui,L. Lamel, J.-L. Gauvain, C. Schmidt, and C. Schmid. The INRIA-LIM-VocR and AXES submissions to Trecvid2014 Multimedia Event Detection. TRECVID Workshop, November, 2014.

2013 • R. Aly, R. Arandjelovic, K. Chatfield, M. Douze, B. Fernando, Z. Harchaoui, K. Mcguiness, N. O’Connor,D. Oneata, O. Parkhi, D. Potapov, J. Revaud, C. Schmid, J.-L. Schwenninger, D. Scott, T. Tuytelaars, J. Ver-beek, H. Wang, and A. Zisserman. The AXES submissions at TrecVid 2013. TRECVID Workshop, November,2013.

• H. Bredin, J. Poignant, G. Fortier, M. Tapaswi, V.-B. Le, A. Roy, C. Barras, S. Rosset, A. Sarkar, Q. Yang, H.Gao, A. Mignon, J. Verbeek, L. Besacier, G. Quenot, H. Ekenel, and R. Stiefelhagen. QCompere @ REPERE2013. Workshop on Speech, Language and Audio for Multimedia, August 2013.

2012 • D. Oneata, M. Douze, J. Revaud, J. Schwenninger, D. Potapov, H. Wang, Z. Harchaoui, J. Verbeek, C.Schmid, R. Aly, K. Mcguiness S. Chen, N. O’Connor, K. Chatfield, O. Parkhi, and R. Arandjelovic, A.Zisserman, F. Basura, and T. Tuytelaars. AXES at TRECVid 2012: KIS, INS, and MED. TRECVID Workshop,November, 2012.

• H. Bredin, J. Poignant, M. Tapaswi, G. Fortier, V. Bac Le, T. Napoleon, H. Gao, C. Barras, S. Rosset, L. Be-sacier, J. Verbeek, G. Quenot, F. Jurie, H. Kemal Ekenel. Fusion of speech, faces and text for person identificationin TV broadcast. ECCV Workshop on Information fusion in Computer Vision for Concept Recognition, Oc-tober, 2012.

2011 • T. Mensink, J. Verbeek, and T. Caetano. Learning to Rank and Quadratic Assignment. NIPS Workshop onDiscrete Optimization in Machine Learning, December 2011.

2010 • T. Mensink, G. Csurka, F. Perronnin, J. Sanchez, and J. Verbeek. LEAR and XRCEs participation to VisualConcept Detection Task - ImageCLEF 2010. Working Notes for the CLEF 2010 Workshop, September 2010.

• M. Guillaumin, T. Mensink, J. Verbeek, and C. Schmid. Apprentissage de distance pour l’annotation d’imagespar plus proches voisins. Reconnaissance des Formes et Intelligence Artificielle, January 2010.

2009 • M. Douze, M. Guillaumin, T. Mensink, C. Schmid, and J. Verbeek. INRIA-LEARs participation to ImageCLEF2009. Working Notes for the CLEF 2009 Workshop, September 2009.

2004 • J. Nunnink, J. Verbeek, and N. Vlassis. Accelerated greedy mixture learning. Proceedings Annual MachineLearning Conference of Belgium and the Netherlands, pp. 80–86, January 2004.

2003 • J. Verbeek, N. Vlassis, and J. Nunnink. A variational EM algorithm for large-scale mixture modeling. Proceed-ings Conference of the Advanced School for Computing and Imaging, pp. 136–143, June 2003.

• J. Verbeek, N. Vlassis, and B. Krose. Non-linear feature extraction by the coordination of mixture models. Pro-ceedings Conference of the Advanced School for Computing and Imaging, pp. 287–293, June 2003.

2002 • J. Verbeek, N. Vlassis, and B. Krose. Locally linear generative topographic mapping. Proceedings AnnualMachine Learning Conference of Belgium and the Netherlands, pp. 79–86, December 2002.

2001 • J. Verbeek, N. Vlassis, and B. Krose. Efficient greedy learning of Gaussian mixtures. Proceedings 13th Belgian-Dutch Conference on Artificial Intelligence, pp. 251–258, October 2001.

• J. Verbeek, N. Vlassis, and B. Krose. Greedy Gaussian mixture learning for texture segmentation. (oral) ICANNWorkshop on Kernel and Subspace Methods for Computer Vision, pp. 37–46, August 2001.

2000 • J. Verbeek. Supervised feature extraction for text categorization. Proceedings Annual Machine Learning Con-ference of Belgium and the Netherlands, December 2000.

1999 • J. Verbeek. Using a sample-dependent coding scheme for two-part MDL. Proceedings Machine Learning &Applications (ACAI ’99), July 1999.

Page 30: DeCoRe - Inriathoth.inrialpes.fr/decore/decore-proposal.pdf · Caption generation will play a central role, integrating image understanding and language generation models. Positioning

Selected Publications (continued)

Patents2012 • T. Mensink, J. Verbeek, G. Csurka, and F. Perronnin. Metric Learning for Nearest Class Mean Classifiers.

United States Patent Application 20140029839, Publication date: 01/30/2014, filing date: 07/30/2012,XEROX Corporation.

2011 • T. Mensink, J. Verbeek, and G. Csurka. Learning Structured prediction models for interactive image labeling.United States Patent Application 20120269436, Publication date: 25/10/2012, filing date: 20/04/2011,XEROX Corporation.

2010 • T. Mensink, J. Verbeek, and G. Csurka. Retrieval systems and methods employing probabilistic cross-mediarelevance feedback. United States Patent Application 20120054130, Publication date: 01/03/2012, filingdate: 31/08/2010, XEROX Corporation.

Technical Reports

2013 • J. Sanchez, F. Perronnin, T. Mensink, J. Verbeek. Image classification with the Fisher vector: theory and practice.Technical Report RR-8209, INRIA, 2011.

2012 • T. Mensink, J. Verbeek, F. Perronnin, and G. Csurka. Large scale metric learning for distance-based imageclassification. Technical Report RR-8077, INRIA, 2011.

2011 • O. Yakhnenko, J. Verbeek, and C. Schmid. Region-based image classification with a latent SVM model. Techni-cal Report RR-7665, INRIA, 2011.

• J. Krapac, J. Verbeek, F. Jurie. Spatial Fisher vectors for image categorization. Technical Report RR-7680,INRIA, 2011.

• T. Mensink, J. Verbeek, and G. Csurka. Weighted transmedia relevance feedback for image retrieval and auto-annotation. Technical Report RT-0415, INRIA, 2011.

2010 • M. Guillaumin, T. Mensink, J. Verbeek, and C. Schmid. Face recognition from caption-based supervision.Technical Report RT-392, INRIA, 2010.

2008 • D. Larlus, J. Verbeek, and F. Jurie. Category level object segmentation by combining bag-of-words models andMarkov random fields. Technical Report RR-6668, INRIA, 2008.

2005 • J. Verbeek, and N. Vlassis. Semi-supervised learning with Gaussian fields. Technical Report IAS-UVA-05-01,University of Amsterdam, 2005.

• J. Verbeek. Rodent behavior annotation from video. Technical Report IAS-UVA-05-02, University of Amster-dam, 2005.

2004 • J. Verbeek, and N. Vlassis. Gaussian mixture learning from noisy data. Technical Report IAS-UVA-04-01,University of Amsterdam, 2004.

2002 • J. Verbeek, N. Vlassis, and B. Krose. The generative self-organizing map: a probabilistic generalization of Koho-nen’s SOM. Technical Report IAS-UVA-02-03, University of Amsterdam, 2002.

• J. Verbeek, N. Vlassis, and B. Krose. Procrustes analysis to coordinate mixtures of probabilistic principal compo-nent analyzers. Technical Report IAS-UVA-02-01, University of Amsterdam, 2002.

2001 • A. Likas, N. Vlassis, and J. Verbeek. The global k-means clustering algorithm. Technical Report IAS-UVA-01-02, University of Amsterdam, 2001.

• J. Verbeek, N. Vlassis, and B. Krose. Efficient greedy learning of Gaussian mixtures. Technical Report IAS-UVA-01-10, University of Amsterdam, 2001.

2000 • J. Verbeek, N. Vlassis, and B. Krose. A k-segments algorithm for finding principal curves. Technical ReportIAS-UVA-00-11, University of Amsterdam, 2000.