seventh framework programme ”ideas” speciﬁc programme ...members.cbio.mines-paristech.fr ›...

Jean-Philippe VERT - SMAC - 280032 - Nov. 25, 2011 1

SEVENTH FRAMEWORK PROGRAMME

”Ideas” Specific programme

European Research Council

Grant agreement for: Starting Grant

Annex I - Description of Work

Project acronym: SMACProposal full title: Statistical machine learning for complex biological dataGrant agreement no.: 280032Duration: 60 months

Date of preparation of Annex I (latest version): November 25, 2011

Principal investigator: Jean-Philippe VertHost institution: ARMINES


Summary

This interdisciplinary project aims to develop new statistical and machine learning approaches to analyze high-dimensional, structured and heterogeneous biological data. We focus on the cases where a relatively small numberof samples are characterized by huge quantities of quantitative features, a common situation in large-scale genomicprojects, but particularly challenging for statistical inference. In order to overcome the curse of dimension we proposeto exploit the particular structures of the data, and encode prior biological knowledge in a unified, mathematicallysound, and computationally efficient framework. These methodological development, both theoretical and practical,will be guided by and applied to the inference of predictive models and the detection of predictive factors for prognosisand drug response prediction in cancer.

Contents

1a. Principal investigator 3Scientific leadership potential . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3Curriculum vitae . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4Early achievements track-Record . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1b. Extended synopsis 8Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9Potential impacts and challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2 Scientific proposal 132.1 State-of-the-art and objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.3 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.4 Ethical issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26


1a. Principal investigator: Jean-Philippe Vert

Scientific leadership potential

Early scientific contributions

My most visible contributions have been both methodological and applied, at the interface of statistical machinelearning and computational biology.

On the methodological side, I have mostly worked within the framework of kernel methods, a family of methodswhich has attracted much attention in statistics and machine learning in the last 10 years. In short, kernel meth-ods allow to extend many multivariate linear methods of statistics to virtually any data by constructing particularmeasures of similarity, named kernels, for the data being analyzed. I have proposed kernels for various data suchas protein sequences [1,9,11,17], gene phylogenetic profiles [3], 2D and 3D structures of small molecules or pro-teins [4,6,14,20,24], and time series [19], among others. These kernels have yielded state-of-the-art new methodsfor problems such as protein remote homology detection [1], subcellular localization prediction [11], or quantitativestructure-activity relationship (QSAR) for small molecules [4,6,14,24]. I have also made more theoretical contribu-tions related to the problem of designing kernels, such as showing connections with information theory [17,21], andhave answered open questions about the convergence of the one-class SVM algorithm, a widely-used kernel methodfor novelty detection [13]. More recently, I have developed novel regularization approaches to attack clustering,structure feature selection, multitask learning, operator inference and change-point detection as convex optimizationproblems [18,22].

On the application side, I have proposed new ideas and algorithms for several important problems. The first isthe problem of inferring biological networks from heterogeneous genomic data. Unlike mainstream approaches inthis fast-moving domain, I have advocated the formalization of this problem as a supervised inference problem, andproposed several specific algorithms with very promising empirical results [2,10,12]. Second, I proposed to attackthe problem of inferring interactions between protein and small molecules in chemogenomics as a multitask learningproblem [18,23]. Third, I have proposed several formulation to integrate the knowledge of gene network or DNAstructure in the analysis of gene expression data [5,15,22]. I have also developed a popular algorithm for siRNAdesign [16].

Recognition and diffusion

My scientific contributions have been published in 65 peer-reviewed papers in mainstream journals and conferencesin machine learning and bioinformatics (36 in international journals, 22 in the proceedings of international conferencesand 7 as contributed book chapters), with an overall number of 1958 citations and an h-index of 251. In most of thesepublications I contributed as first or senior author, and none of them were co-authored by my PhD supervisor. Roughlyhalf of my publications have appeared in bioinformatics, computational biology and computational chemistry journals,while the other half were published in machine learning journals and conferences. I am in the editorial board of thetwo flagship journals in machine learning (JMLR and MLJ), and of one of the leading journal in bioinformatics (BMCBioinformatics).

I am regularly invited to present my work in international conferences in different domains, ranging from math-ematics and machine learning to bioinformatics, chemoinformatics and natural language processing. In the last 10years I have given more than 165 lectures in seminars, workshops and conferences, half of them outside France2. Ihave developed and maintained several long-term international collaborations, in particular with colleagues in Japan(Kyoto University, Institute of Statistical Mathematics) and in the USA (UC Berkeley and University of Washing-ton) with whom I have regular exchanges and month-long visits. I have supervised 11 PhD students and secured18 competitive grants in the last 7 years, including an NIH R33 grant, three European projects, a grant from theFrance-Berkeley fund and several grants from the Japanese Society for the Promotion of Science. I have created myown research group in 2003, and am now adjunct director of a 65-people joint laboratory between Institut Curie andMines ParisTech in Paris.

My scientific contributions were recognized in 2004 by the Simon Regnier prize of the francophone society forclassification and in 2006 by the bronze medal of the national center for scientific research (CNRS).

1Citation numbers obtained from Google Scholar. The details of these numbers are available at http://cbio.ensmp.fr/˜jvert/publi .2The detailed list of my lectures is available at http://cbio.ensmp.fr/˜jvert/talks.


Curriculum vitae

Education

2004 Research habilitation (HDR), Paris 6 University, Paris, France.Dissertation “Kernel methods in computational biology”.

2001 PhD in Mathematics, Paris 6 University and Ecole normale superieure de Paris, Paris, France.Dissertation: “Statistical methods for natural language modeling”.Advisor: Olivier Catoni, obtained with the highest honors.

1998 Master of Public Administration (M.P.A.), Corps des Mines, Paris, France.

1997 M.S. in Mathematics, Paris 6 University, Paris, France.

1995 B.S. in Mathematics, Ecole Polytechnique, Palaiseau, France. Ranked 8th / 400.

Professional experience

2008-present Adjunct director, ”Cancer Computational Genomics, Bioinformatics, Biostatistics and Epidemiol-ogy” laboratory (65 members), and leader of the “Machine learning for cancer informatics” team(12 members). Mines ParisTech / Institut Curie / INSERM, Paris, France

2006-present Senior researcher and director, Centre for Computational Biology, Ecole des Mines de Paris,Fontainebleau, France

2002 - 2005 Junior researcher and group leader, Ecole des Mines de Paris, Fontainebleau, France

2001 - 2002 Associate researcher (post-doc), Bioinformatics Center, Kyoto University, Kyoto, Japan

1999 - 2000 Scientific consultant (natural language processing and statistics), Sudimage S.A., Cachan, France

1996 - 1997 Research scientist (statistics), Elf Atochem North America, Philadelphia, USA

1995 - 1996 Consultant (strategy), Matra Automobile S.A., Romorantin, France

1994 Summer intern, Department of Mathematics, Kyoto University, Kyoto, Japan

1994 Summer intern, Hamamatsu Photonics, Hamamatsu, Japan

1992 - 1993 Military Service as officer and platoon leader, Monthlery, France

PhD students (with percentage of supervision)

- Pierre Chiche (2010-present): Methods to call genomic variations from next generation sequencing data (100%).- Toby Dylan Hocking (2009-present): Sparse structure methods for bioinformatics and computer vision (50%).- Anne-Claire Haury (2009-present): Selection of discriminant pathways for cancer prognosis (100%).- Fantine Mordelet (2007-present): Supervised inference of biological networks (100%).- Mikhail Zaslavskiy (2006-2010): Graph matching for machine learning (50%).- Laurent Jacob (2006-2009): Multitask learning in bio-, chemo- and immuno-informatics (100%).- Martial Hue (2004-present): Semi-supervised learning and classification of protein structures (100%).- Franck Rapaport (2004-2008): Integration of gene networks and microarray data for cancer research (50%).- Joannes Vermorel (2004-2006): Large-scale learning algorithms (100%).- Pierre Mahe (2003-2006): Kernel methods in virtual screening (100%).- Marco Cuturi (2002-2005): Learning from structured objects with semigroup kernels (100%).

Professional activities

Editorial board of international journals: the Journal of Machine Learning Research (JMLR, since 2009), theMachine Learning journal (MLJ, since 2010), BMC Bioinformatics (since 2010).Reviewer for journals: Annals of Applied Statistics, Annals of Statistics, Applied and Computational HarmonicAnalysis, Artificial Intelligence in Medicine, Bioinformatics, BMC Bioinformatics, BMC Genomics, Briefings inBioinformatics, Combinatorial Chemistry and High Throughput Screening, Discrete Applied Mathematics, EURASIP


Journal on Advances in Signal Processing, EURASIP Journal on Bioinformatics and Systems Biology, IEEE Transac-tions on Information Theory, IEEE IEEE/ACM Transactions on Computational Biology and Bioinformatics, Interna-tional Journal of Computer Vision, International Journal of Data Mining and Bioinformatics, International Journal ofKnowledge Discovery in Bioinformatics, Journal of Applied Statistics, Journal of Bioinformatics and ComputationalBiology, Journal of Biomedical Informatics, Journal of Computer Science and Technology, Journal of Machine Learn-ing Research, Machine Learning, Neurocomputing, Neuroinformatics, Nucleic Acids Research, Pattern Recognition,PLoS Computational Biology, Proteins.Program committee membership of international conferences: AAAI 2010; ACML 2010; AISTATS 2009; APBC2008-2010; BIBE 2009; BIOINFORMATICS 2010; BIRD 2007, 2008; CIBB 2007-2009; COLT 2003-2009; ECCB2005-2010; ECML 2006-2010; ESANN 2010; GIW 2004-2009; ICML 2004-2010; IPG 2009-2010; ISBRA 2009-2010; ISMB 2004-2010; JOBIM 2009-2010; KRBIO 2005; MLG 2007-2009; MLCB 2008-2010; MLSB 2007-2010;NIPS 2003-2010, PMSB 2006, RxDM 2009, SMPGD 2009.Reviewer for funding agencies: Austrian Science Fund (FWF), Dutch National Science Foundation (NWO), FrenchNational Research Agency (ANR), Israel Science Foundation (ISF), Swiss National Science Foundation (SNSF),Workshop organization: RECOMB satellite workshop “Kernel Methods in Computational Biology”, Berlin, Ger-many (2003), NIPS workshop on “Machine Learning in Computational Biology”, Whistler, Canada (2005, 2009,2010), Cancer bioinformatics workshop, Cambridge, UK (2010).Other: Member of the European Network of Excellence PASCAL and PASCAL2 (since 2003), Chairman of theJapan-France Frontier of Science (JFFoS) conference, Tokyo, Japan (2011), Scientific Committee member, ParisIle-de-France Canceropole (2010).

Selected research funding (as PI or co-PI)

There is and there will be no funding overlap with the ERC grant requested and any other source of funding for thesame activities and costs that are foreseen in this project.

2010-2015 European Commission FP7-NMP-2009-LRAGE-3. NADINE: Nanosystems for early diagnosisof neurodegenerative diseases (own funding: 144K euros).

2009-2013 French National Research Agency ANR-09-BLAN-0051-04. CLARA: Clustering in high dimen-sion, algorithms and applications (68K euros).

2008-2010 JSPS-INSERM Japan-France grant. Development of algorithms and databases in cancer informat-ics (40K euros)

2007-2011 French Ministry of Economy, Finance and Industry DGE-07-2-90-6473. RAMIS: High-resolutionimaging for the screening of anti-cancer drugs (165K euros).

2007-2010 French National Research Agency ANR-07-BLAN-0311-03. MGA: Graphical models and appli-cations (50K euros).

2007-2009 France-Berkeley fund. Inference and learning in dynamic graphical models, with applications inspeech and bio-informatics (10K USD)

2006-2009 French Ministry of Economy, Finance and Industry DGE-06-2-90-6056. BIOTYPE: Multidimen-sional molecular and cellular biotyping (123K euros)

2005-2008 European Commission FP6-2004-IST-NMP-2. INDIGO: Integrated highly sensitive fluorescence-based biosensor for diagnosis applications (64K euros)

2005-2007 European Commission LSH-2004-1.1.0-2. ESBIC-D: European Systems Biology Initiative forcombating Complex Diseases (24K euros)

2004-2007 NIH R33HG003070-01: Detecting Relations Among Heterogeneous Datasets (215K USD)


Early achievements track-Record

Selected publications in journals and peer-reviewed conferences34

[1] H. Saigo, J.-P. Vert, N. Ueda, and T. Akutsu. Protein homology detection using string alignment kernels. Bioin-formatics, 20(11):1682–1689, 2004. 166 citations.[2] Y. Yamanishi, J.-P. Vert, and M. Kanehisa. Protein network inference from multiple genomic data: a supervisedapproach. Bioinformatics, 20:i363–i370, 2004. 118 citations.[3] J.-P. Vert. A tree kernel to analyze phylogenetic profiles. Bioinformatics, 18:S276–S284, 2002. 94 citations.[4] P. Mahe, N. Ueda, T. Akutsu, J.-L. Perret, and J.-P. Vert. Extensions of marginalized graph kernels. Proceedings ofthe Twenty-First International Conference on Machine Learning (ICML), 552–559. ACM Press, 2004. 76 citations.[5] F. Rapaport, A. Zynoviev, M. Dutreix, E. Barillot, and J.-P. Vert. Classification of microarray data using genenetworks. BMC Bioinformatics, 8:35, 2007. 69 citations.[6] P. Mahe, N. Ueda, T. Akutsu, J.-L. Perret, and J.-P. Vert. Graph kernels for molecular structure-activity relation-ship analysis with support vector machines. J. Chem. Inf. Model., 45(4):939–51, 2005. 67 citations.[7] J.-P. Vert and M. Kanehisa. Graph-driven features extraction from microarray data using diffusion kernels andkernel CCA. Adv. Neural Inform. Process. Syst. (NIPS), 1449–1456. MIT Press, 2003. 66 citations.[8] Y. Yamanishi, J.-P. Vert, A. Nakaya, and M. Kanehisa. Extraction of correlated gene clusters from multiple ge-nomic data by generalized kernel canonical correlation analysis. Bioinformatics, 19:i323–i330, 2003. 55 citations.[9] J.-P. Vert. Support vector machine prediction of signal peptide cleavage site using a new class of kernels forstrings. Proceedings of the Pacific Symposium on Biocomputing, 649–660. World Scientific, 2002. 55 citations.[10] J.-P. Vert and Y. Yamanishi. Supervised graph inference. Adv. Neural Inform. Process. Syst. (NIPS), volume 17,1433–1440. MIT Press, Cambridge, MA, 2005. 44 citations.[11] A. Matsuda, J.-P. Vert, H. Saigo, N. Ueda, H. Toh, and T. Akutsu. A novel representation of protein sequences forprediction of subcellular location using support vector machines. Protein Sci., 14(11):2804–2813, 2005. 44 citations.[12] Y. Yamanishi, J.-P. Vert, and M. Kanehisa. Supervised enzyme network inference from the integration of ge-nomic data and chemical information. Bioinformatics, 21:i468–i477, 2005. 43 citations.[13] R. Vert and J.-P. Vert. Consistency and convergence rates of one-class SVMs and related algorithms. J. Mach.Learn. Res., 7:817–854, 2006. 43 citations.[14] P. Mahe, L. Ralaivola, V. Stoven, and J.-P. Vert. The pharmacophore kernel for virtual screening with supportvector machines. J. Chem. Inf. Model., 46(5):2003–2014, 2006. 42 citations.[15] J.-P. Vert and M. Kanehisa. Extracting active pathways from gene expression data. Bioinformatics, 19:238ii–234ii, 2003. 38 citations.[16] J.-P. Vert, N. Foveau, C. Lajaunie, and Y. Vandenbrouck. An accurate and interpretable model for siRNA efficacyprediction. BMC Bioinformatics, 7:520, 2006. 33 citations.[17] M. Cuturi, K. Fukumizu, and J.-P. Vert. Semigroup Kernels on Measures. J. Mach. Learn. Res., 6:1169–1198,2005. 31 citations.[18] J. Abernethy, F. Bach, T. Evgeniou and J.-P. Vert. A New Approach to Collaborative Filtering: Operator Estima-tion with Spectral Regularization. J. Mach. Learn. Res., 10:803–826, 2009. 31 citations.[19] M. Cuturi, J.-P. Vert, T. Birkenes and T. Matsui. A kernel for time series based on global alignment. Proceedingsof the IEEE International Conference on Acoustics, Speech and Signal Processing, 2:413–416, 2007. 27 citations.[20] J. Qiu, M. Hue, A. Ben-Hur, J.-P. Vert and W.S. Noble. A structural alignment kernel for protein structures.Bioinformatics, 23(9):1090-1098, 2007. 26 citations.[21] M. Cuturi and J.-P. Vert. The context-tree kernel for strings. Neural Network., 18(4):1111–1123, 2005. 26citations.[22] L. Jacob, G. Obozinski and J.-P. Vert. Group Lasso with overlap and graph Lasso. Proceedings of the Twenty-Sixth International Conference on Machine Learning (ICML), 433–440. ACM Press, 2009. 25 citations.[23] L. Jacob and J.-P. Vert. Efficient peptide-MHC-I binding prediction for alleles with few known binders. Bioin-formatics, 24(3):358–366, 2008. 25 citations.[24] P. Mahe and J.-P. Vert. Graph kernels based on tree patterns for molecules. Machine Learning, 75(1):3-35, 2009.25 citations.

3My PhD supervisor co-authored none of my publications.4Complete list and detailed citation statistics available at http://cbio.ensmp.fr/˜jvert/publi


Research monographs

[22] B. Scholkopf, K. Tsuda, and J.-P. Vert. Kernel Methods in Computational Biology. MIT Press, Cambridge,Massachussetts, 2004. 267 citations.

Invited presentations to international workshops, conferences, summer schools5

50th workshop on optimization, machine learning and bioinformatics, Erice, Italy, 2010.”From phenotypes to pathways” ESF exploratory workshop, Cambridge, UK, 2010.10th annual International Workshop on Bioinformatics and Systems Biology (IBSB 2010), Kyoto, Japan, 2010.Statistical Genomics in Biomedical Research workshop, Banff, Canada, 2010.2nd Strasbourg Summer School on Chemoinformatics, Obernai, France, 2010.1st spring school on machine learning (EPAT 2010), Cap Hornu, France, 2010.NIPS 2009 Workshop: understanding multiple kernel learning methods, Whistler, Canada, 2009.The third school on Analysis of Patterns, Cagliari, Italy, 2009.27th European Meeting of Statisticians (EMS 2009), Toulouse, France, 2009.“Advances in the Theory of Control, Signals and Systems” workshop, Lausanne, Switzerland, 2009.6th International Workshop on Computational Systems Biology, (WCSB 2009), Aarhus, Denmark, 2009.“Statistical advances in Genome-scale Data Analysis” Workshop, Ascona, Switzerland, 2009.5th Medchem Europe conference, Berlin, Germany, 2009.3rd Japanese-French Frontiers of Science (JFFoS) Symposium, Tokyo, Japan, 2009.2nd Canada-France congress of mathematics, Montreal, Canada, 2008.10th European Symposium on Statistical Methods for the Food Industry, Louvain-la-Neuve, Belgium, 2008.The International Workshop on Data Mining and Statistical Science (DMSS 2007), Tokyo, Japan, 2007.VI Colloquium Chemiometricum Mediterraneum, Saint-Maximin, France, 2007.French Conference on Bioinformatics (JOBIM 2007), Marseille, France, 2007.6th Workshop on Graph-based Representations in Pattern Recognition (GbR 2007), Alicante, Spain, 2007.Machine Learning in Systems Biology Conference (MLSB 2007), Evry, France, 2007.International conference on Embeddings of Graphs and Groups into Hilbert spaces, Lausanne, Switzerland, 2007.Current Challenges in Kernel Methods workshop (CCKM’06), Brussel, Belgium, 2006.International Conference on Grammatical Inference (ICGI 2006), Tokyo, Japan, 2006.Course at the Statistical Mathematics and Application Workshop, Luminy, France, 2006.Course at the Machine Learning Summer School, Taipei, Taiwan, 2006.Second Conference “Mathematical foundations of learning theory”, Paris, France, 2006.Workshop on Knowledge Discovery and Emergent Complexity in Bioinformatics, Ghent, Belgium, 2006.“Kernel methods and structured domain” workshop, Whistler, Canada, 2005.50th NIBB conference on “Structure and Dynamics of Complex Biological Networks, Okazaki, Japan, 2005.Symposium on perspectives in computational and theoretical biology, Shanghai, P.R. China, 2004.“Complex stochastic systems in biology and medicine” workshop, Munich, Germany, 2004.“Advanced microarray data analysis” course, Center for Biological Sequence Analysis, Elsinore, Denmark, 2004.The learning workshop, Snowbird, Utah, USA, 2004.Machine Learning in Bioinformatics conference, Brussels, Belgium, 2003.AIM “Geometric models of biological phenomena” workshop, Palo Alto, CA, USA, 2003.Workshop “Mathematical aspects of molecular biology : Towards new constructions”, Nara, Japan, 2003.Workshop “Statistical Learning in Classification and Model Selection”, Eindhoven, The Netherlands, 2003.

Prizes and awards

• 2006 Bronze medal of the National Center for Scientific Research (CNRS)

• 2004 Simon Regnier prize of the Francophone Society for Classification

5Complete list available at http://cbio.ensmp.fr/˜jvert/talks


1b. Extended synopsis

Context

Cancer is a major cause of morbidity and mortality in the world. In 2007, an estimated 12 million new cancer casesand 7.6 million cancer deaths were reported worldwide, and cancer will affect even more people in the years aheadwith the aging of populations if our ability to prevent, diagnose and treat cancer does not improve. A shift towardspersonalized cancer medicine, i.e., from a one-size-fits-all approach to more tailored therapies based on a tumor’sgenomic makeup, holds promises to lower the social and economical burden of cancer. This in turns requires thedevelopment and validation of accurate predictive models and biomarkers to characterize each tumour and predictwhich therapeutic choice is the best option on a case-by-case basis. The sequencing of the human genome in 2003,and recent technological advances, have brought new hopes to make this happen. With the continuing development ofhigh-throughput genomic technologies, including the emergence of next generation sequencing (NGS) technologies,it is now feasible to contemplate comprehensive surveys of human cancer genomes and epigenomes, transcriptomes,proteomes and response profiles to various perturbations, thus paving the way to rational elucidation of biologicalmechanisms and identification of predictive factors. Several large-scale initiatives have been launched recently, suchas The Cancer Genome Atlas (TCGA) [1], to investigate the molecular characterization of large cohorts of humantumors with various high-throughput technologies, with the hope of cataloguing and discovering major cancer-causingalterations through multi-dimensional analysis across many samples.

A hallmark of these new technologies is their tendency to generate huge amounts of unbiased data to characteriseindividual tumours and patients. It is then tempting to rely on a posteriori statistical analysis to identify interest-ing correlation in the data, construct biomarkers and infer predictive models. An obvious benefit of this data-drivenresearch paradigm is that is allows to investigate directions that could never have been predicted by traditional molec-ular biology, and has undoubtably already led to many significant discoveries, such as the identification of novelbiomarkers. On the other hand, many frustrating results have also been reported. For example, prognostic molecularsignatures estimated by state-of-the-art feature selection methods on gene expression datasets lack stability and havebeen shown to be very sensitive to the choice of samples used to select the genes, making the biological interpretationof these signatures challenging, and the discovery of novel drug targets even more difficult. Overall, the picture thatemerges from the accumulation of biological data is overwhelmingly complex, raising doubts about the capacity ofpurely data-driven approaches to elucidate the full complexity of molecular mechanisms in living cells.

Statistics and machine learning obviously play a central and increasing role in data-driven biology. Both disci-plines have witnessed impressive progress in the last two decades, both in theory and in practice, allowing to attackcomplex data processing scenarios with efficient algorithms and sound theoretical bases. Statistical learning the-ory has increased our understanding of high-dimensional and nonlinear learning problems, and large-scale learningalgorithms able to learn nonlinear functions and work with structured data are now mature technologies in manyapplication domains such as speech recognition, automatic translation, computer vision or marketing, to name justa few. However, in spite of many successes also in bioinformatics, it is fair to say that state-of-the-art machinelearning approaches have not yet been able to significantly move forward the field of predictive modeling from high-dimensional and complex genomic data. For example, modern feature selection methods specifically developed forthe high-dimensional setting are not better than traditional univariate statistical tests in terms of stability, accuracy,and ability to discover biological knowledge from gene expression data.

Why is that so? I believe that a fundamental issue in current data-driven biology is the widening gap be-tween, on the one hand, the quantity and complexity of the data that we collect on each sample and, on theother hand, the limited amount of samples available.. While we can observe at an exquisite level of details themolecular portrait of cancer tumours, and measures billions of molecular parameters on each tumour, we will neverbe able to observe billions of tumours. In statistical terms, if we assume that we observe n samples each character-ized by p descriptors, we are in a setting where n is doomed to remain small (in the hundreds or thousands), whilep dominates n and increases with each technological advance (currently in the billions). This is a sharp contrastwith other application domains of statistical machine learning such as image processing or speech recognition, wherethere is nowadays virtually no limit in the number of samples (speech recording or images) that can be gathered totrain machine learning models. With a limited number of samples, the statistical power and learning capacity ofgeneral-purpose machine learning procedures is drastically limited, and offers little hope to elucidate complexbiological phenomena, such as robust modeling of tumor prognosis from genomic or epigenomic data. The ”small


n large p” phenomenon has triggered much research in statistics recently, and both theory and algorithms have beenproposed, e.g., to consistently select features in this setting [2, 3]. However, the strong hypothesis underlying thesemethods, such as the limited correlations between features, is usually not met in practice. I therefore believe thatwe have reached the limits of current statistical and machine learning approaches, such as multiple testing strategiesor high-dimensional statistical inference, and that novel data analysis procedure, which go beyond blind data-drivenmethods, are needed to exploit the wealth of quantitative biological available now and in the future in cancer genomics.

Objectives

The data produced in high-throughput genomics are not only high-dimensional (”large p”), they also often havean explicit or implicit underlying structure. For example, data measured along the chromosomes, such as DNA copynumber variations, methylation profiles or sequenced read counts, have an explicit linear structure. Measures on genesand proteins (such as expression levels) give a hint on the activity of underlying biological processes which involvethe collective action of several genes and proteins, and the implicit structure of such interactions may be inferred from,e.g., known gene and protein networks. In other words, instead of considering a set of p measurements on a tumouras simply a vector in Rp, it is often possible to see as a vector in RZ , where Z is a structured space of p features.For example, Z could be a finite chain to represent ordered features along a chromosome, a graph or hypergraph torepresent features such as genes with known pairwise or multiway interactions, or more generally have a geometric oralgebraic structure. In addition, for many inference tasks we may have prior knowledge in terms of structure, e.g., wemay believe that a diagnostic signature should involve a limited number of biological pathways or protein complexes,or have a characteristic pattern on the DNA. Interestingly, exploiting the structure of Z in the learning process maybe a way to reduce the complexity of the learning task, adapt it to the amount of available samples, and thus improvethe quality of the inferred model in terms of accuracy. Moreover, by using prior knowledge to impose a structure onZ or on the model we wish to infer, one could constrain the learning problem enough to let it infer robust structuralpatterns among plausible alternatives, leading to accurate and interpretable models.

I therefore believe that the only way to move forward and infer complex phenomena from a limited amount ofstructured and high-dimensional data is to depart from generic methods and develop specific approaches adaptedto each problem and data. The objective of this project is to propose and investigate new methods to inferpredictive models and discover complex predictive patterns from limited amounts of complex biological data,by exploiting the structure of data in the learning process. My vision is that, instead of confronting data-drivenand hypothesis-based approaches as two antagonist paradigms in high-throughput biology, we must integrate at theheart of data-driven strategies a lot of prior knowledge about the structure of the data and about the model we wishto infer, as routinely done in hypothesis-driven approaches. Only such prior knowledge may decrease the complexityof the learning task and adjust it to the amount of data available. In other words, I wish to develop a data-drivenhypothesis-constrained paradigm, that exploits inherent structures in the data and combines the strength ofstate-of-the-art machine learning methods with the wealth of prior biological knowledge available in a unified,mathematically sound and computationally efficient framework. These developments will be guided by andtargeted to applications in cancer diagnosis, prognosis and drug response prediction.

Methodology

The methodological backbone of this project is in the field of statistics and machine learning for high-dimensional,structured and heterogeneous data. My objective is to process data of the form x ∈ RZ , where the set of features Z isfinite but has particular structures that reflect biological constraints. I will follow the popular idea to rephrase statisticalinference as a constrained optimization problem, where a data-dependent empirical risk functional is minimized overa restricted functional space. While most existing methods in this framework use generic functional spaces, suchas balls for the `1 or `2 norm in a Hilbert space, I will investigate novel penalties and norms that explicitly takeinto account the structures of Z and the prior knowledge we have about the problems. The general questions I willaddress concern the design of novel penalties for specific data and problems, the study of their theoretical properties,their efficient implementation, and their validation on real data in cancer genomics. The project is divided into fourmethodological work packages, corresponding to increasingly complex structures for Z, and a fifth work packagededicated to the application of these methods on real cancer data.


WP1: learning with linearly ordered features. Many genomic data can be seen as numeric profiles over chro-mosomes, i.e., can be considered as elements of X = RZ , where Z = [1, p] is a succession of p positions. This isthe case, for example, of DNA copy number variation profiles, or profiles of read counts mapped to the genome aftera sequencing experiment. Prior knowledge tells us that, depending on the data, we can expect important biologicalpatterns with spatial structures, such as predictive peaks with various shapes or presence of sudden change-points.

In order to capture piecewise constant predictive patterns and to detect change points in the data, I will firstfocus on the total variation norm [4] and the fused Lasso penalty [5], which have been shown to be very useful fordenoising, regression and classification of CNV profiles [6]∗. I will study in particular how trustworthy the estimatedchange points and piecewise constant patterns are, in a context where classical consistency results do not hold becauseof the strong correlations between successive features [7]. I will also investigate generalizations to multiple profiles[8]*, corrections for boundary effects, and propose fast implementations with proximal first-order accelerated gradientmethods [9] to scale the methods to profile lenghts in the billions.

A second important family of signals in genomic profiles are localized patterns such as peaks in ChIP-chip andChiP-seq experiments, or localized patterns with particular shapes in RNA-seq experiments. I will investigate howto encode the knowledge of these particular patterns into novel penalties, using for example sparse expansions overadequate dictionaries, or positive definite kernels between profiles to account for possible variations in shape orlocation of signals.

WP2: learning with features in graphs or hypergraphs. A lot of prior information in genomics is encoded inlarge graphs or hypergraphs, which account for example for known protein complexes, functional groups, metabolicor signaling pathways, or more general genetic interactions. When we observe quantitative measurement about genesor proteins in a cell, it is then natural to consider the features (genes) as vertices of a graph or hypergraph Z (genenetwork), and to see some data such as gene expression values as a function over the graph or hypergraph. In this workpackage I will investigate how structural property of Z can be exploited in the learning algorithm through a graph- orhypergraph-dependent penalty. Note that this question is very different from the popular problem of learning over agraph or hypergraph [10] (i.e., when X = Z), or from the problem of learning when each data is itself a graph [11]*.

I recently proposed two families of penalties to translate prior knowledge about Z into a convex penalty. Onefamily encodes the idea that we wish to detect patterns which are smooth on the graph [12], the other that we shouldselect features that should be connected on the graph, without constraining the weights of selected features [13].Based on my experience with these first investigations, I propose to study the properties of these penalties in terms ofaccuracy and consistency, propose variants, and investigate efficient implementations. A important question I wish toaddress is to understand the influence of the graph structure, such as the presence of hubs, or particular properties ofthe graph spectrum, on the penalty and subsequent learning algorithm. For example, I observed that the fused Lassosuffers from boundary effects when the graph is a simple chain, and noted that the hubs have a strong influence on thegraph-dependent penalty presented above. Another direction I will explore is the possibility to derive from the graphor hypergraph Z a multi-scale representation of vectors on RZ , and to explicitly define penalties from this multi-scalerepresentation. A well-designed coarse-to-fine hierarchy in RZ could let the data select by themselves the correctdegree of granularity of the function that can be inferred, and provide useful biological interpretability at any scale.

WP3: learning with continuous priors on features. A more general setting that those address in WP1 and WP2is the case where we assume that Z is a discrete set embedded in a continuous space with, for example, a Euclideanor Hilbert geometry. A typical example would be to start from large collections of available data on genes or proteins,such as the their sequence or structure, their expression level in thousands of publicly available expression arrays, ortheir co-occurrences in literature, and to use positive definite kernels on these data to map the genes as points in aHilbert space [14]. In other words, Z = z1, . . . ,zp ⊂ H would be a finite set of points in a Hilbert space H , whichwould integrate our current knowledge about the genes. The idea to define positive kernel from various informationsabout the same data has been popularized as a convenient way to integrate heterogeneous data over genes or proteins[15]*. A significant amount of work has already been devoted to the construction of such representations throughthe engineering of various positive definite kernels, and to the use of such representations to learn over Z, e.g., forfunctional of structural classification of proteins [16]. Here my goal is different: I propose to use the structure of Hto learn over RZ , and not over Z directly.

I will consider different strategies to exploit the continuous nature of the structure on the features and encode priorknowledge in the hypothesis. A first idea would be to discretize the space and represent Z as a graph, to borrow the


methods developed in WP2. This would parallel the popular approach in bioinformatics which consist in representingdifferent informations about genes and proteins as networks, and to perform data integration at the level of networks[17]. A second and more elegant idea is to directly work with the continuous space, and extend the mathematicaloperations on graphs (such as operations on the spectrum of the graph Laplacian) directly to the continuous space.This could allow to directly exploit the manifold or geometric structures underlying the features.

WP4: integrating heterogeneous data. In practice we are often confronted with the problem of building predictivemodels from a combination of heterogeneous data. Typically, one would like to develop prognosis models for cancerfrom a combination of genomic, epigenomic and transcriptomic data. While work packages 1, 2 and 3 focus onindividual sources of data, WP4 focuses on methods to learn jointly from several heterogeneous sources of data.

Our general optimization framework offers interesting possibilities to perform data integration. On the one hand,we can try to estimate a function f from multiple data sources that decomposes as a sum of functions for each datasource, and design a penalty for f that integrate penalties for each data source. For particular choices of penalties,we recover various flavors of multiple kernel learning [16, 18], which has attracted a lot of attention recently. I willextend this framework to the case of more general penalty functions, as studied in WP1-WP3. On the other hand,we can directly try to infer a function over the product space composed by the different data sources, which does notnecessarily decompose as a sum of functions for each source. This is motivated by the need to consider interactionsbetween sources of data, e.g., to make predictions that depend on the presence of a particular mutation in DNA andof another particular pattern in the expression level of a group of genes. I will investigate the extension of recent ideassuch as spectral penalties on matrices and operators, which were proposed in the context of collaborative filtering andmulti-task learning [19, 20, 21], to the problem of data integration.

WP5: application in cancer classification and prognosis. Although most methods investigated in WP1-WP4 arerelevant to many applications, I will focus the development and practical applications of these methods to a particularproblem: the analysis of data measured on cancer patients and tumours, and in particular the development of predictivemodels for the presence or subtype of a tumour (diagnosis), for the evolution of the tumour and the risk of metastasis(prognosis), and for the response to candidate therapies (theragnosis). Although these applications differ in their goals,they can all be formulated similarly as learning a model from genomic data measured on a few biological samples.I will work both on public data, obtained from dedicated repositories on the web, and on non-public data, throughcollaborations within Institut Curie on breast cancer and neuroblastoma. I will focus particularly on data generatedby The Cancer Genome Atlas (TCGA) project, which are freely and publicly available and provide a multitude ofcharacterization and clinical information for hundreds of tumours. I will benefit from the infrastructure and support ofthe core bioinformatics team in Institut Curie for the choice, collection and normalization of high-throughput cancerdata to be analyzed in this project.

Potential impacts and challenges

From a theoretical and methodological point of view, the questions addressed in statistics and machine learningchallenge the limits of current methods when confronted to complex data and complex inference tasks with smalltraining tests. Although much progress has been done recently in high-dimensional learning and feature selection, thepractical problems I wish to attack are beyond reach of current methods. I propose a change of paradigm, where thestructure of data is encoded in the learning algorithm itself to compensate for the limited amount of training data. Ibelieve this could trigger much new developments both in theory and in algorithms in statistics and machine learning.From the point of view of applications, the detection of predictive factors and construction of predictive modelsfor cancer prognosis and drug response prediction is one of the grand challenges in current genomic research. Anyimprovement in these applications could improve the quality of life and therapeutic management of the millions ofnew cancer cases diagnosed each year, through personalized treatments adapted to each tumor, and may lead to thediscovery of new therapeutic targets.

In order to meet these goals, the main challenges that we will need to address are the following.1. Propose specific penalties for the integration of prior knowledge on various data structures in state-of-

the-art machine learning methods. This is the methodological heart of the project, which requires a good expertisein machine learning and functional analysis, as well as creativity to imagine novel penalties. My prior experience indesigning kernels and convex penalties for a variety of setting will be directly applicable here.


2. Reconcile model accuracy and interpretability. We wish not only to develop accurate predictive models, butalso to extract biological information from them such as important biological pathways or novel drug targets. Thefocus on predictive structural patterns will force us to depart from the fast moving field of consistency results forsparse estimation, and develop specific methods. We have preliminary results showing that specific analysis takinginto account the structure can lead to consistent structural estimation in settings where classical consistency results donot hold [13, 8]*.

3. Develop efficient algorithms and implementations. Although this project targets statistical problems wherethe number of training samples is relatively low (typically a few hundreds or thousands of tumours), it does not meanthat the quantity of data will be small, since each samples may be characterized by millions or billions of features.The general framework considered in this project naturally leads to convex optimization problem, which can be solvedefficiently be general-purpose solvers when the dimension of the problem does not exceed a few hundreds or thousandsdimensions [22]. With more features, however, we need to develop specific implementations that take into accountthe particular structure of the optimization problem.

4. Integrate specific biological knowledge and impact cancer research. I believe that the main challengingaspect of this project, which is also its main strength, is its inter-disciplinary aspect. On the one hand, we wish tointegrate domain-specific knowledge in the learning algorithms, which requires input from experts in biology andcancer genomics. On the other hand, we want our methods to impact cancer research, which means collaboratingwith biologists and medical doctors to guide our targeted results towards useful findings. I am convinced that myown experience at the interface of statistics, machine learning and biology over the last 10 years, as well as the richresearch environment of my laboratory with close connections to experts in cancer research and machine learning,will create the adequate conditions to overcome this challenge.


2 Scientific proposal

2.1 State-of-the-art and objectives

Context

The need for personalized cancer medicine. Cancer is a major cause of morbidity and mortality in the world. In2007, an estimated 12 million new cancer cases and 7.6 million cancer deaths were reported worldwide, and cancerwill affect even more people in the years ahead with the aging of populations if our ability to prevent, diagnose andtreat cancer does not improve. In spite of important advances in cancer research in the last decades, many aggressiveforms of cancer remain without treatments, and treated tumours often become resistant to their therapeutic agent. Amajor challenge and opportunity in treating cancer is that every tumor and every cancer is different, from a genomicpoint of view, explaining to a large extent the variability in risk, tumor prognosis, and response to therapies. A shifttowards personalized cancer medicine, i.e., from a one-size-fits-all approach to more tailored therapies based on atumor’s genomic makeup, therefore holds promises to lower the social and economical burden of cancer. This in turnsrequires the development and validation of accurate predictive models and biomarkers to characterize each tumourand predict which therapeutic choice is the best option on a case-by-case basis.

New technology-driven perspectives in cancer biology. Linking chromosomal aberrations and cancer is far fromnew : oncogenes and tumour suppressor genes have been known for a long time to be frequently amplified or deleted,leading to DNA copy imbalances. However the resolution at which the identification of cancer-genes could be donehas long been limited by technical aspects. The sequencing of the human genome in 2003, and recent technologicaladvances have radically changed the game. We can now systematically quantify DNA copy number variations (CNV)and polymorphisms along the genome with array comparative genomic hybridization (aCGH) or single nucleotidepolymorphism (SNP) arrays, or even detect fine mutations and chromosomal rearrangement with next generation se-quencing (NGS) technologies. In addition, we can measure the concentrations of messenger or micro RNA and pro-teins for thousands of genes simultaneously with chips or sequencing technologies [23], monitor various epigeneticevents along the genome such as patterns of DNA methylation and chromatin structure [24], and capture protein-DNAbinding events with ChIP-chip [25] or Chip-Seq [26]. Overall, with the continuing development of high-throughputgenomic technologies, and with the last revolution supported by the emergence of next generation sequencing (NGS)technologies, it is now feasible to contemplate comprehensive surveys of human cancer genomes and epigenomes,transcriptomes, proteomes and response profiles to various perturbations, thus paving the way to rational understand-ing of new mechanisms and weaknesses of the disease. The analysis of gene expression only has already contributedto the better use of available drugs by identifying cancer subtypes with different prognoses and responses to therapies[27, 28, 29], and several molecular prognosis signatures based on gene expression data are already in clinical trials[30, 31, 32].

Several initiatives have been launched recently to investigate the molecular characterization of large cohorts ofhuman tumors with various high-throughput technologies, with the hope of cataloguing and discovering major cancer-causing alterations through multi-dimensional analysis. The most visible coordinated projects include The CancerGenome Atlas (TCGA), which plans to process thousands of samples of more than 20 tumor types over the nextfive years [1], and the International Cancer Genome Consortium, which coordinates an international effort to analyzethousands of samples from 50 different tumor types and subtypes. Many other projects in large hospitals and cancerresearch centers focus on specific tumor subtypes, and follow a similar data-driven in the hope to identify predictivebiomarkers and better understand the biology of tumour development and evolution.

Challenges in high-throughput data-driven biology. A hallmark of these new technologies is their tendency togenerate huge amounts of unbiased data to characterise individual tumours and patients. It is then tempting to rely ona posteriori statistical analysis to identify interesting correlation in the data, construct biomarkers and infer predictivemodels. An obvious benefit of this data-driven research paradigm, with large and unbiased genomic surveys, is thatis allows to investigate directions that could never have been predicted by traditional molecular biology, and hasundoubtably already led to many significant discoveries, such as the identification of novel biomarkers [33]. On theother hand, many frustrating results have also been reported. For example, prognostic molecular signatures estimatedby state-of-the-art feature selection methods on gene expression datasets lack stability and have been shown to bevery sensitive to the choice of samples used to select the genes [34, 35]. This makes the biological interpretation of


these signatures challenging, and the discovery of novel drug targets even more difficult. Furthermore, the molecularsignatures derived from gene expression data have not revolutionized existing models based on classical clinicaldata in terms of accuracy for the moment. Another example is that large-scale projects of genomic data collection,such as genome-wide association studies (GWAS) or the TCGA project, have highlighted many complex genomicvariants with certain predictive power for disease susceptibility or cancer characterization, but have largely failed forthe moment to drastically increase our understanding of these diseases [36]. Overall, once the ”easy” and alreadyknown strong correlations have been detected from genomic surveys, the picture that emerges from the accumulationof biological data is overwhelmingly complex, raising doubts about the capacity of purely data-driven approaches toelucidate the full complexity of molecular mechanisms in living cells, and calling for more hypothesis-driven research[37].

Modern statistical and machine learning methods. Arguably, a major bottleneck in current and future researchis not as much the difficulty to generate data as the difficulty to analyse and exploit them. Statistics and machinelearning obviously play a central and increasing role in data-driven biology, when it comes to process and infer pre-dictive models and biomarkers from large amounts of heterogeneous data. Both disciplines have witnessed impressiveprogress in the last two decades, both in theory and in practice, allowing to attack complex data processing scenarioswith efficient algorithms and sound theoretical bases. Statistical learning theory has increased our understanding ofhigh-dimensional and nonlinear learning problems, and large-scale learning algorithms able to learn nonlinear func-tions and work with structured data, such as support vector machines, graphical models or deep belief networks, arenow mature technologies in many application domains where the abundance of data allows the automated extractionof useful information and construction of accurate predictive models [38, 14, 39]. The advent of the internet andthe availability of large volumes of digital data and powerful computers, in particular, has triggered the developmentof state-of-the-art machine learning approaches for a variety of applications such as speech recognition, automatictranslation, computer vision or marketing, to name just a few. The application of these modern statistical and machinelearning methods to challenging problems in high-throughput data-driven biology seems natural.

The small n large p issue. However, in spite of many successes in bioinformatics such as automated gene findingin DNA or functional or structural annotations of proteins [40][41]*, it is fair to say that state-of-the-art machinelearning approaches have not been able yet to significantly move forward the field of predictive modeling from high-dimensional and complex genomic data. For example, modern feature selection methods specifically developed forthe high-dimensional setting are not better than traditional univariate statistical tests in terms of stability, accuracy,and ability to discover biological knowledge from gene expression data [42]*. Why is that so? I believe that afundamental issue in current data-driven biology is the widening gap between, on the one hand, the quantityand complexity of the data that we collect on each sample and, on the other hand, the limited amount ofsamples available.. While we can observe at an exquisite level of details the molecular portrait of cancer tumours,and measures billions of molecular parameters on each tumour, we will never be able to observe billions of tumours. Instatistical terms, if we assume that we observe n samples each characterized by p descriptors, we are in a setting wheren is doomed to remain small (in the hundreds or thousands), while p dominates n and increases with each technologicaladvance (currently in the billions). This is a sharp contrast with other application domains of statistical machinelearning such as image processing or speech recognition, where there is nowadays virtually no limit in the number ofsamples (speech recording or images) that can be gathered to train machine learning models. With a limited numberof samples, the statistical power and learning capacity of general-purpose machine learning procedures isdrastically limited, and offers little hope to elucidate complex biological phenomena, such as interactions betweenSNP in GWAS studies, or robust modeling of tumor prognosis from genomic or epigenomic data. The ”small n largep” phenomenon has triggered much research in statistics recently, and both theory and algorithms have been proposed,e.g., to consistently select features in this setting [2, 3]. However, the strong hypothesis underlying these methods,such as the limited correlations between features, is usually not met in practice. I therefore believe that we havereached the limits of current statistical and machine learning approaches, such as multiple testing strategies or high-dimensional statistical inference, and that novel data analysis procedure, which go beyond blind data-driven methods,are needed to better exploit the wealth of quantitative biological available now and in the future.


Objectives

Exploiting prior knowledge and data structure. The data produced in high-throughput genomics are not onlyhigh-dimensional (”large p”), they also often have an explicit or implicit underlying structure. For example, datameasured along the chromosomes, such as CNV, methylation profiles or sequenced read counts, have an obviousexplicit linear structure. Measures on genes and proteins give a hint on the activity of underlying biological processeswhich involve the collective action of several genes and proteins, and the structure of such interactions may be inferredfrom, e.g., known gene and protein networks. In other words, instead of considering a set of p measurements on atumour as simply a vector in Rp, it is often possible to see as a vector in RZ , where Z is a structured space ofp features. For example, Z could be a finite chain to represent ordered features along a chromosome, a graph orhypergraph to represent features such as genes with known pairwise or multiway interactions, or more generally havea geometric or algebraic structure. In addition, for many inference tasks we may have prior knowledge in terms ofstructure, e.g., we may believe that a diagnostic signature should involve a limited number of biological pathwaysor protein complexes, or have a characteristic pattern on the DNA. Interestingly, exploiting the structure of Z in thelearning process may therefore be a way to reduce the complexity of the learning task and adapt it to the amount ofavailable samples. This could improve the quality of the inferred model in terms of accuracy. Moreover, by usingprior knowledge to impose a structure on Z or on the model we wish to infer, one could constrain the learning problemenough to let it infer a robust model among plausible alternatives, leading to accurate and interpretable models.

While the mainstream in statistics and machine learning research is to focus on generic inference methods forp-dimensional methods, I therefore believe that the only way to move forward and infer complex phenomena froma limited amount of structured and high-dimensional data is to depart from generic methods and develop specificapproaches adapted to each problem and data. The objective of this project is to propose and investigate newmethods to infer predictive models and discover complex predictive patterns from limited amounts of com-plex biological data, by exploiting the structure of data in the learning process. My vision is that, instead ofconfronting data-driven and hypothesis-based approaches as two antagonist paradigms in high-throughput biology,we must integrate at the heart of data-driven strategies a lot of prior knowledge about the structure of the data andabout the model we wish to infer, as routinely done in hypothesis-driven approaches. Only such prior knowledge maydecrease the complexity of the learning task and adjust it to the amount of data available. In other words, I wish todevelop a data-driven hypothesis-constrained paradigm, that exploits inherent structures in the data and com-bines the strength of state-of-the-art machine learning methods with the wealth of prior biological knowledgeavailable in a unified, mathematically sound and computationally efficient framework. The main application Iwill pursue to guide methodological developments is the inference of predictive models and detection of predictivefactors for prognosis and drug response in cancer.

Impact

The potential impacts of this project are numerous and important, both in methodology and in applications.From a theoretical and methodological point of view, the questions addressed in statistics and machine learning

challenge the limits of current methods when confronted to complex data and complex inference tasks with smalltraining tests. Although much progress has been done recently in high-dimensional learning and feature selection, thepractical problems I wish to attack are beyond reach of current methods. The approach I propose to investigate couldtrigger novel theoretical results which explicitly take into account the nature of the data, as opposed to consideringthem as generic large-dimensional vectors, and lead to novel algorithms to process complex data.

From the point of view of applications, the detection of predictive factors and construction of predictive modelsfor cancer prognosis and drug response prediction is one of the grand challenges in current genomic research. Anyimprovement in these applications could improve the quality of life and therapeutic management of the millions ofnew cancer cases diagnosed each year, through personalized treatments adapted to each tumor, and may lead to thediscovery of new therapeutic targets.

Challenges

To reach these goals, we will have to address the following challenges.

1. Propose specific penalties for the integration of prior knowledge on various data structures in state-of-the-art machine learning methods. The methodological heart of this project is to move beyond the majority of


existing methods in statistical machine learning, which typically assume that data are represented as (potentiallyhigh-dimensional) vectors, and then apply an ingenious algorithm on the vectors. Here I wish to consider datawhich are vectors with particular structure and prior knowledge on the underlying features. I will attack thischallenge in the context of regularized empirical risk minimization procedures, and investigate in particular theuse of novel regularizers integrating prior knowledge. This will require a good expertise in machine learningand functional analysis, as well as creativity to imagine novel penalties. My prior experience in designingkernels and convex penalties for a variety of setting will be directly applicable here.

2. Reconcile model accuracy and interpretability. In many applications of machine learning, we experiencea frustrating gap between model interpretability and accuracy. Typically, ”black-box” models such as deepbelief network or nonlinear SVM trained on large training sets often outperform more interpretable modelssuch as sparse linear models. When limited training sets of complex data are available, one can expect priorknowledge to ”force” the inference of interpretable models which explain the observed training data, leading togood predictive accuracy. The clarification of how robust the inferred model is, and how much we can trust it,will be an important challenge of this project, since we wish not only to develop accurate predictive models, butalso to extract biological information from them such as important biological pathways or novel drug targets.Paralleling the important corpus of work in recent statistical research on the consistent estimation of sparsemodels, I wish to develop approaches that can provably estimate more complex patterns that usual sparsity. Ihave preliminary results showing that specific analysis taking into account the structure can lead to consistentstructural estimation in settings where classical consistency results do not hold [13, 8]*.

3. Develop efficient algorithms and implementations. Although this project targets statistical problems wherethe number of training samples is relatively low (typically a few hundreds or thousands of tumours), it doesnot mean that the quantity of data will be small. To the contrary, each samples may be characterized bymillions or billions of features (typically, sequencing or microarray data), and we therefore need very efficientimplementations of the methods in terms of number of features. The general framework considered in thisproject naturally leads to convex optimization problem, which can be solved efficiently be general-purposesolvers when the dimension of the problem does not exceed a few hundreds or thousands dimensions [22].With more features, however, we need to develop specific implementations that take into account the particularstructure of the optimization problem.

4. Integrate specific biological knowledge and impact cancer research. I believe that the main challengingaspect of this project, which is also its main strength, is its inter-disciplinary aspect. From my past and currentexperience in statistics, machine learning and computational biology, I am convinced that bioinformatics hasreached a point where standard data analysis tools fail to exploit the full information hidden within the largebiological data sets, and that methods specifically dedicated to a given problem and type of data often out-perform standard methods. The development of such methods requires both a clear understanding and strongexpertise in modern statistics and machine learning, as well as a deep involvement in the biological aspects ofthe problem being considered. This in turns necessitates collaboration with experts from different fields, bothto develop new concepts and test newly developed methods on real and relevant data. I am convinced that myown experience at the interface of statistics, machine learning and biology over the last 10 years, as well as therich research environment of my laboratory with close connections to experts in cancer research and machinelearning, will create the adequate conditions to overcome this challenge.

2.2 Methodology

The methodological backbone of this project is in the field of statistics and machine learning for high-dimensional,structured and heterogeneous data. My objective is to process data of the form x ∈ RZ , where the set of features Z isfinite but has particular structures that reflect biological constraints. I will follow the popular idea to rephrase statisticalinference as a constrained optimization problem, where a data-dependent empirical risk functional is minimized overa restricted functional space. While most existing methods in this framework use generic functional spaces, suchas balls for the `1 or `2 norm in a Hilbert space, I will investigate novel penalties and norms that explicitly takeinto account the structures of Z and the prior knowledge we have about the problems. The general questions I willaddress concern the design of novel penalties for specific data and problems, the study of their theoretical properties,


their efficient implementation, and their validation on real data in cancer genomics. The project is divided into fourmethodological work packages, corresponding to increasingly complex structures for Z, and a fifth work packagededicated to the application of these methods on real cancer data. We begin below by a presentation of the generalframework for statistical inference that we will use, before describing each work package in more details.

General framework: regularized estimators and convex optimization

Let us consider the situation where we wish to infer a function f : X → R, where X represents a set of possible dataor patterns, such as gene expression vectors or DNA copy number profiles of a tumor sample. In our case, we willtypically consider X =RZ , where Z is a finite structured set. f represents a property we wish to assign to each pattern.Typically, in binary classification, the sign of f assigns each pattern to one of two categories, such as metastatic vsnon-metastatic when the data represent tumours. To estimate f we assume that we have a set of n labeled pattern(xi,yi), i = 1, . . . ,n, where xi ∈ X and yi ∈ R for all i. This set of labeled examples is called the training set.

A general approach to estimate (or infer) the function f from the training set is to first define a set of candidatefunctions F among which to look for a solution, and a criterion R : F → R, which quantifies whether a candidatefunction f ∈F “fits well” the training set. The risk R( f ) is typically an empirical average of a loss function `(y, f (x)),i.e., R( f ) = 1/n∑

ni=1 `(yi, f (xi)). The choice of the loss function ` depends itself on the data and the problem, and we

just assume that it is convex to ensure that R is a convex function of f . Minimizing R( f ) over F means looking for thefunction in F that best fits the training set. This approach is often referred to as empirical risk minimisation (ERM)[43]. Depending on the choice of the loss function, the ERM principle can be applied to solve various problems suchas classification, regression, clustering and density estimation [38]. Although quite appealing at first sight, ERM isprone to overfitting if F is too big, i.e., some functions may perfectly fit the training data but have no predictivepower on unseen data. This danger is particularly important when we work with a limited number of training data inhigh-dimension, since in that case even simple linear functions can fit perfectly any training set. In order to make theinference process consistent, a useful approach is to restrict the functions over which the risk is minimised, e.g., bydefining a priori a penalty function Ω : F → R and considering the constrained optimisation problem:

minf∈F

R( f ) such that Ω( f )≤ γ . (1)

In this equation, γ is a parameter to control the complexity of the class of functions over which the risk is minimised,namely, Fγ = f ∈ F : Ω( f )≤ γ. Intuitively, when γ is small, Fγ is also small and ERM is consistent, i.e., thefunction selected by minimising the empirical risk over Fγ will usually be near the best function in Fγ in terms ofprediction accuracy on unseen data (we say that the procedure has a small estimation error or, by abuse of notation,a small variance). On the other hand, it might be that Fγ is too limited to contain any good function at all, so eventhe best function in Fγ may have poor accuracy (we say that the procedure has a large approximation error, or bias).γ is therefore the parameter that controls the bias/variance trade-off, once the risk function and penalty function arechosen, and is typically optimised by cross-validation or by some alternative heuristic. Different choices of functionalspace F , empirical risk functional R, and penalty functional Ω correspond to a number of state-of-the-art machinelearning algorithms. Let us highlight in particular kernel methods [14] and shrinkage estimators for linear functions[44], which motivate many of the developments in this project and are reviewed below.

Importantly, in order to obtain good estimators with a limited number of training examples in high dimension orstructured spaces, we intuitively need to design a function class F and a penalty functions Ω such that Fγ is ”small”,yet contains good functions with a small risk. Indeed, such a choice would simultaneously ensure a small biasand variance. This highlights the importance of problem- and data-specific design of function spaces and penaltyfunctions, where we can translate our prior belief about which functions are likely to be good. This clarifies why webelieve there is much room for new developments in machine learning for specific high-dimensional problems, sincespecific penalties are expected. We now review two related families of penalties which have triggered a lot of researchin the last decade.

Learning with positive definite kernels

A positive definite (p.d.) kernel on a space X is a symmetric real-valued function K : X ×X → R which is suchthat, for any l ∈ N and set of l data x1, . . . ,xl ∈ X , the symmetric l× l Gram matrix G defined by Gi j = K(xi,x j)has no negative eigenvalue. It is well-known that any p.d. kernel implicitly defines an embedding Φ : X → H of


X to a Hilbert space H by the equality K(x,x′) = 〈Φ(x),Φ(x′)〉H . Moreover, any p.d. kernel K defines a Hilbertspace HK of functions f : X → R, called the reproducing kernel Hilbert space (RKHS) [45, 46, 47]. Although suchfunctions have been know for a long time in functional and harmonic analysis, they have become extremely popularover the last 15 years in machine learning with the development of so-called kernel methods, in particular the SVMfor supervised classification and regression [38, 14]. The idea in kernel methods is to use the RKHS as a functionalspace for inference, and the norm in the RKHS as penalty in the general formalism (1), i.e., to take F = HK andΩ( f ) = ‖ f ‖K . From the practical point of view, a key property of kernel approaches, often referred to as the kerneltrick, is that the optimisation problem (1) can be solved efficiently through finite-dimensional optimisation procedures,for any positive definite kernel, even though the RKHS may be of infinite dimension. This property is particularlyinteresting when the number of samples n is limited and the number of features p is large, since the complexity of thelearning methods is typically quadratic or cubic in n once the kernel function has been evaluated on the training set.

Learning with positive definite kernels, in particular with the SVM algorithm for supervised classification, hasbecome extremely popular in many applied domains, in particular computational biology [41]. A salient feature forlearning in RKHS is that, by designing a specific p.d. kernel on particular data, it is possible to learn functions onvirtually any type of data and to include prior knowledge about the data and the problem to be solved in the kernelitself. In the past we have contributed significantly to the design of new kernels for biological and chemical data,including kernels for protein sequences [48, 49]*, phylogenetic profiles of genes [50]*, promoter regions of genes[51]*, 3D structures of proteins [52]* and 2D and 3D structures of small molecules [53, 54, 11]*. Besides allowingmanipulation of complex and structured objects in a clean theoretical framework, many authors have noticed thatthe choice of kernels has a substantial influence on the accuracy of the final model, e.g., [49]*. This illustrates theimportance of choosing an adequate functional space and penalty in (1), in particular when limited training data isavailable. Moreover, the computational complexity of the kernel evaluation plays a central role in the computationalcomplexity of the learning step, and particular care must often be paid to the computational cost when a kernel isdesigned for complex objects.

Learning with sparsity-inducing norms

A second popular approach in regularized statistical estimation is to use penalties that lead to sparse linear models, i.e.,automatically perform feature selection jointly with the inference of the model. More precisely we consider the casewhere data are p-dimensional vectors, i.e., X = Rp, and we wish to infer a linear function of the form fβ(x) = β>xfor some β ∈ Rp. Shrinkage estimators optimise (1) with a constraint on a norm of β [44]. For example, when theEuclidean norm of β is used, i.e., Ω2(β) = ‖β‖2

2 = ∑pi=1 β2

i , we recover the classical ridge regression or linear supportvector machine, depending on the risk function. Alternatively, taking the `1 norm Ω1(β) = ‖β‖1 = ∑

pi=1 |βi | leads to

Lasso regression [55], which leads to sparse solutions in the sense that many of the coefficients of the optimal β are0. The `1 penalty therefore provides an elegant approach to learn a sparse classifier, which can be of interest if one isinterested in selecting a few predictors for the prediction. An interesting extension of Lasso is the `1/`2 norm or grouplasso penalty which is defined as follows when the features are clustered into distinct groups G = g1, . . . ,gk,[56]:

Ωgroup(β) = ∑g∈G

√∑i∈G

β2i = ∑

g∈G‖βg ‖2 . (2)

This penalty is a sum of the `2 norms of the vector β restricted to the different groups. It has the effect of selectingsparse classifiers at the group level, i.e., to select groups of predictors. It has recently been extended to situationswhere groups are organized into a hierarchy [57, 58], when groups overlap [13]*[59], or when they are extracted froma graph structure [13]*.

WP1: learning with linearly ordered features

Many genomic data can be seen as numeric profiles over chromosomes, i.e., can be considered as elements of X =RZ ,where Z = [1, p] is a succession of p positions. Examples of such data include CNV information captured by aCGHor SNP arrays (in which case p is the number of positions probed on the DNA), methylation, histone modification orprotein-DNA binding sites captured by ChIP-chip or ChIP-seq technology, or more generally profiles of read countsmapped to the genome after a sequencing experiment. Prior knowledge tells us that, depending on the data, we canexpect important biological patterns to have a spatial structure, e.g., we may assume the existence of predictive peakswith various shapes in the data, or the presence of sudden change-points.


Figure 1: This picture from [6]* shows a genomic signature to discriminate between metastatic and non-metastaticmelanoma that was estimated from CGH profiles with an SVM (left) and a fused SVM (right) using the fused penalty(3). Each point on the x-axis is a position on the genome for which we have a measure of the DNA copy number.The different chromosomes are separated by vertical red lines. The fused SVM signature is more appealing when itcomes to biological interpretation, having for example correctly identified the characteristic loss of 8p and gain of 8qin metastatic tumors.

In the case of data where we wish to find piecewise constant patters, I will start from an old and interesting ideato infer piecewise constant signals is to use the total variation norm [4], which recently was combined with the Lassopenalty to infer piecewise constant and sparse signal with the fused Lasso penalty:

ΩTV (β) =p−1

∑i=1|βi+1−βi | , Ω f used(β) =

p

∑i=1|βi |+ΩTV (β) . (3)

The later has been shown to be very useful for denoising, regression and classification of CNV profiles [5, 60][6]∗,and is a promising strategy to estimate interpretable and accurate predictive models (Figure 1). On the other hand, itraises several questions that I will investigate. From the theoretical point of view, it is important to understand howaccurate the inferred pattern can be, in a context where classical consistency results do not hold because of the strongcorrelations between successive features [7]. I recently proved consistency results for a multidimensional extension ofthe fused Lasso but highlighted that it suffers from boundary effects which one needs to correct [8]*. A possibility forthat may be to balance the influence of each position in (3) by a position-dependent weight. From the practical pointof view, solving the general problem (1) with a fused Lasso penalty (3) raises computational issues when p is in themillions or billions, since we are confronted with a convex optimization problem in p dimensions. Recently severalresearchers have found efficient specific implementations to approximate a signal by a piecewise constant profile inthe least-square sense with a fused Lasso penalty, in O(pk) [7] or O(p ln p) [61], where k is the final number of changepoints. I have preliminary evidence that it is possible to improve again these results, and that a specific dichotomicsearch algorithm can solve 1 in O(p lnk). I will investigate the possibility to extend this result to general convex riskfunctions R(β) with proximal first-order accelerated gradient methods [9]. Finally, I will investigate an interestingmultidimensional extension of total variation and fused Lasso penalties, when several profiles must be learned jointly,is the group fused Lasso [8]*:

ΩTV (β) =p−1

∑i=1‖βi+1−βi ‖2 , (4)

where each βi in (4) is itself a multidimensional vector. This penalty can be relevant when, e.g., shared change-pointsmust be caught in different tumor’s genomic profiles, of in different profiles for a single tumour (e.g., the two channelsof a SNP array). The use of such penalties raises novel computational challenges, but we showed in [8]* that linearcomplexities in p and n can be expected.


A second important family of signals in genomic profiles are localized patterns such as peaks in ChIP-chip andChiP-seq experiments, or localized patterns with particular shapes in RNA-seq experiments. I will investigate howto encode the knowledge of these particular patterns into novel penalties, using for example sparse expansions overadequate dictionaries, or positive definite kernels between profiles to account for possible variations in shape orlocation of signals.

WP2: learning with features in graphs or hypergraphs

A lot of prior information in genomics is encoded in large graphs or hypergraphs, which account for example forknown protein complexes, functional groups, metabolic or signaling pathways, or more general genetic interactions.When we observe quantitative measurement about genes or proteins in a cell, it is then natural to consider the features(genes) as vertices of a graph or hypergraph Z (gene network), and to see some data such as gene expression values asa function over the graph or hypergraph. In this work package I will investigate how structural property of Z can beexploited in the learning algorithm through a graph- or hypergraph-dependent prior ΩZ(β) for β ∈ X = RZ . We notethat this question is very different from the problem of learning over a graph or hypergraph [10] (i.e., when X = Z),or from the problem of learning when each data is itself a graph [11]*.

I recently proposed two families of penalties to translate prior knowledge about Z into a convex penalty. First,if Z is a graph and if we believe that the vector β ∈ RZ should be smooth or piecewise constant on the graph, thefollowing penalties are useful [12, 62]*:

Ωspectral(β) = ∑(i, j)∈Z

(βi−β j)2 , ΩTV−graph = ∑

(i, j)∈Z

∣∣βi−β j∣∣ . (5)

The first penalty in (5) penalizes high frequencies in the graph Fourier spectrum of β, and can be generalized tovarious spectral penalties that work in the Fourier domain. Interestingly, it can be expressed also as a norm in theRKHS, and can be implemented efficiently using the kernel trick [12]*. The second one is a generalization of thetotal variation norm on the graph, leading to piecewise constant values over connected components. Although moredemanding computationally, recent results suggest that fast proximal optimization methods may be developed forthis penalty [61]. Both approaches are motivated by situations where, e.g., we assume that the observed activitiesof genes or proteins are driven by underlying activities of pathways or protein complexes which form connectedcomponents on the graph Z (Figure 2). Both penalties can be easily combined with a Lasso penalty to additionallyselect connected components of Z, which may correspond to interesting pathways of protein complexes. Second, weproposed convex penalties which translates the possible prior hypothesis that we should try to select features whichtend to form connected components in the hypergraph Z, without constraining the weights themselves [13]*:

Ωgroup(β) = ∑h∈Z‖βh ‖2 , Ωoverlap(β) = sup

α∈RZ :∀h∈Z,‖αh ‖≤1α>

β . (6)

While the first penalty is a direct extension application of the group Lasso penalty and tends to set features in the samehyperedge to 0 together, the second one has the interesting property that the final support of β tends to be a union ofhyperedges. Geometrically, the difference between the two penalties can be caught by the positions singularities inthe balls they generate, since the final β after optimization of (1) will tend to lie on a singularity (Figure 3).

Based on my experience with these first investigations, I propose to study the properties of these penalties in termsof accuracy and consistency, propose variants, and investigate efficient implementations with, e.g., proximal first-ordermethods which seem particularly adapted here. A important question I wish to address is to understand the influenceof the graph structure, such as the presence of hubs, or particular properties of the graph spectrum, on the penalty andsubsequent learning algorithm. For example, I observed that the fused Lasso suffers from boundary effects when thegraph is a simple chain, and noted that the hubs have a strong influence on the graph-dependent penalty presentedabove (unpublished data). Another direction I will explore is the possibility to derive from the graph or hypergraph Za multi-scale representation of vectors on RZ , and to explicitly define penalties from this multi-scale representation.A well-designed coarse-to-fine hierarchy in RZ could in principle let the data select by themselves the correct degreeof granularity of the function that can be inferred, and provide useful biological interpretability at any scale. Finally,I will study the impact of ”errors” in the graph, e.g., wrong or missing interactions between genes, and focus onmethods robust with respect to such errors.


Figure 2: This picture, from [12]*, shows a transcriptomic signature to discriminate between yeast populations subjectto low and high doses of irradiation. Each gene in the KEGG metabolic network is colored according to its weightin the signature. The signature was estimated with a SVM using a classical linear kernel (left picture) or a network-dependent kernel (right picture) implementing the spectral penalty (5). The signature obtained from the network-dependent kernel directly captures discriminant pathways.


Figure 3: This picture from [13]* shows the balls for Ωgroup(β) (left) and Ωoverlap(β) (right) as defined in (6) forthe graph Z = 1,2,2,3, where β2 is represented as the vertical coordinate. The singularities in each ballcorrespond to the prior hypothesis we put on the vector β to be inferred.

WP3: learning with continuous priors on features

A more general setting that those address in WP1 and WP2 is the case where we assume that Z is a discrete setembedded in a continuous space with, for example, a Euclidean or Hilbert geometry. A typical example would beto start from large collections of available data on genes or proteins, such as the their sequence or structure, theirexpression level in thousands of publicly available expression arrays, or their co-occurrences in literature, and to usepositive definite kernels on these data to map the genes as points in a RKHS. In other words, Z = z1, . . . ,zp ⊂ Hwould be a finite set of points in a RKHS H , which would integrate our current knowledge about the genes. Theidea to define positive kernel from various informations about the same data has been popularized as a convenientway to integrate heterogeneous data over genes or proteins [63][15]*. A significant amount of work has been devotedto the construction of such representations through the engineering of various positive definite kernels (see above fora snapshot of my own contributions), and to the use of such representations to learn over Z, e.g., for functional ofstructural classification of proteins [16]. Here my goal is different: I propose to use the structure of H to learn overRZ and not over Z, i.e., use the prior hypothesis on the features themselves, and not on the sample.

I will consider different strategies to exploit the continuous nature of the structure on the features and encode priorknowledge in the hypothesis. A first idea would be to discretize the space and represent Z as a graph, and to borrow themethods developed in WP2. This would parallel the popular approach in bioinformatics which consist in representingdifferent informations about genes and proteins as networks, and to perform data integration and data analysis at thelevel of networks [17]. A second and more elegant idea is to directly work with the continuous space, and extendthe mathematical operations on graphs (such as operations on the spectrum of the graph Laplacian) directly to thecontinuous space. This could allow to directly exploit the manifold or geometric structures underlying the features.Theoretical consistency results and practical implementations may however be more challenging to obtain in thecontinuous case.

WP4: integrating heterogeneous data

In practice we are often confronted with the problem of building predictive models from a combination of heteroge-neous data. Typically, one would like to infer prognosis models for cancer from a combination of genomic, epige-


nomic and transcriptomic data. While work packages 1, 2 and 3 focus on individual sources of data, WP4 focuseson methods to learn jointly from several heterogeneous sources of data. The general framework (1) offers interestingpossibilities to perform this data integration. Indeed, let us suppose that we have m sources of data. For each sourcei = 1, . . . ,m, we discussed in WP1-WP3 various ways to design specific functional spaces Fi and penalty functions Ωi

such that a predictor fi solely based on this source of data is estimated by solving the problem:

minf∈Fi

R( fi) such that Ωi( fi)≤ γi . (7)

A first extension of this approach to jointly use the m data sources is to write the predictor we wish to estimate as asum of predictors based on individual data, i.e., f = f1 + . . .+ fm, and to jointly estimate the different componentsthrough the following optimisation problem:

min( f1,..., fm)∈F1×...×Fm

R

(m

∑i=1

fi

)such that Ωi ( fi)≤ γi, for i = 1, . . . ,m . (8)

When the penalties Ωi are norms in RKHS, we recover with (8) various flavors of multiple kernel learning [16, 18],which has attracted a lot of attention recently. I will extend this framework to the case of more general penaltyfunctions, as studied in WP1-WP3. A second extension is to jointly estimate a function f = f1 + . . .+ fm whichintegrates the n data sources by designing a problem- and data specific joint penalty Ω joint( f1, . . . , fm), and solvingthe problem:

min( f1,..., fm)∈F1×...×Fm

R

(m

∑i=1

fi

)such that Ω joint ( f1, . . . , fm)≤ γi . (9)

While (9) covers in particular the case (8) when we take Ω joint( f1, . . . , fn) = ∑ni=1 Ωi( fi), I will also study more

general penalties Ω joint which introduce correlations between the fi’s. For example, to train predictors of tumourclass from expression data and CGH profiles, we may want to select genes jointly from their expression and copynumber variations, in which case we could for example define a joint penalty as a group lasso which groups togethervariables across datasets.

Finally, a third approach we will investigate is the inference of functions f : Z1× . . .×Zm → R which do notdecompose as a sum f = f1 + . . .+ fm. This is motivated by the need to consider interactions between sources ofdata, e.g., to make predictions that depend on the presence of a particular pattern in the CGH profile and of anotherparticular pattern in the expression level of a group of genes. Mathematically, this raises the question of learning afunction over a product space Z = Z1× . . .×Z. Interestingly, much research has been devoted recently to the studyof specific penalties for matrices, which correspond to a particular case of learning over a product space, albeit a verysimple one. In particular, spectral penalties such as the non-convex rank or the convex trace norm of matrices havebeen proposed as interesting penalties to capture interactions between two spaces [19, 20]. With colleagues I recentlygeneralized these approaches and proposed a general framework to learn operators over product spaces, when eachZi is a Hilbert structure, with application in collaborative filtering, multitask learning and network reconstruction[21, 64]*. In the project, I propose to investigate the extension of these concepts to more than two spaces, endowedwith Hilbert or non-Hilbert norms, in order to provide new methods for data integration.

WP5: application in cancer classification and prognosis

Although most methods investigated in the methodological work packages WP1-WP4 are relevant to many appli-cations, in computational biology or other fields, I will focus the development and practical applications of thesemethods to a particular problem: the analysis of data measured on cancer patients and tumours, and in particularthe development of predictive models for the presence or subtype of a tumour (diagnosis), for the evolution of thetumour and the risk of metastasis (prognosis), and for the response to candidate therapies (theragnosis). Althoughall three applications differ in their goals, they can all be formulated similarly as learning a model from genomicdata measured on a limited amount of biological samples. I will work both on public data, obtained from dedicatedrepositories on the web, and on non-public data, through collaborations within Institut Curie. I will benefit from theinfrastructure and support of the core bioinformatics team in Institut Curie for the choice, collection and normalizationof high-throughput cancer data to be analyzed in this project.


More precisely, I will first focus on public gene expression datasets that can be easily collected from repositoriessuch as EBI’s Arrayexpress or NCBI’s Gene Expression Omnibus, which offers the possibility to collect thousandsof expression data for a particular problem such as prognosis on breast cancer. We have already the expertise and apipeline dedicated to the collection and normalization of expression, CGH, clinical and proteomic data in the lab [65].Our first targeted application will be to predict the risk of metastasis in early-stage breast cancer, an application withtremendous potential impact and lots of publicly available data [66][42]*.

Second, I will work on the public data generated by the TCGA project. Various datasets for three types of cancer(brain, lung and ovarian) were produced in the first phase of the project and are already publicly available, includingclinical and histopathological data, gene expression, chromosomal copy number, loss of heterozygosity, methylationpatterns, miRNA expression and DNA sequences. More data on other cancer should become available within the next5 years. These data represent an excellent showcase for the methods developed in this project, and will allow the fastdiffusion of the successful new methods.

Third, and in parallel, I will work directly with biologists and doctors within the Institut Curie, where my labis located, on focused projects with original data. Institut Curie is a reference center in Europe for breast and pedi-atric cancer, and I already started local collaborations with several groups including Fabien Reyal (surgeon), MarcBollet (radiotherapist) and Anne Vincent-Salomon (anapathologist) on breast cancer classification and prognosis ap-plications from expression, copy number and genotype, and with Isabelle Janoueix-Lerosey (biologist) and GudrunSchleiermacher (paediatrician) on neuroblastoma prognosis from expression and copy number information. Severalnew projects are planned within Institut Curie in the coming years, involving in particular the analysis of tumourswith sequencing technologies, and will provide ample opportunities to attack concrete applications and benefit formexpert knowledge.

In order to have the maximum impact on cancer research and potential users of the methods developed in thisproject, they will by default be packaged in the R statistical language, and diffused freely.

2.3 Resources

The core scientific team for this project will include 7 personnel, for a total of 213 months involvement. I will devote75% of my time to this project, covering scientific leadership, project management, and student/post-doc supervision.Three PhD students will be recruited (starting in years 1, 2 and 3), to work respectively on WP1, WP2 and WP3.Three post-docs will be recruited, to work more specifically on collaborations with biologists, analysis of TCGA data(WP5), and integration of heterogeneous data (WP4). In addition, one European Project Manager will take care of theadministrative and financial matters (5 months). The Gantt chart below summarizes the organization of the activityin the different but interdependent work packages. WP1, 2 and 3 correspond to similar problems with increasingcomplexity and will be attacked successively. WP4 will be attacked in the second half of the project, since it will bebased in part on results of the first 3 work packages. WP5 is a transversal activity which will be active throughout theproject.

The budget necessary to support this project is detailed for each year in the table below. In addition to personalcosts, I plan to support 2 or 3 publications per year (4 Keuros / year), travels to international conferences and visitsfor project members (15 Keuros / year), invite students and researchers (4 Keuros / year). No external collaboratorwill be paid by this grant. I plan to buy computer resources (servers for scientific computation: 25 Keuros at the startof the project), which will be used exclusively for the realization of for this project; their cost will be determinedaccordingly, following ARMINES’s usual accounting principles and practices. Finally, 15 Keuros are requested foraudit expenses through subcontracting (1% of the total). The project will benefit from the infrastructures of EcoleNationale Superieure des Mines de Paris and ARMINES.


ERC Starting Grants GrantsBudget tables to be inserted in section c) "Resource", heading iii. "Budget"

iii. Budget - Table 1 Please enter duration in months1 ==> : 60

Personnel:

P.I.2 118 501 118 501 118 501 39 500 395 005

Senior Staff -

Post docs 83 357 83 357 83 357 27 786 277 855

Students 88 392 176 783 110 489 22 098 397 762

Other 7 065 7 065 7 065 2 355 23 549

Total Personnel: 297 314 385 706 319 412 91 739 1 094 170

Other Direct Costs:Equipment 12 500 12 500 - - 25 000

Travel (project members) 22 500 22 500 22 500 7 500 75 000

Travel (invitations) 6 000 6 000 6 000 2 000 20 000

Publications 6 000 6 000 6 000 2 000 20 000

Total Other Direct Costs: 47 000 47 000 34 500 11 500 140 000

Total Direct Costs: 344 314 432 706 353 912 103 239 1 234 170

Indirect Costs (overheads): 20% of Direct Costs 68 863 86 541 70 782 20 648 246 834

Subcontracting Costs: (No overheads) - 7 500 7 500 - 15 000

Total Requested Grant: (by reporting period and total) 413 177 526 747 432 194 123 886 1 496 004

[1] Adapt to actual project duration.[2] Please take into account the percentage of your dedicated working time (minimum 50%) to run the ERC funded activity when calculating the salary

For the above cost table, please indicate the % of working time the PI dedicates to the project over the period of the Grant : 75,00%

iii. Budget - Table 2

,,key intermediate goal", as defined in section 2.ii. Estimated % of total requested grant

Expected to be completed on month

:

WP1 20% 36

WP2 20% 48

WP3 20% 60

WP4 20% 60

WP5 20% 60

Total 100%

Direct Costs:

Cost Category month 1 to 18 Totalmonth 37 to 54 month 55 to 60month 19 to 36

Comment


2.4 Ethical issuesResearch on Human Embryo/ Foetus NO PageDoes the proposed research involve human Embryos?Does the proposed research involve human Foetal Tissues/ Cells?Does the proposed research involve human Embryonic Stem Cells (hESCs)?Does the proposed research on human Embryonic Stem Cells involve cells in culture?Does the proposed research on Human Embryonic Stem Cells involve the derivation of cellsfrom Embryos?I CONFIRM THAT NONE OF THE ABOVE ISSUES APPLY TO MY PROPOSAL YESResearch on Humans YES PageDoes the proposed research involve children? NODoes the proposed research involve patients? NODoes the proposed research involve persons not able to give consent? NODoes the proposed research involve adult healthy volunteers? NODoes the proposed research involve Human genetic material? NODoes the proposed research involve Human biological samples? NODoes the proposed research involve Human data collection? YES 4-8

Privacy YES PageDoes the proposed research involve processing of genetic information or personal data (e.g.health, sexual lifestyle, ethnicity, political opinion, religious or philosophical conviction)?

YES 4-8

Does the proposed research involve tracking the location or observation of people? NO

Research on Animals NO PageDoes the proposed research involve research on animals?Are those animals transgenic small laboratory animals?Are those animals transgenic farm animals?Are those animals non-human primates?Are those animals cloned farm animals?I CONFIRM THAT NONE OF THE ABOVE ISSUES APPLY TO MY PROPOSAL YESResearch Involving Developing Countries NO PageDoes the proposed research involve the use of local resources (genetic, animal, plant, etc)?Is the proposed research of benefit to local communities (e.g. capacity building, access tohealthcare, education, etc)?I CONFIRM THAT NONE OF THE ABOVE ISSUES APPLY TO MY PROPOSAL YESDual Use NO PageResearch having direct military useResearch having the potential for terrorist abuseI CONFIRM THAT NONE OF THE ABOVE ISSUES APPLY TO MY PROPOSAL YESResearch on Human Embryo/ Foetus NO Page

I CONFIRM THAT NONE OF THE ABOVE ISSUES APPLY TO MY PROPOSAL YES


References

[1] Cancer Genome Atlas Research Network. Comprehensive genomic characterization defines human glioblastomagenes and core pathways. Nature, 455(7216):1061–1068, Oct 2008.

[2] E. Candes and T. Tao. The Dantzig selector: Statistical estimation when p is much larger than n. Ann. Stat.,35(6):2313–2351, 2007.

[3] P. J. Bickel, Y. Ritov, and A. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. Ann. Stat.,37(4):1705–1732, 2009.

[4] L. I. Rudin, S. Osher, and E. Fatemi. Nonlinear total variation based noise removal algorithms. Physica D,60:259–268, 1992.

[5] R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight. Sparsity and smoothness via the fused lasso. J. R.Stat. Soc. Ser. B Stat. Methodol., 67(1):91–108, 2005.

[6] F. Rapaport, E. Barillot, and J.-P. Vert. Classification of arrayCGH data using fused SVM. Bioinformatics,24(13):i375–i382, Jul 2008.

[7] Z. Harchaoui and C. Levy-Leduc. Catching change-points with lasso. In J.C. Platt, D. Koller, Y. Singer, andS. Roweis, editors, Adv. Neural. Inform. Process Syst., volume 20, pages 617–624. MIT Press, Cambridge, MA,2008.

[8] J-P. Vert and K. Bleakley. Fast detection of multiple change-points shared by many signals using group LARS.In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R.S. Zemel, and A. Culotta, editors, Adv. Neural. Inform.Process Syst., volume 22, pages 2343–2352, 2010.

[9] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAMJ. Img. Sci., 2(1):183–202, 2009.

[10] O. Chapelle, B. Scholkopf, and A. Zien, editors. Semi-Supervised Learning. MIT Press, Cambridge, MA, 2006.

[11] P. Mahe and J. P. Vert. Graph kernels based on tree patterns for molecules. Mach. Learn., 75(1):3–35, 2009.

[12] F. Rapaport, A. Zynoviev, M. Dutreix, E. Barillot, and J.-P. Vert. Classification of microarray data using genenetworks. BMC Bioinformatics, 8:35, 2007.

[13] L. Jacob, G. Obozinski, and J.-P. Vert. Group lasso with overlap and graph lasso. In ICML ’09: Proceedingsof the 26th Annual International Conference on Machine Learning, pages 433–440, New York, NY, USA, 2009.ACM.

[14] B. Scholkopf and A. J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization,and Beyond. MIT Press, Cambridge, MA, 2002.

[15] Y. Yamanishi, J.-P. Vert, and M. Kanehisa. Protein network inference from multiple genomic data: a supervisedapproach. Bioinformatics, 20:i363–i370, 2004.

[16] G. R. G. Lanckriet, T. De Bie, N. Cristianini, M. I. Jordan, and W. S. Noble. A statistical framework for genomicdata fusion. Bioinformatics, 20(16):2626–2635, 2004.

[17] R. Jansen, H. Yu, D. Greenbaum, Y. Kluger, N.J. Krogan, S. Chung, A. Emili, M. Snyder, J.F. Greenblatt, andM. Gerstein. A Bayesian networks approach for predicting protein-protein interactions from genomic data.Science, 302(5644):449–453, 2003.

[18] A. Rakotomamonjy, F. Bach, S. Canu, and Y. Grandvalet. SimpleMKL. J. Mach. Learn. Res., 9:2491–2521,2008.


[19] N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In L. K. Saul, Y. Weiss,and L. Bottou, editors, Adv. Neural. Inform. Process Syst. 17, pages 1329–1336, Cambridge, MA, 2005. MITPress.

[20] Nathan Srebro and Adi Shraibman. Rank, trace-norm and max-norm. In COLT, pages 545–560, 2005.

[21] J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. A new approach to collaborative filtering: operator estimationwith spectral regularization. J. Mach. Learn. Res., 10:803–826, 2009.

[22] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, New York, NY, USA, 2004.

[23] A. Mortazavi, B. A. Williams, K. McCue, L. Schaeffer, and B. Wold. Mapping and quantifying mammaliantranscriptomes by RNA-Seq. Nat. Methods, 5(7):621–628, Jul 2008.

[24] P. A. Jones and S. B. Baylin. The fundamental role of epigenetic events in cancer. Nat. Rev. Genet., 3(6):415–428, Jun 2002.

[25] B. Ren, F. Robert, J. J. Wyrick, O. Aparicio, E. G. Jennings, I. Simon, J. Zeitlinger, J. Schreiber, N. Hannett,E. Kanin, T. L. Volkert, C. J. Wilson, S. P. Bell, and R. A. Young. Genome-wide location and function of DNAbinding proteins. Science, 290(5500):2306–2309, Dec 2000.

[26] D. S. Johnson, A. Mortazavi, R. M. Myers, and B. Wold. Genome-wide mapping of in vivo protein-dna interac-tions. Science, 316(5830):1497–1502, Jun 2007.

[27] A. A. Alizadeh, M. B. Eisen, R. E. Davis, C. Ma, I. S. Lossos, A. Rosenwald, J. C. Boldrick, H. Sabet, T. Tran,X. Yu, J. I. Powell, L. Yang, G. E. Marti, T. Moore, J. Hudson, L. Lu, D. B. Lewis, R. Tibshirani, G. Sherlock,W. C. Chan, T. C. Greiner, D. D. Weisenburger, J. O. Armitage, R. Warnke, R. Levy, W. Wilson, M. R. Grever,J. C. Byrd, D. Botstein, P. O. Brown, and L. M. Staudt. Distinct types of diffuse large B-cell lymphoma identifiedby gene expression profiling. Nature, 403(6769):503–511, Feb 2000.

[28] A. Bhattacharjee, W. G. Richards, J. Staunton, C. Li, S. Monti, P. Vasa, C. Ladd, J. Beheshti, R. Bueno,M. Gillette, M. Loda, G. Weber, E. J. Mark, E. S. Lander, W. Wong, B. E. Johnson, T. R. Golub, D. A. Sug-arbaker, and M. Meyerson. Classification of human lung carcinomas by mRNA expression profiling revealsdistinct adenocarcinoma subclasses. Proc. Natl. Acad. Sci. USA, 98(24):13790–13795, Nov 2001.

[29] T. Sørlie, C. M. Perou, R. Tibshirani, T. Aas, S. Geisler, H. Johnsen, T. Hastie, M. B. Eisen, M. van de Rijn, S. S.Jeffrey, T. Thorsen, H. Quist, J. C. Matese, P. O. Brown, D. Botstein, P. Eystein Lønning, and A. L. Børresen-Dale. Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications.Proc. Natl. Acad. Sci. USA, 98(19):10869–10874, Sep 2001.

[30] L. J. van ’t Veer, H. Dai, M. J. van de Vijver, Y. D. He, A. A. M. Hart, M. Mao, H. L. Peterse, K. van der Kooy,M. J. Marton, A. T. Witteveen, G. J. Schreiber, R. M. Kerkhoven, C. Roberts, P. S. Linsley, R. Bernards, andS. H. Friend. Gene expression profiling predicts clinical outcome of breast cancers. Nature, 415(6871):530–536,Jan 2002.

[31] M. J. van de Vijver, Y. D. He, L. J. van’t Veer, H. Dai, A. A. M. Hart, D. W. Voskuil, G. J. Schreiber, J. L.Peterse, C. Roberts, M. J. Marton, M. Parrish, D. Atsma, A. Witteveen, A. Glas, L. Delahaye, T. van der Velde,H. Bartelink, S. Rodenhuis, E. T. Rutgers, S. H. Friend, and R. Bernards. A gene-expression signature as apredictor of survival in breast cancer. N. Engl. J. Med., 347(25):1999–2009, Dec 2002.

[32] B. Haibe-Kains, C. Desmedt, F. Piette, M. Buyse, F. Cardoso, L. Van’t Veer, M. Piccart, G. Bontempi, andC. Sotiriou. Comparison of prognostic gene expression signatures for breast cancer. BMC Genomics, 9:394,2008.

[33] T. Golub. Counterpoint: Data first. Nature, 464(7289):679, Apr 2010.

[34] L. Ein-Dor, I. Kela, G. Getz, D. Givol, and E. Domany. Outcome signature genes in breast cancer: is there aunique set? Bioinformatics, 21(2):171–178, Jan 2005.


[35] S. Michiels, S. Koscielny, and C. Hill. Prediction of cancer outcome with microarrays: a multiple randomvalidation strategy. Lancet, 365(9458):488–492, 2005.

[36] D. B. Goldstein. Common genetic variation and human traits. N. Engl. J. Med., 360(17):1696–1698, Apr 2009.

[37] R. Weinberg. Point: Hypotheses first. Nature, 464(7289):678, Apr 2010.

[38] V. N. Vapnik. Statistical Learning Theory. Wiley, New-York, 1998.

[39] D. Koller and N. Friedman. Probabilistic Graphical Models. MIT Press, 2009.

[40] P. Baldi and S. Brunak. Bioinformatcs, the machine learning approach. MIT Press, 2001.

[41] B. Scholkopf, K. Tsuda, and J.-P. Vert. Kernel Methods in Computational Biology. MIT Press, The MIT Press,Cambridge, Massachussetts, 2004.

[42] A.-C. Haury and J-P. Vert. On the stability and interpretability of prognosis signatures in breast cancer. InProceedings of the Fourth International Workshop on Machine Learning in Systems Biology (MLSB10), 2010.To appear.

[43] L. Devroye, L. Gyorfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition, volume 31 of Applicationsof Mathematics. Springer, 1996.

[44] T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning: data mining, inference, andprediction. Springer, 2001.

[45] N. Aronszajn. Theory of reproducing kernels. Trans. Am. Math. Soc., 68:337 – 404, 1950.

[46] C. Berg, J. P. R. Christensen, and P. Ressel. Harmonic analysis on semigroups. Springer-Verlag, New-York,1984.

[47] A. Berlinet and C. Thomas-Agnan. Reproducing Kernel Hilbert Spaces in Probability and Statistics. Springer,2003.

[48] M. Cuturi and J.-P. Vert. The context-tree kernel for strings. Neural Network., 18(4):1111–1123, 2005.

[49] H. Saigo, J.-P. Vert, N. Ueda, and T. Akutsu. Protein homology detection using string alignment kernels. Bioin-formatics, 20(11):1682–1689, 2004.

[50] J.-P. Vert. A tree kernel to analyze phylogenetic profiles. Bioinformatics, 18:S276–S284, 2002.

[51] J.-P. Vert, R. Thurman, and W. S. Noble. Kernels for gene regulatory regions. In Y. Weiss, B. Scholkopf, andJ. Platt, editors, Adv. Neural. Inform. Process Syst., volume 18, pages 1401–1408, Cambridge, MA, 2006. MITPress.

[52] J. Qiu, J. Hue, A. Ben-Hur, J.-P. Vert, and W. S. Noble. A structural alignment kernel for protein structures.Bioinformatics, 23(9):1090–1098, May 2007.

[53] P. Mahe, N. Ueda, T. Akutsu, J.-L. Perret, and J.-P. Vert. Graph kernels for molecular structure-activity relation-ship analysis with support vector machines. J. Chem. Inf. Model., 45(4):939–51, 2005.

[54] P. Mahe, L. Ralaivola, V. Stoven, and J.-P. Vert. The pharmacophore kernel for virtual screening with supportvector machines. J. Chem. Inf. Model., 46(5):2003–2014, 2006.

[55] R. Tibshirani. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B, 58(1):267–288, 1996.

[56] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser.B, 68(1):49–67, 2006.

[57] P. Zhao, G. Rocha, and B. Yu. Grouped and hierarchical model selection through composite absolute penalties.Ann. Stat., 37(6A):3468–3497, 2009.


[58] F. Bach. Exploring large feature spaces with hierarchical multiple kernel learning. In Adv. Neural. Inform.Process Syst., volume 21, 2009.

[59] R. Jenatton, J.-Y. Audibert, and F. Bach. Structured variable selection with sparsity-inducing norms. TechnicalReport 0904.3523, arXiv, 2009.

[60] R. Tibshirani and P. Wang. Spatial smoothing and hot spot detection for cgh data using the fused lasso. Bio-statistics (Oxford, England), 9(1):18–29, January 2008.

[61] H. Hoefling. A path algorithm for the Fused Lasso Signal Approximator. Technical Report 0910.0526v1, arXiv,Oct. 2009.

[62] F. Rapaport. Introduction de la connaissance a priori dans letude des puces a ADN. PhD thesis, UniversitePierre et Marie Curie - Paris 6, 2008.

[63] P. Pavlidis, J. Weston, J. Cai, and W.S. Noble. Learning gene functional classifications from multiple data types.J. Comput. Biol., 9(2):401–411, 2002.

[64] M. Hue and J-P. Vert. On learning with kernels for unordered pairs. In J. Furnkranz and T. Joachims, editors,Proceedings of the 27th International Conference on Machine Learning (ICML-10), June 21-24, 2010, Haifa,Israel, pages 463–470. Omnipress, 2010.

[65] A. Elfilali, S. Lair, C. Verbeke, P. La Rosa, F. Radvanyi, and E. Barillot. ITTACA: a new database for integratedtumor transcriptome array and clinical data analysis. Nucleic Acids Res., 34(Database issue):D613–D616, Jan2006.

[66] P. Wirapati, C. Sotiriou, S. Kunkel, P. Farmer, S. Pradervand, B. Haibe-Kains, C. Desmedt, M. Ignatiadis,T. Sengstag, F. Schutz, D. R. Goldstein, M. Piccart, and M. Delorenzi. Meta-analysis of gene expression profilesin breast cancer: toward a unified understanding of breast cancer subtyping and prognosis signatures. BreastCancer Res., 10(4):R65, 2008.

seventh framework programme ”ideas” speciﬁc programme ...members.cbio.mines-paristech.fr ›...

Documents