csi5180. machine learning for bioinformatics …turcotte/teaching/csi-5180/lectures/...application...
TRANSCRIPT
CSI5180. Machine Learning forBioinformatics Applications
Course overview
by
Marcel Turcotte
Version November 6, 2019
Preamble 2/58
Preamble
Preamble
Preamble 3/58
Course overview
Machine Learning for Bioinformatics Applications is about the analysis of complexbiological data using modern machine learning methods. No prior machine learningknowledge is assumed. However, a basic understanding of probability and statistics isneeded, as well as, calculus and linear algebra. Also, I am expecting that you can writeprograms in Python. Now, what about biology? Biology is important asbioinformatics strives to solve “real-world” problems. There will be at least twolectures introducing essential concepts of the molecular biology of the cell. Inevitably,we will revisit these concepts each time that a new problem will be introduced. At thevery least, I am expecting a desire to learn more about biology.
General objective :Summarize the learning objectives and the expectations for this course
Learning objectives
Preamble 4/58
Clarify the propositionSummarize what bioinformatics is aboutGive an overview of the instructor’s backgroundDiscuss the syllabusArticulate the expectations
Reading:
Chunming Xu and Scott A Jackson.Machine learning and complex biological data.Genome Biol, 20(1):76, 04 2019.
Plan
Preamble 5/58
1. Preamble
2. Proposition
3. About the course
4. About me
5. What is Bioinformatics?
6. Syllabus
7. What is Machine Learning?
8. Prologue
Proposition 6/58
Proposition
AI detects mutations behind autism
Proposition 7/58
“Using artificial intelligence, aPrinceton University-led team hasdecoded the functional impact ofsuch mutations in people withautism.”Zhou et al. Nat Genet,51(6):973980, June 2019.https://bit.ly/2QtnmxS
Image: Autism Daily Newscast
Olga Troyanskaya/Princeton at AI NY 2019
Proposition 8/58
https://oreilly.com/go/ainy19
AI detects mutations behind autism
Proposition 9/58
“We address the challenge of detecting the contribution of noncodingmutations to disease with a deep-learning-based framework thatpredicts the specific regulatory effects and the deleterious impact of geneticvariants.”“Our predictive genomics framework illuminates the role of noncodingmutations in ASD [autism spectrum disorder] and prioritizes mutationswith high impact for further study, and is broadly applicable to complexhuman diseases.”Zhou et al. Nat Genet, 51(6):973980, June 2019.
Proposition 10/58
“Together, the HMP1 and HMP2 phases haveproduced a total of 42 terabytes of multi-omic
data.”Integrative HMP (iHMP) Research Network Consortium.
The Integrative Human Microbiome Project.
Nature 569, 641648 (2019).
Improving fitness and health
Proposition 11/58
“MyExome, a new DNA testdesigned by Toronto entrepreneurZaid Shahatit, claims to be ableto provide a little insight into ourpersonal quirks by testing 57different genes that coulddetermine our ability tometabolize certain things,sleep patterns and physicalperformance.”Can a DNA test improve yourfitness and health? by ChristineSismondo, The Star. July 31,2019.
Image: myexome.com
“A Brief History of Tomorrow”
Proposition 12/58
Yuval Noah Harari argues thatartificial intelligence andgenetic engineering will play acentral role shaping the future ofsociety.
Image: Amazon.ca
About the course 13/58
About the course
What this course is not
About the course 14/58
Although the following are of paramount importance, this is not what this course isabout:
Computational Learning Theory:Probably approximately correct learning (PAC Learning)proposed by Leslie Valiant;VC theoryproposed by Vladimir Vapnik and Alexey Chervonenkis;Bayesian inferenceinfluenced by Judea Pearl;Algorithmic learning theoryfrom E. Mark Gold;Online machine learningfrom Nick Littlestone.
Compression bounds and learnability in general.
What is course is
About the course 15/58
Practical applications of machine learning to biological sequence data,gene expression, genomics and proteomics.
Aurélien Géron.Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow.O’Reilly Media, 2nd edition, 2019.
Andriy Burkov.The Hundred-Page Machine Learning Book.Andriy Burkov, 2019.
What I would like the course to be. . .
About the course 16/58
In future editions of this course:Extensive set of examplesPractical Machine Learning Applications in Bioinformatics(textoobk)Hackathon, hackfest, codefest, and (friendly) competitive challenges;Participation to international competitions:
https://dream.recomb2019.org.Activity in the bioGARAGE;Guests lectures.
Cellular Molecular Biology Problems
About the course 17/58
Predicting protein stability changes upon mutation, intrinsically disorderedprotein regionProtein secondary and tertiary structure predictionPrediction of anti-hypertensive peptidesGenome assembly, gene prediction, genome annotationIdentifying DNA landmark sites: methylation, splice site, promotors, proteinbinding sites, etc.Prediction and prioritization of gene functional annotations.Clustering and classification of non-coding RNA genesSubtypes cancer classificationToxicity, carcinogenicity, structure activity relationshipsPredicting disease associations, identify robust prognostic gene signaturesSub-cellular localization
Machine Learning Concepts
About the course 18/58
Feature Engineering, Data Imputation, Dimensionality ReductionUnsupervised LearningLinear and Logistic RegressionDecision Trees, Random Forests and eXtreme Gradient Boosting, EnsembleHidden Markov ModelsKernel Methods, Support Vector MachinesDeep Learning: Fundamentals, Embeddings, ArchitecturesConcept and Rule-basedLearning GraphsSemi-supervised LearningAutomated Scientific Discovery
Learning objectives
About the course 19/58
Encode and clean biological data for machine learning applicationsApply modern machine learning methods to solve bioinformatics problemsFind optimal values for the hyperparameters a given machine learningalgorithm and data setUse a sound methodology for your machine learning projectsCritically review scientific publications in this fieldLocate and critically evaluate scientific informationPresent scientific content to a small technical audience
About me 20/58
About me
Professional experience
About me 21/58
1989, Honours project, implementation of a graphical user interface for aprotein folding/unfolding system
1989–95, Université de Montréal, graduate studies under the direction ofGuy Lapalme (IRO), Robert Cedergren (Biochemistry), work on methods forbuilding nucleic acids’ 3-D structures1995–97, University of Florida, work with Steven A. Benner (Chemistry)on evolutionary-based approaches to predict protein secondary structure1997–00, Imperial Cancer Research Fund (London/UK), work withMichael J.E. Sternberg and Stephen H. Muggleton (York) on the applicationof Inductive Logic Programming to discover automatically protein foldingrules2000–, University of Ottawa, work on nucleic acids secondary structuredetermination, motifs inference and pattern matching
Professional experience
About me 21/58
1989, Honours project, implementation of a graphical user interface for aprotein folding/unfolding system1989–95, Université de Montréal, graduate studies under the direction ofGuy Lapalme (IRO), Robert Cedergren (Biochemistry), work on methods forbuilding nucleic acids’ 3-D structures
1995–97, University of Florida, work with Steven A. Benner (Chemistry)on evolutionary-based approaches to predict protein secondary structure1997–00, Imperial Cancer Research Fund (London/UK), work withMichael J.E. Sternberg and Stephen H. Muggleton (York) on the applicationof Inductive Logic Programming to discover automatically protein foldingrules2000–, University of Ottawa, work on nucleic acids secondary structuredetermination, motifs inference and pattern matching
Professional experience
About me 21/58
1989, Honours project, implementation of a graphical user interface for aprotein folding/unfolding system1989–95, Université de Montréal, graduate studies under the direction ofGuy Lapalme (IRO), Robert Cedergren (Biochemistry), work on methods forbuilding nucleic acids’ 3-D structures1995–97, University of Florida, work with Steven A. Benner (Chemistry)on evolutionary-based approaches to predict protein secondary structure
1997–00, Imperial Cancer Research Fund (London/UK), work withMichael J.E. Sternberg and Stephen H. Muggleton (York) on the applicationof Inductive Logic Programming to discover automatically protein foldingrules2000–, University of Ottawa, work on nucleic acids secondary structuredetermination, motifs inference and pattern matching
Professional experience
About me 21/58
1989, Honours project, implementation of a graphical user interface for aprotein folding/unfolding system1989–95, Université de Montréal, graduate studies under the direction ofGuy Lapalme (IRO), Robert Cedergren (Biochemistry), work on methods forbuilding nucleic acids’ 3-D structures1995–97, University of Florida, work with Steven A. Benner (Chemistry)on evolutionary-based approaches to predict protein secondary structure1997–00, Imperial Cancer Research Fund (London/UK), work withMichael J.E. Sternberg and Stephen H. Muggleton (York) on the applicationof Inductive Logic Programming to discover automatically protein foldingrules
2000–, University of Ottawa, work on nucleic acids secondary structuredetermination, motifs inference and pattern matching
Professional experience
About me 21/58
1989, Honours project, implementation of a graphical user interface for aprotein folding/unfolding system1989–95, Université de Montréal, graduate studies under the direction ofGuy Lapalme (IRO), Robert Cedergren (Biochemistry), work on methods forbuilding nucleic acids’ 3-D structures1995–97, University of Florida, work with Steven A. Benner (Chemistry)on evolutionary-based approaches to predict protein secondary structure1997–00, Imperial Cancer Research Fund (London/UK), work withMichael J.E. Sternberg and Stephen H. Muggleton (York) on the applicationof Inductive Logic Programming to discover automatically protein foldingrules2000–, University of Ottawa, work on nucleic acids secondary structuredetermination, motifs inference and pattern matching
Learning protein structure principles (1/3)
About me 22/58
M. Turcotte, S.H. Muggleton, and M.J.E. Sternberg.Application of inductive logic programming to discover rules governing thethree-dimensional topology of protein structure.In C.D. Page, editor, Proc. of the 8th International Workshop on Inductive LogicProgramming (ILP-98), LNAI 1446, pages 53–64, Berlin, 1998. Springer-Verlag.
M. J. E. Sternberg, P. A. Bates, L. A. Kelley, R. M. MacCallum, A. Müller,S. Muggleton, and M. Turcotte.Exploiting protein structure in the post-genome era.In Intelligent Systems for Molecular Biology 1999, 1999.Oral Presentation.
Learning protein structure principles (2/3)
About me 23/58
M. Turcotte, S.H. Muggleton, and M.J.E. Sternberg.Learning protein structure principles.In The 17th Machine Intelligence Workshop, Suffolk, UK, July 19-21 2000.Oral Presentation.M. Turcotte, S.H. Muggleton, and M.J.E. Sternberg.Generating protein three-dimensional folds signatures using inductive logicprogramming.In 2000 Convention of the Society for the Study of Artificial Intelligence and theSimulation of Behaviour, Birmingham, UK, April 17-20 2000.Oral Presentation.
Learning protein structure principles (3/3)
About me 24/58
Marcel Turcotte, Stephen H. Muggleton, and Michael J. E. Sternberg.Automated discovery of structural signatures of protein fold and function.Journal of Molecular Biology, 306(3):591–605, February 2001.
Marcel Turcotte, Stephen H. Muggleton, and Michael J. E. Sternberg.Generating protein three-dimensional fold signatures using inductive logicprogramming.Computers & Chemistry, 26(1):57–64, December 2001.
Marcel Turcotte, Stephen H. Muggleton, and Michael J. E. Sternberg.The effect of relational background knowledge on learning of proteinthree-dimensional fold signatures.Machine Learning, 43(1-2):81-95, 2001.
Annotation concept synthesis
About me 25/58
Mikhail Jiline, Stan Matwin, and Marcel Turcotte.Annotation Concept Synthesis and Enrichment Analysis.Canadian AI 2010: Advances in Artificial Intelligence, 304–308, 2010.
Mikhail Jiline, Stan Matwin, and Marcel Turcotte.Annotation Concept Synthesis and Enrichment Analysis: a Logic-Based Approachto the Interpretation of High-Throughput Experiments.Bioinformatics (Oxford, England), 27(17):2391-2398, September 2011.
Learning relationships between motifs
About me 26/58
Oksana Korol and Marcel TurcotteLearning relationships between over-represented motifs in a set of DNA sequences.2012 IEEE Symposium on Computational Intelligence and Computational Biology,CIBCB 2012, 2012.
Frequent Subgraph Mining (FSM)
About me 27/58
Alexander R. Gawronski and Marcel Turcotte.RiboFSM: Frequent subgraph mining for the discovery of RNA structures andinteractions.BMC bioinformatics, 15(S2), 2014.
Smart controls
About me 28/58
Aseel Awdeh, Marcel Turcotte, and Theodore J. Perkins.WACS: Improving peak calling by optimally weighting controls.In Great Lakes Bioinformatics Conference, GLBIO 2019, May 1922 2019.
Aseel Awdeh, Marcel Turcotte, and Theodore J. Perkins.WACS: Improving Peak Calling by Optimally Weighting Controls.biorxiv.org.
What is Bioinformatics? 29/58
What is Bioinformatics?
Beginnings
What is Bioinformatics? 30/58
“Computers and specialized software have become an essential part of the biologiststoolkit. Either for routine DNA or protein sequence analysis or to parse meaningfulinformation in massive gigabyte-sized biological data sets, virtually all modern researchprojects in biology require, to some extent, the use of computers. (. . . ) the verybeginnings of bioinformatics occurred more than 50 years ago, when desktopcomputers were still a hypothesis and DNA could not yet be sequenced.”
Gauthier, J., Vincent, A. T., Charette, S. J. & Derome, N. A brief history ofbioinformatics. Brief Bioinform 79, 137 (2018).
A. Isaev
What is Bioinformatics? 31/58
“Broadly speaking, bioinformatics can be defined as a collection of mathematical,statistical and computational methods for analyzing biological sequences, thatis, DNA, RNA and amino acid (protein) sequences.”
In Introduction to Mathematical Methods in Bioinformatics,A. Isaev, Springer, p. i, 2006.
Lacroix and Critchlow
What is Bioinformatics? 32/58
“Bioinformatics is the design and development of computer-based technology thatsupports life sciences. Using this definition bioinformatics tools and systems perform adiverse range of functions including: data collection, data mining, data analysis,data management, data integration, simulation, statistics, and visualization.Computer-aided technology directly supporting medical applications is excluded fromthis definition and is referred to as medical informatics.”
In Bioinformatics: Managing Scientific Data, Zoé Lacroix and T. Critchlow Editors,Morgan Kaufmann, p. 3, 2003.
Jones N.C. and Pevzner P. A.
What is Bioinformatics? 33/58
“Biologists that reduce bioinformatics to simply the application of computers inbiology sometimes fail to recognize the rich intellectual content of bioinformatics.Bioinformatics has become a part of modern biology and often dictates new fashions,enables new approaches, and drives further biological developments”’In An Introduction to Bioinformatics Algorithms, Jones N.C. and Pevzner P. A.,MIT Press, p. 77, 2004.
J.J. Ramsden
What is Bioinformatics? 34/58
“In bioinformatics, so much is to be done, the raw material to hand is already so vastand vastly increasing, and the problems to be solved are so important (perhaps themost important of any science at present) we may be entering an era comparableto the great flowering of quantum mechanics in the first three decades of thetwentieth century (. . . )”
In Bioinformatics: An introduction, J.J Ramsden, Kluwer, p. xiii, 2004.
SIB - Swiss Institute of Bioinformatics
What is Bioinformatics? 35/58
https://youtu.be/182AzhLiwxc
What it’s not!
What is Bioinformatics? 36/58
Leonard Adleman (Science, December 1994) solved a particular instance of theHamiltonian Path problem using DNA molecules!
⇒ A Hamiltonian path visits every node of a graph exactly once.
What it’s not! (continued)
What is Bioinformatics? 37/58
DNA computing is the theoretical study of the use of DNA molecules to solvechallenging problems or as a new architecture (what class of problems can be solved,what are the properties, limits, etc.).
What it’s not! (continued)
What is Bioinformatics? 38/58
Biotechnology and biomedical engineering apply engineering approachesto problems dealing with biological systems.Examples of biomedical engineering include developing biomedicaldevices for human implantation, drug delivery systems, simulation oforgans and micro-fluids, medical imaging, and many more.
Bioinformatics courses on campus
What is Bioinformatics? 39/58
http://www.bioinformatics.uottawa.caCSI 5126. Algorithms in bioinformaticsBNF5106 Bioinformatics1
BCH5101 Analysis of -omics data
1www.bioinformatics.uottawa.ca/stephane/bnf5106.syllabus.pdf
Collaborative programs in bioinformatics
What is Bioinformatics? 40/58
Starting from January 2008, Carleton University and the University ofOttawa offers a Collaborative Program leading to an MSc degree withSpecialization in Bioinformatics or MSc of Computer Science degreewith Specialization in Bioinformatics;A proposal for a Ph.D. program is under review.
Most cited publications in Science
What is Bioinformatics? 41/58
Van Noorden, R., Maher, B. & Nuzzo, R. The top 100 papers. Nature514:550553, 2014.Wren, J. D. Bioinformatics programs are 31-fold over-represented among thehighest impact scientific papers of the past two decades. Bioinformatics32(17):2686-91, September 2016.
Syllabus 43/58
Syllabus
Course information
Syllabus 44/58
Web siteshttps://www.eecs.uottawa.ca/~turcotte/teaching/csi-5180/https://piazza.com/uottawa.ca/fall2019/csi5180https://uottawa.brightspace.com
ScheduleLectures: Tuesday, 13:00 to 14:30, and Thursday, 11:30 to 13:00, MNT 103Office hours: Tuesday from 14:30 to 16:00 at STE 5-106Official schedule: www.uottawa.ca/course-timetable
Evaluation30% — assignments (3)10% — presentation (1)20% — project (1)40% — examinations (2)
Deadlines
Syllabus 45/58
AssignmentsA1 - October 10, 2019, 18:00A2 - October 31, 2019, 18:00A3 - November 21, 2019, 18:00
PresentationSchedule will be published on September 19, 2019Presentations between October 1, 2019 and December 3, 2019
ProjectOutline - October 1, 2019Report - December 3, 2019
ExaminationsMidterm - October 24, 2019Final - December 5 to 18, 2019
What is Machine Learning? 46/58
What is Machine Learning?
What is Machine Learning? 47/58
“Lets start by telling the truth: machines dontlearn. (. . . ) just like artificial intelligence is
not intelligence, machine learning is notlearning.”
The Hundred-Page Machine Learning Book, Andriy Burkov, 2019
1959
What is Machine Learning? 48/58
Samuel, A. L. Some Studies in Machine Learning Using the Game ofCheckers. IBM Journal of Research and Development 3, 211229 (1959).
“A computer can be programmed so that it will learn to play a better gameof checkers than can be played by the person who wrote the program.”“Programming computers to learn from experience should eventuallyeliminate the need for much of this detailed programming effort.”
1951
What is Machine Learning? 49/58
Stephen .H. Muggleton. Logic and learning: Turing’s legacy. In K.Furukawa, D. Michie, and S.H. Muggleton, editors, Machine Intelligence13, pages 37-56. Oxford University Press, 1994.
“Inspired by a radio talk given by Turing in 1951, Christopher Stracheywent on to implement the worlds first machine learning program.”
Mitchell
What is Machine Learning? 50/58
Tom M Mitchell. Machine Learning. McGraw-Hill, New York, 1997.“A computer program is said to learn from experience E with respect tosome class of tasks T and performance measure P, if its performance attasks in T , as measured by P, improves with experience E .”
Machine learning in computational biology
What is Machine Learning? 51/58
Chicco, D. Ten quick tips for machine learning in computational biology.BioData Mining 10, 35 (2017).
“A machine learning algorithm is a computational method based uponstatistics, implemented in software, able to discover hidden non-obviouspatterns in a dataset, and moreover to make reliable statisticalpredictions about similar new data.”“The ability [of machine learning] to automatically identify patterns in data[. . . ] is particularly important when the expert knowledge is incomplete orinaccurate, when the amount of available data is too large to be handledmanually, or when there are exceptions to the general cases.”
Prologue 52/58
Prologue
Summary
Prologue 53/58
A practical application of machine learning to biological data
Python programming skills and a love of biology are both expected
Summary
Prologue 53/58
A practical application of machine learning to biological dataPython programming skills and a love of biology are both expected
Atul Butte/Stanford at TEDMED 2012
Prologue 54/58
https://youtu.be/dtNMA46YgX4
Atul Butte/Stanford at TEDx 2017
Prologue 55/58
https://youtu.be/dtNMA46YgX4
Next module
Prologue 56/58
Essential Cell Biology (two lectures)
References
Prologue 57/58
Chunming Xu and Scott A Jackson.Machine learning and complex biological data.Genome Biol, 20(1):76, 04 2019.
Jian Zhou, Christopher Y Park, Chandra L Theesfeld, Aaron K Wong, Yuan Yuan,Claudia Scheckel, John J Fak, Julien Funk, Kevin Yao, Yoko Tajima, Alan Packer,Robert B Darnell, and Olga G Troyanskaya.Whole-genome deep-learning analysis identifies contribution of noncodingmutations to autism risk.Nat Genet, 51(6):973–980, Jun 2019.
Tom M Mitchell.Machine Learning.McGraw-Hill, New York, 1997.
Prologue 58/58
Marcel [email protected]
School of Electrical Engineering and Computer Science (EECS)University of Ottawa