9/30/2004tcss588a isabelle bichindaritz1 introduction to bioinformatics
TRANSCRIPT
9/30/2004 TCSS588A Isabelle Bichindaritz 1
Introduction to Bioinformatics
9/30/2004 TCSS588A Isabelle Bichindaritz 2
Introduction to Class
• Syllabus• Schedule• Web-site http://courses.washington.edu/tcss588• Assignments:
– An application to genetics – An application to proteomics– …
• Project – project teams (proposal due next week)
9/30/2004 TCSS588A Isabelle Bichindaritz 3
Introduction to Class• 1. Biological foundations.• 2. Machine learning algorithms and applications
to biology/life sciences.• 3. Neural networks.• 4. Hidden Markov Models.• 5. Graphical models.• 6. Case Based Reasoning.• 7. Phylogenetic trees induction.• 8. Microarrays and gene expression.• 9. Image understanding and mining.• 10. Biometrics.
9/30/2004 TCSS588A Isabelle Bichindaritz 4
Introduction to ClassDay Date Subject Pre-reading R 9/30 Introduction to Bioinformatics and the Life Sciences Chapter 1 T 10/5 Probabilistic Framework Chapter 2 R 10/7 Probabilistic Inference Chapter 3 T 10/12 Machine Learning Algorithms (Part I) 4.1-4.4 R 10/14 Machine Learning Algorithms (Part II) 4.5-4.8 T 10/19 Neural Networks Theory Chapter 5 R 10/21 Neural Networks Applications Chapter 6 T 10/26 Hidden Markov Models Theory Chapter 7 R 10/28 Hidden Markov Models Applications Chapter 8 T 11/2 Graphical Models (Part I) 9.1-9.4
R 11/4 Graphical Models (Part II) 9.5-9.6 T 11/9 Case Based Reasoning Handout R 11/11 Veterans Day Holiday T 11/16 Future Trends Discussion / MIDTERM R 11/18 Phylogenetic Trees Induction Chapter 10 T 11/23 Microarrays and Gene Expression Chapter 12 R 11/25 Thanksgiving Holiday T 11/30 Image Understanding Handout R 12/2 Image Mining Handout T 12/7 Biometrics Handout R 12/9 Future Perspectives Discussion / FINAL R 12/16 FINAL PROJECT PRESENTATIONS in CP 106 5:00P– 7:15P
9/30/2004 TCSS588A Isabelle Bichindaritz 5
Course Learning Objectives• Understand biological concepts and set of problems. o Understand scientific framework for bioinformatics in statistics,
complexity, and information theory.o Understand machine learning methods for bioinformatics.o Understand innovative algorithms and methods for
bioinformatics. o Program using available bioinformatics tools. o Learn familiarity with statistical learning, concept learning,
hidden Markov models, case based reasoning, neural networks, knowledge-based systems and ontologies, genetic algorithms, stochastic grammars and linguistics, grid computing, and semantic Web.
o Design and develop new computer systems for bioinformatics.
9/30/2004 TCSS588A Isabelle Bichindaritz 6
OutlineOutline
• Informatics / Medical Informatics / Bioinformatics / Computational Biology
• Project examples– Care Partner– Telemakus– Phylsyst– Human Genome Project
• Introduction to biology
9/30/2004 TCSS588A Isabelle Bichindaritz 7
Informatics / Medical InformaticsInformatics / Medical Informatics
• Informatics is “The science of rational and computerized processing of information as it supports human knowledge and communication in scientific, technical, economical, and social domains.” .
• Often associated with health care and medical research applications medical informatics
• Interdisciplinary field involving medicine, biology, computer science, mathematics, information science, and statistics.
9/30/2004 TCSS588A Isabelle Bichindaritz 8
Medical InformaticsMedical Informatics
• Computer Applications in Health Care
1 communication and telematics2 storage and retrieval3 processing and automation4 diagnosis and decision making5 therapy and control6 research and development
INCREASINGLEVEL OF
COMPLEXITY
9/30/2004 TCSS588A Isabelle Bichindaritz 9
BioinformaticsBioinformatics
• Bioinformatics is the discipline that develops technologies for supporting information management in fields like biology.
• Target domains: biology, medicine, pharmacology, agriculture …
• Interdisciplinary field.• Main tasks: analyze biological sequence data,
genome content, and arrangement, predict the function and structure of macromolecules.
9/30/2004 TCSS588A Isabelle Bichindaritz 10
Computational BiologyComputational Biology
• Computational biology provides algorithms for bioinformatics.
• Target applications: – Genomics DNA genes– Proteomics proteins– Phylogenetics evolutionary classifications
9/30/2004 TCSS588A Isabelle Bichindaritz 11
Care Partner System DescriptionCare Partner System Description
• A decision support system for stem cell post transplant care:– comprehensive knowledge-base (scientific
literature, monographs, clinical guidelines, clinical pathways, clinical cases)
– available on the WWW – learns from experience
9/30/2004 TCSS588A Isabelle Bichindaritz 12
Knowledge-Base Knowledge-Base
N LTFUCDSS
SNOMEDv. 3.4
Diseases 1109 35,834Functions 452 19,221Labs 1152 30,723Procedures 547 20,105Medications 2684 14,846Sites 460 5,875
9/30/2004 TCSS588A Isabelle Bichindaritz 13
Knowledge-BaseKnowledge-Base
N CDSSTerms 739,439Relations 51
N CDSSPatient cases 4904
9/30/2004 TCSS588A Isabelle Bichindaritz 14
9/30/2004 TCSS588A Isabelle Bichindaritz 15
TelemakusTelemakus• Goal of the Telemakus System:
– to enhance the knowledge discovery process by developing retrieval, visual and interaction tools to mine and map research findings from the research literature.
• Objective of the research:– to create, test and validate an infrastructure to permit
the automation of the creation and maintenance of a searchable database that generates knowledge maps via query tools and concept mapping algorithms.
– to apply natural language processing models and information analysis methods to ultimately speed up the scientific discovery process.
9/30/2004 TCSS588A Isabelle Bichindaritz 16
TelemakusTelemakus
9/30/2004 TCSS588A Isabelle Bichindaritz 17
Phylsyst
9/30/2004 TCSS588A Isabelle Bichindaritz 18
Phylsyst
• Example – Phylsyst built cladogram
clado1 Level 1 01-10 Doublon split on characters: 8 12 27 Level 1 values: 8(0) 12(1) 27(1) Level 2 01-10 Doublon split on characters: 18 29 25 Level 2 values: 18(0) 29(1) 25(1) Taxon Diphylleia Level 2 values: 18(1) 29(0) 25(1) Level 3 01-10 Doublon split on characters: 14 17 Level 3 values: 14(0) 17(1) Taxon: Dysosma Level 1 values: 8(1) 12(0) 27(0) Level 2 01-10 Doublon split on characters: 16 29 30 19 Level 2 values: 16(0) 29(1) 30(0) 19(0)
Level 3 00-11 Doublon split on characters: 1 7 33 25 23 13 11 Level 3 values: 1(0) 7(0) 33(0) 25(0) 23(0) 13(0) 11(0) 10(0) Level 4 Agglom. Split Taxon: Berberis Taxon: Mahonia Level 3 values: 1(1) 7(1) 33(1) 25(1) 23(1) 13(1) 11(1) 10(1) Taxon: Ranzania
9/30/2004 TCSS588A Isabelle Bichindaritz 19
Human Genome ProjectHuman Genome Project• Goal of the Human Genome
Project:– identify all the approximate 30,000 genes in human DNA, – determine the sequences of the 3 billion chemical base
pairs that make up human DNA, – store this information in databases, – improve tools for data analysis, – transfer related technologies to the private sector, and – address the ethical, legal, and social issues (ELSI) that
may arise from the project.
• Completed in 2003
9/30/2004 TCSS588A Isabelle Bichindaritz 20
Human Genome Program, U.S. Department of Energy, Genomics and Its Impact on Medicine and Society: A 2001 Primer, 2001
9/30/2004 TCSS588A Isabelle Bichindaritz 21
The Human Genome Project
• The Human Genome Project
9/30/2004 TCSS588A Isabelle Bichindaritz 22
The Visible Human Project
• Image understanding – the Visible Human Project