a proposed undergraduate bioinformatics curriculum for
DESCRIPTION
TRANSCRIPT
A proposed undergraduatebioinformatics curriculum for
computer scientists
A proposed undergraduatebioinformatics curriculum for
computer scientistsCrossing the Interdisciplinary
Boundaries
Drs. Travis Doom (CS), Michael Raymer (CS),
Dan Krane (Bio), and Oscar Garcia (CS)
Crossing the Interdisciplinary Boundaries
Drs. Travis Doom (CS), Michael Raymer (CS),
Dan Krane (Bio), and Oscar Garcia (CS)
This work supported by NSF grant #EIA-0122582
T. Doom, M. Raymer, D. Krane, O. Garcia 2
OverviewOverview• What is bioinformatics?
– The genome as an information source– Bioinformatics problems
• How do people learn bioinformatics?
• A bioinformatics curriculum
• What is bioinformatics?– The genome as an information source– Bioinformatics problems
• How do people learn bioinformatics?
• A bioinformatics curriculum
T. Doom, M. Raymer, D. Krane, O. Garcia 3
Genomic information: from genes to proteinsGenomic information: from genes to proteins
• TATAAGCTGACTGTCACTGA• TATAAGCTGACTGTCACTGA
one codon
3apr.pdb
4 Bases:A,G,C,T
20 Amino Acids
Protein:Structural orEnzyme
T. Doom, M. Raymer, D. Krane, O. Garcia 4
Bioinformatics ProblemsBioinformatics Problems• Sequence alignment
– Given a gene, search a database for similar genes
• Protein folding
• Sequence alignment– Given a gene, search a database for similar genes
• Protein folding
GCTATAATGCGTGT*CCA*CGCAGC*A*AATGC*TGTACCATCGCA
T. Doom, M. Raymer, D. Krane, O. Garcia 5
Bioinformatics ProblemsBioinformatics Problems
• Complementarity– Shape– Chemical– Electrostatic
• Complementarity– Shape– Chemical– Electrostatic
??Drug Lead Screening/Docking
T. Doom, M. Raymer, D. Krane, O. Garcia 6
The Role of ComputationThe Role of Computation• Target Identification: Pattern Recognition, Data
Mining, Dynamic Programming– Finding proteins that are likely to be related to disease &
determining their active sites– Finding genes that code for these proteins
• Finding drug leads: Databases, Parallel Systems, Graph Theory, etc.– Database screening– Docking
• Refining leads: Knowledge-Based & Expert Systems, AI, Pattern Recognition, Graph Theory– Toxicology & delivery
• Target Identification: Pattern Recognition, Data Mining, Dynamic Programming– Finding proteins that are likely to be related to disease &
determining their active sites– Finding genes that code for these proteins
• Finding drug leads: Databases, Parallel Systems, Graph Theory, etc.– Database screening– Docking
• Refining leads: Knowledge-Based & Expert Systems, AI, Pattern Recognition, Graph Theory– Toxicology & delivery
T. Doom, M. Raymer, D. Krane, O. Garcia 7
Growth of biological databasesGrowth of biological databases
1 2 3 5 10 16 24 35 49 72 101 157217
385652
1,160
2,009
3,841
0
500
1,000
1,500
2,000
2,500
3,000
3,500
4,000
Millions
82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
Source: GenBank
3D StructuresGrowth:3D StructuresGrowth:
Source: http://www.rcsb.org/pdb/holdings.html
GenBank BASEPAIR GROWTH:GenBank BASEPAIR GROWTH:
T. Doom, M. Raymer, D. Krane, O. Garcia 8
The role of computationThe role of computationACGTCCGGCCTTATACGCTAATAAGCGCTCTATGTCTATACGCGCGATGCCGTACGAGACGCTAATAAGCGCTCTATGTCTATACGCGCGATGCCGTACGAGACGCTAATAAGCGCTCTATGTCTATACGCGCGATGCCGTACGAGACGCTAATAAGCGCTCTATGTCTATACGCGCGATGCCGTACGAGACGCTAATAAGCGCTCTATGTCTATACGCGCGATGCCGTACGAGACGCTAATAAGCGCTCTATGTCTATACGCGCGATGCCGTACGAGACGCTAATAAGCGCTCTATGTCTATACGCGCGATGCCGTACGAGACGCTAATAAGCGCTCTATGTCTATACGCGCGATGCCGTACGAGACGCTAATAAGCGCTCTATGTCTATACGCGCGATGCCGTACGAGACGCTAATAAGCGCTCTATGTCTATACGCGCGATGCCGTACGAGACGCTAATAAGCGCTCTATGTCTATACGCGCGATGCCGTACGAGACGCTAATAAGCGCTCTATGTCTATACGCGCGATGCCGTACGAGACGCTAATAAGCGCTCTATGTCTATACGCGCGATGCCGTACGAGACGCTAATAAGCGCTCTATGTCTATACGCGCGATGCCGTACGAGACGCTAATAAGCGCTCTATGTCTATACGCGCGATGCCGTACGAGACGCTAATAAGCGCTCTATGTCTATACGCGCGATGCCGTACGAGACGCTAATAAGCGCTCTATGTCTATACGCGCGATGCCGTACGAGACGCTAATAAGCGCTCTATGTCTATACGCGCGATGCCGTACGAGACGCTAATAAGCGCTCTATGTCTATACGCGCGATGCCGTACGAGACGCTAATAAGCGCTCTATGTCTATACGCGCGATGCCGTACGAGACGCTAATAAGCGCTCTATGTCTATACGCGCGATGCCGTACGAGACGCTAATAAGCGCTCTATGTCTATACGCGCGATGCCGTACGAG…
Please find me the genes on this chromosome associated withtype II diabetes…
Please find me the genes on this chromosome associated withtype II diabetes…
T. Doom, M. Raymer, D. Krane, O. Garcia 9
Bioinformatics ProblemsBioinformatics Problems• Site
Recognition– Active site– Other binding
sites
• Data Integration– Indexing,
retrieval– Formatting
• Site Recognition– Active site– Other binding
sites
• Data Integration– Indexing,
retrieval– Formatting
T. Doom, M. Raymer, D. Krane, O. Garcia 10
A growing research industryA growing research industry
Source: Ernst & Young 13th & 14th Annual Reports, Biospace.
3,500
2,007
1,222 1,354
10,896
0
2,000
4,000
6,000
8,000
10,000
12,000
96 97 98 99 Qt1, 00
Cas
h I
nfl
ow
($M
)
3x Bioinformatics
Related (B)
Other Biotech (O)
37%
63%
AVERAGE QUARTERLY FINANCING
Year
>95% O
<5% B
T. Doom, M. Raymer, D. Krane, O. Garcia 11
OverviewOverview• What is bioinformatics?
• How do people learn bioinformatics?– Learning a bilingual discipline– Why undergraduates?
• A bioinformatics curriculum
• What is bioinformatics?
• How do people learn bioinformatics?– Learning a bilingual discipline– Why undergraduates?
• A bioinformatics curriculum
T. Doom, M. Raymer, D. Krane, O. Garcia 12
Bioinformatics in the USBioinformatics in the US• The demand is growing
– The National Institute for General Medical Sciences (NIGMS) has issued a report that shows there is a critical need for researchers for other disciplines that can perform the kind of modeling and data analysis that biological researchers require.
• Graduate programs are flourishing– Approximately 20 US universities started graduate
programs in Bioinformatics last year.– New graduate programs are being proposed at many
universities across the nation.
• The demand is growing– The National Institute for General Medical Sciences
(NIGMS) has issued a report that shows there is a critical need for researchers for other disciplines that can perform the kind of modeling and data analysis that biological researchers require.
• Graduate programs are flourishing– Approximately 20 US universities started graduate
programs in Bioinformatics last year.– New graduate programs are being proposed at many
universities across the nation.
T. Doom, M. Raymer, D. Krane, O. Garcia 13
The ProblemThe Problem• Bioinformatics is interdisciplinary
– Students must posses a strong grasp of computer science fundamentals
– Students must posses a strong grasp of biochemistry to recognize and appreciate the results
– Learning to speak the languages of both fields is essential
– Learning to “think” as a bioinformatician requires training in both the scientific method and solid engineering design methodology
• We believe this can (and must) be done at the undergraduate level
• Bioinformatics is interdisciplinary– Students must posses a strong grasp of computer science
fundamentals
– Students must posses a strong grasp of biochemistry to recognize and appreciate the results
– Learning to speak the languages of both fields is essential
– Learning to “think” as a bioinformatician requires training in both the scientific method and solid engineering design methodology
• We believe this can (and must) be done at the undergraduate level
T. Doom, M. Raymer, D. Krane, O. Garcia 14
The ProblemThe Problem• To pursue a career or graduate study in bioinformatics,
a CS student must be familiar with:– “Classical” CS: introductory programming, data structures,,
formal and comparative languages (complexity and optimizaiton algorithms), probability and statistics
– “Contemporary” CS: AI algorithms (search, optimization, list processing, pattern recognition, etc.), databases (storage, transmission, and processing of large data sets), modeling and simulation
– Biology: genetics, molecular bio, cellular bio, gene expression, replication, recombination, repair, and the experimental tools of molecular biology (~2.5 years)
– Chemistry: inorganic and organic chemistry (~2 years)
• To pursue a career or graduate study in bioinformatics, a CS student must be familiar with:– “Classical” CS: introductory programming, data structures,,
formal and comparative languages (complexity and optimizaiton algorithms), probability and statistics
– “Contemporary” CS: AI algorithms (search, optimization, list processing, pattern recognition, etc.), databases (storage, transmission, and processing of large data sets), modeling and simulation
– Biology: genetics, molecular bio, cellular bio, gene expression, replication, recombination, repair, and the experimental tools of molecular biology (~2.5 years)
– Chemistry: inorganic and organic chemistry (~2 years)
T. Doom, M. Raymer, D. Krane, O. Garcia 15
OverviewOverview• What is Bioinformatics?
• How do people learn bioinformatics?
• How are we facilitating learning in bioinformatics at Wright State University?– NSF CISE Educational Innovation Award– Towards an accredited undergraduate program in
bioinformatics
• What is Bioinformatics?
• How do people learn bioinformatics?
• How are we facilitating learning in bioinformatics at Wright State University?– NSF CISE Educational Innovation Award– Towards an accredited undergraduate program in
bioinformatics
T. Doom, M. Raymer, D. Krane, O. Garcia 16
NSF Educational InnovationNSF Educational Innovation• The NSF’s directorate for Computer and
Information Sciences and Engineering has awarded WSU an Educational Innovation grant.– Crossing the interdisciplinary barrier: An integrated
undergraduate program in bioinformatics– Three year plan – Fall 2001 to Summer 2004.– Goal: An interdisciplinary baccalaureate
bioinformatics program in Computer Science at WSU to serve as a national model of excellence
• The NSF’s directorate for Computer and Information Sciences and Engineering has awarded WSU an Educational Innovation grant.– Crossing the interdisciplinary barrier: An integrated
undergraduate program in bioinformatics– Three year plan – Fall 2001 to Summer 2004.– Goal: An interdisciplinary baccalaureate
bioinformatics program in Computer Science at WSU to serve as a national model of excellence
T. Doom, M. Raymer, D. Krane, O. Garcia 17
The Big PictureThe Big Picture• Graduate programs accept students with either
bachelor’s degrees in CS or Biology– The majority of the first year of graduate study is
generally consumed with remedial coursework in the other discipline
• Undergraduate programs must incorporate:– More specific (and shorter) biology and chemistry
sequences– More focused computer science foundation– Redesignate traditional “core” CS with
contemporary areas of IT knowledge
• Graduate programs accept students with either bachelor’s degrees in CS or Biology– The majority of the first year of graduate study is
generally consumed with remedial coursework in the other discipline
• Undergraduate programs must incorporate:– More specific (and shorter) biology and chemistry
sequences– More focused computer science foundation– Redesignate traditional “core” CS with
contemporary areas of IT knowledge
T. Doom, M. Raymer, D. Krane, O. Garcia 18
Goal: Integrating researchGoal: Integrating research• Integrating research into the undergraduate
curriculum– Academic collaborations– Industry collaborations for research and internship
• Why is bioinformatics a rich field for integration?– Apply the tools to new data– Develop new tools
• Integrating research into the undergraduate curriculum– Academic collaborations– Industry collaborations for research and internship
• Why is bioinformatics a rich field for integration?– Apply the tools to new data– Develop new tools
T. Doom, M. Raymer, D. Krane, O. Garcia 19
Goal: Minimal New ResourcesGoal: Minimal New Resources• Bio/CS 2xx – Introduction to Bioinformatics
– Tools-oriented approach to bioinformatics emphasizing data structure in DNA, string representation in PERL, data searches, pairwise alignment, substitution patterns, protein structure prediction and modeling, proteomics, and the use of web-based bioinformatic tools
• Bio/CS 4xx – Algorithms for Bioinformatics– Theory-oriented approach to the application of contemporary
algorithms to bioinformatics. Graph theory, complexity theory, dynamic programming and optimization techniques are introduced in the context of application toward solving specific computational problems in molecular genetics
• Bio/CS 2xx – Introduction to Bioinformatics– Tools-oriented approach to bioinformatics emphasizing data
structure in DNA, string representation in PERL, data searches, pairwise alignment, substitution patterns, protein structure prediction and modeling, proteomics, and the use of web-based bioinformatic tools
• Bio/CS 4xx – Algorithms for Bioinformatics– Theory-oriented approach to the application of contemporary
algorithms to bioinformatics. Graph theory, complexity theory, dynamic programming and optimization techniques are introduced in the context of application toward solving specific computational problems in molecular genetics
T. Doom, M. Raymer, D. Krane, O. Garcia 20
Goal: Strong CS BS programGoal: Strong CS BS program• This degree program should a different but
strong CS BS student:– We use the CAC guidelines as a rule for “core” CS– Other components developed in close collaboration
with Biology and an industry panel
• CAC guidelines include:– Algorithms, data structures, software design,
programming languages (variety), computer org. & arch., discrete math, calculus, statistics, lab science, and development of oral, written, and social/ethical skills
• This degree program should a different but strong CS BS student:– We use the CAC guidelines as a rule for “core” CS– Other components developed in close collaboration
with Biology and an industry panel
• CAC guidelines include:– Algorithms, data structures, software design,
programming languages (variety), computer org. & arch., discrete math, calculus, statistics, lab science, and development of oral, written, and social/ethical skills
T. Doom, M. Raymer, D. Krane, O. Garcia 21
Towards a CAC accredited programTowards a CAC accredited programCourses Removed3xx-04 Digital Sys. Design 4xx-04 Concurrent Software4xx-04 Formal Languages4xx-04 Software Engineering
xxx-20 CS Electives package
1xx-16 Physics sequencexxx-04 Science electivexxx-24 Concentration reqs.
(MTH/SCI/ENG)80 QH removed
Courses Added2xx-04 Intro. Bioinformatics4xx-04 Artificial Intelligence4xx-04 Algorithms for Bioinf.4xx-04 Databases
xxx-08 Focused CS electives
1xx-15 Inorganic Chemistry2xx-18 Organic Chemistryxxx-29 Biology sequence
82 QH added
T. Doom, M. Raymer, D. Krane, O. Garcia 22
Towards a CAC accredited programTowards a CAC accredited program
• 195 Total Quarter Credit Hours– 42 General Education (as per CS)
– 66 Computer Science / Engineering (Vs. 82)• Includes AI, Databases, two new bioinformatics courses; excludes
Digital System Design, Formal Languages, Software Eng., Concurrent Software
– 29 Biology (~two year sequence) (Vs. 24 Concentration)
– 33 Chemistry (two year sequence) (Vs. 19 MTH/Sci)
– 25 Mathematics (as per CS)
• Approved Winter 2002
• 195 Total Quarter Credit Hours– 42 General Education (as per CS)
– 66 Computer Science / Engineering (Vs. 82)• Includes AI, Databases, two new bioinformatics courses; excludes
Digital System Design, Formal Languages, Software Eng., Concurrent Software
– 29 Biology (~two year sequence) (Vs. 24 Concentration)
– 33 Chemistry (two year sequence) (Vs. 19 MTH/Sci)
– 25 Mathematics (as per CS)
• Approved Winter 2002
T. Doom, M. Raymer, D. Krane, O. Garcia 23
Un undergraduate textbookUn undergraduate textbookFundamental Concepts in Bioinformatics
I. Molecular Biology and Biological ChemistryII. Data searches and pairwise alignmentsIII. Substitution patternsIV. Distance-based methods of phylogeneticsV. Character-Based approaches to phylogeneticsVI. Gene recognition: Prokaryotic GenomesVII. Gene Recognition: Eukaryotic GenomesVII. Protein foldingVIII. ProteomicsAppendix 1: A gentle introduction to programming & data structuresAppendix 2: Enzyme kineticsAppendix 3: Sample programs in Perl and worksets
T. Doom, M. Raymer, D. Krane, O. Garcia 24
Questions?Questions?
http://birg.cs.wright.edu
T. Doom, M. Raymer, D. Krane, O. Garcia 25
Simplified Diagram of Modern IT & CSSimplified Diagram of Modern IT & CS
Classical View
Modern IT View
Logic DatabasesLogic Databases
Machine ReasoningMachine Reasoning DataWarehousingDataWarehousing
Web ProgrammingWeb Programming
WWWWWW
DataminingDatamining
Video on DemandVideo on DemandParallelismParallelism
Human-Computer InteractionHuman-Computer Interaction
SearchingSearching
T. Doom, M. Raymer, D. Krane, O. Garcia 26
Three Possible Views of BioinformaticsThree Possible Views of Bioinformatics
ComputerComputerScienceScience BiologyBiology
Is it Genomics in CS?
ComputerComputerScienceScience BiologyBiology
Or is it CS in Biology?
Or is it an independent discipline? This argues for theformation of interdisciplinary centers broader than either
the bio or the informatics disciplines.
ACT
G
See: “Impact of EmergingTechnologies on the Bio-logical Sciences” athttp://www.nsf.gov/bio/pubs/stctech/stcmain.html
T. Doom, M. Raymer, D. Krane, O. Garcia 27
Sister program in BiologySister program in Biology• 200 credit hour program in Biological Sciences
– 42 General Education– 63 Biology (~four year sequence)
• Includes two new bioinformatics courses
– 28 Computer Science (~three year sequence)– 33 Chemistry (two year sequence)– 34 Mathematics and Physics
• Close collaboration with the department of computer science and an industrial panel
• 200 credit hour program in Biological Sciences– 42 General Education– 63 Biology (~four year sequence)
• Includes two new bioinformatics courses
– 28 Computer Science (~three year sequence)– 33 Chemistry (two year sequence)– 34 Mathematics and Physics
• Close collaboration with the department of computer science and an industrial panel
T. Doom, M. Raymer, D. Krane, O. Garcia 28
Bioinformatics OverviewBioinformatics Overview• Genomics
– emphasis on genetics, chemical and physical aspects of flow of genetic information from DNA to proteins, gene expression, replication, recombination, and repair
– Databases, Data Mining, Neural Networks, Pattern Recognition, etc.
• Proteomics– Study of how genes make proteins. Emphasis on the structure and
properties of proteins and ligands
– Molecular modeling, Pattern Recognition, Data Mining, etc.
• Genomics– emphasis on genetics, chemical and physical aspects of flow of genetic
information from DNA to proteins, gene expression, replication, recombination, and repair
– Databases, Data Mining, Neural Networks, Pattern Recognition, etc.
• Proteomics– Study of how genes make proteins. Emphasis on the structure and
properties of proteins and ligands
– Molecular modeling, Pattern Recognition, Data Mining, etc.
T. Doom, M. Raymer, D. Krane, O. Garcia 29
Molecular EvolutionMolecular EvolutionXLRHODOP 1 ggtagaacagcttcagttgggatcacaggcttcta 35 ||||||||||||||||||||||||||||||||||XL23808 1171 tgggtcatactgtagaacagcttcagttgggatcacaggcttcta 1215XLRHODOP 36 gggatcctttgggcaaaaaagaaacacagaaggcattctttctat 80 |||||||||||||||||||||||||||||||||||||||||||||XL23808 1216 gggatcctttgggcaaaaaagaaacacagaaggcattctttctat 1260 XLRHODOP 81 acaagaaaggactttatagagctgctaccatgaacggaacagaag 125 |||||||||||||||||||||||||||||||||||||||||||||XL23808 1261 acaagaaaggactttatagagctgctaccatgaacggaacagaag 1305XLRHODOP 126 gtccaaatttttatgtccccatgtccaacaaaactggggtggtac 170 |||||||||||||||||||||||||||||||||||||||||||||XL23808 1306 gtccaaatttttatgtccccatgtccaacaaaactggggtggtac 1350
T. Doom, M. Raymer, D. Krane, O. Garcia 30
Drug discovery life cycleDrug discovery life cycle
Years
0 2 4 6 8 10 12 14 16
Discovery (2 to 10 Years)
Preclinical Testing(Lab and Animal Testing)
Phase I(20-30 Healthy Volunteers used to check for safety and dosage)
Phase II(100-300 Patient Volunteers used to check for efficacy and side effects)
Phase III(1000-5000 Patient Volunteers used to monitor reactions to long-term drug use)
FDA Review & Approval
Post-Marketing Testing
$600-700 Million,$600-700 Million,
7 – 15 Years7 – 15 Years
T. Doom, M. Raymer, D. Krane, O. Garcia 31
Benefits of bioinformaticsBenefits of bioinformatics• Every major pharmaceutical company now
employs bioinformatics techniques to improve drug design (among other business aspects)
• Increased understanding of evolution at the genetic/molecular level (phylogenetics)
• Our best glimpse yet at the molecular mechanisms that regulate life at a cellular level and possibilities for simulating some aspects with a computer (basic science)
• Every major pharmaceutical company now employs bioinformatics techniques to improve drug design (among other business aspects)
• Increased understanding of evolution at the genetic/molecular level (phylogenetics)
• Our best glimpse yet at the molecular mechanisms that regulate life at a cellular level and possibilities for simulating some aspects with a computer (basic science)