bioinformatics timothy ketcham union college gradutate seminar 2003 bioinformatics

36
Bioinformatics Timothy Ketcham Union College Gradutate Seminar 2003 Bioinformat ics

Upload: albert-sherman

Post on 27-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bioinformatics Timothy Ketcham Union College Gradutate Seminar 2003 Bioinformatics

Bioinformatics

Timothy KetchamUnion College

Gradutate Seminar2003

Bioinformatics

Page 2: Bioinformatics Timothy Ketcham Union College Gradutate Seminar 2003 Bioinformatics

Bioinformatics

Agenda

- What is Bioinformatics? - Goals - Molecular Biology – Genes & Proteins - AI Techniques applied to Gene & Protein Studies - Molecular Biology – Phylogenetic Trees - CS Techniques applied to Tree Estimation - Databases - Tools - Results - Discussion 

Bioinformatics

Introduction

Page 3: Bioinformatics Timothy Ketcham Union College Gradutate Seminar 2003 Bioinformatics

Bioinformatics

What is Bioinformatics?

- Entire field of Computational Biology? - Computational Molecular Biology? - Application of Computer Science to Genome Analysis? 

Bioinformatics

Introduction

Page 4: Bioinformatics Timothy Ketcham Union College Gradutate Seminar 2003 Bioinformatics

Bioinformatics

What is Bioinformatics?

Definition:…conceptualizing biology in terms of molecules(in the sense of physical chemistry) and applying“informatics techniques” (derived from disciplinessuch as applied maths, computer science andstatistics) to understand and organize theinformation associated with these molecules, ona large scale. In short, Bioinformatics is an

information management system for molecular biology…

Bioinformatics

Introduction

Page 5: Bioinformatics Timothy Ketcham Union College Gradutate Seminar 2003 Bioinformatics

Bioinformatics

Organizing existing biological data.

Developing tools and techniques to mine the data.

Using the data and tools for knowledge discovery.

Bioinformatics

Goals

Page 6: Bioinformatics Timothy Ketcham Union College Gradutate Seminar 2003 Bioinformatics

Bioinformatics

Genetics

- Genome - Chromosomes - Genes - Nucleotides - Base Pairs

- Key PointThe sequence of nucleotides in a gene

determines its functions and changes inthe sequence can lead major changes inthose functions.

Bioinformatics

Molecular Biology

Page 7: Bioinformatics Timothy Ketcham Union College Gradutate Seminar 2003 Bioinformatics

Bioinformatics

Proteins

- Linear chains of amino acids - Structural Components

- Primary- Secondary- Tertiary- Quaternary

- Key Point

The four structural components along withthe chemical properties of the amino acidsdetermine the function of the protein.

Bioinformatics

Molecular Biology

Page 8: Bioinformatics Timothy Ketcham Union College Gradutate Seminar 2003 Bioinformatics

Bioinformatics

Artificial Intelligence

- Components - Performance element - Learning element - Critic

- Training - Testing - Operation

Bioinformatics

Techniques

Page 9: Bioinformatics Timothy Ketcham Union College Gradutate Seminar 2003 Bioinformatics

Bioinformatics

Decision Trees

Bioinformatics

Attribute 1

Attribute 2 Attribute 2 Attribute 2

Condition 1Condition 2 Condition 3

Condition 1 Condition 2 Condition 1 Condition 2 Condition 1 Condition 2

Result 1 Result 2 Result 1 Result 2 Result 1 Result 2

Techniques

Page 10: Bioinformatics Timothy Ketcham Union College Gradutate Seminar 2003 Bioinformatics

Bioinformatics

Decision Trees

Bioinformatics

Attribute 2

Condition 1 Condition 2

Result 1 Result 2

Techniques

Page 11: Bioinformatics Timothy Ketcham Union College Gradutate Seminar 2003 Bioinformatics

Bioinformatics

Neural Networks

Bioinformatics

Attribute 1

Attribute 2

Attribute 3

Attribute 4

Attribute 5

Decisionfact

fact

fact

fact

Techniques

Page 12: Bioinformatics Timothy Ketcham Union College Gradutate Seminar 2003 Bioinformatics

Bioinformatics

Belief Networks

Bioinformatics

Techniques

Attribute 1

Result 1 Attribute 2

Attribute 3

Result 2

Result 3

p = 0.3 p = 0.7

p = 0.3

p = 0.3

p = 0.4

p = 0.5

p = 0.5

Page 13: Bioinformatics Timothy Ketcham Union College Gradutate Seminar 2003 Bioinformatics

Bioinformatics

Hidden Markov Models

Bioinformatics

Techniques

Start

End

Match State Delete StateInsert State

Page 14: Bioinformatics Timothy Ketcham Union College Gradutate Seminar 2003 Bioinformatics

Bioinformatics

Phylogenetic Trees

- Used to map evolutionary relationships - Traditionally done at the organism level

- Mapping at molecular level can help evaluate the relationships and/or evolution of genetic structures, proteins or organisms

Bioinformatics

Molecular Biology

Page 15: Bioinformatics Timothy Ketcham Union College Gradutate Seminar 2003 Bioinformatics

Bioinformatics

Tree Estimation

- Number of Trees (T) for a given number of taxa (n)

- T increases very rapidly (108 trees for 11 taxa)

- Need efficient search methods

Bioinformatics

Techniques

)32(1

1

iTn

i

Page 16: Bioinformatics Timothy Ketcham Union College Gradutate Seminar 2003 Bioinformatics

Bioinformatics

Exhaustive Search

- Brute Force Method

- Algorithm - Create all possible trees

- Evaluate against optimality criteria- Select best tree

- Only used up to 11 taxa

Bioinformatics

Techniques

Page 17: Bioinformatics Timothy Ketcham Union College Gradutate Seminar 2003 Bioinformatics

Bioinformatics

Branch and Bound

- Effectively used for problems involving less than 20 taxa (approximately 1022 trees)

- Algorithm- Establish minimally acceptable criteria- Evaluate all n taxa trees, discard ones not

meeting criteria- Evaluate n+1 taxa trees using remaining 4 taxa

trees as bases- Repeat until all taxa have been evaluated- Select optimal remaining tree

Bioinformatics

Techniques

Page 18: Bioinformatics Timothy Ketcham Union College Gradutate Seminar 2003 Bioinformatics

Bioinformatics

Branch Swapping

- Used in most phylogenetic tree estimates

- Algorithm- Construct trees with n taxa- Discard all but optimal tree- Rearrange branches of optimal tree to check for more optimal arrangement- Best tree becomes base for n+1 taxa- Repeat for n+1 taxa

Bioinformatics

Techniques

Page 19: Bioinformatics Timothy Ketcham Union College Gradutate Seminar 2003 Bioinformatics

Bioinformatics

Divide and Conquer

- Subdivides problem by finding optimal sub-trees into a super-tree

- Algorithm- Select a subset size (less than n)- Divide taxa into subsets- Find optimal trees for each subset of taxa- Combine optimal sub-trees into super-tree

with all taxa

Bioinformatics

Techniques

Page 20: Bioinformatics Timothy Ketcham Union College Gradutate Seminar 2003 Bioinformatics

Bioinformatics

Problem

All the previous methods (except Exhaustive Search) may result in a finding a locally optimal tree, but not the globally optimal tree

Bioinformatics

Techniques

Page 21: Bioinformatics Timothy Ketcham Union College Gradutate Seminar 2003 Bioinformatics

Bioinformatics

Stochastic Methods

- Simulated Annealing Algorithm- Create trees for n taxa (based on other methods)- Evaluate against optimality criteria, select best- Evaluate remaining trees using other parameters

(“cooling schedule”)- Tree retained is one best meeting both optimality criteria

and cooling schedule

- Allows retention of a less optimal tree in some cases, but may lead to better globally optimal result

Bioinformatics

Techniques

Page 22: Bioinformatics Timothy Ketcham Union College Gradutate Seminar 2003 Bioinformatics

Bioinformatics

Stochastic Methods

- Genetic Algorithm- Create trees for n taxa (based on other methods)- Select a population of trees to proceed to next generation- Allow trees to mutate or cross over based on criteria established by designer

- Follows the Darwinian Evolution Model (Survival of the Fittest)

Bioinformatics

Techniques

Page 23: Bioinformatics Timothy Ketcham Union College Gradutate Seminar 2003 Bioinformatics

Bioinformatics

Databases

- Overwhelming amount of information available

- As of 1998, over 200 databases

- Some have well over 1,000,000 entries

- Includes sequences and metadata

- Most freely available over web

Bioinformatics

Resources

Page 24: Bioinformatics Timothy Ketcham Union College Gradutate Seminar 2003 Bioinformatics

Bioinformatics

Databases

- EpoDB- Used for study of gene regulation of blood- Organized by gene, not structure- 10,000 entries

- GenBank- Operated by NIH- Over 18,000,000 records- Contains info on all publicly available DNA sequences- Flat file structure

Bioinformatics

Resources

Page 25: Bioinformatics Timothy Ketcham Union College Gradutate Seminar 2003 Bioinformatics

Bioinformatics

Databases

- GeneCards

- Focus on medical aspects of genetics

- Uses metadata

- Provides efficient navigation system to other

databases

- The Genome Database

- Official database for HGI

- Information includes maps of gene locations,

genetic structure and variations.

Bioinformatics

Resources

Page 26: Bioinformatics Timothy Ketcham Union College Gradutate Seminar 2003 Bioinformatics

Bioinformatics

Databases

- PIR – International Protein Sequence Database- oldest database of molecular sequence info- begun in 1960’s (paper based)- info on protein sequences, functional and structural properties and phylogeny

- SWISS-PROT

- Protein database (90,000 entries)- Links to other databases- Most often cited

Bioinformatics

Resources

Page 27: Bioinformatics Timothy Ketcham Union College Gradutate Seminar 2003 Bioinformatics

Bioinformatics

Tools

- Search engines

- Programming languages for structured queries

- Phylogenetic Tree Analysis tools

Bioinformatics

Resources

Page 28: Bioinformatics Timothy Ketcham Union College Gradutate Seminar 2003 Bioinformatics

Bioinformatics

Tools

- BLAST (Basic Local Alignment Search Tool)- Dominant search engine for biological sequence databases.- Uses an algorithm that concentrates on finding regions of high local similarity and then attempting to extend the sequence over adjacent areas.- Provides an estimate of the statistical significance of sequence matches.- Various versions

Bioinformatics

Resources

Page 29: Bioinformatics Timothy Ketcham Union College Gradutate Seminar 2003 Bioinformatics

Bioinformatics

Tools

- Entrez- Search and retrieval system at National Center for Biotechnology Information- Searches all databases at NCBI for information on nucleotide and protein

sequences, macromolecular structures and whole genomes.- User defined custom search strategies- Frequently cited

Bioinformatics

Resources

Page 30: Bioinformatics Timothy Ketcham Union College Gradutate Seminar 2003 Bioinformatics

Bioinformatics

Tools

- Kleisli- Integrated data management system- Functional programming language (CPL)- Built in data types – user extensible- Extends Flat and Relational DBs to OODB- Works with Sybase, ORACLE, Entrez & BLAST

Bioinformatics

Resources

Page 31: Bioinformatics Timothy Ketcham Union College Gradutate Seminar 2003 Bioinformatics

Bioinformatics

Tools

- PHYLIP (Phylogeny Inference Package)- Collection of tools for developing trees- Works with proteins and genes- Uses branch and bound & branch swapping techniques.- Created in 1980 (lots of citations)- Freely available on web (both source code & executables

Bioinformatics

Resources

Page 32: Bioinformatics Timothy Ketcham Union College Gradutate Seminar 2003 Bioinformatics

Bioinformatics

Tools

- SMART (Simple Modular Architecture Research Tool)

- Analyzes protein sequences

- Can identify more than 400 structural families

- Information on phylogeny, function and

structure

- Uses Hidden Markov Models

- Web-based

Bioinformatics

Resources

Page 33: Bioinformatics Timothy Ketcham Union College Gradutate Seminar 2003 Bioinformatics

Bioinformatics

Human Genome Project

- Requires identifying and decoding 35,000 genes

- From 2,000 – 2,000,000 base pairs per gene

- First draft (~90% of base pairs) in 2001

- Recently published 4th chromosome map (87,000,000 base pairs)

- Expect to complete in April, 2003

Bioinformatics

Research

Page 34: Bioinformatics Timothy Ketcham Union College Gradutate Seminar 2003 Bioinformatics

Bioinformatics

Other Work

- HIV-1 Genome Mutation Detection

- Link between Neuregulin-1 and Schizophrenia

- MLP and Cardiomyopathy Link

Bioinformatics

Research

Page 35: Bioinformatics Timothy Ketcham Union College Gradutate Seminar 2003 Bioinformatics

Bioinformatics

Other Work

- Study to Identify Genetic & Environmental Disease Causes

- “in silico” Biology

Bioinformatics

Research

Page 36: Bioinformatics Timothy Ketcham Union College Gradutate Seminar 2003 Bioinformatics

Bioinformatics

- What level of domain knowledge is needed for IT professionals working in Bioinformatics?

Bioinformatics

Discussion

- What courses would be needed in a Bioinformatics curriculum?

- Is a Bioethics course needed for IT professionals working in the field?