Dina ElDina ElDina El---KhishinKhishinKhishin
Dina ElDina ElDina El---KhishinKhishinKhishin
Dina El-Khishin (Ph.D.)
Deputy Director of AGERI
&
Head of the Genomics, Proteomics
&
Bioinformatics Research Facility
Agricultural Genetic Engineering Research Institute (AGERI)
Giza
EGYPT
Dina ElDina ElDina El---KhishinKhishinKhishin
BioinformaticsBibliotheca Alexandrina
December 2007
Dina ElDina ElDina El---KhishinKhishinKhishin
Use of internet browsers and computers
Assumptions
PC with Microsoft Windows
Internet connection
Background in molecular biology
Dina ElDina ElDina El---KhishinKhishinKhishin
What’s in the name?
SequenceAnalysis
DatabaseHomologySearching
MultipleSequenceAlignment
HomologyModelingDocking
ProteinAnalysis
Proteomics
3DModeling
SampleRegistration &
TrackingIntegrated
DataRepositories
CommonVisual
Interfaces
IntellectualPropertyAuditing
BioInformatics
GenomeMapping
Dina ElDina ElDina El---KhishinKhishinKhishin
What is Bioinformatics?
Computerized annotation of genomic and
biological information and data (databases).
•Transformation and manipulation of
these data (software tools).
- computational analysis of biological
data.
Dina ElDina ElDina El---KhishinKhishinKhishin
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology
Here we will consider the use of Bioinformatics tools rather than their design and construction
Here we will consider the access and analysis of data and information items rather than their generation, storage or annotation
Dina ElDina ElDina El---KhishinKhishinKhishin
Typical BioinformaticsMulti-Disciplinary
•Scientists– Experimental Design & Interpretation– Laboratory Protocols & Standards/Controls
•Mathematicians– Analysis & Correlation of Data– Validation methodologies
•Computer Scientists– Information Storage / Control Vocabulary– Data Mining
Dina ElDina ElDina El---KhishinKhishinKhishin
Bioinformatics
USERS• of Information• of Tools• of Instrumentation• In-Silico Modeling
INTERPRETERS• of Information
DEVELOPERS*• of Information• of Tools• of Instrumentation• of Architecture/Storage• Algorithms• Modeling Strategies• Visualization
*
Dina ElDina ElDina El---KhishinKhishinKhishin
Overall Aim of Bioinformatics:
Provides biologically important predictions from annotated data and transformation / manipulation of these data.
Dina ElDina ElDina El---KhishinKhishinKhishin
Bioinformatics
• SCOPE: Biological information AcquisitionProcessingStorageDistributionAnalysisInterpretation
• TECHNIQUES:MathematicsComputer scienceBiologyOBJECTIVE:Understanding biological significance of biological data
Dina ElDina ElDina El---KhishinKhishinKhishin
Bioinformatics Databases• Nucleotide and protein sequences.
• Protein structures.
• All sorts of functional data related to genes, proteins and their regulation, interactions etc.
• Curated and non-curated databases.
Dina ElDina ElDina El---KhishinKhishinKhishin
Software tools (computer programs):
•Software tools:
•sequence analysis, •database construction and management, evolutionary relations, •structural analyses, •pathways, •microarray analysis, •proteomic analysis.
•Software tools integrated into databases.
Dina ElDina ElDina El---KhishinKhishinKhishin
The Need for Bioinformatics:•Whole Genome Analyses and Sequences. •Experimental analyses for thousands of genes simultaneously. •DNA Chips and Array Analyses.
-Expression Arrays. -Comparative Analyses between Species and Strains.
•Proteomics: 'Proteome' of an Organism ... 2D gels, Mass Spec. •Medical applications: Genetic Disease ... SNPs.
-Pharmaceutical and Biotech Industry. •Forensic applications. •Agricultural applications.
Dina ElDina ElDina El---KhishinKhishinKhishin
Main Bioinformatics Applications
• Comparison and analysis of nucleotide and protein sequences.
• Comparison and analysis of molecular structures (especially proteins).
– Results of comparisons can be used in evolutionary and phylogenetic studies.
• Data mining of genomic data (large-scale gene expression results etc.).
Dina ElDina ElDina El---KhishinKhishinKhishin
Main Goals • Introduction to biological databases.
– Good knowledge of major databases.– An overview of the large variety of minor
databases.
• Learning to use tools at major database sites.– Genbank/ncbi, expasy/swissprot, PDB.
• Introduction to sequence searching and sequence alignments.– The tools with most practical "everyday" usability.
Dina ElDina ElDina El---KhishinKhishinKhishin
DNA •Sequence Submission
•Sequence Alignments (Pairwise and Multiple)
•Scoring Matrices
•Motifs and Patterns
•Genes, Exons, and Introns
•Promoters, Transcription-factor-binding Sites
•Other Regulatory Sites
RNA •Secondary Structure
•RNA-specifying Genes, Motifs
Protein•Sequence Alignment
•Motifs, Patterns, and Profiles
Dina ElDina ElDina El---KhishinKhishinKhishin
II. Sequence Databases and Their Use:
A. Primary Sequence Databases:
• Nucleic Acid Databases• NCBI (Natl Center Biotech Information) - GenBank
• http://www.ncbi.nlm.nih.gov/
• EBI (European Bioinformatics Institute) - EMBL
• http://www.ebi.ac.uk/
• DISC - DNA Information and Stock Center, Japan
• http://www.dna.affrc.go.jp/
Dina ElDina ElDina El---KhishinKhishinKhishin
Protein Databases• NCBI - GenPept
• http://www.ncbi.nlm.nih.gov/
• ExPASy - SwissProt and TrEMBL
• http://www.expasy.ch/
• EBI (European Bioinformatics Institute)
• SwissProt, TrEMBL, PIR
• http://www.ebi.ac.uk/
• DISC - DNA Information and Stock Center, Japan
• http://www.dna.affrc.go.jp/
Dina ElDina ElDina El---KhishinKhishinKhishin
B. Uses of Sequence Databases:
•Information Retrieval •Analysis: "given a new DNA sequence, what's in it?"
•Finding Homologues •Finding Genes •Finding Motifs - DNA Binding Sites
Dina ElDina ElDina El---KhishinKhishinKhishin
•NCBI - Entrez •http://www.ncbi.nlm.nih.gov/Entrez/ •Types of Databases Available •Entrez Help •Retrieve Large Data Sets
•ExPASy - SwissProt and TrEMBL •http://www.expasy.ch/ •SwissProt - Bairoch well-annotated non-redundant protein DB •TrEMBL - Translation of EMBL DNA coding sequences
•EBI - SwissProt, TrEMBL, PIR •http://www.ebi.ac.uk/ •SRS - Sequence Retrieval System •Software Tools - FASTA, WU-Blast2, ClustalW •EBI2 - second server at EBI
C. Retrieve Info from Sequence Databases:
•DISC - DNA Information and Stock Center, Japan - DDBJ
•http://www.dna.affrc.go.jp/ •SRS - Sequence Retrieval System •Software Tools - FASTA, BLAST, MpSrch
•PDB - Protein DataBank •http://www.rcsb.org/pdb/ •Protein 3D Structure database
Dina ElDina ElDina El---KhishinKhishinKhishin
•Homologues - sequences descending from common ancestor
•Comparison of Sequences using Distance Matrix approach
•DOT PLOTS - 2D graph of alignment of two sequences
• FASTA - fast, global database search tool of Pearson and Lipman
D. Sequence Analysis: finding Homologues
Dina ElDina ElDina El---KhishinKhishinKhishin
BLAST - Basic Local Alignment Sequence Tool
http://www.ncbi.nlm.nih.gov/BLAST/
-BLASTN - NA query NA database
-BLASTP - Protein query Protein database
-BLASTX - NA query (translated) Protein database
-TBLASTN - Protein query NA (translated) database
-TBLASTX - NA query (translated) NA (translated) database
Dina ElDina ElDina El---KhishinKhishinKhishin
E. Sequence Analysis: finding Genes in DNA
Methods:•Gene Search by Signal
-Look for Signals - Promoter Sites, Splice Sites, ...
•Gene Search by Content• Open Reading Frame•Use of Statistical Properties of Protein Coding Regions
•Unequal use of amino acids •Unequal numbers of codons per amino acid •Codons available not equally used - Codon Usage
Dina ElDina ElDina El---KhishinKhishinKhishin
F. Sequence Analysis: finding Motifs
Motifs:•Motif - a recurrent thematic element •Structural motifs - pieces of folded 3D structure •Sequence motifs - conserved "blocks" of sequences
DNA Motifs:•Protein binding sites ... regulatory elements •Relatively short •Statistically difficult •Cooperative binding often important •Structural elements may be important - bends, kinks
Dina ElDina ElDina El---KhishinKhishinKhishin
Protein Motifs:
•Secondary structure - alpha helices, beta sheets
•Super secondary structure - 4 helix bundle, etc.,
Basic Methods:
•Consensus sequence - single, best sequence
•Regular Expression - multiple characters per site
•Weight-Matrix - any character per site, with score -Profile
•Hidden Markov Model
Dina ElDina ElDina El---KhishinKhishinKhishin
Protein Family Classifications
Prosite:•Database of protein families and domains - at ExPASy and elsewhere •Regular expressions (Patterns) and Profiles •Programs
•Search Prosite for Pattern or Profile •ScanProsite - scan a sequence against ProSite, or pattern against SwissProt •ProfileScan - scan a sequence against Profile Database
Dina ElDina ElDina El---KhishinKhishinKhishin
G. Multiple Sequence AnalysisBasics
•Progressive Sequence Alignment -Pairwise alignment of most similar, then next most similar, etc.,
•Steps -Do pair wise alignment for all sequences -Get Matrix of approximate Distances between each pair -Create an approximate phylogenetic tree - Guide Tree -Use this to determine order of addition of sequences to alignment -Align: two sequences; seq. to sub-alignment; two sub-alignments -Keep GAPS that appear early - 'Once a gap, always a gap'
Dina ElDina ElDina El---KhishinKhishinKhishin
•Web sites for Multiple Sequence Alignment
Clustal W:•Weighting - different weights given to unequally sampled sequences •Position Dependent Weights
•Position-Specific Gap Penalties (Opening vs Extension) •Sequence Weighting •Weights for Adding New Sequences to existing Alignment - extra weight to sequences most similar to alignment
•Clustal W Servers
Other Web Programs :•MAP, PIMA, MSA •Many others available
Dina ElDina ElDina El---KhishinKhishinKhishin
Web Databases of Multiple Sequence Alignments
•Fold Classification via Structure-Structure Protein alignments (FSSP)
•Homology derived Secondary Structure Assignments (HSSP)
•
•Database of Secondary Structure Assignments (DSSP)
Dina ElDina ElDina El---KhishinKhishinKhishin
H. Phylogenetics
Basics:•Trees - Rooted vs Unrooted
•Rooted Tree - position of Ancestor is known •Unrooted Tree - no Ancestral Node •Topology - Branching Pattern of the Tree
Dina ElDina ElDina El---KhishinKhishinKhishin
1, 2, 3, 4, 5: Taxa or External NodesX, Y, Z: Internal Nodes R1: Root a, b, c, d, e: External Branchesf, g: Internal Branchesh: Internal Branch ONLY IF tree is Rooted;else h is part of the Outgroup: Taxan 5 ... used to "root trees"
Terminology
Dina ElDina ElDina El---KhishinKhishinKhishin
Methods:•Distance Matrix methods
•UPGMA - Unweighted Pair Group Method of Averages •Fixed 'clock', averages used to get distances
•Fitch & Margoliash - 3 branches calculated at a time •Neighbor Joining - Pairs of taxa, finding closest pair
•tree with smallest sum of Branch Lengths •Other methods also available
•Parsimony methods•Find tree with fewest inferred mutations •Programs: PHYLIP package; PAUP
•Maximum Likelihood methods •Use a mathematical model of process of evolution •Model contains a parameter which is used to Maximize the Likelihood that observed changes took place
Dina ElDina ElDina El---KhishinKhishinKhishin
Confidence - "How good is the Tree?“
•Bootstrap - permutation resampling of the sequences
-How robust is the tree to such resampling? always same tree?
•How much better is this "best" tree than other trees?
-Use set of "User defined" Trees ... how good is each?
-PHYLIP programs
Dina ElDina ElDina El---KhishinKhishinKhishin
III. Whole Genomes
A. Implications
TOTAL information on Heritable Properties of an Organism
What an Organism CAN do ... and CAN NOT do ...
Major step toward Understanding an Organismand toward making Biology a PREDICTIVE SCIENCE
Current: identify Genes, predict Function
Dina ElDina ElDina El---KhishinKhishinKhishin
Next:
Deduce Life Style of the Organism
Predict Metabolic and Genetic Pathways
Predict Adaptive Responses, Developmental Pathways
ORGANISM DATABASES
Dina ElDina ElDina El---KhishinKhishinKhishin
B. TIGR
The Institute for Genomic Research
•First to Sequence whole Genome of Free-living Organism
•Sequenced the first Three Eubacteria and First Two Archae
•TIGR Database (TDB) - links to specific organisms
Dina ElDina ElDina El---KhishinKhishinKhishin
IV. Organisms and Other Databases
A. Need for Organism Databases
• Direct result of Genome Physical Mapping efforts
• Need for Maps, Genes, Sequences, References
• Incomplete Genome Information plus other Information
• NOW: Complete Genome Information
Dina ElDina ElDina El---KhishinKhishinKhishin
B. Web Organism Databases
ACeDB - A C. elegans Data Base
• Created by Durbin and Thierry-Mieg for Sulston R.Mapping Program
• Over 40 organisms represented in ACeDB databases
• Highly variable Types of Information in each
• Examples: C. elegans, yeast, fly, grains, Arabidopsis, human chroms
Dina ElDina ElDina El---KhishinKhishinKhishin
• Saccharomyces Genome Database (SGD)
• Basic database is Web enhancement over ACeDB SacchDB
• Excellent interface to yeast genome maps:
• Many resources including analysis tools
• BLAST and FASTA facilities
• SacchDB extended to include
• Genome Deletion Project
• Yeast Evolution Project
• Sacch3D - protein 3D structure information
• Worm and Mammalian Homology to Yeast
• Yeast SAGE data
Dina ElDina ElDina El---KhishinKhishinKhishin
•The Arabidopsis Information Resource (TAIR): Arabidopsis thaliana
•Database based on Oracle relational database system
•Much underlying information from ACeDB AatDB
•Analysis tools and Viewers, including BLAST and FASTA
•Arabidopsis Genome Initiative (AGI)
•PlantsP: Plant Phosphorylation Proteins (kinases, phosphatases)
•underlying MySQL database
•display and usage is Web based
•many other resources, links, download, etc
Dina ElDina ElDina El---KhishinKhishinKhishin
Berkeley Drosophila Genome Project (BDGP)
• Outgrowth of Encyclopedia of Drosophila (EofD)
• Excellent Map Viewers - largely Java applets
• Example: CytoView
• Includes FlyBase, ACeDB database of Drosophila
• Mouse Genome Informatics (MGI)
• Integrated access to mouse genetics and biology
• Mouse Genome Database (MGD)
• Mouse Gene Expression Database (GXD)
• Encyclopedia of the Mouse Genome
• links to
• Mouse Tumor Biology database
• Rat Data resource
Dina ElDina ElDina El---KhishinKhishinKhishin
Human Genome Resources at NCBI
•Information and links to Human Genome Project
•Human Genes •OMIM - Online Mendelian Inheritance in Man
•McKusick catalog of human genes and disorders •Over 10,000 entries
•LocusLink - single interface to all human locus info •Human/Mouse Homology Relationships •Examples of Info on Candidate Human Genes for Hypertension
Dina ElDina ElDina El---KhishinKhishinKhishin
VI. Problems ... Directions to Go
A. Problems:
• Sequence DBs and Others are Flat File Database • one piece of information at a time
• Analysis Tools are largely Single Task oriented • from Task to Task, User must make Decisions
• Automate Basic Analytical Tasks for new DNA Sequences • This is now done currently in some facilities
and in some expensive commercial packages • Examples: Pangea, Incyte
Dina ElDina ElDina El---KhishinKhishinKhishin
B. Need: "smart" Analysis Packages
• Need "smart" Analysis Packages that can "learn" from DB info.
• "predict" next best options for User.
• Analysis: DNA seq --> gene --> protein --> motifs --> 3D structure.
Dina ElDina ElDina El---KhishinKhishinKhishin
Basic Problem with Biology becoming a "Predictive Science“
1-Large number of Different Molecules, eg Proteins
-Large Variety per Cell
-Variety Changes with Type of Cell in Organism
2-Often a Small number of Each Molecule
Thus: Statistical Analysis is often not Appropriate
Dina ElDina ElDina El---KhishinKhishinKhishin
The new paradigm, is that all the genes will be known "resident in database available electronically"
Biological investigations will be theoretical
Scientists will start with a theoretical conjecture and only then turning to experiment to follow or test the hypothesis.
The Potential of Bioinformatics
Dina ElDina ElDina El---KhishinKhishinKhishin
Bioinformatics scientist have developed new techniques to analyze genes on an industrial scale resulting in a new area of science known as
'Genomics'
Genomics is revolutionizing our entire approach to science.
Dina ElDina ElDina El---KhishinKhishinKhishin
Gene Discovery Informatics
Microdissection
Create DNA Libraries
Signature Hybridization
Clusteringby Signature
Expression Profiles
DifferentialExpression
DNASequencing
GeneAssignments
FunctionalPredictions
MicroArrays
Functional Assays
Small MoleculeDrugs
Tissues &Cell Lines
In situHybridization
ClonesDatabase
DNALibrariesDatabase
AnnotatedSequenceDatabase
Assays &ValidationDatabase
ClusteringDatabase
Tissue &Cell LinesDatabase Small
MoleculeDatabase Micro
ArrayDatabase
In SituHybridiz-
ation
Dina ElDina ElDina El---KhishinKhishinKhishin
Dina ElDina ElDina El---KhishinKhishinKhishin
Thank you
Dina ElDina ElDina El---KhishinKhishinKhishin
References
• Martti Tolvanen & Bairong ShenUniversity of Tampere
• Bioinformatics for Dummies (Wiley 2003)
• Internet Bioinformatics