bioinformatics - what and why? the following power point presentation is designed to give some...
TRANSCRIPT
BioInformatics - What and Why?BioInformatics - What and Why?
The following power point presentation is designed to give
some background information on Bioinformatics.
This presentation is modified from information supplied by Dr. Bruno Gaeta, and with permission from eBioInformatics Pty
Ltd (c) Copywright
The need for bioinformaticists. The number of entries in data bases of gene sequences is increasing exponentially. Bioinformaticians are needed to understand and use this information.
82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
GenBank growth
Genome sequencing projects, including the human genome project are producing vast amounts of information. The challenge is to use this information in a useful way
Genome sequencing projects, including the human genome project are producing vast amounts of information. The challenge is to use this information in a useful way
COMPLETE/PUBLIC
Aquifex aeolicus
Pyrococcus horikoshii
Bacillus subtilis
Treponema pallidum
Borrelia burgdorferi
Helicobacter pylori
Archaeoglobus fulgidus
Methanobacterium thermo.
Escherichia coli
Mycoplasma pneumoniae
Synechocystis sp. PCC6803
Methanococcus jannaschii
Saccharomyces cerevisiae
Mycoplasma genitalium
Haemophilus influenzae
COMPLETE/PENDING PUBLICATIONRickettsia prowazekii Pseudomonas aeruginosa
Pyrococcus abyssii
Bacillus sp. C-125
Ureaplasma urealyticum
Pyrobaculum aerophilum
ALMOST/PUBLIC
Pyrococcus furiosus
Mycobacterium tuberculosis H37Rv
Mycobacterium tuberculosis CSU93
Neisseria gonorrhea
Neisseria meningiditis
Streptococcus pyogenes
Terry Gaasterland, Siv Andersson, Christoph Sensenhttp://www.mcs.anl.gov/home/gaasterl/genomes.html
Publically available genomes (April 1998)
”..We must hook our individual computers into the worldwide network that gives us access to daily changes in the databases and also makes immediate our communications with each other. The programs that display and analyze the material for us must be improved - and we must learn to use them more effectively. Like the purchased kits, they will make our life easier, but also like the kits, we must understand enough of how they work to use them effectively…”
Walter Gilbert (1991) “Towards a paradigm shift in biology” Nature News and Views 349:99
Bioinformatics impacts on all aspects of biological research.
Promises of genomics and bioinformatics Promises of genomics and bioinformatics
Medicine Knowledge of protein structure facilitates drug design Understanding of genomic variation allows the tailoring
of medical treatment to the individual’s genetic make-up Genome analysis allows the targeting of genetic
diseases The effect of a disease or of a therapeutic on RNA and
protein levels can be elucidated
The same techniques can be applied to biotechnology, crop and livestock improvement, etc...
What is bioinformatics?What is bioinformatics?
Application of information technology to the storage, management and analysis of biological information
Facilitated by the use of computers
What is bioinformatics?What is bioinformatics? Sequence analysis
Geneticists/ molecular biologists analyse genome sequence information to understand disease processes
Molecular modeling Crystallographers/ biochemists design drugs using computer-aided
tools
Phylogeny/evolution Geneticists obtain information about the evolution of organisms by
looking for similarities in gene sequences
Ecology and population studies Bioinformatics is used to handle large amounts of data obtained in
population studies
Medical informatics Personalised medicine
Sequence analysis: overviewSequence analysis: overview
Nucleotide sequence file
Search databases for similar sequences
Sequence comparison
Multiple sequence analysis
Design further experimentsRestriction mappingPCR planning
Translate into protein
Search for known motifs
RNA structure prediction
non-coding
coding
Protein sequence analysis
Search for protein coding regions
Manual sequence entry
Sequence database browsing
Sequencing project management
Protein sequence file
Search databases for similar sequences
Sequence comparison
Search for known motifs
Predict secondary structure
Predict tertiary
structureCreate a multiple sequence alignment
Edit the alignment
Format the alignment for publication
Molecular phylogeny
Protein family analysis
Nucleotide sequence analysis
Sequence entry
Gene Sequencing: Automated chemcial sequencing methods allow rapid generation of large data banks of gene sequences
Database similarity searching: The BLAST program has been written to allow rapid comparison of a new gene sequence with the 100s of 1000s of gene sequences in data bases
Sequences producing significant alignments: (bits) Value
gnl|PID|e252316 (Z74911) ORF YOR003w [Saccharomyces cerevisiae] 112 7e-26gi|603258 (U18795) Prb1p: vacuolar protease B [Saccharomyces ce... 106 5e-24gnl|PID|e264388 (X59720) YCR045c, len:491 [Saccharomyces cerevi... 69 7e-13gnl|PID|e239708 (Z71514) ORF YNL238w [Saccharomyces cerevisiae] 30 0.66gnl|PID|e239572 (Z71603) ORF YNL327w [Saccharomyces cerevisiae] 29 1.1gnl|PID|e239737 (Z71554) ORF YNL278w [Saccharomyces cerevisiae] 29 1.5
gnl|PID|e252316 (Z74911) ORF YOR003w [Saccharomyces cerevisiae] Length = 478 Score = 112 bits (278), Expect = 7e-26 Identities = 85/259 (32%), Positives = 117/259 (44%), Gaps = 32/259 (12%)
Query: 2 QSVPWGISRVQAPAAHNRG---------LTGSGVKVAVLDTGIST-HPDLNIRGG-ASFV 50 + PWG+ RV G G GV VLDTGI T H D R + +Sbjct: 174 EEAPWGLHRVSHREKPKYGQDLEYLYEDAAGKGVTSYVLDTGIDTEHEDFEGRAEWGAVI 233
Query: 51 PGEPSTQDGNGHGTHVAGTIAALNNSIGVLGVAPSAELYXXXXXXXXXXXXXXXXXQGLE 110 P D NGHGTH AG I + + GVA + ++ +G+ESbjct: 234 PANDEASDLNGHGTHCAGIIGSKH-----FGVAKNTKIVAVKVLRSNGEGTVSDVIKGIE 288
Sequence comparison: Gene sequences can be aligned to see similarities between gene from different sources
768 TT....TGTGTGCATTTAAGGGTGATAGTGTATTTGCTCTTTAAGAGCTG 813 || || || | | ||| | |||| ||||| ||| ||| 87 TTGACAGGTACCCAACTGTGTGTGCTGATGTA.TTGCTGGCCAAGGACTG 135 . . . . .814 AGTGTTTGAGCCTCTGTTTGTGTGTAATTGAGTGTGCATGTGTGGGAGTG 863 | | | | |||||| | |||| | || | |136 AAGGATC.............TCAGTAATTAATCATGCACCTATGTGGCGG 172 . . . . .864 AAATTGTGGAATGTGTATGCTCATAGCACTGAGTGAAAATAAAAGATTGT 913 ||| | ||| || || ||| | ||||||||| || |||||| |173 AAA.TATGGGATATGCATGTCGA...CACTGAGTG..AAGGCAAGATTAT 216
Restriction mapping: Genes can be analysed to detect gene sequences that can be cleaved with restriction enzymes
AceIII 1 CAGCTCnnnnnnn’nnn...AluI 2 AG’CTAlwI 1 GGATCnnnn’n_ApoI 2 r’AATT_yBanII 1 G_rGCy’CBfaI 2 C’TA_GBfiI 1 ACTGGGBsaXI 1 ACnnnnnCTCCBsgI 1 GTGCAGnnnnnnnnnnn...
BsiHKAI 1 G_wGCw’CBsp1286I 1 G_dGCh’C
BsrI 2 ACTG_Gn’BsrFI 1 r’CCGG_yCjeI 2 CCAnnnnnnGTnnnnnn...CviJI 4 rG’CyCviRI 1 TG’CADdeI 2 C’TnA_GDpnI 2 GA’TCEcoRI 1 G’AATT_CHinfI 2 G’AnT_CMaeIII 1 ’GTnAC_MnlI 1 CCTCnnnnnn_n’MseI 2 T’TA_AMspI 1 C’CG_GNdeI 1 CA’TA_TG
Sau3AI 2 ’GATC_SstI 1 G_AGCT’CTfiI 2 G’AwT_C
Tsp45I 1 ’GTsAC_Tsp509I 3 ’AATT_
TspRI 1 CAGTGnn’
50 100 150 200 250
PCR Primer Design:Oligonucleotides for use in the polymerisation chain
reaction can be designed using computer based prgrams
OPTIMAL primer length --> 20MINIMUM primer length --> 18MAXIMUM primer length --> 22 OPTIMAL primer melting temperature --> 60.000MINIMUM acceptable melting temp --> 57.000MAXIMUM acceptable melting temp --> 63.000MINIMUM acceptable primer GC% --> 20.000MAXIMUM acceptable primer GC% --> 80.000Salt concentration (mM) --> 50.000 DNA concentration (nM) --> 50.000MAX no. unknown bases (Ns) allowed --> 0 MAX acceptable self-complementarity --> 12 MAXIMUM 3' end self-complementarity --> 8 GC clamp how many 3' bases --> 0
Plot created using codon preference (GCG)
Gene discovery: Computer program can be used to recognise the protein coding regions in DNA
0 1,000 2,000 3,000 4,000
4,0003,0002,0001,0000
2.0
1.5
1.0
0.5
-0.0
2.0
1.5
1.0
0.5
-0.0
2.0
1.5
1.0
0.5
-0.0
RNA structure prediction: Structural features of RNA can be predicted
G
GA
C
A
G
G
A
G
G
A
U
ACCG
CG
G
U
C
C
UGC
CG G U C C
U CA
CUU
GGACUUAGU
A
U
CA
U
C
A
G
U
C
UGCGC
AAU
A
G
G
UA A
C
G CGU
Protein structure prediction: Particular structural features can be recognised in protein sequences
Protein structure prediction: Particular structural features can be recognised in protein sequences
KD Hydrophobicity
Surface Prob.
Flexibility
Antigenic Index
CF TurnsCF Alpha Helices
CF Beta Sheets
GOR Alpha HelicesGOR Turns
GOR Beta SheetsGlycosylation Sites
0.8
1.20.0
-1.7
1.7
10
-5.0
5.0
50 100
50 100
Protein Structure : the 3-D structure of proteins is used to understand protein function and design new drugs
Multiple sequence alignment: Sequences of proteins from different organisms can be aligned to see similarities and differences
Alignment formatted using MacBoxshade
Phylogeny inference: Analysis of sequences allows evolutionary relationships to be determined
E.coli
C.botulinum
C.cadavers
C.butyricum
B.subtilis
B.cereusPhylogenetic tree constructed using the Phylip package
Large scale bioinformatics: genome projectsLarge scale bioinformatics: genome projects
MappingIdentifying the location of clones and markers on the chromosome by genetic linkage analysis and physical mapping
SequencingAssembling clone sequence reads into large (eventually complete) genome sequences
Gene discoveryIdentifying coding regions in genomic DNA by database searching and other methods
Function assignmentUsing database searches, pattern searches, protein family analysis and structure prediction to assign a function to each predicted gene
Data mining
Searching for relationships and correlations in the information
Genome comparisonComparing different complete genomes to infer evolutionary history and genome rearrangements
Challenges in bioinformaticsChallenges in bioinformatics Explosion of information
Need for faster, automated analysis to process large amounts of data
Need for integration between different types of information (sequences, literature, annotations, protein levels, RNA levels etc…)
Need for “smarter” software to identify interesting relationships in very large data sets
Lack of “bioinformaticians” Software needs to be easier to access, use and
understand Biologists need to learn about the software, its
limitations, and how to interpret its results