bsc4933/5936 florida state university department of biology october 2, 2003 introduction to...

27
BSC4933/5936 BSC4933/5936 Florida State University Florida State University Department of Biology Department of Biology www.bio.fsu.edu www.bio.fsu.edu October 2, 2003 October 2, 2003 Introduction to Introduction to BioInformatics BioInformatics

Upload: vanessa-randall

Post on 25-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: BSC4933/5936 Florida State University Department of Biology  October 2, 2003 Introduction to BioInformatics

BSC4933/5936BSC4933/5936

Florida State UniversityFlorida State University

Department of BiologyDepartment of Biology

www.bio.fsu.eduwww.bio.fsu.edu

October 2, 2003October 2, 2003

Introduction to BioInformaticsIntroduction to BioInformatics

Page 2: BSC4933/5936 Florida State University Department of Biology  October 2, 2003 Introduction to BioInformatics

What sort of information can be

determined from a genomic sequence?

Steven M. ThompsonSteven M. Thompson

Florida State University School of Florida State University School of Computational Science and Computational Science and

Information Technology (Information Technology (CSITCSIT))

GenomicsGenomics

Page 3: BSC4933/5936 Florida State University Department of Biology  October 2, 2003 Introduction to BioInformatics

Easy Easy —— restriction digests and associated mapping; e.g. restriction digests and associated mapping; e.g.

software like the Wisconsin Package’s (Genetics software like the Wisconsin Package’s (Genetics

Computer Group [GCG]) Map, MapSort, and MapPlot.Computer Group [GCG]) Map, MapSort, and MapPlot.

Harder —Harder — fragment assembly and genome mapping; such as fragment assembly and genome mapping; such as

packages from the University of Washington’s packages from the University of Washington’s

Genome Center (Genome Center (http://www.genome.washington.edu/

), Phred/Phrap/Consed (), Phred/Phrap/Consed (http://www.phrap.org/) and ) and

SegMap, and The Institute for Genomic Research’s (SegMap, and The Institute for Genomic Research’s (

http://www.tigr.org/) Lucy and Assembler programs.) Lucy and Assembler programs.

Very hard —Very hard — gene finding and sequence annotation. This will gene finding and sequence annotation. This will

be the bulk of today’s lecture and is a primary be the bulk of today’s lecture and is a primary

focus in current genomics research.focus in current genomics research.

EasyEasy—— forward translation to peptides.forward translation to peptides.

Hard again Hard again —— genome scale comparisons and analyses.genome scale comparisons and analyses.

Lecture Topics —Lecture Topics —

Page 4: BSC4933/5936 Florida State University Department of Biology  October 2, 2003 Introduction to BioInformatics

Nucleic Acid Characterization: Nucleic Acid Characterization: Recognizing Coding Sequences.Recognizing Coding Sequences.Three general solutions to the gene finding problem:Three general solutions to the gene finding problem:

1)1) all genes have certain regulatory signals positioned in all genes have certain regulatory signals positioned in or about them,or about them,

2)2) all genes by definition contain specific code patterns, all genes by definition contain specific code patterns,

3)3) and many genes have already been sequenced and and many genes have already been sequenced and recognized in other organisms so we can infer function recognized in other organisms so we can infer function and location by homology if our new sequence is and location by homology if our new sequence is similar enough to an existing sequence.similar enough to an existing sequence.

All of these principles can be used to help locate the All of these principles can be used to help locate the position of genes in DNA and are often known as position of genes in DNA and are often known as ““searching by signalsearching by signal,” “,” “searching by contentsearching by content,” and ,” and ““homology inferencehomology inference” respectively.” respectively.

Page 5: BSC4933/5936 Florida State University Department of Biology  October 2, 2003 Introduction to BioInformatics

URFs and ORFs — URFs and ORFs — definitionsdefinitionsURFURF:: Unidentified Reading Frame — any Unidentified Reading Frame — any

potential string of amino acids encoded by potential string of amino acids encoded by a stretch of DNA. Any given stretch of DNA a stretch of DNA. Any given stretch of DNA has potential URFs on any combination of has potential URFs on any combination of six potential reading frames, three forward six potential reading frames, three forward and three backward.and three backward.

ORFORF:: Open Reading Frame — by definition any Open Reading Frame — by definition any continuous reading frame that starts with a continuous reading frame that starts with a start codon and stops with a stop codon. start codon and stops with a stop codon. Not usually relevant to discussions of Not usually relevant to discussions of genomic eukaryotic DNA, but very relevant genomic eukaryotic DNA, but very relevant when dealing with mRNA/cDNA or when dealing with mRNA/cDNA or prokaryotic DNA.prokaryotic DNA.

Page 6: BSC4933/5936 Florida State University Department of Biology  October 2, 2003 Introduction to BioInformatics

Signal Searching:Signal Searching:locating transcription and translation affecter sites.locating transcription and translation affecter sites.

Start Sites:Start Sites:Prokaryote promoter ‘Pribnow Box,’ Prokaryote promoter ‘Pribnow Box,’

TTGACwx{15,21}TAtAaTTTGACwx{15,21}TAtAaT;;

Eukaryote transcription factor site database, Eukaryote transcription factor site database, TFSites.DatTFSites.Dat;;

Shine-Dalgarno site, Shine-Dalgarno site, (AGG,GAG,GGA)x{6,9}ATG(AGG,GAG,GGA)x{6,9}ATG, in , in prokaryotes;prokaryotes;

Kozak eukaryote start consensus, Kozak eukaryote start consensus, cc(A,g)ccAUGgcc(A,g)ccAUGg;;

AUGAUG start codon in about 90% of genomes, start codon in about 90% of genomes, exceptions in some prokaryotes and organelles.exceptions in some prokaryotes and organelles.

One strategy — One-Dimensional Signal Recognition.One strategy — One-Dimensional Signal Recognition.

Page 7: BSC4933/5936 Florida State University Department of Biology  October 2, 2003 Introduction to BioInformatics

Signal Searching:Signal Searching:locating transcription and translation affecter sites.locating transcription and translation affecter sites.One-Dimensional Approaches, cont.One-Dimensional Approaches, cont.

End Sites:End Sites:

‘‘Nonsense’ chain terminating, stop codons Nonsense’ chain terminating, stop codons — — UAAUAA, , UAGUAG, , UGAUGA;;

Eukaryote terminator consensus, Eukaryote terminator consensus, YGTGTTYYYGTGTTYY;;

Eukaryote poly(A) adenylation signal Eukaryote poly(A) adenylation signal AAUAAAAAUAAA;;

but exceptions in some ciliated protists and but exceptions in some ciliated protists and due to eukaryote suppresser tRNAs.due to eukaryote suppresser tRNAs.

Page 8: BSC4933/5936 Florida State University Department of Biology  October 2, 2003 Introduction to BioInformatics

Signal Searching:Signal Searching:locating transcription and translation affecter sites.locating transcription and translation affecter sites.

Another Strategy — Two-Dimensional Weight Matrix.Another Strategy — Two-Dimensional Weight Matrix.

Exon/Intron JunctionsExon/Intron Junctions.. Donor Site Acceptor SiteDonor Site Acceptor Site

ExonExonIntronIntronExonExon

AA6464GG7373GG100100TT100100AA6262AA6868GG8484TT6363 . . . 6Py . . . 6Py74-8774-87NCNC6565AA100100GG100100NN

The splice cut sites occur before a 100% GT The splice cut sites occur before a 100% GT consensus at the donor site and after a 100% consensus at the donor site and after a 100% AG consensus at the acceptor site, but a AG consensus at the acceptor site, but a simple consensus is not informative enough.simple consensus is not informative enough.

Page 9: BSC4933/5936 Florida State University Department of Biology  October 2, 2003 Introduction to BioInformatics

Signal Searching:Signal Searching:locating transcription and translation affecter sites.locating transcription and translation affecter sites.Two-Dimensional Weight Matrices describe the probability at Two-Dimensional Weight Matrices describe the probability at

each base position to be either A, C, U, or G, in percentages.each base position to be either A, C, U, or G, in percentages.

The Donor Matrix.The Donor Matrix.

CONSENSUS from: Donor Splice site sequencesCONSENSUS from: Donor Splice site sequences

from Stephen Mount NAR 10(2) 459;472 figure 1 page 460from Stephen Mount NAR 10(2) 459;472 figure 1 page 460

Exon cutExon cutsitesite IntronIntron

%G 20 9 11 74 100 0 29 12 84 9 18 20%G 20 9 11 74 100 0 29 12 84 9 18 20

%A 30 40 64 9 0 0 61 67 9 16 39 24%A 30 40 64 9 0 0 61 67 9 16 39 24

%U 20 7 13 12 0 100 7 11 5 63 22 27%U 20 7 13 12 0 100 7 11 5 63 22 27

%C 30 44 11 6 0 0 2 9 2 12 20 28%C 30 44 11 6 0 0 2 9 2 12 20 28

CONSENSUS sequence to a certainty level of 75 percent.CONSENSUS sequence to a certainty level of 75 percent.

VMWKGTRRGWHHVMWKGTRRGWHH

The matrix begins four bases ahead of the splice site!The matrix begins four bases ahead of the splice site!

Page 10: BSC4933/5936 Florida State University Department of Biology  October 2, 2003 Introduction to BioInformatics

Signal Searching:Signal Searching:locating transcription and translation affecter sites.locating transcription and translation affecter sites.

Two-Dimensional Weight Matrices, cont.Two-Dimensional Weight Matrices, cont.

The Acceptor Matrix.The Acceptor Matrix.

CONSENSUS of: Acceptor.Dat. IVS Acceptor Splice Site SequencesCONSENSUS of: Acceptor.Dat. IVS Acceptor Splice Site Sequences

from Stephen Mount NAR 10(2); 459-472 figure 1 page 460from Stephen Mount NAR 10(2); 459-472 figure 1 page 460

IntronIntron cut cutsite Exonsite Exon%G 15 22 10 10 10 6 7 9 7 5 5 24 1 0 100 52 24 19%G 15 22 10 10 10 6 7 9 7 5 5 24 1 0 100 52 24 19

%A 15 10 10 15 6 15 11 19 12 3 10 25 4 100 0 22 17 20%A 15 10 10 15 6 15 11 19 12 3 10 25 4 100 0 22 17 20

%T 52 44 50 54 60 49 48 45 45 57 58 30 31 0 0 8 37 29%T 52 44 50 54 60 49 48 45 45 57 58 30 31 0 0 8 37 29

%C 18 25 30 21 24 30 34 28 36 35 27 21 64 0 0 18 22 32%C 18 25 30 21 24 30 34 28 36 35 27 21 64 0 0 18 22 32

CONSENSUS sequence to a certainty level of 75.0 percent at each CONSENSUS sequence to a certainty level of 75.0 percent at each

position:position:

BBYHYYYHYYYDYAGVBHBBYHYYYHYYYDYAGVBH

The matrix begins fifteen bases upstream of the splice site!The matrix begins fifteen bases upstream of the splice site!

Page 11: BSC4933/5936 Florida State University Department of Biology  October 2, 2003 Introduction to BioInformatics

Signal Searching:Signal Searching:locating transcription and translation affecter sites.locating transcription and translation affecter sites.

Two-Dimensional Weight Matrices, cont.Two-Dimensional Weight Matrices, cont.

The CCAAT site — occurs around 75 base pairs upstream of the start point The CCAAT site — occurs around 75 base pairs upstream of the start point

of eukaryotic transcription, may be involved in the initial binding of RNA of eukaryotic transcription, may be involved in the initial binding of RNA

polymerase II.polymerase II.

Base freguencies according to Philipp Bucher (1990) J. Mol. Biol. 212:563-578.Base freguencies according to Philipp Bucher (1990) J. Mol. Biol. 212:563-578.

Preferred region: motif within -212 to -57. Optimized cut-off value: 87.2%.Preferred region: motif within -212 to -57. Optimized cut-off value: 87.2%.

%G 7 25 14 40 57 1 0 0 12 9 34 30%G 7 25 14 40 57 1 0 0 12 9 34 30

%A 32 18 14 58 29 0 0 100 68 10 13 66%A 32 18 14 58 29 0 0 100 68 10 13 66

%U 30 27 45 1 11 1 1 0 15 82 2 1%U 30 27 45 1 11 1 1 0 15 82 2 1

%C 31 30 27 1 3 99 99 0 5 0 51 3%C 31 30 27 1 3 99 99 0 5 0 51 3

CONSENSUS sequence to a certainty level of 68 percent at each position:CONSENSUS sequence to a certainty level of 68 percent at each position:

HBYRRCCAATSRHBYRRCCAATSR

Page 12: BSC4933/5936 Florida State University Department of Biology  October 2, 2003 Introduction to BioInformatics

Signal Searching:Signal Searching:locating transcription and translation affecter sites.locating transcription and translation affecter sites.

Two-Dimensional Weight Matrices, cont.Two-Dimensional Weight Matrices, cont.

The TATA site (aka “Hogness” box) — a conserved A-T rich The TATA site (aka “Hogness” box) — a conserved A-T rich sequence found about 25 base pairs upstream of the start point of sequence found about 25 base pairs upstream of the start point of eukaryotic transcription, may be involved in positioning RNA eukaryotic transcription, may be involved in positioning RNA polymerase II for correct initiation and binds Transcription Factor IID.polymerase II for correct initiation and binds Transcription Factor IID.

Base freguencies according to Philipp Bucher (1990) J. Mol. Biol. 212:563-578.Base freguencies according to Philipp Bucher (1990) J. Mol. Biol. 212:563-578.

Preferred region: center between -36 and -20. Optimized cut-off value: 79%.Preferred region: center between -36 and -20. Optimized cut-off value: 79%.

%G 39 5 1 1 1 0 5 11 40 39 33 33 33 36 36%G 39 5 1 1 1 0 5 11 40 39 33 33 33 36 36

%A 16 4 90 1 91 69 93 57 40 14 21 21 21 17 20%A 16 4 90 1 91 69 93 57 40 14 21 21 21 17 20

%U 8 79 9 96 8 31 2 31 8 12 8 13 16 19 18%U 8 79 9 96 8 31 2 31 8 12 8 13 16 19 18

%C 37 12 0 3 0 0 1 1 11 35 38 33 30 28 26%C 37 12 0 3 0 0 1 1 11 35 38 33 30 28 26

CONSENSUS sequence to a certainty level of 61 percent at each position:CONSENSUS sequence to a certainty level of 61 percent at each position:

STATAWAWRSSSSSSSTATAWAWRSSSSSS

Page 13: BSC4933/5936 Florida State University Department of Biology  October 2, 2003 Introduction to BioInformatics

Signal Searching:Signal Searching:locating transcription and translation affecter sites.locating transcription and translation affecter sites.

Two-Dimensional Weight Matrices, cont.Two-Dimensional Weight Matrices, cont.

The GC box — may relate to the binding of transcription The GC box — may relate to the binding of transcription factor Sp1.factor Sp1.Base freguencies according to Philipp Bucher (1990) J. Mol. Biol. 212:563-578.Base freguencies according to Philipp Bucher (1990) J. Mol. Biol. 212:563-578.

Preferred region: motif within -164 to +1. Optimized cut-off value: 88%.Preferred region: motif within -164 to +1. Optimized cut-off value: 88%.

%G 18 41 56 75 100 99 0 82 81 62 70 13 19 40%G 18 41 56 75 100 99 0 82 81 62 70 13 19 40

%A 37 35 18 24 0 1 20 17 0 29 8 0 7 15%A 37 35 18 24 0 1 20 17 0 29 8 0 7 15

%U 30 12 23 0 0 0 18 1 18 9 15 27 42 37%U 30 12 23 0 0 0 18 1 18 9 15 27 42 37

%C 15 11 2 0 0 0 62 0 1 0 6 61 31 9%C 15 11 2 0 0 0 62 0 1 0 6 61 31 9

CONSENSUS sequence to a certainty level of 67 percent at each position:CONSENSUS sequence to a certainty level of 67 percent at each position:

WRKGGGHGGRGBYKWRKGGGHGGRGBYK

Page 14: BSC4933/5936 Florida State University Department of Biology  October 2, 2003 Introduction to BioInformatics

Signal Searching:Signal Searching:locating transcription and translation affecter sites.locating transcription and translation affecter sites.

Two-Dimensional Weight Matrices, cont.Two-Dimensional Weight Matrices, cont.The cap signal — a structure at the 5’ end of eukaryotic mRNA introduced after The cap signal — a structure at the 5’ end of eukaryotic mRNA introduced after

transcription by linking the 5’ end of a guanine nucleotide to the terminal base of the transcription by linking the 5’ end of a guanine nucleotide to the terminal base of the

mRNA and methylating at least the additional guanine; the structure is mRNA and methylating at least the additional guanine; the structure is 7Me7MeGG5’5’pppppp5’5’Np. Np.

Base freguencies according to Philipp Bucher (1990) J. Mol. Biol. 212:563-578.Base freguencies according to Philipp Bucher (1990) J. Mol. Biol. 212:563-578.

Preferred region: center between 1 and +5. Optimized cut-off value: 81.4%.Preferred region: center between 1 and +5. Optimized cut-off value: 81.4%.

%G 23 0 0 38 0 15 24 18%G 23 0 0 38 0 15 24 18

%A 16 0 95 9 25 22 15 17%A 16 0 95 9 25 22 15 17

%U 45 0 5 26 43 24 33 33%U 45 0 5 26 43 24 33 33

%C 16 100 0 27 31 39 28 32%C 16 100 0 27 31 39 28 32

CONSENSUS sequence to a certainty level of 63 percent at each position:CONSENSUS sequence to a certainty level of 63 percent at each position:

KCABHYBYKCABHYBY

Page 15: BSC4933/5936 Florida State University Department of Biology  October 2, 2003 Introduction to BioInformatics

Signal Searching:Signal Searching:locating transcription and translation affecter sites.locating transcription and translation affecter sites.

Two-Dimensional Weight Matrices, cont.Two-Dimensional Weight Matrices, cont.

The eukaryotic terminator weight matrix.The eukaryotic terminator weight matrix.Base frequencies according to McLauchlan et al.Base frequencies according to McLauchlan et al.(1985) N.A.R. 13:1347-1368.(1985) N.A.R. 13:1347-1368.

Found in about 2/3's of all eukaryotic gene sequences.Found in about 2/3's of all eukaryotic gene sequences.

%G 19 81 9 94 14 10 11 19%G 19 81 9 94 14 10 11 19

%A 13 9 3 3 4 0 11 13%A 13 9 3 3 4 0 11 13

%U 51 9 89 3 79 61 56 47%U 51 9 89 3 79 61 56 47

%C 17 1 0 0 3 29 21 21%C 17 1 0 0 3 29 21 21

CONSENSUS sequence to a certainty level of 68 percent at each position:CONSENSUS sequence to a certainty level of 68 percent at each position:

BGTGTBYYBGTGTBYY

Page 16: BSC4933/5936 Florida State University Department of Biology  October 2, 2003 Introduction to BioInformatics

Content Approaches. Strategies for Content Approaches. Strategies for finding coding regions based on the finding coding regions based on the content of the DNA itself.content of the DNA itself.Searching by content utilizes the fact that genes necessarily have

many implicit biological constraints imposed on their genetic code. This induces certain periodicities and patterns to produce distinctly unique coding sequences; non-coding stretches do not exhibit this type of periodic compositional bias. These principles can help discriminate structural genes in two ways:

1) based on the local “non-randomness” of a stretch, and

2) based on the known codon usage of a particular life form.

The first, the non-randomness test, does not tell us anything about the particular strand or reading frame; however, it does not require a previously built codon usage table. The second approach is based on the fact that different organisms use different frequencies of codons to code for particular amino acids. This does require a codon usage table built up from known translations; however, it also tells us the strand and reading frame for the gene products as opposed to the former.

Page 17: BSC4933/5936 Florida State University Department of Biology  October 2, 2003 Introduction to BioInformatics

Content Approaches, cont.Content Approaches, cont.“Non-Randomness” Techniques.“Non-Randomness” Techniques. GCG’s TestCode.GCG’s TestCode.

Relies solely on the base compositional bias of every third position base.Relies solely on the base compositional bias of every third position base.

The plot is divided into three regions: top and bottom areas predict coding and The plot is divided into three regions: top and bottom areas predict coding and noncoding regions, respectively, to a confidence level of 95%, the middle area claims noncoding regions, respectively, to a confidence level of 95%, the middle area claims no statistical significance. Diamonds and vertical bars above the graph denote no statistical significance. Diamonds and vertical bars above the graph denote potential stop and start codons respectively.potential stop and start codons respectively.

Page 18: BSC4933/5936 Florida State University Department of Biology  October 2, 2003 Introduction to BioInformatics

Content Approaches, cont. Content Approaches, cont. Codon Usage Codon Usage Techniques.Techniques. GCG’s CodonPreference.GCG’s CodonPreference.

Genomes use synonymous codons unequally sorted Genomes use synonymous codons unequally sorted phylogenetically.phylogenetically.

Each forward reading frame indicates a red codon preference curve and a blue third Each forward reading frame indicates a red codon preference curve and a blue third position GC bias curve. The horizontal lines within each plot are the average values of position GC bias curve. The horizontal lines within each plot are the average values of each attribute. Start codons are represented as vertical lines rising above each box and each attribute. Start codons are represented as vertical lines rising above each box and stop codons are shown as lines falling below the reading frame boxes. Rare codon stop codons are shown as lines falling below the reading frame boxes. Rare codon choices are shown for each frame with hash marks below each reading frame.choices are shown for each frame with hash marks below each reading frame.

Page 19: BSC4933/5936 Florida State University Department of Biology  October 2, 2003 Introduction to BioInformatics

Internet World Wide Web servers.Internet World Wide Web servers.Many servers have been established that can be a huge Many servers have been established that can be a huge help with gene finding analyses. Most of these servers help with gene finding analyses. Most of these servers combine many of the methods previously discussed but combine many of the methods previously discussed but they consolidate the information and often combine signal they consolidate the information and often combine signal and content methods with homology inference in order to and content methods with homology inference in order to ascertain exon locations. Many use powerful neural net ascertain exon locations. Many use powerful neural net or artificial intelligence approaches to assist in this or artificial intelligence approaches to assist in this difficult ‘decision’ process.difficult ‘decision’ process.

A wonderful bibliography on computational methods for A wonderful bibliography on computational methods for gene recognition has been compiled by Wentian Li (gene recognition has been compiled by Wentian Li (http://www.http://www.nslijnslij-genetics.org/gene/-genetics.org/gene/),),

and the Baylor College of Medicine’s Gene Feature and the Baylor College of Medicine’s Gene Feature Search (Search (http://http://searchlaunchersearchlauncher..bcmbcm..tmctmc..eduedu//seqseq-search/gene-search.html-search/gene-search.html) is another nice portal to ) is another nice portal to several gene finding tools.several gene finding tools.

Page 20: BSC4933/5936 Florida State University Department of Biology  October 2, 2003 Introduction to BioInformatics

World Wide Web Internet servers, cont.World Wide Web Internet servers, cont.Five popular gene-finding services are GrailEXP, GeneId, GenScan, NetGene2, and GeneMark.

The neural net system GrailEXP (Gene recognition and analysis internet link–EXPanded http://grail.lsd.ornl.gov/grailexp/) is a gene finder, an EST alignment utility, an exon prediction program, a promoter and polyA recognizer, a CpG island locater, and a repeat masker, all combined into one package.

GeneId (http://www1.imim.es/software/geneid/index.html) is an ‘ab initio’ Artificial Intelligence system for predicting gene structure optimized in genomic Drosophila or Homo DNA.

NetGene2 (http://www.cbs.dtu.dk/services/NetGene2/), another ‘ab initio’ program, predicts splice site likelihood using neural net techniques in human, C. elegans, and A. thaliana DNA.

GenScan (http://genes.mit.edu/GENSCAN.html) is perhaps the most ‘trusted’ server these days with vertebrate genomes.

The GeneMark (http://opal.biology.gatech.edu/GeneMark/) family of gene prediction programs is based on Hidden Markov Chain modeling techniques; originally developed in a prokaryotic context the programs have now been expanded to include eukaryotic modeling as well.

Page 21: BSC4933/5936 Florida State University Department of Biology  October 2, 2003 Introduction to BioInformatics

Homology Inference.Homology Inference.Similarity searching can be particularly powerful for inferring gene Similarity searching can be particularly powerful for inferring gene location by homology. This can often be the most informative of any location by homology. This can often be the most informative of any of the gene finding techniques, especially now that so many of the gene finding techniques, especially now that so many sequences have been collected and analyzed. Wisconsin Package sequences have been collected and analyzed. Wisconsin Package programs such as the BLAST and FastA families, Compare and programs such as the BLAST and FastA families, Compare and DotPlot, Gap and BestFit, and FrameAlign and FrameSearch can all DotPlot, Gap and BestFit, and FrameAlign and FrameSearch can all be a huge help in this process. But this too can be misleading and be a huge help in this process. But this too can be misleading and seldom gives exact start and stop positions. For example:seldom gives exact start and stop positions. For example:

805 GCCATCGCCCGGGGCCGAGGGAAGGGCCCGGCAGCTGAGGAGCCG...CT 851805 GCCATCGCCCGGGGCCGAGGGAAGGGCCCGGCAGCTGAGGAGCCG...CT 851 ||| |||||| ::: ||||||...|||||| ||||| |||||| ::: ||||||...|||||| || 46 AlaAlaAlaArgCysLysAlaAlaGluAlaAlaAlaAspGluProAlaLe 6246 AlaAlaAlaArgCysLysAlaAlaGluAlaAlaAlaAspGluProAlaLe 62 . . . . .. . . . . 852 GAGCTTGCTGGACGACATGAACCACTGCTACTCCCGCCTGCGGGAACTGG 901852 GAGCTTGCTGGACGACATGAACCACTGCTACTCCCGCCTGCGGGAACTGG 901 | ||| ||||||||| |||||||||||||||||| ||||| ||| ||||||||| |||||||||||||||||| |||| 63 uCysLeuGlnCysAspMetAsnAspCysTyrSerArgLeuArgArgLeuV 7963 uCysLeuGlnCysAspMetAsnAspCysTyrSerArgLeuArgArgLeuV 79 . . . . .. . . . . 902 TACCCGGAGTCCCGAGAGGCACTCAGCTTAGCCAGGTGGAAATCCTACAG 951902 TACCCGGAGTCCCGAGAGGCACTCAGCTTAGCCAGGTGGAAATCCTACAG 951 ||||| :::||| ... |||...|||||||||||||||||||| :::||| ... |||...||||||||||||||| 80 alProThrIleProProAsnLysLysValSerLysValGluIleLeuGln 9580 alProThrIleProProAsnLysLysValSerLysValGluIleLeuGln 95 . . .. . . 952 CGCGTCATCGACTACATTCTCGACCTGCAGGTAGTCCTG 990952 CGCGTCATCGACTACATTCTCGACCTGCAGGTAGTCCTG 990 ||||||||||||||||||||||||||| |||||||||||||||||||||||||||||| ||| 96 HisValIleAspTyrIleLeuAspLeuGlnLeuAlaLeu 10896 HisValIleAspTyrIleLeuAspLeuGlnLeuAlaLeu 108

Page 22: BSC4933/5936 Florida State University Department of Biology  October 2, 2003 Introduction to BioInformatics

Summary. The combinatorial approach.Summary. The combinatorial approach.Get all your data in one place. GCG’s SeqLab is a great Get all your data in one place. GCG’s SeqLab is a great way to do this due to its advanced annotation capabilities:way to do this due to its advanced annotation capabilities:

Page 23: BSC4933/5936 Florida State University Department of Biology  October 2, 2003 Introduction to BioInformatics

The Objective: a complete protein.The Objective: a complete protein.

Now what?Now what?

Page 24: BSC4933/5936 Florida State University Department of Biology  October 2, 2003 Introduction to BioInformatics

Beyond just finding genes: Beyond just finding genes: Genome scale analyses.Genome scale analyses.Unfortunately much ’traditional’ sequence analysis software can’t do it, but Unfortunately much ’traditional’ sequence analysis software can’t do it, but there are some very good Web resources available for these types of ‘global there are some very good Web resources available for these types of ‘global view’ analyses. Let’s run through a few examples. NCBI’s Genome pages (view’ analyses. Let’s run through a few examples. NCBI’s Genome pages (http://www.ncbi.nlm.nih.gov/) present a good starting point in North America:) present a good starting point in North America:

Page 25: BSC4933/5936 Florida State University Department of Biology  October 2, 2003 Introduction to BioInformatics

Beyond just finding genes: Beyond just finding genes: Genome scale analyses, cont.Genome scale analyses, cont.That can lead to neat places like the That can lead to neat places like the Genome Browser at the University of Genome Browser at the University of

California, Santa Cruz (California, Santa Cruz (http://genome.ucsc.edu/) and the Ensembl project at ) and the Ensembl project at

the Sanger Center for BioInformatics (the Sanger Center for BioInformatics (http://www.ensembl.org/):):

Page 26: BSC4933/5936 Florida State University Department of Biology  October 2, 2003 Introduction to BioInformatics

Beyond just finding genes: Beyond just finding genes: Genome scale analyses, cont.Genome scale analyses, cont.And sites like the the University of Wisconsin’s And sites like the the University of Wisconsin’s E. coliE. coli Genome Genome

Project Project (http://www.genome.wisc.edu/) and The and The Institute for

Genomic Research’s (http://www.tigr.org/) MUMMER package.

Page 27: BSC4933/5936 Florida State University Department of Biology  October 2, 2003 Introduction to BioInformatics

References.References. A perplexing variety of techniques exist for the identification and A perplexing variety of techniques exist for the identification and analysis of protein coding regions in genomic DNA. Knowing which to use when and analysis of protein coding regions in genomic DNA. Knowing which to use when and how to combine their inferences will go a long way in your genomic analyses!how to combine their inferences will go a long way in your genomic analyses!

Bucher, P. (1990). Weight Matrix Descriptions of Four Eukaryotic RNA Polymerase II Promoter Elements Derived from 502 Unrelated Promoter Sequences. Journal of Molecular Biology 212, 563-578.

Bucher, P. (1995). The Eukaryotic Promoter Database EPD. EMBL Nucleotide Sequence Data Library Release 42, Postfach 10.2209, D-6900 Heidelberg.

Ghosh, D. (1990). A Relational Database of Transcription Factors. Nucleic Acids Research 18, 1749-1756.

Gribskov, M. and Devereux, J., editors (1992) Sequence Analysis Primer. W.H. Freeman and Company, New York, N.Y., U.S.A.

Hawley, D.K. and McClure, W.R. (1983). Compilation and Analysis of Escherichia coli promoter sequences. Nucleic Acids Research 11, 2237-2255.

Kozak, M. (1984). Compilation and Analysis of Sequences Upstream from the Translational Start Site in Eukaryotic mRNAs. Nucleic Acids Research 12, 857-872.

McLauchen, J., Gaffrey, D., Whitton, J. and Clements, J. (1985). The Consensus Sequences YGTGTTYY Located Downstream from the AATAAA Signal is Required for Efficient Formation of mRNA 3’ Termini. Nucleic Acid Research 13 , 1347-1368.

Proudfoot, N.J. and Brownlee, G.G. (1976). 3’ Noncoding Region in Eukaryotic Messenger RNA. Nature 263, 211-214.

Stormo, G.D., Schneider, T.D. and Gold, L.M. (1982). Characterization of Translational Initiation Sites in E. coli. Nucleic Acids Research 10, 2971-2996.

von Heijne, G. (1987a) Sequence Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit. Academic Press, Inc., San Diego, CA.

von Heijne, G. (1987b). SIGPEP: A Sequence Database for Secretory Signal Peptides. Protein Sequences & Data Analysis 1, 41-42.