biological motivation gene finding anne r. haake rhys price jones
TRANSCRIPT
![Page 1: Biological Motivation Gene Finding Anne R. Haake Rhys Price Jones](https://reader030.vdocument.in/reader030/viewer/2022032517/56649c875503460f9493efd1/html5/thumbnails/1.jpg)
Biological MotivationGene Finding
Anne R. Haake
Rhys Price Jones
![Page 2: Biological Motivation Gene Finding Anne R. Haake Rhys Price Jones](https://reader030.vdocument.in/reader030/viewer/2022032517/56649c875503460f9493efd1/html5/thumbnails/2.jpg)
Gene Finding
Why do it? • Find and annotate all the genes within the large
volume of DNA sequence data– how many genes in an organism? homologies?
• Gain understanding of problems in basic science
– e.g. gene regulation-what are the mechanisms involved in transcription, splicing, etc?
• Different emphasis in these goals has some effect on the design of computational approaches for gene finding.
![Page 3: Biological Motivation Gene Finding Anne R. Haake Rhys Price Jones](https://reader030.vdocument.in/reader030/viewer/2022032517/56649c875503460f9493efd1/html5/thumbnails/3.jpg)
Gene Finding by Biological Methods:
• Extract mRNA reverse transcribe cDNA
Label cDNA
Detecting by using cDNA probe
Gene found
DNA library
![Page 4: Biological Motivation Gene Finding Anne R. Haake Rhys Price Jones](https://reader030.vdocument.in/reader030/viewer/2022032517/56649c875503460f9493efd1/html5/thumbnails/4.jpg)
Gene Finding by Computational Methods
• Dependent on good experimental data to build reliable predictive models
• Various aspects of gene structure/function provide information used in gene finding programs
![Page 5: Biological Motivation Gene Finding Anne R. Haake Rhys Price Jones](https://reader030.vdocument.in/reader030/viewer/2022032517/56649c875503460f9493efd1/html5/thumbnails/5.jpg)
Figure 12.3
Figure 12.3
![Page 6: Biological Motivation Gene Finding Anne R. Haake Rhys Price Jones](https://reader030.vdocument.in/reader030/viewer/2022032517/56649c875503460f9493efd1/html5/thumbnails/6.jpg)
The Informatics View of Genes
• Genes are character strings embedded in much larger strings called the genome
• Genes are composed of ordered elements associated with the fundamental genetic processes including transcription, splicing, and translation.
![Page 7: Biological Motivation Gene Finding Anne R. Haake Rhys Price Jones](https://reader030.vdocument.in/reader030/viewer/2022032517/56649c875503460f9493efd1/html5/thumbnails/7.jpg)
Gene Finding
• Cells recognize genes from DNA sequence– find genes via their bioprocesses
• Not so easy for us..
![Page 8: Biological Motivation Gene Finding Anne R. Haake Rhys Price Jones](https://reader030.vdocument.in/reader030/viewer/2022032517/56649c875503460f9493efd1/html5/thumbnails/8.jpg)
CTAGCAGGGACCCCAGCGCCCGAGAGACCATGCAGAGGTCGCCTCTGGAAAAGGCCAGCGTTGTCTCCAAACTTTTTTTCAGGTGAGAAGGTGGCCAACCGAGCTTCGGAAAGACACGTGCCCACGAAAGAGGAGGGCGTGTGTATGGGTTGGGTTGGGGTAAAGGAATAAGCAGTTTTTAAAAAGATGCGCTATCATTCATTGTTTTGAAAGAAAATGTGGGTATTGTAGAATAAAACAGAAAGCATTAAGAAGAGATGGAAGAATGAACTGAAGCTGATTGAATAGAGAGCCACATCTACTTGCAACTGAAAAGTTAGAATCTCAAGACTCAAGTACGCTACTATGCACTTGTTTTATTTCATTTTTCTAAGAAACTAAAAATACTTGTTAATAAGTACCTANGTATGGTTTATTGGTTTTCCCCCTTCATGCCTTGGACACTTGATTGTCTTCTTGGCACATACAGGTGCCATGCCTGCATATAGTAAGTGCTCAGAAAACATTTCTTGACTGAATTCAGCCAACAAAAATTTTGGGGTAGGTAGAAAATATATGCTTAAAGTATTTATTGTTATGAGACTGGATATAT...
![Page 9: Biological Motivation Gene Finding Anne R. Haake Rhys Price Jones](https://reader030.vdocument.in/reader030/viewer/2022032517/56649c875503460f9493efd1/html5/thumbnails/9.jpg)
GCTAGCAGGGACCCCAGCGCCCGAGAGACCATGCAGAGGTCGCCTCTGGAAAAGGCCAGCGTTGTCTCCAAACTTTTTTTCAGGTGAGAAGGTGGCCAACCGAGCTTCGGAAAGACACGTGCCCACGAAAGAGGAGGGCGTGTGTATGGGTTGGGTTGGGGTAAAGGAATAAGCAGTTTTTAAAAAGATGCGCTATCATTCATTGTTTTGAAAGAAAATGTGGGTATTGTAGAATAAAACAGAAAGCATTAAGAAGAGATGGAAGAATGAACTGAAGCTGATTGAATAGAGAGCCACATCTACTTGCAACTGAAAAGTTAGAATCTCAAGACTCAAGTACGCTACTATGCACTTGTTTTATTTCATTTTTCTAAGAAACTAAAAATACTTGTTAATAAGTACCTANGTATGGTTTATTGGTTTTCCCCCTTCATGCCTTGGACACTTGATTGTCTTCTTGGCACATACAGGTGCCATGCCTGCATATAGTAAGTGCTCAGAAAACATTTCTTGACTGAATTCAGCCAACAAAAATTTTGGGGTAGGTAGAAAATATATGCTTAAAGTATTTATTGTTATGAGACTGGATATAT...
![Page 10: Biological Motivation Gene Finding Anne R. Haake Rhys Price Jones](https://reader030.vdocument.in/reader030/viewer/2022032517/56649c875503460f9493efd1/html5/thumbnails/10.jpg)
Types of Genes
• Protein coding– most genes
• RNA genes– rRNA– tRNA– snRNA (small nuclear RNA)– snoRNA (small nucleolar RNA)
![Page 11: Biological Motivation Gene Finding Anne R. Haake Rhys Price Jones](https://reader030.vdocument.in/reader030/viewer/2022032517/56649c875503460f9493efd1/html5/thumbnails/11.jpg)
3 Major Categories of Information used in Gene Finding Programs
• Signals/features = a sequence pattern with functional significance e.g. splice donor & acceptor sites, start and stop codons, promoter features such as TATA boxes, TF binding sites, CpG islands
• Content/composition -statistical properties of coding vs. non-coding regions. – e.g. codon-bias; length of ORFs in prokaryotes;GC content
• Similarity-compare DNA sequence to known sequences in database– Not only known proteins but also ESTs, cDNAs
![Page 12: Biological Motivation Gene Finding Anne R. Haake Rhys Price Jones](https://reader030.vdocument.in/reader030/viewer/2022032517/56649c875503460f9493efd1/html5/thumbnails/12.jpg)
Looking for Protein Coding Genes
• Look for ORF (begins with start codon, ends with stop codon, no internal stops!)– long (usually > 60-100 aa)– If homologous to “known” protein more likely
• Look for basal signals– Transcription, splicing, translation
• Look for regulatory signals– Depends on organism
• Prokaryotes vs Eukaryotes• Vertebrate vs fungi
![Page 13: Biological Motivation Gene Finding Anne R. Haake Rhys Price Jones](https://reader030.vdocument.in/reader030/viewer/2022032517/56649c875503460f9493efd1/html5/thumbnails/13.jpg)
Easier problem:Gene Finding in Bacterial Genomes
Why?• Dense Genomes• Short intergenic regions• Uninterrupted ORFs• Conserved signals• Abundant comparative information
– Complete Genomes available for many
![Page 14: Biological Motivation Gene Finding Anne R. Haake Rhys Price Jones](https://reader030.vdocument.in/reader030/viewer/2022032517/56649c875503460f9493efd1/html5/thumbnails/14.jpg)
What do Prokaryotic Genes look like?
5’ 3’
Open Reading Frame
Promoter region (maybe)
Ribosome binding site (maybe)
Termination sequence (maybe)
Start codon / Stop Codon
![Page 15: Biological Motivation Gene Finding Anne R. Haake Rhys Price Jones](https://reader030.vdocument.in/reader030/viewer/2022032517/56649c875503460f9493efd1/html5/thumbnails/15.jpg)
Prokaryotic Gene Expression
Promoter Cistron1 Cistron2 CistronN Terminator
Transcription RNA Polymerase
mRNA 5’ 3’
TranslationRibosome, tRNAs,Protein Factors
1 2 N
Polypeptides
NC
NC N
C
1 2 3
Slide modified from: http://biology.uky.edu/520/Lecture/lect8/lect8Notes.ppt
SD in polycistronic message
![Page 16: Biological Motivation Gene Finding Anne R. Haake Rhys Price Jones](https://reader030.vdocument.in/reader030/viewer/2022032517/56649c875503460f9493efd1/html5/thumbnails/16.jpg)
Open Reading Frame (ORF)
• Any stretch of DNA that potentially encodes a protein
• The identification of an ORF is the first indication that a segment of DNA may be part of a functional gene
![Page 17: Biological Motivation Gene Finding Anne R. Haake Rhys Price Jones](https://reader030.vdocument.in/reader030/viewer/2022032517/56649c875503460f9493efd1/html5/thumbnails/17.jpg)
Open Reading Frames
Each grouping of the nucleotides into consecutive triplets constitutes a reading frame. There are three different reading frames in the 5’->3’ direction and a further three in the reverse direction on the opposite strand.
A sequence of triplets that contains no stop codon is an Open Reading Frame (ORF)
A C G T A A C T G A C T A G G T G A A T
GTA ACT GAC TAG GTG AAT
CGT AAC TGA CTA GGT GAA
![Page 18: Biological Motivation Gene Finding Anne R. Haake Rhys Price Jones](https://reader030.vdocument.in/reader030/viewer/2022032517/56649c875503460f9493efd1/html5/thumbnails/18.jpg)
ORFs as gene candidates
• An open reading frame that begins with a start codon (usually ATG, GTG or TTG, but this is species-dependent)
• Most prokaryotic genes code for proteins that are 60 or more amino acids in length
• The probability that a random sequence of nucleotides of length n has no stop codons is (61/64)n
• When n is 50, there is a probability of 92% that the random sequence contains a stop codon
• When n is 100, this probability exceeds 99%
![Page 19: Biological Motivation Gene Finding Anne R. Haake Rhys Price Jones](https://reader030.vdocument.in/reader030/viewer/2022032517/56649c875503460f9493efd1/html5/thumbnails/19.jpg)
Codon Bias
• Genetic code degenerate– Equivalent triplet codons code for the same amino acid– http://www.pangloss.com/seidel/Protocols/codon.html
• Codon usage varies– organism to organism– gene to gene
• Biological basis– Avoidance of codons similar to stop– Preference for codons that correspond to abundant tRNAs
within the organism
![Page 20: Biological Motivation Gene Finding Anne R. Haake Rhys Price Jones](https://reader030.vdocument.in/reader030/viewer/2022032517/56649c875503460f9493efd1/html5/thumbnails/20.jpg)
Codon Bias Gene Differences
GAL4 ADH1Gly GGG 0.21 0Gly GGA 0.17 0Gly GGT 0.38 0.93Gly GGC 0.24 0.07
Slide modified from: http://biology.uky.edu/520/Lecture/lect8/lect8Notes.ppt
![Page 21: Biological Motivation Gene Finding Anne R. Haake Rhys Price Jones](https://reader030.vdocument.in/reader030/viewer/2022032517/56649c875503460f9493efd1/html5/thumbnails/21.jpg)
Codon BiasOrganism differences
• Yeast Genome: arg specified by AGA 48% of time (other five equivalent codons ~10% each)
• Fruitfly Genome: arg specified by CGC 33% of time (other five ~13% each)
• Complete set of codon usage biases can be found at: http://www.kazusa.or.jp/codon/
![Page 22: Biological Motivation Gene Finding Anne R. Haake Rhys Price Jones](https://reader030.vdocument.in/reader030/viewer/2022032517/56649c875503460f9493efd1/html5/thumbnails/22.jpg)
GC content
• GC relative to AT is a distinguishing factor of bacterial genomes
• Varies dramatically across species– Serves as a means to identify bacterial species
• For various biological reasons– Mutational bias of particular DNA polymerases– DNA repair mechanisms – horizontal gene transfer (transformation, transduction,
conjugation)
![Page 23: Biological Motivation Gene Finding Anne R. Haake Rhys Price Jones](https://reader030.vdocument.in/reader030/viewer/2022032517/56649c875503460f9493efd1/html5/thumbnails/23.jpg)
GC Content
• GC content may be different in recently acquired genes than elsewhere
• This can lead to variations in the frequency of codon usage within coding regions – There may be significant differences in codon bias within
different genes of a single bacterium’s genome
![Page 24: Biological Motivation Gene Finding Anne R. Haake Rhys Price Jones](https://reader030.vdocument.in/reader030/viewer/2022032517/56649c875503460f9493efd1/html5/thumbnails/24.jpg)
Ribosome Binding Sites
• RBS is also known as a Shine-Dalgarno sequence (species-dependent) that should bind well with the 3’ end of 16S rRNA (part of the ribosome)
• Usually found within 4-18 nucleotides of the start codon of a true gene
![Page 25: Biological Motivation Gene Finding Anne R. Haake Rhys Price Jones](https://reader030.vdocument.in/reader030/viewer/2022032517/56649c875503460f9493efd1/html5/thumbnails/25.jpg)
Shine-Dalgarno Sequence
• Is a nucleotide sequence (consensus = AGGAGG) that is present in the 5'-untranslated region of prokaryotic mRNAs.
• This sequence serves as a binding site for ribosomes and is thought to influence the reading frame.
• If a subsequence aligning well with the Shine-Dalgarno sequence is found within 4-18 nucleotides of an ORF’s start codon, that improves the ORF’s candidacy.
![Page 26: Biological Motivation Gene Finding Anne R. Haake Rhys Price Jones](https://reader030.vdocument.in/reader030/viewer/2022032517/56649c875503460f9493efd1/html5/thumbnails/26.jpg)
Bacterial Promoter
-35T82T84G78A65C54A45…
(16-18 bp)…T80A95T45A60A50T96…(A,G)
-10 +1
Not so simple: remember, these are consensus sequences
![Page 27: Biological Motivation Gene Finding Anne R. Haake Rhys Price Jones](https://reader030.vdocument.in/reader030/viewer/2022032517/56649c875503460f9493efd1/html5/thumbnails/27.jpg)
Termination Sequences
• 3’-U tail
• Stem/loop– Inverted repeat immediately preceding the runs of uracil
Termination sequence