genome annotation haixu tang school of informatics
TRANSCRIPT
![Page 1: Genome Annotation Haixu Tang School of Informatics](https://reader036.vdocument.in/reader036/viewer/2022062301/56649f4d5503460f94c6dd0b/html5/thumbnails/1.jpg)
Genome Annotation
Haixu TangSchool of Informatics
![Page 2: Genome Annotation Haixu Tang School of Informatics](https://reader036.vdocument.in/reader036/viewer/2022062301/56649f4d5503460f94c6dd0b/html5/thumbnails/2.jpg)
Genome and genes
• Genome: an organism’s genetic material (Car encyclopedia)
• Gene: a discrete units of hereditary information located on the chromosomes and consisting of DNA. (Chapters to make components of a car, or to use and drive a car).
![Page 3: Genome Annotation Haixu Tang School of Informatics](https://reader036.vdocument.in/reader036/viewer/2022062301/56649f4d5503460f94c6dd0b/html5/thumbnails/3.jpg)
Gene Prediction: Computational Challenge
aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcctgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcgg
![Page 4: Genome Annotation Haixu Tang School of Informatics](https://reader036.vdocument.in/reader036/viewer/2022062301/56649f4d5503460f94c6dd0b/html5/thumbnails/4.jpg)
Gene Prediction: Computational Challenge
aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcctgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcgg
![Page 5: Genome Annotation Haixu Tang School of Informatics](https://reader036.vdocument.in/reader036/viewer/2022062301/56649f4d5503460f94c6dd0b/html5/thumbnails/5.jpg)
Gene Prediction: Computational Challenge
aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcctgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcgg
Gene!
![Page 6: Genome Annotation Haixu Tang School of Informatics](https://reader036.vdocument.in/reader036/viewer/2022062301/56649f4d5503460f94c6dd0b/html5/thumbnails/6.jpg)
• Gene: A sequence of nucleotides coding for protein
• Gene Prediction Problem: Determine the beginning and end positions of genes in a genome
Gene Prediction: Computational Challenge
![Page 7: Genome Annotation Haixu Tang School of Informatics](https://reader036.vdocument.in/reader036/viewer/2022062301/56649f4d5503460f94c6dd0b/html5/thumbnails/7.jpg)
Protein
RNA
DNA
transcription
translation
CCTGAGCCAACTATTGATGAA
PEPTIDE
CCUGAGCCAACUAUUGAUGAA
Central Dogma: DNA -> RNA -> Protein
![Page 8: Genome Annotation Haixu Tang School of Informatics](https://reader036.vdocument.in/reader036/viewer/2022062301/56649f4d5503460f94c6dd0b/html5/thumbnails/8.jpg)
• Codon: 3 consecutive nucleotides
• 4 3 = 64 possible codons
• Genetic code is degenerative and redundant
– Includes start and stop codons
– An amino acid may be coded by more than one codon (codon degeneracy)
Translating Nucleotides into Amino Acids
![Page 9: Genome Annotation Haixu Tang School of Informatics](https://reader036.vdocument.in/reader036/viewer/2022062301/56649f4d5503460f94c6dd0b/html5/thumbnails/9.jpg)
• In 1961 Sydney Brenner and Francis Crick discovered frameshift mutations
• Systematically deleted nucleotides from DNA– Single and double deletions dramatically
altered protein product– Effects of triple deletions were minor– Conclusion: every triplet of nucleotides, each codon, codes for exactly one amino acid in a protein
Codons
![Page 10: Genome Annotation Haixu Tang School of Informatics](https://reader036.vdocument.in/reader036/viewer/2022062301/56649f4d5503460f94c6dd0b/html5/thumbnails/10.jpg)
UAA, UAG and UGA correspond to 3 Stop codons that (together with Start codon ATG) delineate Open Reading Frames
Genetic Code and Stop Codons
![Page 11: Genome Annotation Haixu Tang School of Informatics](https://reader036.vdocument.in/reader036/viewer/2022062301/56649f4d5503460f94c6dd0b/html5/thumbnails/11.jpg)
Six Frames in a DNA Sequence
• stop codons – TAA, TAG, TGA
• start codons - ATG
GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG
CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC
GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTGGACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTGGACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG
CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACACCTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACACCTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC
![Page 12: Genome Annotation Haixu Tang School of Informatics](https://reader036.vdocument.in/reader036/viewer/2022062301/56649f4d5503460f94c6dd0b/html5/thumbnails/12.jpg)
• Detect potential coding regions by looking at ORFs– A genome of length n is comprised of (n/3) codons– Stop codons break genome into segments
between consecutive Stop codons– The subsegments of these that start from the Start
codon (ATG) are ORFs• ORFs in different frames may overlap
3
n
3
n
3
n
Genomic Sequence
Open reading frame
ATG TGA
Open Reading Frames (ORFs)
![Page 13: Genome Annotation Haixu Tang School of Informatics](https://reader036.vdocument.in/reader036/viewer/2022062301/56649f4d5503460f94c6dd0b/html5/thumbnails/13.jpg)
• Long open reading frames may be a gene– At random, we should expect one stop codon
every (64/3) ~= 21 codons– However, genes are usually much longer
than this• A basic approach is to scan for ORFs whose
length exceeds certain threshold– This is naïve because some genes (e.g. some
neural and immune system genes) are relatively short
Long vs.Short ORFs
![Page 14: Genome Annotation Haixu Tang School of Informatics](https://reader036.vdocument.in/reader036/viewer/2022062301/56649f4d5503460f94c6dd0b/html5/thumbnails/14.jpg)
Testing ORFs: Codon Usage
• Create a 64-element hash table and count the frequencies of codons in an ORF
• Amino acids typically have more than one codon, but in nature certain codons are more in use
• Uneven use of the codons may characterize a real gene
• This compensate for pitfalls of the ORF length test
![Page 15: Genome Annotation Haixu Tang School of Informatics](https://reader036.vdocument.in/reader036/viewer/2022062301/56649f4d5503460f94c6dd0b/html5/thumbnails/15.jpg)
Codon Usage in Human Genome
![Page 16: Genome Annotation Haixu Tang School of Informatics](https://reader036.vdocument.in/reader036/viewer/2022062301/56649f4d5503460f94c6dd0b/html5/thumbnails/16.jpg)
AA codon /1000 frac Ser TCG 4.31 0.05Ser TCA 11.44 0.14Ser TCT 15.70 0.19Ser TCC 17.92 0.22Ser AGT 12.25 0.15Ser AGC 19.54 0.24
Pro CCG 6.33 0.11Pro CCA 17.10 0.28Pro CCT 18.31 0.30Pro CCC 18.42 0.31
AA codon /1000 frac Leu CTG 39.95 0.40Leu CTA 7.89 0.08Leu CTT 12.97 0.13Leu CTC 20.04 0.20
Ala GCG 6.72 0.10Ala GCA 15.80 0.23Ala GCT 20.12 0.29Ala GCC 26.51 0.38
Gln CAG 34.18 0.75Gln CAA 11.51 0.25
Codon Usage in Mouse Genome
![Page 17: Genome Annotation Haixu Tang School of Informatics](https://reader036.vdocument.in/reader036/viewer/2022062301/56649f4d5503460f94c6dd0b/html5/thumbnails/17.jpg)
Transcription in prokaryotes
Coding region
Promoter
Transcription start side
Untranslated regions
Transcribed region
start codon stop codon
5’ 3’
upstream downstream
Transcription stop side
![Page 18: Genome Annotation Haixu Tang School of Informatics](https://reader036.vdocument.in/reader036/viewer/2022062301/56649f4d5503460f94c6dd0b/html5/thumbnails/18.jpg)
Microbial gene finding
• Microbial genome tends to be gene rich (80%-90% of the sequence is coding sequence)
• Major problem – finding genes without known homologue.
![Page 19: Genome Annotation Haixu Tang School of Informatics](https://reader036.vdocument.in/reader036/viewer/2022062301/56649f4d5503460f94c6dd0b/html5/thumbnails/19.jpg)
Open Reading Frame
Open Reading Frame (ORF) is a sequence of codons which starts with start codon, ends with a stop codon and has no stop codons in-between.
Searching for ORFs – consider all 6 possible reading frames: 3 forward and 3 reverse
Is the ORF a coding sequence?1. Must be long enough (roughly 300 bp or more)2. Should have average amino-acid composition specific for a
given organism.3. Should have codon usage specific for the given organism.
![Page 20: Genome Annotation Haixu Tang School of Informatics](https://reader036.vdocument.in/reader036/viewer/2022062301/56649f4d5503460f94c6dd0b/html5/thumbnails/20.jpg)
Gene finding using codon frequency
frequency in coding region frequency in non-coding region
Input sequence
Compare
Coding region or non-coding region
![Page 21: Genome Annotation Haixu Tang School of Informatics](https://reader036.vdocument.in/reader036/viewer/2022062301/56649f4d5503460f94c6dd0b/html5/thumbnails/21.jpg)
Example Codon position
A C T G
1 28% 33% 18% 21%
2 32% 16% 21% 32%
3 33% 15% 14% 38%
frequency in
genome
31% 18% 19% 31%
Assume: bases making codon are independent
P(x|in coding)P(x|random)
=
P(Ai at ith position)P(Ai in the sequence)i
Score of AAAGAT:
.28*.32*.33*.21*.26*.14
.31*.31*.31*.31*.31*.19
![Page 22: Genome Annotation Haixu Tang School of Informatics](https://reader036.vdocument.in/reader036/viewer/2022062301/56649f4d5503460f94c6dd0b/html5/thumbnails/22.jpg)
Using codon frequency to find correct reading frame
Consider sequence x1 x2 x3 x4 x5 x6 x7 x8 x9….
where xi is a nucleotide
let p1 = p x1 x2 x3 p x3 x4 x5…. p2 = p x2 x3 x4 p x5 x6 x7….
p3 = p x3 x4 x5 p x6 x7 x8….
then probability that ith reading frame is the coding frame is:
pi
p1 + p2 + p3
Algorithm:• slide a window along the sequence and compute Pi
•Plot the results
Pi =
![Page 23: Genome Annotation Haixu Tang School of Informatics](https://reader036.vdocument.in/reader036/viewer/2022062301/56649f4d5503460f94c6dd0b/html5/thumbnails/23.jpg)
Eukaryotic gene finding
• On average, vertebrate gene is about 30KB long
• Coding region takes about 1KB• Exon sizes vary from double digit numbers to
kilobases• An average 5’ UTR is about 750 bp• An average 3’UTR is about 450 bp but both
can be much longer.
![Page 24: Genome Annotation Haixu Tang School of Informatics](https://reader036.vdocument.in/reader036/viewer/2022062301/56649f4d5503460f94c6dd0b/html5/thumbnails/24.jpg)
Exons and Introns
• In eukaryotes, the gene is a combination of coding segments (exons) that are interrupted by non-coding segments (introns)
• This makes computational gene prediction in eukaryotes even more difficult
• Prokaryotes don’t have introns - Genes in prokaryotes are continuous
![Page 25: Genome Annotation Haixu Tang School of Informatics](https://reader036.vdocument.in/reader036/viewer/2022062301/56649f4d5503460f94c6dd0b/html5/thumbnails/25.jpg)
Gene Structure
![Page 26: Genome Annotation Haixu Tang School of Informatics](https://reader036.vdocument.in/reader036/viewer/2022062301/56649f4d5503460f94c6dd0b/html5/thumbnails/26.jpg)
Gene structure in eukaryotes
Promoter
Transcription start side
Untranslated regions
Transcribed region
start codon stop codon
5’ 3’
Transcription stop side
exons
Initial exon
Final exon
donor and acceptor sides
GT AG
![Page 27: Genome Annotation Haixu Tang School of Informatics](https://reader036.vdocument.in/reader036/viewer/2022062301/56649f4d5503460f94c6dd0b/html5/thumbnails/27.jpg)
Central Dogma and Splicingexon1 exon2 exon3
intron1 intron2
transcription
translation
splicing
exon = codingintron = non-coding
![Page 28: Genome Annotation Haixu Tang School of Informatics](https://reader036.vdocument.in/reader036/viewer/2022062301/56649f4d5503460f94c6dd0b/html5/thumbnails/28.jpg)
Splicing Signals
Exons are interspersed with introns and typically flanked by GT and AG
![Page 29: Genome Annotation Haixu Tang School of Informatics](https://reader036.vdocument.in/reader036/viewer/2022062301/56649f4d5503460f94c6dd0b/html5/thumbnails/29.jpg)
Splice site detection
5’ 3’Donor site
Position
% -8 … -2 -1 0 1 2 … 17
A 26 … 60 9 0 1 54 … 21C 26 … 15 5 0 1 2 … 27G 25 … 12 78 99 0 41 … 27T 23 … 13 8 1 98 3 … 25
![Page 30: Genome Annotation Haixu Tang School of Informatics](https://reader036.vdocument.in/reader036/viewer/2022062301/56649f4d5503460f94c6dd0b/html5/thumbnails/30.jpg)
Consensus splice sites
Donor: 7.9 bitsAcceptor: 9.4 bits
![Page 31: Genome Annotation Haixu Tang School of Informatics](https://reader036.vdocument.in/reader036/viewer/2022062301/56649f4d5503460f94c6dd0b/html5/thumbnails/31.jpg)
Promoters• Promoters are DNA segments upstream
of transcripts that initiate transcription
• Promoter attracts RNA Polymerase to the transcription start site
5’Promoter 3’
![Page 32: Genome Annotation Haixu Tang School of Informatics](https://reader036.vdocument.in/reader036/viewer/2022062301/56649f4d5503460f94c6dd0b/html5/thumbnails/32.jpg)
• Statistical: coding segments (exons) have typical sequences on either end and use different subwords than non-coding segments (introns).
• Similarity-based: many human genes are similar to genes in mice, chicken, or even bacteria. Therefore, already known mouse, chicken, and bacterial genes may help to find human genes.
Two Approaches to Eukaryotic Gene Prediction
![Page 33: Genome Annotation Haixu Tang School of Informatics](https://reader036.vdocument.in/reader036/viewer/2022062301/56649f4d5503460f94c6dd0b/html5/thumbnails/33.jpg)
Ribosomal Binding Site
![Page 34: Genome Annotation Haixu Tang School of Informatics](https://reader036.vdocument.in/reader036/viewer/2022062301/56649f4d5503460f94c6dd0b/html5/thumbnails/34.jpg)
(http://www-lmmb.ncifcrf.gov/~toms/sequencelogo.html)
Donor: 7.9 bitsAcceptor: 9.4 bits(Stephens & Schneider, 1996)
Donor and Acceptor Sites: Motif Logos
![Page 35: Genome Annotation Haixu Tang School of Informatics](https://reader036.vdocument.in/reader036/viewer/2022062301/56649f4d5503460f94c6dd0b/html5/thumbnails/35.jpg)
Similarity-based gene finding
• Alignment of
– Genomic sequence and (assembled) EST sequences
– Genomic sequence and known (similar) protein sequences
– Two or more similar genomic sequences
![Page 36: Genome Annotation Haixu Tang School of Informatics](https://reader036.vdocument.in/reader036/viewer/2022062301/56649f4d5503460f94c6dd0b/html5/thumbnails/36.jpg)
Cell or tissue
Isolate mRNA andReverse transcribe intocDNA
Clone cDNA into a vector toMake a cDNA library
5’
3’
EST
Pick a cloneAnd sequence the 5’ and 3’Ends of cDNA insert
dbEST
SubmitTo dbEST
Vectors
Expressed Sequence Tags
![Page 37: Genome Annotation Haixu Tang School of Informatics](https://reader036.vdocument.in/reader036/viewer/2022062301/56649f4d5503460f94c6dd0b/html5/thumbnails/37.jpg)
Central Dogma and Splicingexon1 exon2 exon3
intron1 intron2
transcription
translation
splicing
exon = codingintron = non-coding
![Page 38: Genome Annotation Haixu Tang School of Informatics](https://reader036.vdocument.in/reader036/viewer/2022062301/56649f4d5503460f94c6dd0b/html5/thumbnails/38.jpg)
Splicing Sequence Alignment
Potential splicing sites
![Page 39: Genome Annotation Haixu Tang School of Informatics](https://reader036.vdocument.in/reader036/viewer/2022062301/56649f4d5503460f94c6dd0b/html5/thumbnails/39.jpg)
Using Similarities to Find the Exon Structure• Human EST (mRNA) sequence is aligned to different
locations in the human genome• Find the “best” path to reveal the exon structure of human
gene
ES
T sequence
Human Genome
![Page 40: Genome Annotation Haixu Tang School of Informatics](https://reader036.vdocument.in/reader036/viewer/2022062301/56649f4d5503460f94c6dd0b/html5/thumbnails/40.jpg)
An annotated gene in human genome