genome annotation - dtu health tech · genome annotation and gene-finding 27621 prokaryotic gene...

Post on 18-Aug-2020

6 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Carsten FriisCenter for Biological Sequence AnalysisTechnical University of Denmark

Genome Annotation and Gene-finding

27621Prokaryotic Gene Discovery,

Metagenomics and Pangenomics

Genome Sequencing

Annotation is the process of assigningbiological meaning to segments of

genomic DNA...

Outline

Some ‘trivial’ questions− Why gene prediction?

− The problem of faster genomic sequencing

− What is a Gene?

The anatomy of a gene

Manual gene finding by you! (exercise)

Gene finder methods and performance− NetGene2

− EasyGene

Outline

Some ‘trivial’ questions

− Why gene prediction?

− The problem of faster genomic sequencing

− What is a Gene?

The anatomy of a gene

Manual gene finding by you! (exercise)

Gene finder methods and performance

− NetGene2

− EasyGene

Why Look for Genes?

Genes is where the action is:

− Explain Basic Biological Functions

Protein kinases, Cyclins, etc.

− Explain Medical Conditions

Symptoms linked to certain genes

− Be Used for Treatment of Disease

− Contain commercial value

As enzymes (Lipases, Amylases, ’washing detergent’)

As drug targets (Ion channels, Receptors)

As therapeutic factors

Nobel Prizes & Genes

The history of genes and related analysis has introduced us several Nobel Prize winners, − Richard J. Roberts and Phillip A. Sharp for their discoveries

of split genes; − Barbara McClintock for her discovery of mobile genetic

elements; − J. Michael Bishop and Harold E. Varmus for their discovery of

the cellular origin of retroviral oncogenes; − Francis Crick & James Watson for the DNA double helix

structure.− ….

Now... sequencing your entire genome in two months

It took longer than 10 years and $4 bn to sequence the three billion letters of the human genome, which was a composite made from dozens of different individuals.

454 Life Sciences makes an innovative DNA sequencing machine, which proved capable of decoding Dr. Watson’s genome in 2 months at a cost of less than $1 million. A copy of his genome, recorded on a pair of DVDs, was presented to Dr. Watson in a ceremony in Houston (2007 May 31).

More than 500 organisms sequenced to date

We have the Genome Sequence......now what?

Are there still novel genes to be discovered?– Yes!

What is the challenge?– We don’t know how many

genes there are!– We don’t know where they

are!– We don’t know what they do!

Carsten FriisCenter for Biological Sequence AnalysisTechnical University of Denmark

The cure lies in high-quality automatedgene finders...

What is a gene?

“Most problems have either many answers or no answer. Only a few problems have a single answer.”– Edmund C. Berkeley Helen Pearson; Nature 441, 398-401, May 2006

What is a gene?

Genes are regions of DNA sequence which hold information required by the cell to generateproteins

Proteins are folded chains of amino acids whoseshape and electro-chemical characteristicsdetermine their function in the cell

Gene definition

A number of genes with distinct structures were discovered a) RNA genes which encode RNAs rather than proteins;

b) Pseudogenes which were considered as nonfunctional replicates of genes;

c) Nested genes located inside introns of other genes;

d) Overlapped genes, where parts of two genes are overlapped; and

e) Assembled genes, where several sections can reassemble into other genes.

Identification of putative non-coding RNA genes inthe Burkholderia cenocepacia J2315 genome

Tom Coenye, Pavel Drevinek, Eshwar Mahenthiralingam, Shiraz Ali Shah, Ryan T. Gill, Peter Vandamme and David W. Ussery

ABSTRACT Non-coding RNA (ncRNA) genes are not involved in the production of mRNA and proteins, but produce transcripts that function directly as structural or regulatory RNAs. In the present study, we evaluated the presence of ncRNA genes in the genome of Burkholderia cenocepacia J2315. We used an approach in which we combined a comparative genomics (alignment-based) approach and the use of secondary structure information for the identification of putative ncRNAs genes. 213 putative ncRNA genes were identified in the B. cenocepacia J2315 genome and we could confirm upregulatedexpression of four of these by microarray analysis. Most of the ncRNA gene transcripts have a marked secondary structure that may allow interaction with other molecules. Several B. cenocepacia J2315 ncRNAs seem related to previously characterised ncRNAsinvolved in regulation of various cellular processes, while the function of many others remains unknown. The presence of a large number of ncRNA genes in this organism may help to explain its complexity, phenotypic variability and ability to survive in a remarkably wide range of environments.

Finding ncRNAs

Gene definition

The origins of “Gene"

It was coined by the Danish geneticist Wilhelm Johannsenin 1909 as a calculating unit. At that time it was only an abstract concept.

In the early 1920s, H. J. Muller predicted that genes carry genetic information and can replicate themselves as real material entity (Muller, 1922;Muller, 1947).

Gene definition

Loose definition of a gene:

“A locatable region of genomic sequence, corresponding to a unit of inheritance, which is associated with regulatory regions, transcribed regions and/or other functional sequence regions”

Structure, function and regulation of genes are all extremely complicated, more so than we suspected, and always beyond our imagination.

The Intron

Manual gene finding

Can U spot Spot?

Manual gene finding

DNA SequenceAAGAGGTAATTAAAGCTAAATGAAGTTGTAAGAGTGGCCCTATCGCATAGGACTAGTGTCCCTATAAGAACACGAAGAAATCACCTTAGAAAGGCTGAGAAAGGGCTGCAGGGCAGTGGGAGTGCAGACTGAAAGATGCAGACCACTGGGCTTCTACTTCTGTTTCCATTTCTGATCCGGCCTGCATCTGCCTCCTTCCTGAACAGGCCAGAGAATTCATCTAAATAGCCTAAGCAGGCTGGGTGCTGTGGCTCACCTGTAATCCCAACACTTGGGAGGCCGAGGTGGGCAGATCACCTGAGGTCAGGAGTTCAAGGCTAGCCTAGCCAACATGACAAAACCCCATCTCTACTAAAAAAATACAAAAATTAGCCAGGCATAGTGGCGCCTATAGTTCCAGCTACTTGGGGGCTGAGGTAGGAAGATCGCTAGAGCCTGGGAGGTTAAGGCTGCGGTGAGCTGTGATTGTGCCACTGCACTCCAGCCTGGGTGACAGAGCAAGACCCTGCCTCAAAAATAAATAAATAAATAAATAAATAAAAATAAGAGTGCTTGGCAGCTTGATCAAGCTATGCCAGGAACCCATCTCTCAAGCAGCAGCTCTTCTCCTGTGCCATTGTCAGCTTTGTCCTGTCTGAGTCCATGGGACTCTTCTGTTTGATGGTGGTCTTCCTCATCCTCTTCATCATGTGAAGCTCCATGGAGATCACCTACCCATACCTGCTTCTGTGACCTCATGCCATTCCTGGTGTTGGAATGTGCCAAGGTTTGCCATTAAACACACATTTCTCATTTCATAATTTCATATATATTATATATATGTGTGTGTGTGTGTGTTTATATATGCGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTATATATATATATATATATATATATATATATATATATATATAAAATATATAGGAAGAGGCACCAGAGAGCTCTCTGCATAGTCACAGAGGAAAGGTCATGTGAGGACAGCCAGAAGGCAGATGTCACAAGCCTCACCAGCAACCTACCATACCCTGCTTGTACCTCCATCCTGGAAGTCCAGCTTCTAAAATTAGAAGAAAATAGTCGGGTGTGGTGGCTCGCACCTATAATCCCAGCACTTTGGGAGGCTGATGTGGGAGGATCATTTGAGGTCAAGAGTTTGAAACCAGCCTAGGCAACATAGGGAGACCCTGTCTTTAAAAAAAATTTTTTTTTGTTTTAATTAGCTGGGTGTGATGGTGCACACCTGAGTCCTAGCTACTTGGGAGGCTGAGGTAGGAGGATCCCCTGAGCCCAGGGAAGTGGAGGCTGCAGTGAGCCATGATCACACCACTGCAATACAGCCTGGGTGACAGAGCAAGACCTTATCTCAAAATAAACAAACAAACAAAAAAGATGACAAAATAAATGTCTGTCGTTTAAGTCACCCATTCTGTGATATCTTGTTACGGCAGCCTGAACTGACCAATACACTTCCTCACCCAGTTTAAATTCCATGCTCAATCATAATCAGCCATTGCAATTACCCTCAACTGTATTATCAACCCTCAATTTGTATTAGTTGCTTGGCAAAACCCAAACCCTTGTGAAATCCAGTTCTTCTATATCTACATCGATGCTGCCGAATATGGCTGAAGAAAAGCAACTGTGTTGACTGGACTGCTTTAAATTCATGACCACTTACCTCAAGTGGGCACTTAACTTCCTGGCAATTATTCTACATTTTTCTAGTCCATTAACTCTCCTCCTCTCTGAGTTAATTATTTCACAGCTTTTCCTCCCTCTTTATACATGTTCCATCCTAACTCTCTGCTGATGACCTTGTTTCTTATTTCACTAATGGAGGCCACCAGGAGAGAACTCCCACAGCCATCAAATTCACCAAGCCAACAGCATCCTTACACAAATCCTCTGCCTTCTCTCTGGGCTGGCTGTGCCCTCTCTTTGCTCCTGCAATTTCCCTAACTCTCCTATACTGTTGTTATTCACTCTCCAGTGGATAATCACCATCAGGATGCAAAGATGCTGTACTAGCTTCTGAACTCTCCAAAAACCCAGGAAACAAAAAGGCAAAGGCTAAGCTTTTTCTTATTCCCCCTTATATACATATATATATATAGTAGGCACTCAATAAACATTCACTGAATGAATGAACAGTAATGCTCACTTGCCCATAAATACAAGTACCTCATCTTTTACCACAAAGGGTATTTGTAAATATTTAGGTTGTTTCTACCCAGATTATGGCTTGGTAATTCTTTTTTTTTTTTTCTAATTTTTATTTTTTTTCTAGGGACAGGGTCTCACTATGTTGCCCAGGATGGTCTTGAACTCCTGGGCTCAAGCATTCTGCCTGCCTTGGCCTCCTAAAGTGCTGAGATTACAGGCATGAGCCACCGTGCCTGCCTTCATGTATGTTTTTAGAACACAGAGAAAATGTGTTCTAAATGTGCTCATTGCTCAGCAATGAGCAAAGGCTTATGCAGTCACCACCAATCAAAAACTTTTTTTTTTTTTTTTGAGACAAGATCTTGCTCTGTTGCCCAGGCTGGAGTGCAGTGGCAGGATCATAGCAAGCTGCAGTCTTGACCTCATAGGCCTAAATCATCCTCCCACCTCAGCCTCACAAGTAGCTAAGACCACAGGTACAAGCCACCGTATCTAGCTAACTTTCAAAATTTTTTGAATTTTTAAATTTAAAAATTTTGAGGCCAGGCTGGCCTCAAACTCCTGAGCTCAAGCAATCCTCCCACCTTGGCTTCCCAAAGTGCTGGGATTATAGGCGTGAGCAACTGTACCTGGCAAAAACTTTTTAAGAGCTTCGCTTCCAGATTTAGGTTGTTTCTACCCAGATTATGGCTTGGTAATTCTTTTTTTTTTTTTCTAATTTTTATTTTTTTTCTAGGGACAGGGTCTCACTATGTTGCCCAGGATGGTCTTGAACTCCTGGGCTCAAGCATTCTGCCTGCCTTGGCCTCCTAAAGTGCTGAGATTACAGGCATGAGCCACCGTGCCTGCCTTCATGTATGTTTTTAGAACACAGAGAAAATGTGTTCTAAATGTGCTCATTGCTCAGCAATGAGCAAAGGCTTATGCAGTCACCACCAATCAAAAACTTTTTTTTTTTTTTTTGAGACAAGATCTTGCTCTGTTGCCCAGGCTGGAGTGCAGTGGCAGGATCATAGCAAGCTGCAGTCTTGACCTCATAGGCCTAAATCATCCTCCCACCTCAGCCTCACAAGTAGCTAAGACCACAGGTACAAGCCACCGTATCTAGCTAACTTTCAAAATTTTTTGAATTTTTAAATTTAAAAATTTTGAGGCCAGGCTGGCCTCAGATTAGGCAACTTTAACCTTCAACAGTGATCATAACCCTTAGTTTTCAGATCCGATTAAGGGAAATGTGTAATGTCTTACTGACACACTAATCCCATCACTGCTCACACCACCCACAATTAGCTGAG

Outline

Some ‘trivial’ questions− Why gene prediction?

− The problem of faster genomic sequencing

− What is a Gene?

The anatomy of a gene

Manual gene finding by you! (exercise)

Gene finder methods and performance− NetGene2

− EasyGene

Start codon: ATGStop codons: TAA, TAG, TGA

>example (950 bp)

1 CTCCCTTAGA AGACTCCAGC AAGTTATTTG AAGAGGTCTT TGGAGACATG51 GTGAGTTCTC TTTCCTTCCC AGAAGGTAAG TCTCACTGTA AGGTCTTTAT101 GTCTTGTGTG TCCCCCAGCA GCCTTGTCAT CTCCGGCTGC CCTAGACCTG151 CATAAGGACA GATTGAGTGT GCTGGGATAG ACTTTTGTTG ACAAAGGGGC201 TGCTCTGCCC TTCTAAGAGG TTGAGTCTCA TCATAAGGCC TTTTGCAGCT251 TGCATGTGTA GTGCCAGGAA AGAGTAGTCA TCCCCCAAAA CCAGACAGGA301 ACTGACGAGA TGCAATCACT GTGTGGACTT TTTACCAGCT AGCTAGGGCA351 CTACCATGAG CCACTGTCTA GCAGGGAGGC TTTGGGGATG GTGTGCCCCG401 AATATCTCTC AGGGTAAGAG TTTACAGTAA GCAGCAAGCA GAGGGGTGTG451 GGTGAGTGTG CAAGTATCTA ATTGGCTAGT TTTTGTGGCC TGTAACATAT501 TGGTGGGTGT TGGGAGTCAT AAGCTAAATG TTTGCTTTCC TCTGCATTGG551 TGGTCATTAG GGAGGGGGCA GATTATGAAC CTAGGTTGCA GATCTGTTGG601 AGTAATAACA AGACACTGGT CTTGTTGGGG GTATAACCTA GAGACTCGAT651 TTATGTTCAT GTTTGGTTTG GGATGGGTTT TATGTGAGTG TTTTCTTTTT701 TGGGGAGGGG GTCGGTTAAC TTGGAAAGTA ATGCTAGGTA CTGTCCTGTT751 CATTTCCCTG AGGTGAAAGT TAGGTCAGGT TTTCTAGAAT GGAGTCTGAA801 GGTAAAACAT TTGGCCACTG GCATGCCCTA AAGTCTTTTT GTGTTCTTGT851 CCCCTAGCAG ATCCAGCCCT ATCATCTCCT GGTGCCCAAC AGCTGCATCA901 GGATGAAGCT CAGGTAGTGG TGGAGCTAAC TGCCAATGAC AAGCCCAGTC

Manual gene findingFind, mark and countall ATGs

Find, mark and countall ATGs

How many ATGs do youexpect?

How many ATGs do youexpect?

Start codon: ATG

p(ATG)=p(A) x p(T) x p(G) ~ ¼ x ¼ x ¼ = 1/64 (in 950 bp = 14.8 ATG expected)

Manual gene finding

p(ATG)=p(A) x p(T) x p(G) ~ ¼ x ¼ x ¼ = 1/64 (in 950 bp = 14.8 ATG expected; observed = 16)

>example (950 bp)

1 CTCCCTTAGA AGACTCCAGC AAGTTATTTG AAGAGGTCTT TGGAGACATG51 GTGAGTTCTC TTTCCTTCCC AGAAGGTAAG TCTCACTGTA AGGTCTTTAT101 GTCTTGTGTG TCCCCCAGCA GCCTTGTCAT CTCCGGCTGC CCTAGACCTG151 CATAAGGACA GATTGAGTGT GCTGGGATAG ACTTTTGTTG ACAAAGGGGC201 TGCTCTGCCC TTCTAAGAGG TTGAGTCTCA TCATAAGGCC TTTTGCAGCT251 TGCATGTGTA GTGCCAGGAA AGAGTAGTCA TCCCCCAAAA CCAGACAGGA301 ACTGACGAGA TGCAATCACT GTGTGGACTT TTTACCAGCT AGCTAGGGCA351 CTACCATGAG CCACTGTCTA GCAGGGAGGC TTTGGGGATG GTGTGCCCCG401 AATATCTCTC AGGGTAAGAG TTTACAGTAA GCAGCAAGCA GAGGGGTGTG451 GGTGAGTGTG CAAGTATCTA ATTGGCTAGT TTTTGTGGCC TGTAACATAT501 TGGTGGGTGT TGGGAGTCAT AAGCTAAATG TTTGCTTTCC TCTGCATTGG551 TGGTCATTAG GGAGGGGGCA GATTATGAAC CTAGGTTGCA GATCTGTTGG601 AGTAATAACA AGACACTGGT CTTGTTGGGG GTATAACCTA GAGACTCGAT651 TTATGTTCAT GTTTGGTTTG GGATGGGTTT TATGTGAGTG TTTTCTTTTT701 TGGGGAGGGG GTCGGTTAAC TTGGAAAGTA ATGCTAGGTA CTGTCCTGTT751 CATTTCCCTG AGGTGAAAGT TAGGTCAGGT TTTCTAGAAT GGAGTCTGAA801 GGTAAAACAT TTGGCCACTG GCATGCCCTA AAGTCTTTTT GTGTTCTTGT851 CCCCTAGCAG ATCCAGCCCT ATCATCTCCT GGTGCCCAAC AGCTGCATCA901 GGATGAAGCT CAGGTAGTGG TGGAGCTAAC TGCCAATGAC AAGCCCAGTC

Manual gene finding

p(ATG)=p(A) x p(T) x p(G) ~ ¼ x ¼ x ¼ = 1/64 (in 950 bp = 14.8 ATG expected; observed = 16 17)

>example (950 bp)

1 CTCCCTTAGA AGACTCCAGC AAGTTATTTG AAGAGGTCTT TGGAGACATG51 GTGAGTTCTC TTTCCTTCCC AGAAGGTAAG TCTCACTGTA AGGTCTTTAT101 GTCTTGTGTG TCCCCCAGCA GCCTTGTCAT CTCCGGCTGC CCTAGACCTG151 CATAAGGACA GATTGAGTGT GCTGGGATAG ACTTTTGTTG ACAAAGGGGC201 TGCTCTGCCC TTCTAAGAGG TTGAGTCTCA TCATAAGGCC TTTTGCAGCT251 TGCATGTGTA GTGCCAGGAA AGAGTAGTCA TCCCCCAAAA CCAGACAGGA301 ACTGACGAGA TGCAATCACT GTGTGGACTT TTTACCAGCT AGCTAGGGCA351 CTACCATGAG CCACTGTCTA GCAGGGAGGC TTTGGGGATG GTGTGCCCCG401 AATATCTCTC AGGGTAAGAG TTTACAGTAA GCAGCAAGCA GAGGGGTGTG451 GGTGAGTGTG CAAGTATCTA ATTGGCTAGT TTTTGTGGCC TGTAACATAT501 TGGTGGGTGT TGGGAGTCAT AAGCTAAATG TTTGCTTTCC TCTGCATTGG551 TGGTCATTAG GGAGGGGGCA GATTATGAAC CTAGGTTGCA GATCTGTTGG601 AGTAATAACA AGACACTGGT CTTGTTGGGG GTATAACCTA GAGACTCGAT651 TTATGTTCAT GTTTGGTTTG GGATGGGTTT TATGTGAGTG TTTTCTTTTT701 TGGGGAGGGG GTCGGTTAAC TTGGAAAGTA ATGCTAGGTA CTGTCCTGTT751 CATTTCCCTG AGGTGAAAGT TAGGTCAGGT TTTCTAGAAT GGAGTCTGAA801 GGTAAAACAT TTGGCCACTG GCATGCCCTA AAGTCTTTTT GTGTTCTTGT851 CCCCTAGCAG ATCCAGCCCT ATCATCTCCT GGTGCCCAAC AGCTGCATCA901 GGATGAAGCT CAGGTAGTGG TGGAGCTAAC TGCCAATGAC AAGCCCAGTC

Manual gene finding

Start codon: ATGStop codons: TAA, TAG, TGA

>example (950 bp)

1 CTCCCTTAGA AGACTCCAGC AAGTTATTTG AAGAGGTCTT TGGAGACATG51 GTGAGTTCTC TTTCCTTCCC AGAAGGTAAG TCTCACTGTA AGGTCTTTAT101 GTCTTGTGTG TCCCCCAGCA GCCTTGTCAT CTCCGGCTGC CCTAGACCTG151 CATAAGGACA GATTGAGTGT GCTGGGATAG ACTTTTGTTG ACAAAGGGGC201 TGCTCTGCCC TTCTAAGAGG TTGAGTCTCA TCATAAGGCC TTTTGCAGCT251 TGCATGTGTA GTGCCAGGAA AGAGTAGTCA TCCCCCAAAA CCAGACAGGA301 ACTGACGAGA TGCAATCACT GTGTGGACTT TTTACCAGCT AGCTAGGGCA351 CTACCATGAG CCACTGTCTA GCAGGGAGGC TTTGGGGATG GTGTGCCCCG401 AATATCTCTC AGGGTAAGAG TTTACAGTAA GCAGCAAGCA GAGGGGTGTG451 GGTGAGTGTG CAAGTATCTA ATTGGCTAGT TTTTGTGGCC TGTAACATAT501 TGGTGGGTGT TGGGAGTCAT AAGCTAAATG TTTGCTTTCC TCTGCATTGG551 TGGTCATTAG GGAGGGGGCA GATTATGAAC CTAGGTTGCA GATCTGTTGG601 AGTAATAACA AGACACTGGT CTTGTTGGGG GTATAACCTA GAGACTCGAT651 TTATGTTCAT GTTTGGTTTG GGATGGGTTT TATGTGAGTG TTTTCTTTTT701 TGGGGAGGGG GTCGGTTAAC TTGGAAAGTA ATGCTAGGTA CTGTCCTGTT751 CATTTCCCTG AGGTGAAAGT TAGGTCAGGT TTTCTAGAAT GGAGTCTGAA801 GGTAAAACAT TTGGCCACTG GCATGCCCTA AAGTCTTTTT GTGTTCTTGT851 CCCCTAGCAG ATCCAGCCCT ATCATCTCCT GGTGCCCAAC AGCTGCATCA901 GGATGAAGCT CAGGTAGTGG TGGAGCTAAC TGCCAATGAC AAGCCCAGTC

Mark codons untilfirst in-frameStop codon

Mark codons untilfirst in-frameStop codon

Manual gene finding

Start codon: ATGStop codons: TAA, TAG, TGA

>example (950 bp)

1 CTCCCTTAGA AGACTCCAGC AAGTTATTTG AAGAGGTCTT TGGAGACATG51 GTGAGTTCTC TTTCCTTCCC AGAAGGTAAG TCTCACTGTA AGGTCTTTAT101 GTCTTGTGTG TCCCCCAGCA GCCTTGTCAT CTCCGGCTGC CCTAGACCTG151 CATAAGGACA GATTGAGTGT GCTGGGATAG ACTTTTGTTG ACAAAGGGGC201 TGCTCTGCCC TTCTAAGAGG TTGAGTCTCA TCATAAGGCC TTTTGCAGCT251 TGCATGTGTA GTGCCAGGAA AGAGTAGTCA TCCCCCAAAA CCAGACAGGA301 ACTGACGAGA TGCAATCACT GTGTGGACTT TTTACCAGCT AGCTAGGGCA351 CTACCATGAG CCACTGTCTA GCAGGGAGGC TTTGGGGATG GTGTGCCCCG401 AATATCTCTC AGGGTAAGAG TTTACAGTAA GCAGCAAGCA GAGGGGTGTG451 GGTGAGTGTG CAAGTATCTA ATTGGCTAGT TTTTGTGGCC TGTAACATAT501 TGGTGGGTGT TGGGAGTCAT AAGCTAAATG TTTGCTTTCC TCTGCATTGG551 TGGTCATTAG GGAGGGGGCA GATTATGAAC CTAGGTTGCA GATCTGTTGG601 AGTAATAACA AGACACTGGT CTTGTTGGGG GTATAACCTA GAGACTCGAT651 TTATGTTCAT GTTTGGTTTG GGATGGGTTT TATGTGAGTG TTTTCTTTTT701 TGGGGAGGGG GTCGGTTAAC TTGGAAAGTA ATGCTAGGTA CTGTCCTGTT751 CATTTCCCTG AGGTGAAAGT TAGGTCAGGT TTTCTAGAAT GGAGTCTGAA801 GGTAAAACAT TTGGCCACTG GCATGCCCTA AAGTCTTTTT GTGTTCTTGT851 CCCCTAGCAG ATCCAGCCCT ATCATCTCCT GGTGCCCAAC AGCTGCATCA901 GGATGAAGCT CAGGTAGTGG TGGAGCTAAC TGCCAATGAC AAGCCCAGTC

Manual gene findingORF of 105 bps =>

A ‘protein’ of 35 aaORF of 105 bps =>

A ‘protein’ of 35 aa

Take home messages 1/2

We have a life book, but difficult to read

Amount of raw sequence is astronomical and growing

rRNA, tRNA genes, etc. are genes too

Many distinct gene structures, and far from every open reading frame is a gene

Outline

Some ‘trivial’ questions− Why gene prediction?

− The problem of faster genomic sequencing

− What is a Gene?

The anatomy of a gene

Manual gene finding by you! (exercise)

Gene finder methods and performance− NetGene2

− EasyGene

Gene Prediction

Prediction relies on integration of several gene features

Each gene feature carries a low signal− E.g. ATG, Donor/acceptor splice sites− Combinatorial explosion− Some are mutually exclusive (e.g. reading frame)

Gene Prediction

Codon frequency/bias– Organism dependent

– Hexamer statistics

Transcriptional– Promoters/enhancers

Exon/introns– Length distributions

– ORFs

Splicing– Donor/acceptor sites

– Branchpoints

Translational– Start codon (ATG)

context

Gene finders of the past...

GeneMark (Borodovsky & McIninch 1993)

Ecoparse (Krogh et al 1994)

GeneMark.hmm (Lukashin & Borodovsky 1998)

Glimmer (Salzberg et al 1998, Delcher et al 1999)

Orpheus (Frishman et al 1998)

Frame-by-frame (Shmatkov et al 1999)

GeneMark.hmm/S (Besemer et al 2001)

Since then...

GENEMARK.2Ecgene

AUGUSTUS.7EXONHUNTER.3DOGFISH-CE.4

GenscanGENEZILLA.2

AcemblyTWINSCAN-MARS.4

FGENESH++.1SAGA.4Geneid

SGPACEVIEW.3AUGUSTUS.2

SPIDA.7AUGUSTUS.4

N-SCAN.4N-SCAN.5Twinscan

AUGUSTUS.1AUGUSTUS.3EXOGEAN.3

PAIRAGON+N-SCAN.1PAIRAGON+N-SCAN.3

JIGSAW.1ENSEMBL.3

DOGFISH-CE.7

Gene Finders are often organism specific

Gene Prediction

Ab initio Gene Finders

”Integrated” methods− Predict genes in context (Hidden Markov Model based)

”Grammar” of genesCertain elements in specific order are required

− HMMgene www.cbs.dtu.dk/services/HMMgene/− GenScan http://genes.mit.edu/GENSCAN.html

”Isolated” methods− Predict individual features (Neural Network based)

E.g. splice sites, coding regions− NetGene2 www.cbs.dtu.dk/services/NetGene2/− GRAIL http://compbio.ornl.gov/Grail-1.3/

Artificial Neural Network

Pyr

Pyr

1

2 1

1

T/F

Pyr|

Pyr|

Pyr|

Pyr

+1

+1+1

+1

+1+1

–2

Hidden Markov Model

Gene Prediction

”Isolated” methods (e.g.NN):

HAPPYEUGENEAWASGUYFINDER

”Integrated” methods (e.g.HMM):

EUGENEFINDERWASAHAPPYGUY

EasyGene –Bacterial Gene Finder

Courtesy of T.S. Larsen & A. Krogh 2004

ORF distributions

Performance landscapeE. coli

Performance landscapeshort ORFs – E. coli

Annotation remains a problem...

Courtesy of M. Skovgaard et al 2001

Annotation remains a problem...

Easygene anno 2009

Take home messages 2/2

Genes may be predicted by computer programs

Most gene prediction programs only predict protein-coding genes

’Unusual’ genes are difficult to predict:− Alternative/Multiple start codons− Non-native genes− Lowly expressed− Introns, Alternatively Spliced

HMM-based gene prediction programs are suitable for “Gene Grammar”

Prediction methods are not perfect!

No single method is always best

Take home message...

UseUse gene finders gene finders withwith cautioncaution!!

...and Coffee Break!

top related