genome annotation - dtu health tech · genome annotation and gene-finding 27621 prokaryotic gene...

51
Carsten Friis Center for Biological Sequence Analysis Technical University of Denmark Genome Annotation and Gene-finding 27621 Prokaryotic Gene Discovery, Metagenomics and Pangenomics

Upload: others

Post on 18-Aug-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Genome Annotation - DTU Health Tech · Genome Annotation and Gene-finding 27621 Prokaryotic Gene Discovery, Metagenomics and Pangenomics. Genome Sequencing. Annotation is the process

Carsten FriisCenter for Biological Sequence AnalysisTechnical University of Denmark

Genome Annotation and Gene-finding

27621Prokaryotic Gene Discovery,

Metagenomics and Pangenomics

Page 2: Genome Annotation - DTU Health Tech · Genome Annotation and Gene-finding 27621 Prokaryotic Gene Discovery, Metagenomics and Pangenomics. Genome Sequencing. Annotation is the process

Genome Sequencing

Page 3: Genome Annotation - DTU Health Tech · Genome Annotation and Gene-finding 27621 Prokaryotic Gene Discovery, Metagenomics and Pangenomics. Genome Sequencing. Annotation is the process

Annotation is the process of assigningbiological meaning to segments of

genomic DNA...

Page 4: Genome Annotation - DTU Health Tech · Genome Annotation and Gene-finding 27621 Prokaryotic Gene Discovery, Metagenomics and Pangenomics. Genome Sequencing. Annotation is the process

Outline

Some ‘trivial’ questions− Why gene prediction?

− The problem of faster genomic sequencing

− What is a Gene?

The anatomy of a gene

Manual gene finding by you! (exercise)

Gene finder methods and performance− NetGene2

− EasyGene

Page 5: Genome Annotation - DTU Health Tech · Genome Annotation and Gene-finding 27621 Prokaryotic Gene Discovery, Metagenomics and Pangenomics. Genome Sequencing. Annotation is the process

Outline

Some ‘trivial’ questions

− Why gene prediction?

− The problem of faster genomic sequencing

− What is a Gene?

The anatomy of a gene

Manual gene finding by you! (exercise)

Gene finder methods and performance

− NetGene2

− EasyGene

Page 6: Genome Annotation - DTU Health Tech · Genome Annotation and Gene-finding 27621 Prokaryotic Gene Discovery, Metagenomics and Pangenomics. Genome Sequencing. Annotation is the process

Why Look for Genes?

Genes is where the action is:

− Explain Basic Biological Functions

Protein kinases, Cyclins, etc.

− Explain Medical Conditions

Symptoms linked to certain genes

− Be Used for Treatment of Disease

− Contain commercial value

As enzymes (Lipases, Amylases, ’washing detergent’)

As drug targets (Ion channels, Receptors)

As therapeutic factors

Page 7: Genome Annotation - DTU Health Tech · Genome Annotation and Gene-finding 27621 Prokaryotic Gene Discovery, Metagenomics and Pangenomics. Genome Sequencing. Annotation is the process

Nobel Prizes & Genes

The history of genes and related analysis has introduced us several Nobel Prize winners, − Richard J. Roberts and Phillip A. Sharp for their discoveries

of split genes; − Barbara McClintock for her discovery of mobile genetic

elements; − J. Michael Bishop and Harold E. Varmus for their discovery of

the cellular origin of retroviral oncogenes; − Francis Crick & James Watson for the DNA double helix

structure.− ….

Page 8: Genome Annotation - DTU Health Tech · Genome Annotation and Gene-finding 27621 Prokaryotic Gene Discovery, Metagenomics and Pangenomics. Genome Sequencing. Annotation is the process

Now... sequencing your entire genome in two months

It took longer than 10 years and $4 bn to sequence the three billion letters of the human genome, which was a composite made from dozens of different individuals.

454 Life Sciences makes an innovative DNA sequencing machine, which proved capable of decoding Dr. Watson’s genome in 2 months at a cost of less than $1 million. A copy of his genome, recorded on a pair of DVDs, was presented to Dr. Watson in a ceremony in Houston (2007 May 31).

More than 500 organisms sequenced to date

Page 9: Genome Annotation - DTU Health Tech · Genome Annotation and Gene-finding 27621 Prokaryotic Gene Discovery, Metagenomics and Pangenomics. Genome Sequencing. Annotation is the process

We have the Genome Sequence......now what?

Are there still novel genes to be discovered?– Yes!

What is the challenge?– We don’t know how many

genes there are!– We don’t know where they

are!– We don’t know what they do!

Page 10: Genome Annotation - DTU Health Tech · Genome Annotation and Gene-finding 27621 Prokaryotic Gene Discovery, Metagenomics and Pangenomics. Genome Sequencing. Annotation is the process

Carsten FriisCenter for Biological Sequence AnalysisTechnical University of Denmark

The cure lies in high-quality automatedgene finders...

Page 11: Genome Annotation - DTU Health Tech · Genome Annotation and Gene-finding 27621 Prokaryotic Gene Discovery, Metagenomics and Pangenomics. Genome Sequencing. Annotation is the process

What is a gene?

“Most problems have either many answers or no answer. Only a few problems have a single answer.”– Edmund C. Berkeley Helen Pearson; Nature 441, 398-401, May 2006

Page 12: Genome Annotation - DTU Health Tech · Genome Annotation and Gene-finding 27621 Prokaryotic Gene Discovery, Metagenomics and Pangenomics. Genome Sequencing. Annotation is the process

What is a gene?

Genes are regions of DNA sequence which hold information required by the cell to generateproteins

Proteins are folded chains of amino acids whoseshape and electro-chemical characteristicsdetermine their function in the cell

Page 13: Genome Annotation - DTU Health Tech · Genome Annotation and Gene-finding 27621 Prokaryotic Gene Discovery, Metagenomics and Pangenomics. Genome Sequencing. Annotation is the process

Gene definition

A number of genes with distinct structures were discovered a) RNA genes which encode RNAs rather than proteins;

b) Pseudogenes which were considered as nonfunctional replicates of genes;

c) Nested genes located inside introns of other genes;

d) Overlapped genes, where parts of two genes are overlapped; and

e) Assembled genes, where several sections can reassemble into other genes.

Page 14: Genome Annotation - DTU Health Tech · Genome Annotation and Gene-finding 27621 Prokaryotic Gene Discovery, Metagenomics and Pangenomics. Genome Sequencing. Annotation is the process

Identification of putative non-coding RNA genes inthe Burkholderia cenocepacia J2315 genome

Tom Coenye, Pavel Drevinek, Eshwar Mahenthiralingam, Shiraz Ali Shah, Ryan T. Gill, Peter Vandamme and David W. Ussery

ABSTRACT Non-coding RNA (ncRNA) genes are not involved in the production of mRNA and proteins, but produce transcripts that function directly as structural or regulatory RNAs. In the present study, we evaluated the presence of ncRNA genes in the genome of Burkholderia cenocepacia J2315. We used an approach in which we combined a comparative genomics (alignment-based) approach and the use of secondary structure information for the identification of putative ncRNAs genes. 213 putative ncRNA genes were identified in the B. cenocepacia J2315 genome and we could confirm upregulatedexpression of four of these by microarray analysis. Most of the ncRNA gene transcripts have a marked secondary structure that may allow interaction with other molecules. Several B. cenocepacia J2315 ncRNAs seem related to previously characterised ncRNAsinvolved in regulation of various cellular processes, while the function of many others remains unknown. The presence of a large number of ncRNA genes in this organism may help to explain its complexity, phenotypic variability and ability to survive in a remarkably wide range of environments.

Page 15: Genome Annotation - DTU Health Tech · Genome Annotation and Gene-finding 27621 Prokaryotic Gene Discovery, Metagenomics and Pangenomics. Genome Sequencing. Annotation is the process

Finding ncRNAs

Page 16: Genome Annotation - DTU Health Tech · Genome Annotation and Gene-finding 27621 Prokaryotic Gene Discovery, Metagenomics and Pangenomics. Genome Sequencing. Annotation is the process

Gene definition

The origins of “Gene"

It was coined by the Danish geneticist Wilhelm Johannsenin 1909 as a calculating unit. At that time it was only an abstract concept.

In the early 1920s, H. J. Muller predicted that genes carry genetic information and can replicate themselves as real material entity (Muller, 1922;Muller, 1947).

Page 17: Genome Annotation - DTU Health Tech · Genome Annotation and Gene-finding 27621 Prokaryotic Gene Discovery, Metagenomics and Pangenomics. Genome Sequencing. Annotation is the process

Gene definition

Loose definition of a gene:

“A locatable region of genomic sequence, corresponding to a unit of inheritance, which is associated with regulatory regions, transcribed regions and/or other functional sequence regions”

Page 18: Genome Annotation - DTU Health Tech · Genome Annotation and Gene-finding 27621 Prokaryotic Gene Discovery, Metagenomics and Pangenomics. Genome Sequencing. Annotation is the process

Structure, function and regulation of genes are all extremely complicated, more so than we suspected, and always beyond our imagination.

Page 19: Genome Annotation - DTU Health Tech · Genome Annotation and Gene-finding 27621 Prokaryotic Gene Discovery, Metagenomics and Pangenomics. Genome Sequencing. Annotation is the process
Page 20: Genome Annotation - DTU Health Tech · Genome Annotation and Gene-finding 27621 Prokaryotic Gene Discovery, Metagenomics and Pangenomics. Genome Sequencing. Annotation is the process

The Intron

Page 21: Genome Annotation - DTU Health Tech · Genome Annotation and Gene-finding 27621 Prokaryotic Gene Discovery, Metagenomics and Pangenomics. Genome Sequencing. Annotation is the process
Page 22: Genome Annotation - DTU Health Tech · Genome Annotation and Gene-finding 27621 Prokaryotic Gene Discovery, Metagenomics and Pangenomics. Genome Sequencing. Annotation is the process

Manual gene finding

Can U spot Spot?

Page 23: Genome Annotation - DTU Health Tech · Genome Annotation and Gene-finding 27621 Prokaryotic Gene Discovery, Metagenomics and Pangenomics. Genome Sequencing. Annotation is the process

Manual gene finding

Page 24: Genome Annotation - DTU Health Tech · Genome Annotation and Gene-finding 27621 Prokaryotic Gene Discovery, Metagenomics and Pangenomics. Genome Sequencing. Annotation is the process

DNA SequenceAAGAGGTAATTAAAGCTAAATGAAGTTGTAAGAGTGGCCCTATCGCATAGGACTAGTGTCCCTATAAGAACACGAAGAAATCACCTTAGAAAGGCTGAGAAAGGGCTGCAGGGCAGTGGGAGTGCAGACTGAAAGATGCAGACCACTGGGCTTCTACTTCTGTTTCCATTTCTGATCCGGCCTGCATCTGCCTCCTTCCTGAACAGGCCAGAGAATTCATCTAAATAGCCTAAGCAGGCTGGGTGCTGTGGCTCACCTGTAATCCCAACACTTGGGAGGCCGAGGTGGGCAGATCACCTGAGGTCAGGAGTTCAAGGCTAGCCTAGCCAACATGACAAAACCCCATCTCTACTAAAAAAATACAAAAATTAGCCAGGCATAGTGGCGCCTATAGTTCCAGCTACTTGGGGGCTGAGGTAGGAAGATCGCTAGAGCCTGGGAGGTTAAGGCTGCGGTGAGCTGTGATTGTGCCACTGCACTCCAGCCTGGGTGACAGAGCAAGACCCTGCCTCAAAAATAAATAAATAAATAAATAAATAAAAATAAGAGTGCTTGGCAGCTTGATCAAGCTATGCCAGGAACCCATCTCTCAAGCAGCAGCTCTTCTCCTGTGCCATTGTCAGCTTTGTCCTGTCTGAGTCCATGGGACTCTTCTGTTTGATGGTGGTCTTCCTCATCCTCTTCATCATGTGAAGCTCCATGGAGATCACCTACCCATACCTGCTTCTGTGACCTCATGCCATTCCTGGTGTTGGAATGTGCCAAGGTTTGCCATTAAACACACATTTCTCATTTCATAATTTCATATATATTATATATATGTGTGTGTGTGTGTGTTTATATATGCGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTATATATATATATATATATATATATATATATATATATATATAAAATATATAGGAAGAGGCACCAGAGAGCTCTCTGCATAGTCACAGAGGAAAGGTCATGTGAGGACAGCCAGAAGGCAGATGTCACAAGCCTCACCAGCAACCTACCATACCCTGCTTGTACCTCCATCCTGGAAGTCCAGCTTCTAAAATTAGAAGAAAATAGTCGGGTGTGGTGGCTCGCACCTATAATCCCAGCACTTTGGGAGGCTGATGTGGGAGGATCATTTGAGGTCAAGAGTTTGAAACCAGCCTAGGCAACATAGGGAGACCCTGTCTTTAAAAAAAATTTTTTTTTGTTTTAATTAGCTGGGTGTGATGGTGCACACCTGAGTCCTAGCTACTTGGGAGGCTGAGGTAGGAGGATCCCCTGAGCCCAGGGAAGTGGAGGCTGCAGTGAGCCATGATCACACCACTGCAATACAGCCTGGGTGACAGAGCAAGACCTTATCTCAAAATAAACAAACAAACAAAAAAGATGACAAAATAAATGTCTGTCGTTTAAGTCACCCATTCTGTGATATCTTGTTACGGCAGCCTGAACTGACCAATACACTTCCTCACCCAGTTTAAATTCCATGCTCAATCATAATCAGCCATTGCAATTACCCTCAACTGTATTATCAACCCTCAATTTGTATTAGTTGCTTGGCAAAACCCAAACCCTTGTGAAATCCAGTTCTTCTATATCTACATCGATGCTGCCGAATATGGCTGAAGAAAAGCAACTGTGTTGACTGGACTGCTTTAAATTCATGACCACTTACCTCAAGTGGGCACTTAACTTCCTGGCAATTATTCTACATTTTTCTAGTCCATTAACTCTCCTCCTCTCTGAGTTAATTATTTCACAGCTTTTCCTCCCTCTTTATACATGTTCCATCCTAACTCTCTGCTGATGACCTTGTTTCTTATTTCACTAATGGAGGCCACCAGGAGAGAACTCCCACAGCCATCAAATTCACCAAGCCAACAGCATCCTTACACAAATCCTCTGCCTTCTCTCTGGGCTGGCTGTGCCCTCTCTTTGCTCCTGCAATTTCCCTAACTCTCCTATACTGTTGTTATTCACTCTCCAGTGGATAATCACCATCAGGATGCAAAGATGCTGTACTAGCTTCTGAACTCTCCAAAAACCCAGGAAACAAAAAGGCAAAGGCTAAGCTTTTTCTTATTCCCCCTTATATACATATATATATATAGTAGGCACTCAATAAACATTCACTGAATGAATGAACAGTAATGCTCACTTGCCCATAAATACAAGTACCTCATCTTTTACCACAAAGGGTATTTGTAAATATTTAGGTTGTTTCTACCCAGATTATGGCTTGGTAATTCTTTTTTTTTTTTTCTAATTTTTATTTTTTTTCTAGGGACAGGGTCTCACTATGTTGCCCAGGATGGTCTTGAACTCCTGGGCTCAAGCATTCTGCCTGCCTTGGCCTCCTAAAGTGCTGAGATTACAGGCATGAGCCACCGTGCCTGCCTTCATGTATGTTTTTAGAACACAGAGAAAATGTGTTCTAAATGTGCTCATTGCTCAGCAATGAGCAAAGGCTTATGCAGTCACCACCAATCAAAAACTTTTTTTTTTTTTTTTGAGACAAGATCTTGCTCTGTTGCCCAGGCTGGAGTGCAGTGGCAGGATCATAGCAAGCTGCAGTCTTGACCTCATAGGCCTAAATCATCCTCCCACCTCAGCCTCACAAGTAGCTAAGACCACAGGTACAAGCCACCGTATCTAGCTAACTTTCAAAATTTTTTGAATTTTTAAATTTAAAAATTTTGAGGCCAGGCTGGCCTCAAACTCCTGAGCTCAAGCAATCCTCCCACCTTGGCTTCCCAAAGTGCTGGGATTATAGGCGTGAGCAACTGTACCTGGCAAAAACTTTTTAAGAGCTTCGCTTCCAGATTTAGGTTGTTTCTACCCAGATTATGGCTTGGTAATTCTTTTTTTTTTTTTCTAATTTTTATTTTTTTTCTAGGGACAGGGTCTCACTATGTTGCCCAGGATGGTCTTGAACTCCTGGGCTCAAGCATTCTGCCTGCCTTGGCCTCCTAAAGTGCTGAGATTACAGGCATGAGCCACCGTGCCTGCCTTCATGTATGTTTTTAGAACACAGAGAAAATGTGTTCTAAATGTGCTCATTGCTCAGCAATGAGCAAAGGCTTATGCAGTCACCACCAATCAAAAACTTTTTTTTTTTTTTTTGAGACAAGATCTTGCTCTGTTGCCCAGGCTGGAGTGCAGTGGCAGGATCATAGCAAGCTGCAGTCTTGACCTCATAGGCCTAAATCATCCTCCCACCTCAGCCTCACAAGTAGCTAAGACCACAGGTACAAGCCACCGTATCTAGCTAACTTTCAAAATTTTTTGAATTTTTAAATTTAAAAATTTTGAGGCCAGGCTGGCCTCAGATTAGGCAACTTTAACCTTCAACAGTGATCATAACCCTTAGTTTTCAGATCCGATTAAGGGAAATGTGTAATGTCTTACTGACACACTAATCCCATCACTGCTCACACCACCCACAATTAGCTGAG

Page 25: Genome Annotation - DTU Health Tech · Genome Annotation and Gene-finding 27621 Prokaryotic Gene Discovery, Metagenomics and Pangenomics. Genome Sequencing. Annotation is the process

Outline

Some ‘trivial’ questions− Why gene prediction?

− The problem of faster genomic sequencing

− What is a Gene?

The anatomy of a gene

Manual gene finding by you! (exercise)

Gene finder methods and performance− NetGene2

− EasyGene

Page 26: Genome Annotation - DTU Health Tech · Genome Annotation and Gene-finding 27621 Prokaryotic Gene Discovery, Metagenomics and Pangenomics. Genome Sequencing. Annotation is the process

Start codon: ATGStop codons: TAA, TAG, TGA

>example (950 bp)

1 CTCCCTTAGA AGACTCCAGC AAGTTATTTG AAGAGGTCTT TGGAGACATG51 GTGAGTTCTC TTTCCTTCCC AGAAGGTAAG TCTCACTGTA AGGTCTTTAT101 GTCTTGTGTG TCCCCCAGCA GCCTTGTCAT CTCCGGCTGC CCTAGACCTG151 CATAAGGACA GATTGAGTGT GCTGGGATAG ACTTTTGTTG ACAAAGGGGC201 TGCTCTGCCC TTCTAAGAGG TTGAGTCTCA TCATAAGGCC TTTTGCAGCT251 TGCATGTGTA GTGCCAGGAA AGAGTAGTCA TCCCCCAAAA CCAGACAGGA301 ACTGACGAGA TGCAATCACT GTGTGGACTT TTTACCAGCT AGCTAGGGCA351 CTACCATGAG CCACTGTCTA GCAGGGAGGC TTTGGGGATG GTGTGCCCCG401 AATATCTCTC AGGGTAAGAG TTTACAGTAA GCAGCAAGCA GAGGGGTGTG451 GGTGAGTGTG CAAGTATCTA ATTGGCTAGT TTTTGTGGCC TGTAACATAT501 TGGTGGGTGT TGGGAGTCAT AAGCTAAATG TTTGCTTTCC TCTGCATTGG551 TGGTCATTAG GGAGGGGGCA GATTATGAAC CTAGGTTGCA GATCTGTTGG601 AGTAATAACA AGACACTGGT CTTGTTGGGG GTATAACCTA GAGACTCGAT651 TTATGTTCAT GTTTGGTTTG GGATGGGTTT TATGTGAGTG TTTTCTTTTT701 TGGGGAGGGG GTCGGTTAAC TTGGAAAGTA ATGCTAGGTA CTGTCCTGTT751 CATTTCCCTG AGGTGAAAGT TAGGTCAGGT TTTCTAGAAT GGAGTCTGAA801 GGTAAAACAT TTGGCCACTG GCATGCCCTA AAGTCTTTTT GTGTTCTTGT851 CCCCTAGCAG ATCCAGCCCT ATCATCTCCT GGTGCCCAAC AGCTGCATCA901 GGATGAAGCT CAGGTAGTGG TGGAGCTAAC TGCCAATGAC AAGCCCAGTC

Manual gene findingFind, mark and countall ATGs

Find, mark and countall ATGs

How many ATGs do youexpect?

How many ATGs do youexpect?

Page 27: Genome Annotation - DTU Health Tech · Genome Annotation and Gene-finding 27621 Prokaryotic Gene Discovery, Metagenomics and Pangenomics. Genome Sequencing. Annotation is the process

Start codon: ATG

p(ATG)=p(A) x p(T) x p(G) ~ ¼ x ¼ x ¼ = 1/64 (in 950 bp = 14.8 ATG expected)

Manual gene finding

Page 28: Genome Annotation - DTU Health Tech · Genome Annotation and Gene-finding 27621 Prokaryotic Gene Discovery, Metagenomics and Pangenomics. Genome Sequencing. Annotation is the process

p(ATG)=p(A) x p(T) x p(G) ~ ¼ x ¼ x ¼ = 1/64 (in 950 bp = 14.8 ATG expected; observed = 16)

>example (950 bp)

1 CTCCCTTAGA AGACTCCAGC AAGTTATTTG AAGAGGTCTT TGGAGACATG51 GTGAGTTCTC TTTCCTTCCC AGAAGGTAAG TCTCACTGTA AGGTCTTTAT101 GTCTTGTGTG TCCCCCAGCA GCCTTGTCAT CTCCGGCTGC CCTAGACCTG151 CATAAGGACA GATTGAGTGT GCTGGGATAG ACTTTTGTTG ACAAAGGGGC201 TGCTCTGCCC TTCTAAGAGG TTGAGTCTCA TCATAAGGCC TTTTGCAGCT251 TGCATGTGTA GTGCCAGGAA AGAGTAGTCA TCCCCCAAAA CCAGACAGGA301 ACTGACGAGA TGCAATCACT GTGTGGACTT TTTACCAGCT AGCTAGGGCA351 CTACCATGAG CCACTGTCTA GCAGGGAGGC TTTGGGGATG GTGTGCCCCG401 AATATCTCTC AGGGTAAGAG TTTACAGTAA GCAGCAAGCA GAGGGGTGTG451 GGTGAGTGTG CAAGTATCTA ATTGGCTAGT TTTTGTGGCC TGTAACATAT501 TGGTGGGTGT TGGGAGTCAT AAGCTAAATG TTTGCTTTCC TCTGCATTGG551 TGGTCATTAG GGAGGGGGCA GATTATGAAC CTAGGTTGCA GATCTGTTGG601 AGTAATAACA AGACACTGGT CTTGTTGGGG GTATAACCTA GAGACTCGAT651 TTATGTTCAT GTTTGGTTTG GGATGGGTTT TATGTGAGTG TTTTCTTTTT701 TGGGGAGGGG GTCGGTTAAC TTGGAAAGTA ATGCTAGGTA CTGTCCTGTT751 CATTTCCCTG AGGTGAAAGT TAGGTCAGGT TTTCTAGAAT GGAGTCTGAA801 GGTAAAACAT TTGGCCACTG GCATGCCCTA AAGTCTTTTT GTGTTCTTGT851 CCCCTAGCAG ATCCAGCCCT ATCATCTCCT GGTGCCCAAC AGCTGCATCA901 GGATGAAGCT CAGGTAGTGG TGGAGCTAAC TGCCAATGAC AAGCCCAGTC

Manual gene finding

Page 29: Genome Annotation - DTU Health Tech · Genome Annotation and Gene-finding 27621 Prokaryotic Gene Discovery, Metagenomics and Pangenomics. Genome Sequencing. Annotation is the process

p(ATG)=p(A) x p(T) x p(G) ~ ¼ x ¼ x ¼ = 1/64 (in 950 bp = 14.8 ATG expected; observed = 16 17)

>example (950 bp)

1 CTCCCTTAGA AGACTCCAGC AAGTTATTTG AAGAGGTCTT TGGAGACATG51 GTGAGTTCTC TTTCCTTCCC AGAAGGTAAG TCTCACTGTA AGGTCTTTAT101 GTCTTGTGTG TCCCCCAGCA GCCTTGTCAT CTCCGGCTGC CCTAGACCTG151 CATAAGGACA GATTGAGTGT GCTGGGATAG ACTTTTGTTG ACAAAGGGGC201 TGCTCTGCCC TTCTAAGAGG TTGAGTCTCA TCATAAGGCC TTTTGCAGCT251 TGCATGTGTA GTGCCAGGAA AGAGTAGTCA TCCCCCAAAA CCAGACAGGA301 ACTGACGAGA TGCAATCACT GTGTGGACTT TTTACCAGCT AGCTAGGGCA351 CTACCATGAG CCACTGTCTA GCAGGGAGGC TTTGGGGATG GTGTGCCCCG401 AATATCTCTC AGGGTAAGAG TTTACAGTAA GCAGCAAGCA GAGGGGTGTG451 GGTGAGTGTG CAAGTATCTA ATTGGCTAGT TTTTGTGGCC TGTAACATAT501 TGGTGGGTGT TGGGAGTCAT AAGCTAAATG TTTGCTTTCC TCTGCATTGG551 TGGTCATTAG GGAGGGGGCA GATTATGAAC CTAGGTTGCA GATCTGTTGG601 AGTAATAACA AGACACTGGT CTTGTTGGGG GTATAACCTA GAGACTCGAT651 TTATGTTCAT GTTTGGTTTG GGATGGGTTT TATGTGAGTG TTTTCTTTTT701 TGGGGAGGGG GTCGGTTAAC TTGGAAAGTA ATGCTAGGTA CTGTCCTGTT751 CATTTCCCTG AGGTGAAAGT TAGGTCAGGT TTTCTAGAAT GGAGTCTGAA801 GGTAAAACAT TTGGCCACTG GCATGCCCTA AAGTCTTTTT GTGTTCTTGT851 CCCCTAGCAG ATCCAGCCCT ATCATCTCCT GGTGCCCAAC AGCTGCATCA901 GGATGAAGCT CAGGTAGTGG TGGAGCTAAC TGCCAATGAC AAGCCCAGTC

Manual gene finding

Page 30: Genome Annotation - DTU Health Tech · Genome Annotation and Gene-finding 27621 Prokaryotic Gene Discovery, Metagenomics and Pangenomics. Genome Sequencing. Annotation is the process

Start codon: ATGStop codons: TAA, TAG, TGA

>example (950 bp)

1 CTCCCTTAGA AGACTCCAGC AAGTTATTTG AAGAGGTCTT TGGAGACATG51 GTGAGTTCTC TTTCCTTCCC AGAAGGTAAG TCTCACTGTA AGGTCTTTAT101 GTCTTGTGTG TCCCCCAGCA GCCTTGTCAT CTCCGGCTGC CCTAGACCTG151 CATAAGGACA GATTGAGTGT GCTGGGATAG ACTTTTGTTG ACAAAGGGGC201 TGCTCTGCCC TTCTAAGAGG TTGAGTCTCA TCATAAGGCC TTTTGCAGCT251 TGCATGTGTA GTGCCAGGAA AGAGTAGTCA TCCCCCAAAA CCAGACAGGA301 ACTGACGAGA TGCAATCACT GTGTGGACTT TTTACCAGCT AGCTAGGGCA351 CTACCATGAG CCACTGTCTA GCAGGGAGGC TTTGGGGATG GTGTGCCCCG401 AATATCTCTC AGGGTAAGAG TTTACAGTAA GCAGCAAGCA GAGGGGTGTG451 GGTGAGTGTG CAAGTATCTA ATTGGCTAGT TTTTGTGGCC TGTAACATAT501 TGGTGGGTGT TGGGAGTCAT AAGCTAAATG TTTGCTTTCC TCTGCATTGG551 TGGTCATTAG GGAGGGGGCA GATTATGAAC CTAGGTTGCA GATCTGTTGG601 AGTAATAACA AGACACTGGT CTTGTTGGGG GTATAACCTA GAGACTCGAT651 TTATGTTCAT GTTTGGTTTG GGATGGGTTT TATGTGAGTG TTTTCTTTTT701 TGGGGAGGGG GTCGGTTAAC TTGGAAAGTA ATGCTAGGTA CTGTCCTGTT751 CATTTCCCTG AGGTGAAAGT TAGGTCAGGT TTTCTAGAAT GGAGTCTGAA801 GGTAAAACAT TTGGCCACTG GCATGCCCTA AAGTCTTTTT GTGTTCTTGT851 CCCCTAGCAG ATCCAGCCCT ATCATCTCCT GGTGCCCAAC AGCTGCATCA901 GGATGAAGCT CAGGTAGTGG TGGAGCTAAC TGCCAATGAC AAGCCCAGTC

Mark codons untilfirst in-frameStop codon

Mark codons untilfirst in-frameStop codon

Manual gene finding

Page 31: Genome Annotation - DTU Health Tech · Genome Annotation and Gene-finding 27621 Prokaryotic Gene Discovery, Metagenomics and Pangenomics. Genome Sequencing. Annotation is the process

Start codon: ATGStop codons: TAA, TAG, TGA

>example (950 bp)

1 CTCCCTTAGA AGACTCCAGC AAGTTATTTG AAGAGGTCTT TGGAGACATG51 GTGAGTTCTC TTTCCTTCCC AGAAGGTAAG TCTCACTGTA AGGTCTTTAT101 GTCTTGTGTG TCCCCCAGCA GCCTTGTCAT CTCCGGCTGC CCTAGACCTG151 CATAAGGACA GATTGAGTGT GCTGGGATAG ACTTTTGTTG ACAAAGGGGC201 TGCTCTGCCC TTCTAAGAGG TTGAGTCTCA TCATAAGGCC TTTTGCAGCT251 TGCATGTGTA GTGCCAGGAA AGAGTAGTCA TCCCCCAAAA CCAGACAGGA301 ACTGACGAGA TGCAATCACT GTGTGGACTT TTTACCAGCT AGCTAGGGCA351 CTACCATGAG CCACTGTCTA GCAGGGAGGC TTTGGGGATG GTGTGCCCCG401 AATATCTCTC AGGGTAAGAG TTTACAGTAA GCAGCAAGCA GAGGGGTGTG451 GGTGAGTGTG CAAGTATCTA ATTGGCTAGT TTTTGTGGCC TGTAACATAT501 TGGTGGGTGT TGGGAGTCAT AAGCTAAATG TTTGCTTTCC TCTGCATTGG551 TGGTCATTAG GGAGGGGGCA GATTATGAAC CTAGGTTGCA GATCTGTTGG601 AGTAATAACA AGACACTGGT CTTGTTGGGG GTATAACCTA GAGACTCGAT651 TTATGTTCAT GTTTGGTTTG GGATGGGTTT TATGTGAGTG TTTTCTTTTT701 TGGGGAGGGG GTCGGTTAAC TTGGAAAGTA ATGCTAGGTA CTGTCCTGTT751 CATTTCCCTG AGGTGAAAGT TAGGTCAGGT TTTCTAGAAT GGAGTCTGAA801 GGTAAAACAT TTGGCCACTG GCATGCCCTA AAGTCTTTTT GTGTTCTTGT851 CCCCTAGCAG ATCCAGCCCT ATCATCTCCT GGTGCCCAAC AGCTGCATCA901 GGATGAAGCT CAGGTAGTGG TGGAGCTAAC TGCCAATGAC AAGCCCAGTC

Manual gene findingORF of 105 bps =>

A ‘protein’ of 35 aaORF of 105 bps =>

A ‘protein’ of 35 aa

Page 32: Genome Annotation - DTU Health Tech · Genome Annotation and Gene-finding 27621 Prokaryotic Gene Discovery, Metagenomics and Pangenomics. Genome Sequencing. Annotation is the process

Take home messages 1/2

We have a life book, but difficult to read

Amount of raw sequence is astronomical and growing

rRNA, tRNA genes, etc. are genes too

Many distinct gene structures, and far from every open reading frame is a gene

Page 33: Genome Annotation - DTU Health Tech · Genome Annotation and Gene-finding 27621 Prokaryotic Gene Discovery, Metagenomics and Pangenomics. Genome Sequencing. Annotation is the process

Outline

Some ‘trivial’ questions− Why gene prediction?

− The problem of faster genomic sequencing

− What is a Gene?

The anatomy of a gene

Manual gene finding by you! (exercise)

Gene finder methods and performance− NetGene2

− EasyGene

Page 34: Genome Annotation - DTU Health Tech · Genome Annotation and Gene-finding 27621 Prokaryotic Gene Discovery, Metagenomics and Pangenomics. Genome Sequencing. Annotation is the process

Gene Prediction

Prediction relies on integration of several gene features

Each gene feature carries a low signal− E.g. ATG, Donor/acceptor splice sites− Combinatorial explosion− Some are mutually exclusive (e.g. reading frame)

Page 35: Genome Annotation - DTU Health Tech · Genome Annotation and Gene-finding 27621 Prokaryotic Gene Discovery, Metagenomics and Pangenomics. Genome Sequencing. Annotation is the process

Gene Prediction

Codon frequency/bias– Organism dependent

– Hexamer statistics

Transcriptional– Promoters/enhancers

Exon/introns– Length distributions

– ORFs

Splicing– Donor/acceptor sites

– Branchpoints

Translational– Start codon (ATG)

context

Page 36: Genome Annotation - DTU Health Tech · Genome Annotation and Gene-finding 27621 Prokaryotic Gene Discovery, Metagenomics and Pangenomics. Genome Sequencing. Annotation is the process

Gene finders of the past...

GeneMark (Borodovsky & McIninch 1993)

Ecoparse (Krogh et al 1994)

GeneMark.hmm (Lukashin & Borodovsky 1998)

Glimmer (Salzberg et al 1998, Delcher et al 1999)

Orpheus (Frishman et al 1998)

Frame-by-frame (Shmatkov et al 1999)

GeneMark.hmm/S (Besemer et al 2001)

Page 37: Genome Annotation - DTU Health Tech · Genome Annotation and Gene-finding 27621 Prokaryotic Gene Discovery, Metagenomics and Pangenomics. Genome Sequencing. Annotation is the process

Since then...

GENEMARK.2Ecgene

AUGUSTUS.7EXONHUNTER.3DOGFISH-CE.4

GenscanGENEZILLA.2

AcemblyTWINSCAN-MARS.4

FGENESH++.1SAGA.4Geneid

SGPACEVIEW.3AUGUSTUS.2

SPIDA.7AUGUSTUS.4

N-SCAN.4N-SCAN.5Twinscan

AUGUSTUS.1AUGUSTUS.3EXOGEAN.3

PAIRAGON+N-SCAN.1PAIRAGON+N-SCAN.3

JIGSAW.1ENSEMBL.3

DOGFISH-CE.7

Page 38: Genome Annotation - DTU Health Tech · Genome Annotation and Gene-finding 27621 Prokaryotic Gene Discovery, Metagenomics and Pangenomics. Genome Sequencing. Annotation is the process

Gene Finders are often organism specific

Page 39: Genome Annotation - DTU Health Tech · Genome Annotation and Gene-finding 27621 Prokaryotic Gene Discovery, Metagenomics and Pangenomics. Genome Sequencing. Annotation is the process

Gene Prediction

Ab initio Gene Finders

”Integrated” methods− Predict genes in context (Hidden Markov Model based)

”Grammar” of genesCertain elements in specific order are required

− HMMgene www.cbs.dtu.dk/services/HMMgene/− GenScan http://genes.mit.edu/GENSCAN.html

”Isolated” methods− Predict individual features (Neural Network based)

E.g. splice sites, coding regions− NetGene2 www.cbs.dtu.dk/services/NetGene2/− GRAIL http://compbio.ornl.gov/Grail-1.3/

Page 40: Genome Annotation - DTU Health Tech · Genome Annotation and Gene-finding 27621 Prokaryotic Gene Discovery, Metagenomics and Pangenomics. Genome Sequencing. Annotation is the process

Artificial Neural Network

Pyr

Pyr

1

2 1

1

T/F

Pyr|

Pyr|

Pyr|

Pyr

+1

+1+1

+1

+1+1

–2

Page 41: Genome Annotation - DTU Health Tech · Genome Annotation and Gene-finding 27621 Prokaryotic Gene Discovery, Metagenomics and Pangenomics. Genome Sequencing. Annotation is the process

Hidden Markov Model

Page 42: Genome Annotation - DTU Health Tech · Genome Annotation and Gene-finding 27621 Prokaryotic Gene Discovery, Metagenomics and Pangenomics. Genome Sequencing. Annotation is the process

Gene Prediction

”Isolated” methods (e.g.NN):

HAPPYEUGENEAWASGUYFINDER

”Integrated” methods (e.g.HMM):

EUGENEFINDERWASAHAPPYGUY

Page 43: Genome Annotation - DTU Health Tech · Genome Annotation and Gene-finding 27621 Prokaryotic Gene Discovery, Metagenomics and Pangenomics. Genome Sequencing. Annotation is the process

EasyGene –Bacterial Gene Finder

Courtesy of T.S. Larsen & A. Krogh 2004

Page 44: Genome Annotation - DTU Health Tech · Genome Annotation and Gene-finding 27621 Prokaryotic Gene Discovery, Metagenomics and Pangenomics. Genome Sequencing. Annotation is the process

ORF distributions

Page 45: Genome Annotation - DTU Health Tech · Genome Annotation and Gene-finding 27621 Prokaryotic Gene Discovery, Metagenomics and Pangenomics. Genome Sequencing. Annotation is the process

Performance landscapeE. coli

Page 46: Genome Annotation - DTU Health Tech · Genome Annotation and Gene-finding 27621 Prokaryotic Gene Discovery, Metagenomics and Pangenomics. Genome Sequencing. Annotation is the process

Performance landscapeshort ORFs – E. coli

Page 47: Genome Annotation - DTU Health Tech · Genome Annotation and Gene-finding 27621 Prokaryotic Gene Discovery, Metagenomics and Pangenomics. Genome Sequencing. Annotation is the process

Annotation remains a problem...

Courtesy of M. Skovgaard et al 2001

Page 48: Genome Annotation - DTU Health Tech · Genome Annotation and Gene-finding 27621 Prokaryotic Gene Discovery, Metagenomics and Pangenomics. Genome Sequencing. Annotation is the process

Annotation remains a problem...

Easygene anno 2009

Page 49: Genome Annotation - DTU Health Tech · Genome Annotation and Gene-finding 27621 Prokaryotic Gene Discovery, Metagenomics and Pangenomics. Genome Sequencing. Annotation is the process

Take home messages 2/2

Genes may be predicted by computer programs

Most gene prediction programs only predict protein-coding genes

’Unusual’ genes are difficult to predict:− Alternative/Multiple start codons− Non-native genes− Lowly expressed− Introns, Alternatively Spliced

HMM-based gene prediction programs are suitable for “Gene Grammar”

Prediction methods are not perfect!

No single method is always best

Page 50: Genome Annotation - DTU Health Tech · Genome Annotation and Gene-finding 27621 Prokaryotic Gene Discovery, Metagenomics and Pangenomics. Genome Sequencing. Annotation is the process

Take home message...

UseUse gene finders gene finders withwith cautioncaution!!

Page 51: Genome Annotation - DTU Health Tech · Genome Annotation and Gene-finding 27621 Prokaryotic Gene Discovery, Metagenomics and Pangenomics. Genome Sequencing. Annotation is the process

...and Coffee Break!