Transcript
Page 1: Biological Motivation Gene Finding

Biological MotivationGene Finding

Rhys Price Jones

Anne R. Haake

Page 2: Biological Motivation Gene Finding

Gene Finding

Why do it? • Find and annotate all the genes within the large

volume of DNA sequence data– how many genes in an organism? homologies?

 • Gain understanding of problems in basic science

– e.g. gene regulation-what are the mechanisms involved in transcription, splicing, etc?

• Different emphasis in these goals has some effect on the design of computational approaches for gene finding.

Page 3: Biological Motivation Gene Finding

Gene Finding

• Dependent on good experimental data to build reliable predictive models

• Various aspects of gene structure/function provide information used in gene finding programs

Page 4: Biological Motivation Gene Finding

The Informatics View of Genes

• Genes are character strings embedded in much larger strings called the genome

• Genes are composed of ordered elements associated with the fundamental genetic processes including transcription, splicing, and translation.

Page 5: Biological Motivation Gene Finding

Gene Finding

• Cells recognize genes from DNA sequence– find genes via their bioprocesses

• Not so easy for us..

Page 6: Biological Motivation Gene Finding

CTAGCAGGGACCCCAGCGCCCGAGAGACCATGCAGAGGTCGCCTCTGGAAAAGGCCAGCGTTGTCTCCAAACTTTTTTTCAGGTGAGAAGGTGGCCAACCGAGCTTCGGAAAGACACGTGCCCACGAAAGAGGAGGGCGTGTGTATGGGTTGGGTTGGGGTAAAGGAATAAGCAGTTTTTAAAAAGATGCGCTATCATTCATTGTTTTGAAAGAAAATGTGGGTATTGTAGAATAAAACAGAAAGCATTAAGAAGAGATGGAAGAATGAACTGAAGCTGATTGAATAGAGAGCCACATCTACTTGCAACTGAAAAGTTAGAATCTCAAGACTCAAGTACGCTACTATGCACTTGTTTTATTTCATTTTTCTAAGAAACTAAAAATACTTGTTAATAAGTACCTANGTATGGTTTATTGGTTTTCCCCCTTCATGCCTTGGACACTTGATTGTCTTCTTGGCACATACAGGTGCCATGCCTGCATATAGTAAGTGCTCAGAAAACATTTCTTGACTGAATTCAGCCAACAAAAATTTTGGGGTAGGTAGAAAATATATGCTTAAAGTATTTATTGTTATGAGACTGGATATAT...

Page 7: Biological Motivation Gene Finding

GCTAGCAGGGACCCCAGCGCCCGAGAGACCATGCAGAGGTCGCCTCTGGAAAAGGCCAGCGTTGTCTCCAAACTTTTTTTCAGGTGAGAAGGTGGCCAACCGAGCTTCGGAAAGACACGTGCCCACGAAAGAGGAGGGCGTGTGTATGGGTTGGGTTGGGGTAAAGGAATAAGCAGTTTTTAAAAAGATGCGCTATCATTCATTGTTTTGAAAGAAAATGTGGGTATTGTAGAATAAAACAGAAAGCATTAAGAAGAGATGGAAGAATGAACTGAAGCTGATTGAATAGAGAGCCACATCTACTTGCAACTGAAAAGTTAGAATCTCAAGACTCAAGTACGCTACTATGCACTTGTTTTATTTCATTTTTCTAAGAAACTAAAAATACTTGTTAATAAGTACCTANGTATGGTTTATTGGTTTTCCCCCTTCATGCCTTGGACACTTGATTGTCTTCTTGGCACATACAGGTGCCATGCCTGCATATAGTAAGTGCTCAGAAAACATTTCTTGACTGAATTCAGCCAACAAAAATTTTGGGGTAGGTAGAAAATATATGCTTAAAGTATTTATTGTTATGAGACTGGATATAT...

Page 8: Biological Motivation Gene Finding

Types of Genes

• Protein coding– most genes

• RNA genes– rRNA– tRNA– snRNA (small nuclear RNA)– snoRNA (small nucleolar RNA)

Page 9: Biological Motivation Gene Finding

3 Major Categories of Information used in Gene Finding Programs

• Signals/features = a sequence pattern with functional significance e.g. splice donor & acceptor sites, start and stop codons, promoter features such as TATA boxes, TF binding sites, CpG islands

 • Content/composition -statistical properties of coding

vs. non-coding regions. – e.g. codon-bias; length of ORFs in prokaryotes;GC content

• Similarity-compare DNA sequence to known sequences in database– Not only known proteins but also ESTs, cDNAs

Page 10: Biological Motivation Gene Finding

Looking for Protein Coding Genes

• Look for ORF (begins with start codon, ends with stop codon, no internal stops!)– long (usually > 60-100 aa)– If homologous to “known” protein more likely

• Look for basal signals– Transcription, splicing, translation

• Look for regulatory signals– Depends on organism

• Prokaryotes vs Eukaryotes• Vertebrate vs fungi

Page 11: Biological Motivation Gene Finding

Easier problem:Gene Finding in Bacterial Genomes

Why?• Dense Genomes• Short intergenic regions• Uninterrupted ORFs• Conserved signals• Abundant comparative information

– Complete Genomes available for many

Page 12: Biological Motivation Gene Finding

What do Prokaryotic Genes look like?

5’ 3’

Open Reading Frame

Promoter region (maybe)

Ribosome binding site (maybe)

Termination sequence (maybe)

Start codon / Stop Codon

Page 13: Biological Motivation Gene Finding

Prokaryotic Gene Expression

Promoter Cistron1 Cistron2 CistronN Terminator

Transcription RNA Polymerase

mRNA 5’ 3’

TranslationRibosome, tRNAs,Protein Factors

1 2 N

Polypeptides

NC

NC N

C

1 2 3

Slide modified from: http://biology.uky.edu/520/Lecture/lect8/lect8Notes.ppt

Page 14: Biological Motivation Gene Finding

Open Reading Frame (ORF)

• Any stretch of DNA that potentially encodes a protein

• The identification of an ORF is the first indication that a segment of DNA may be part of a functional gene

Page 15: Biological Motivation Gene Finding

Open Reading Frames

Each grouping of the nucleotides into consecutive triplets constitutes a reading frame. There are three different reading frames in the 5’->3’ direction and a further three in the reverse direction on the opposite strand.

A sequence of triplets that contains no stop codon is an Open Reading Frame (ORF)

A C G T A A C T G A C T A G G T G A A T

GTA ACT GAC TAG GTG AAT

CGT AAC TGA CTA GGT GAA

Page 16: Biological Motivation Gene Finding

ORFs as gene candidates

• An open reading frame that begins with a start codon (usually ATG, GTG or TTG, but this is species-dependent)

• Most prokaryotic genes code for proteins that are 60 or more amino acids in length

• The probability that a random sequence of nucleotides of length n has no stop codons is (61/64)n

• When n is 50, there is a probability of 92% that the random sequence contains a stop codon

• When n is 100, this probability exceeds 99%

Page 17: Biological Motivation Gene Finding

Codon Bias

• Genetic code degenerate– Equivalent triplet codons code for the same

amino acid

• Codon usage varies– organism to organism– gene to gene

• Biological basis– Avoidance of codons similar to stop– Preference for codons that correspond to

abundant tRNAs within the organism

Page 18: Biological Motivation Gene Finding

Codon Bias Gene Differences

GAL4 ADH1Gly GGG 0.21 0Gly GGA 0.17 0Gly GGT 0.38 0.93Gly GGC 0.24 0.07

Slide modified from: http://biology.uky.edu/520/Lecture/lect8/lect8Notes.ppt

Page 19: Biological Motivation Gene Finding

Codon BiasOrganism differences

• Yeast Genome: arg specified by AGA 48% of time (other five equivalent codons ~10% each)

• Fruitfly Genome: arg specified by CGC 33% of time (other five ~13% each)

• Complete set of codon usage biases can be found at:

• http://www.kazusa.or.jp/codon/

Page 20: Biological Motivation Gene Finding

GC content

• GC relative to AT is a distinguishing factor of bacterial genomes

• Varies dramatically across species– Serves as a means to identify bacterial species

• For various biological reasons– Mutational bias of particular DNA polymerases– DNA repair mechanisms – horizontal gene transfer (transformation,

transduction, conjugation)

Page 21: Biological Motivation Gene Finding

GC Content

• GC content may be different in recently acquired genes than elsewhere

• This can lead to variations in the frequency of codon usage within coding regions – There may be significant differences in codon bias

within different genes of a single bacterium’s genome

Page 22: Biological Motivation Gene Finding

Ribosome Binding Sites

• RBS is also known as a Shine-Dalgarno sequence (species-dependent) that should bind well with the 3’ end of 16S rRNA (part of the ribosome)

• Usually found within 4-18 nucleotides of the start codon of a true gene

Page 23: Biological Motivation Gene Finding

Shine-Dalgarno Sequence

• Is a nucleotide sequence (consensus = AGGAGG) that is present in the 5'-untranslated region of prokaryotic mRNAs.

• This sequence serves as a binding site for ribosomes and is thought to influence the reading frame.

• If a subsequence aligning well with the Shine-Dalgarno sequence is found within 4-18 nucleotides of an ORF’s start codon, that improves the ORF’s candidacy.

Page 24: Biological Motivation Gene Finding

Bacterial Promoter

-35T82T84G78A65C54A45…

(16-18 bp)…T80A95T45A60A50T96…(A,G)

-10 +1

Not so simple: remember, these are consensus sequences

Page 25: Biological Motivation Gene Finding

Termination Sequences

• 3’-U tail

• Stem/loop– Inverted repeat immediately preceding the runs of

uracil


Top Related