gene structure and identification genes and genomes orfs and more consensus sequences gene finding...
TRANSCRIPT
Gene Structure and Identification
Genes and GenomesORFs and more
Consensus SequencesGene Finding
BIO520 Bioinformatics Jim Lund
Reading: sections 1.3, 9.1-9.6
Gene
The functional and physical unit of heredity passed from parent to offspring. Genes are pieces of DNA, and most genes contain the information for making a specific protein.
Gene-Informatics
Genes are character strings embedded in much larger strings called the genome. A gene usually encodes a protein. Genes are composed of ordered elements associated with the fundamental genetic processes including transcription, splicing, and translation.
Genomes
• Genome seq. has only limited use by itself– Markers, SNPs, etc.
• Functional annotation – Identify proteins and their functions.– And regulatory regions, etc.
• Parts list: a source for understanding all biology--and ushers in the post-genomic age of biology.
Characteristics of Protein Coding Genes
• ORF– long (usually >100 aa)– “known” proteinslikely
• Basal signals– Transcription, splicing, translation
• Regulatory signals– Depend on organism
• Prokaryotes vs Eukaryotes• Verterbrate vs fungi, eg.
Infer Gene Structure“Gene Model”
Promoter•Strength•Regulation
mRNA•Exons•Splicing•Stability •ORF=protein
GenomesGene Content
Human
27,148 genes X 2 kbp=54 Mb mRNA
Introns=300 Mb?Regulatory regions=300 Mb?
2,446 Mb = ?
Complex Genome DNA
• ~10% highly repetitive (300 Mb)– NOT GENES
• ~25% moderate repetitive (750 Mb)– Some genes
• ~10% exons and introns (354 Mb)
• 55% = ?– Regulatory regions– Intergenic regions
Easy problem:Bacterial Gene Finding
• Dense Genomes• Short intergenic regions• Uninterrupted ORFs• Conserved signals• Abundant comparative
information• Complete Genomes
E. coli genome
• 4,415 genes• Ave. distance between genes: 118 bp• 318 aa, average protein length• 57 proteins longer than 1000 aa.• 318 shorter than 100 aa.• 2,584 operons, 70% contain one gene.• 1.5% repetitive DNA (mostly viral
fragments).
Prokaryotic Gene Expression
PromoterCistron1Cistron2CistronNTerminator
Transcription RNA Polymerase
mRNA 5’ 3’
TranslationRibosome, tRNAs,Protein Factors
1 2 N
Polypeptides
NC
NC N
C1 2 3
Prokaryotic gene prediction•ORFs
•Biased nucleotide distribution–Periodicity of 3–Codon bias (codon usage statistics)–Also called Codon Adaptation Index (CAI).
•Signal sequences•Homology•Other biological info: for E. coli, partial N-terminal protein sequences.
Prokaryotic signal sequences•Ribosome binding site (RBS)/Shine-
Delgarno element•3-9 purines complementary to sequence at 3’ end of the 16S rRNA in the small subunit of the ribosome.•Located: 4-7 bps 5’ of the AUG.
•Promoter•-35 consensus site (TTGACA)•-10 consensus site (TATAAT)
•Signal peptides•Regulatory protein binding sites (4 to 8 bps)
ORF finding tools
• Artemis– analyze ORFs
• Testcode (Fickett’s)• CodonPreference• ORF Finder (NCBI)• BCM Search Launcher
Codon Bias
• Genetic code degenerate• Codon usage varies
– Organism to organism– Gene to gene
•High bias correlates with high level expression
•Bias correlates with tRNA isoacceptors
•Change bias or tRNAs, change expression
Nucleotide Bias
• Coding DNA vs non-Coding DNA– often G+C content higher than bulk
• Empirical statistics (Fickett’s TESTCODE)
Useful:• ORF matches “typical”
– organism, bias
• ORF obscured by STOP codons
We found ORFs-now what?
• Work backwards–Locate adjacent cistrons
–Locate RBS
–Locate promoter
–Locate terminator
–Locate regulatory sites
TranslationRibosome Binding Site,
Shine-Dalgarno Site
nnAGGAGGAGGAGGnnnnnATG…
Consensus not always used, example E. coli gene:
nnAaGAGGAaGAGGnnnnATG(Better represented as a PSSM or a HMM)
Bacterial Promoter
-35T82T84G78A65C54A45…
(16-18 bp)…T80A95T45A60A50T96…(A,G)
-10 +1
Alternate sigma factorsAlternate sigma factorsCCCTTGAA….CCCGATNT
Terminators
• Stem/loop– structural only
• 3’-U tail
Rho-independent
• C-rich
• G-poor
• “loose” consensus
Rho-dependent
Difficulties in gene prediction
• Frame shifts– sequencing errors
• Overlapping ORFs– Rare (a few percent)
• Short ORFs• Unusual genes
– bp composition– signal sequences