using blast to study gene evolution – an example. introduction to bioinformatics, lesson 3b
TRANSCRIPT
Using blast to Using blast to study gene study gene evolution – an evolution – an example.example.
Introduction to bioinformatics, lesson 3b.
NCBI diagram
Orthologs
Homologous sequences are Homologous sequences are orthologousorthologous if they were separated by a if they were separated by a speciation event:event:
If a gene exists in a species, and that If a gene exists in a species, and that species diverges into two species, then the species diverges into two species, then the copies of this gene in the resulting species copies of this gene in the resulting species are orthologous.are orthologous.
Orthologs
• Orthologs will typically have the same or similar function in the course of evolution.
• Identification of orthologs is critical for reliable prediction of gene function in newly sequenced genomes.
Paralogs
Homologous sequences are Homologous sequences are paralogousparalogous if if they were separated by a they were separated by a gene duplication event: event:
If a gene in an organism is duplicated, If a gene in an organism is duplicated, then the two copies are paralogous. then the two copies are paralogous.
Paralogs
• Orthologs will typically have the same or similar function.
• This is not always true for paralogs due to lack of the original selective pressure upon one copy of the duplicated gene, this copy is free to mutate and acquire new functions.
Paralogs
DuplicationDuplication
Orthologs and Paralogs
Duplication
Speciation
Species a Species b
Paralogs
Orthologs
Orthologs
NCBI diagram
What is conservation?
Functionally or structurally important sites are conserved:
Conserved sites “slow” evolving sitesVariable sites “fast evolving” sites
A functionally or structurally important sites – are subject to stronger evolutionary pressure =Purifying selection force
Finding conservation regions from an alignment
S1 KITAYCELARTDMKLGLDFYKGVSLANWVCLAKWESGYN S2 MPFERCELARTLKRMADADIRGVSLANWVCLAKWFWDGGS3 MPFERCELARTLKRMMDADIRGVSLANWVCLAKWFWDGG
From the MSA and the tree, one can determine how From the MSA and the tree, one can determine how conserved is a gene.conserved is a gene.
Mol. Biol. Evol. (2005) 22:598-606
Protocol
Step 1 - BLAST
Search for Human-mouse orthologous protein pairs
Step 1 - BLAST
• The orthologs are defined as pairs of reciprocal BLAST hits.
• Eliminate genes with more than one potential orthologous sequence.
• Select only genes which the human protein was functionally annotated.
Step 2 – Evolutionary Rates
For each orthologous pair:
• Alignment at the amino acid level.
• Measure conservation
The data set contained 6,776 human-mouse gene pairs.
Step 3 – Assignment of Temporal Categories
Using BLAST for finding homologous genes in 6 different eukaryotic genomes .
Caenorhabditis elegans Schizosaccharo
myces pombe
Takifugu rubripes
Drosophila melanogaster
Arabidopsis thaliana
Saccharomyces cerevisiae
What is Old?
• Presence of any homolog in all the 6 genomes.
What is Presence? • Using an e-value cutoff of 10-4 in BLAST.
OLD
METAZOANS
DEUTEROSTOMES
TETRAPODS
Caenorhabditis elegans
Drosophila melanogaster
Takifugu rubripes
• METAZOANS - Animals whose bodies consist of many cells, as distinct from Protozoa, which are unicellular; all animals commonly recognized as animals.
• DEUTEROSTOMES - The second of the two main groups of bilaterally symmetrical animals. The name derives from 'deutero' (second) 'stome' (mouth), referring to the origin of the definitive mouth as an opening independent from the blastopore of the embryo.
• TETRAPODS - Any four-legged animals, including mammals, birds, reptiles and amphibians.
Results
Negative correlation between “age” of genes and the rate of
evolutionCONSERVATION
CONSERVATION
CONSERVATION
CONSERVATION
Control.• Changing the sensitivity of the BLAST
detection to a more conservative one of 10-10, did not significantly affect the result.
Explanations
Functional constraint remained constant throughout the evolutionary history of
each gene, but the newer genes are less constrained than older genes.
Functional constraints are not constant, rather they are weak at the time of origin of a gene and they become progressively
more stringent with age.
Eran Elhaik, Niv Sabath, and Dan Graur
Mol. Biol. Evol. 23(1):1–3. 2006
Goal
• To show that these results are an artifact caused by our inability to detect similarity when genetic distances are large.
Simulation
The evolutionary process
Rat
Dog
Cat
Mouse
Fly
AlaArgVal
Ala
Arg
Val
…
Replacement probabilities
…
The evolutionary process
Rat
Dog
Cat
Mouse
Fly
V
AlaArgVal
Ala
Arg
Val
…
Replacement probabilities
…
Rat
Dog
Cat
Mouse
Fly
V
V
The evolutionary process
AlaArgVal
Ala
Arg
Val
…
Replacement probabilities
…
Rat
Dog
Cat
Mouse
Fly
LV
V
The evolutionary process
AlaArgVal
Ala
Arg
Val
…
Replacement probabilities
…
LLIM
V
Rat
Dog
Cat
Mouse
Fly
LL
V
V
The evolutionary process
AlaArgVal
Ala
Arg
Val
…
Replacement probabilities
…
Rat L M T G S H M G N F I IMouse L M T G S G M A N H V ICat I M T G S H I G Y A M FDog M M T G S G I G L T R A Fly V M T G S W R G R M Y A
The evolutionary process
...
And repeat the process for all positions…(assume: each position evolves independently)
All the genes originated in the common ancestor of A,B,C,D,E and are, thus, of equal age.
Similar to the human and mouse orthologous genes.
Remote homologous genes from increasingly more distant taxa.
Generate terminal sequences with the following phylogenetic relationships:
DA B EC
Simulation
• They simulated genes with 101 different rates.
• High rate -> higher likelihood for a amino acid replacement in each branch.
Simulation
Use BLAST, at the same way that Alba and Castresana used it, to detect homology between gene A to genes C,D and E.
Only one different – the groups names
OLD
METAZOANS
DEUTEROSTOMES
TETRAPODS
SENIORS
ADULTS
TEENAGERS
TODDLERS
Results
Same as Alba and Castresana
But all the simulated genes are at the same “age”.
What is the problem???
We can only count genes that are identified as homologous by the
protocol
Alba and Castresana may have, thus, failed to spot the vast
majority of homologs from among the fastest evolving genes
The vast majority of the fastest evolving genes are undetectable even when the cutoffs are extremely permissive.
Conclusion
The inverse relationship between evolutionary rate and gene age is an artifact caused by our inability to detect similarity when genetic distances are large.
• Since genetic distance increases with time of divergence and rate of evolution, it is difficult to identify homologs of fast evolving genes in distantly related taxa.
• Thus, fast evolving genes may be misclassified as “new”.
So, the only conclusion that can be drawn from Alba and
Castresana’s study is that
Slowly evolving genesevolve slowly
!!!