gene families and functional annotation once genes have been id.ed they need to be functionally...
TRANSCRIPT
Gene Families and Functional AnnotationGene Families and Functional Annotation
Once genes have been id.ed they need to be functionally Once genes have been id.ed they need to be functionally annotatedannotated
A computational first step is to group genes w/ other genes - A computational first step is to group genes w/ other genes - some of which will hopefully have known fx.ssome of which will hopefully have known fx.s
Once genes are classified, we can begin to examine whether Once genes are classified, we can begin to examine whether certain genes are missing or overrepresented in the given certain genes are missing or overrepresented in the given genome - possibly reflecting the niche of the organismgenome - possibly reflecting the niche of the organism
As w/ earlier computational analyses, functional annotation As w/ earlier computational analyses, functional annotation based solely on based solely on in silicoin silico analyses is only a first step analyses is only a first step
08:47
Gene Families and Functional AnnotationGene Families and Functional Annotation
Sequence-similarity searches are a first pass in Sequence-similarity searches are a first pass in classificationclassification
BLAST - Basic Local Alignment Search ToolBLAST - Basic Local Alignment Search Tool
BLASTn - nucleotideBLASTn - nucleotide
BLASTp- proteinBLASTp- protein
BLASTx - translates a nucleotide sequence into all possible BLASTx - translates a nucleotide sequence into all possible reading frames and scans these against a protein databasereading frames and scans these against a protein database
All give a Expectation, E, value score - to evaluate the All give a Expectation, E, value score - to evaluate the significance of the matchsignificance of the match
In both eukaryotes and prokaryotes, 1/3 to 1/2 of searched In both eukaryotes and prokaryotes, 1/3 to 1/2 of searched genes do not match a protein = orphan genesgenes do not match a protein = orphan genes
08:47
Protein Structural DomainsProtein Structural DomainsProteins are made up of combinations of Proteins are made up of combinations of distinct structural units or domainsdistinct structural units or domains
Genes can be grouped based on the Genes can be grouped based on the domains they containdomains they contain
These groupings depend on structural These groupings depend on structural similarity - sequence similarity alone may similarity - sequence similarity alone may be insufficientbe insufficient
08:47
Gene clustering by seq. similarityGene clustering by seq. similarity
BLAST searches generally return matches from more than BLAST searches generally return matches from more than one protein from more than one speciesone protein from more than one species
This happens if the query protein is part of a gene (protein) This happens if the query protein is part of a gene (protein) family or contains multiple domains found in other proteinsfamily or contains multiple domains found in other proteins08:47
BLAST output can be interpreted as a match to one or more protein BLAST output can be interpreted as a match to one or more protein domains - Searches of closely related sp. often id. genes/proteins domains - Searches of closely related sp. often id. genes/proteins w/ similar domain structurew/ similar domain structure
Domains shuffle over evolutionary time and are often found in Domains shuffle over evolutionary time and are often found in different combinations across more distant comparisonsdifferent combinations across more distant comparisons
Domains do tend to follow biologically reasonable patterns - DNA Domains do tend to follow biologically reasonable patterns - DNA binding domains w/ other DNA binding domains, transmembrane binding domains w/ other DNA binding domains, transmembrane domains w/ intra and extracellular domainsdomains w/ intra and extracellular domains
08:47
Gene clustering by seq. similarityGene clustering by seq. similarity
Genes can be classified by domain contentGenes can be classified by domain content
The Enzyme Commission (EC) hierarchical classification of The Enzyme Commission (EC) hierarchical classification of enzymes - each enzyme is assigned a number that reflects enzymes - each enzyme is assigned a number that reflects sub-classification of function, sub-classification of function, e.g.e.g. ADH is EC1.1.1.1 ADH is EC1.1.1.1
Other classification schemes are not as obvious - protein Other classification schemes are not as obvious - protein function is often context-specificfunction is often context-specific
PFAM - protein database that allows access to biochemical PFAM - protein database that allows access to biochemical properties of predicted proteinsproperties of predicted proteins
08:47
Gene clustering by seq. similarityGene clustering by seq. similarity
InterPro - classifies individual protein domainsInterPro - classifies individual protein domains
08:47
Gene clustering by seq. similarityGene clustering by seq. similarity
Protein functional prediction ≠ assignment of genes to Protein functional prediction ≠ assignment of genes to familiesfamilies
Protein function prediction allows general conclusions about Protein function prediction allows general conclusions about protein function and genome content based on protein protein function and genome content based on protein domainsdomains
Classification of gene families involves distinguishing Classification of gene families involves distinguishing between paralogs and orthologsbetween paralogs and orthologs
08:47
Major Classes of Protein FunctionMajor Classes of Protein Function
Enzymes Enzymes
Signal transduction (receptors and kinases)Signal transduction (receptors and kinases)
Nucleic acid binding (transcription factors, nucleic acid Nucleic acid binding (transcription factors, nucleic acid enzymes)enzymes)
Structural (cytoskeletal, extracellular matrix, motor proteins)Structural (cytoskeletal, extracellular matrix, motor proteins)
Channel (voltage and chemically gated)Channel (voltage and chemically gated)
ImmunoglobinsImmunoglobins
Calcium-binding proteinsCalcium-binding proteins
TransportersTransporters
Subclasses vary - as do the representation w/in each Subclasses vary - as do the representation w/in each genomegenome
08:47
Gene ClustersGene Clusters
Alignment searches (BLAST) identify genes w/ similar Alignment searches (BLAST) identify genes w/ similar sequence to the query sequence to the query
If searches id. a single gene, or genes w/ a single fx then If searches id. a single gene, or genes w/ a single fx then functional assignment to query seq. is simple - but searches functional assignment to query seq. is simple - but searches often id lg # of seq.s w/ multiple functionsoften id lg # of seq.s w/ multiple functions
The most similar sequence is not nec. the seq. w/ which the The most similar sequence is not nec. the seq. w/ which the query seq. shares a fxquery seq. shares a fx
08:47
Gene ClustersGene Clusters
One approach is to try and define as large a protein family as One approach is to try and define as large a protein family as possible (including many possible functions) possible (including many possible functions)
PSI-BLAST can be used to identify a large set of potential PSI-BLAST can be used to identify a large set of potential protein family membersprotein family members
A BLAST search is conducted to create an initial protein A BLAST search is conducted to create an initial protein sequence alignment - which is then used to initiate a fresh sequence alignment - which is then used to initiate a fresh searchsearch
The process is then iterated until no further matches are The process is then iterated until no further matches are id.ed - this reduces the degree of seq. similarity required for id.ed - this reduces the degree of seq. similarity required for inclusion in the familyinclusion in the family
A “true” family of genes ought to be bounded by a A “true” family of genes ought to be bounded by a significance cut-off to limit the proteins includedsignificance cut-off to limit the proteins included
08:47
Gene ClustersGene ClustersClusters of orthologous genes, COGs, can be used to Clusters of orthologous genes, COGs, can be used to classify proteins classify proteins
COGs are created by id.ing the best hit for each gene in COGs are created by id.ing the best hit for each gene in complete pairwise comparisons across a set of genomescomplete pairwise comparisons across a set of genomes
08:47
Gene ClustersGene Clusters185,000 proteins from 66 microbial genomes id.ed 4,873 185,000 proteins from 66 microbial genomes id.ed 4,873 COGs - 75% of all predicted microbial proteins COGs - 75% of all predicted microbial proteins
50% of 110,00 proteins from fly, nematode, human, 50% of 110,00 proteins from fly, nematode, human, ariabidopsis, yeasts and a microsporidian form 4,852 COGsariabidopsis, yeasts and a microsporidian form 4,852 COGs
08:47
COG0837
Gene ClustersGene ClustersCOGs include both orthologs and paralogs COGs include both orthologs and paralogs
In (a) In (a) HuAHuA and and HuA’HuA’ are paralogs - distinguishing which are paralogs - distinguishing which retains the ancestral fx is not as simple as determining retains the ancestral fx is not as simple as determining which has the most similar seq.which has the most similar seq.
08:47
Gene ClustersGene ClustersHuAHuA and and MmAMmA differ in 5 a.a., none affect fx differ in 5 a.a., none affect fx
HuA’HuA’ and and MmAMmA differ in 4 a.a., but one of which changes the differ in 4 a.a., but one of which changes the charge of a critical residuecharge of a critical residue
Clustering based on similarity would lead to erroneous fx Clustering based on similarity would lead to erroneous fx classificationclassification
08:47
Gene PhylogeniesGene PhylogeniesClustering Clustering groupsgroups genes by seq. similarity genes by seq. similarity
Phylogentic analyses ascertain how groups of similar genes Phylogentic analyses ascertain how groups of similar genes are related by descentare related by descent
In the HuA, MmA example, the 2 A’ genes can either result In the HuA, MmA example, the 2 A’ genes can either result from one (orthologs) or two (paralogs) duplication eventsfrom one (orthologs) or two (paralogs) duplication events
Paralogs are less likely to share a functionParalogs are less likely to share a function
08:47
Gene PhylogeniesGene PhylogeniesOften gene fx can be inferred from phylogenetic analysis Often gene fx can be inferred from phylogenetic analysis
The first step is aligning the sequencesThe first step is aligning the sequences
A gene tree is then constructed using some algorithmA gene tree is then constructed using some algorithm
Duplications and gene relatedness are then ascertainedDuplications and gene relatedness are then ascertained
In the example on the In the example on the lftlft, an ancient duplication splits 2 fx.al , an ancient duplication splits 2 fx.al grps, on the grps, on the rtrt protein 2 likely has the same fx as 5 and 6 protein 2 likely has the same fx as 5 and 6
08:47
Gene OntologyGene OntologyMolecular function alone may not predict/describe biological fx Molecular function alone may not predict/describe biological fx (think crystallins)(think crystallins)
The Gene Ontology (GO) annotates and groups genes using a The Gene Ontology (GO) annotates and groups genes using a multi-character approach including cell biological and molecular fx multi-character approach including cell biological and molecular fx and/or subcellular localizationand/or subcellular localization
The GO project uses defined vocabulary and a hierarchical The GO project uses defined vocabulary and a hierarchical structure to classify genes and includes links indicating the type structure to classify genes and includes links indicating the type of evidence for the classificationof evidence for the classification
08:47
GO network GO network In this example, the gene INNER NO OUTER is at the center w/ the In this example, the gene INNER NO OUTER is at the center w/ the 3 separate classifications radiating out from it3 separate classifications radiating out from it
08:47
Gene OtologyGene Otology
The GO vocabulary includes 7000 terms describing molecular fx, The GO vocabulary includes 7000 terms describing molecular fx, 5000 describing biological process, some annotations include as 5000 describing biological process, some annotations include as many as 12 levels w/ in hierarchy termsmany as 12 levels w/ in hierarchy terms
This is too deep for efficient computational searches - other This is too deep for efficient computational searches - other simplified systems are also being developed to allow simplified systems are also being developed to allow computationally screen and classify genescomputationally screen and classify genes
08:47
• Homology = similarity due to common ancestry
• The Gpdh gene sequence from two different species are homologous sequences
• All comparisons made in molecular evolution (biology) are based on comparing homologous sequences = apples to apples
• Sequences must be aligned to allow comparison = homologous bases lined up in columns
Molecular Phylogenetics
Human MVHLTPBaboon MVHLTPCow MLTPSheep MLTPMouse MVHLTP
The cow and sheep β globin proteins are 2 a.a. shorter than the other sequences, so gaps are added to align the seqeunces
Human MVHLTPBaboon ......Cow .--...Sheep .--...Mouse ......
08:47
• Accumulation of sequence differences through time is the basis of molecular systematics, which analyses them in order to infer evolutionary relationships
• A gene tree is a diagram of the inferred ancestral history of a group of sequences
• A gene tree is only an estimate of the true pattern of evolutionary relations
• UPGMA and Neighbor joining = simple ways to estimate a gene tree
• Bootstrapping = sampling w/ replacement, a common technique for assessing the reliability of a node in a gene tree
• Taxon = the source of each sequence
Gene Trees
08:47
Rooted and Unrooted TreesRooted and Unrooted TreesAnalyses of a set of genes produces an unrooted tree Analyses of a set of genes produces an unrooted tree
Trees can be rooted, assigned polarity, by assignment of an Trees can be rooted, assigned polarity, by assignment of an outgroup - a sequence that is known to be more distantly related outgroup - a sequence that is known to be more distantly related than any within the rest of the analysis (the ingroup)than any within the rest of the analysis (the ingroup)
Tree branch length denotes the amountTree branch length denotes the amountof change along that branch in of change along that branch in somesometree building methodstree building methods
3 distinct unrooted trees
08:47
Tree Building methodsTree Building methods
The 3 primary methods (algorithms) for building gene trees are:The 3 primary methods (algorithms) for building gene trees are:
1. Parsimony1. Parsimony - a character-based approach that surveys every - a character-based approach that surveys every possible tree topology. The most parsimonious tree is the possible tree topology. The most parsimonious tree is the topology that requires the minimum # of steps (changes) in a data topology that requires the minimum # of steps (changes) in a data setset
Position 1 of this example - tree1 requires 1 change, tree2 2 Position 1 of this example - tree1 requires 1 change, tree2 2 changes and tree3 2 changes. When the 4 positions are summed changes and tree3 2 changes. When the 4 positions are summed tree 3 is found to be the best (shortest)tree 3 is found to be the best (shortest)
08:47
Tree Building methodsTree Building methods
The 3 primary methods (algorithms) for building gene trees are:The 3 primary methods (algorithms) for building gene trees are:
2. Maximum Likelihood2. Maximum Likelihood - also a character-based approach, surveys - also a character-based approach, surveys every possible tree topology and assigns all topologies a every possible tree topology and assigns all topologies a maximum likelihood estimate (score) based on a model of maximum likelihood estimate (score) based on a model of evolution describing the probability of changes (mutation) through evolution describing the probability of changes (mutation) through time. The ML tree is the one with the highest probabilitytime. The ML tree is the one with the highest probability
This method can be accurate, but is computationally expensiveThis method can be accurate, but is computationally expensive
08:47
Tree Building methodsTree Building methods
The 3 primary methods (algorithms) for building gene trees are:The 3 primary methods (algorithms) for building gene trees are:
3. Distance Methods3. Distance Methods - are not character based, instead they - are not character based, instead they calculate pairwise distances across entire aligned sequences and calculate pairwise distances across entire aligned sequences and construct data matrixes. Trees are built by grouping pairs with the construct data matrixes. Trees are built by grouping pairs with the shortest distances between them. These methods can also shortest distances between them. These methods can also incorporate complex evolutionary modelsincorporate complex evolutionary models
This method is computationally cheap, will always return and This method is computationally cheap, will always return and answer, but are not always accurate.answer, but are not always accurate.
The simplest distance method, The simplest distance method, Unweighted Pair Group Method with Arithmatic Mean, UPGMA, simply counts the number of sequence changes in all pairwise comparisons
08:47
UPGMA Trees
08:47
Hu Ba Co Sh Mo Ha ChHu 2 6 9 8 9 13Ba 7 10 7 10 13Co 3 11 12 16Sh 12 9 15Mo 7 16Ha 14
HuBa Co Sh Mo Ha ChHuBa 6.5 9.5 7.5 9.5 13Co 3 11 12 16Sh 12 9 15Mo 7 16Ha 14
Hu
Ba2/2 = 1.0
1.0
Co
Sh3/2 = 1.5
1.5
1.0
1.5
UPGMA Tree Construction
08:47
HuBa CoSh Mo Ha ChHuBa 8 7.5 9.5 13CoSh 11.5 10.5 15.5Mo 7 16Ha 14
HuBa Co Sh Mo Ha ChHuBa 6.5 9.5 7.5 9.5 13Co 3 11 12 16Sh 12 9 15Mo 7 16Ha 14
Hu
Ba
1.0
Co
Sh
7/2 = 3.5
1.5
1.0
1.5
Mo
Ha
3.5
3.5
UPGMA Tree Construction
08:47
HuBa CoSh Mo Ha ChHuBa 8 7.5 9.5 13CoSh 11.5 10.5 15.5Mo 7 16Ha 14
Hu
Ba
1.0Co
Sh
1.5
1.01.5
Mo
Ha
3.5
3.5
HuBa CoSh MoHa ChHuBa 8 8.5 13CoSh 11 15.5MoHa 15 8/2 = 4
Hu
Ba
1.0
1.0
Co
Sh
1.5
1.5
3.0
2.5
UPGMA Tree Construction
08:47
((HuBa)(CoSh)) MoHa Ch((HuBa)(Cosh)) 9.75 14.25MoHa 15
9.75/2 = 4.875
Hu
Ba
1.0
1.0
Co
Sh
1.5
1.5
3.0
2.5
HuBa CoSh MoHa ChHuBa 8 8.5 13CoSh 11 15.5MoHa 15
Mo
Ha
3.5
3.5
.875
1.375
UPGMA Tree Construction
08:47
((HuBa)(CoSh)) MoHa Ch((HuBa)(Cosh)) 9.75 14.25MoHa 15
((HuBa)(CoSh))(MoHa) Ch((HuBa)(Cosh))(MoHa) 14.625
14.625/2 = 7.3125
Hu
Ba
1.0
1.0
Co
Sh
1.5
1.5
3.0
2.5
Mo
Ha
3.5
3.5
.875
1.375
Ch7.3125
2.4375
UPGMA Tree Construction
08:47
a
b
1.0
1.0
c
d
1.5
1.5
3.0
2.5
e
f
3.5
3.5
.875
1.375
g7.3125
2.44
Final UPGMA Tree
08:47
Phylogenetic TreesPhylogenetic Trees
Phylogenetic trees are representations summarizing a Phylogenetic trees are representations summarizing a reconstructed evolutionary historyreconstructed evolutionary history
A phylogenetic tree is a diagram that proposes a A phylogenetic tree is a diagram that proposes a hypothesis for reconstructed evolutionary relationships hypothesis for reconstructed evolutionary relationships between a set of objects (taxa or OTUs)between a set of objects (taxa or OTUs)
Phylogenetic trees can represent relationships between Phylogenetic trees can represent relationships between species or genesspecies or genes
Phylogenetic TreesPhylogenetic Trees
OTUs are connected by a set of lines - branches or edgesOTUs are connected by a set of lines - branches or edges
External nodes or leaves are existing OTUs or extinct External nodes or leaves are existing OTUs or extinct objects tht did not give rise to descendentsobjects tht did not give rise to descendents
Internal nodes represent ancestral states hypothesized to Internal nodes represent ancestral states hypothesized to have occurred during evolutionhave occurred during evolution
Internal nodes can represent speciation or gene duplication Internal nodes can represent speciation or gene duplication eventsevents
A gene tree does not necessarily coincide with a species tree
Gene duplications will cause a gene tree to differ from a species tree
HumanMonkeyRatMouse
StrugeonChicken
ZebrafishPlaty
LampreyHagfish
ResolutionResolution
Trees may be fully or only partially resolvedTrees may be fully or only partially resolved
Every node in a fully resolved tree is bifurcating or Every node in a fully resolved tree is bifurcating or dichotomousdichotomous
Some nodes in unresolved trees are multifurcating or Some nodes in unresolved trees are multifurcating or polytomouspolytomous
HumanMonkeyRatMouse
StrugeonChicken
ZebrafishPlaty
LampreyHagfish
HumanMonkeyRatMouse
StrugeonChicken
ZebrafishPlaty
LampreyHagfish
RootingRooting
Unrooted trees establish the relationships among taxa, but Unrooted trees establish the relationships among taxa, but not the evolutionary pathwaynot the evolutionary pathway
For 4 taxa there are 3 unrooted trees, but 15 rooted treesFor 4 taxa there are 3 unrooted trees, but 15 rooted trees
HumanMonkeyRatMouseChicken
Human
Monkey
Rat
Mouse
Human Monkey
RatMouse
Human
MonkeyRat
Mouse
RootingRooting
Unrooted trees establish the relationships among taxa, but Unrooted trees establish the relationships among taxa, but not the evolutionary pathwaynot the evolutionary pathway
For 4 taxa there are 3 unrooted trees, but 15 rooted treesFor 4 taxa there are 3 unrooted trees, but 15 rooted trees
HumanMonkeyRatMouseChicken
Human
Monkey
Rat
Mouse
Human
Monkey
Rat
Mouse
Human
Monkey
Rat
Mouse
Human
Monkey
Rat
MouseHuman
Monkey
Rat
Mouse
Human
Monkey
Rat
Mouse
Types of TreesTypes of Trees
Cladograms show the genealogy of taxa, but do not include Cladograms show the genealogy of taxa, but do not include timing or divergence (branch lengths have no meaning)timing or divergence (branch lengths have no meaning)
Human
Monkey
Rat
Mouse
Human
Monkey
Rat
Mouse
Types of TreesTypes of Trees
Additive trees show the genealogy of taxa and Additive trees show the genealogy of taxa and branch branch lengthslengths represent divergence between taxa represent divergence between taxa
Comparison of branch lengths gives a meaningful estimate Comparison of branch lengths gives a meaningful estimate of evolutionary divergenceof evolutionary divergence
Human
Monkey
Rat
Mouse
Types of TreesTypes of Trees
Ultrametric trees are similar to additive trees, but assume a Ultrametric trees are similar to additive trees, but assume a constant rate of change between characters used to build constant rate of change between characters used to build the tree - a molecular clockthe tree - a molecular clock
Comparison of branch lengths gives a meaningful estimate Comparison of branch lengths gives a meaningful estimate of evolutionary divergenceof evolutionary divergence
Ultrametric trees are always rootedUltrametric trees are always rooted
HumanMonkeyRatMouse
time
Outgroups Outgroups
The most accurate way to root a tree is to use an The most accurate way to root a tree is to use an “outgroup” a taxon or group of taxa more distantly related “outgroup” a taxon or group of taxa more distantly related than any member of the “ingroup”than any member of the “ingroup”
HumanMonkeyRatMouseChicken
time
Representing PhylogeniesRepresenting Phylogenies
Phylogenetic Phylogenetic relationships can be relationships can be represented as represented as graphical trees, tables graphical trees, tables or parenthetical or parenthetical statements (Newick or statements (Newick or New Hampshire format)New Hampshire format)
((raccon, bear),((sea_lion, seal), ((monkey,cat), weasel)), dog);
((raccon:0.20, bear:0.07):0.01,((sea_lion:0.12, seal:0.12):0.08,((monkey:1.00,cat:0.47), weasel:0.18)), dog:0.25);
BootstrappingBootstrapping
Many tree building algorithms will give a single, fully Many tree building algorithms will give a single, fully resolved, tree from any data set. resolved, tree from any data set.
Nodes will all be equally represented even if one is Nodes will all be equally represented even if one is supported by many characters and another by very few.supported by many characters and another by very few.
How to quantify support for any given tree? We can’t re-How to quantify support for any given tree? We can’t re-run evolution. We can sample many different genes and we run evolution. We can sample many different genes and we can bootstrap our data.can bootstrap our data.
BootstrappingBootstrapping is is samplingsampling a data set, a data set, with replacementwith replacement, to , to generate a new data set. We then use this new set in a generate a new data set. We then use this new set in a phylogenetic analysis - and repeat this process hundreds phylogenetic analysis - and repeat this process hundreds or thousands of times.or thousands of times.
We can then present bootstrap scores at each node, the We can then present bootstrap scores at each node, the % % of bootstrapof bootstrap trees that contained that specific node trees that contained that specific node
BootstrappingBootstrapping1- G A D D Y T T K L P2- G V E D Y T T K - P3- G A D D Y T T R L P4- C V E D Y T T R - P
1- T K L L T P D A D G2- T K - - T P E V D G3- T R L L T P D A D G4- T R - - T P E V D C
1- G P K D K K T P D P2- G P K D K K T P E P3- G P R D R R T P D P4- C P R D R R T P E P
1- L P Y D A D D P T G2- - P Y E V D E P T G3- L P Y D A D D P T G4- - P Y E V D E P T C
BootstrappingBootstrapping1- G A D D Y T T K L P2- G V E D Y T T K - P3- G A D D Y T T R L P4- C V E D Y T T R - P
1- T K L L T P D A D G2- T K - - T P E V D G3- T R L L T P D A D G4- T R - - T P E V D C
1- G P K D K K T P D P2- G P K D K K T P E P3- G P R D R R T P D P4- C P R D R R T P E P
1- L P Y D A D D P T G2- - P Y E V D E P T G3- L P Y D A D D P T G4- - P Y E V D E P T C
1 2 3 412 3 3 1 44 5 2 4
132
4
1 2 3 412 4 3 1 54 6 2 5
132
4
1 2 3 412 4 3 0 44 4 1 5
132
4
1 2 3 412 1 3 3 44 5 4 2
123
4
Bootstrapping and Condensed TreesBootstrapping and Condensed Trees
In this example, bear In this example, bear and raccoon form a and raccoon form a pair in 50% of the pair in 50% of the data sets data sets
We can choose to We can choose to present a tree that present a tree that condenses branches condenses branches of less than some of less than some threshold bootstrap threshold bootstrap support - a support - a condensed treecondensed tree
Consensus TreesConsensus Trees
Some tree building Some tree building methods will produce methods will produce multiple equally “good” multiple equally “good” trees trees
A consensus tree shows A consensus tree shows the features that are the features that are shared by all or some shared by all or some trees.trees.
A strict consensus tree A strict consensus tree only includes features only includes features found in all treesfound in all trees
A majority-rule consensus A majority-rule consensus tree includes features tree includes features found ≥ a set %found ≥ a set %
Reconciled trees attempt to combine gene trees and species trees, clearly identifying both speciation and duplication events
Reconciled TreesReconciled Trees
Tree showing
duplicationsSpecies tree
Reconciled trees attempt to combine gene trees and species trees, clearly identifying both speciation and duplication events
Reconciled TreesReconciled Trees
Species tree indicating
locations of duplication
events
Tree showing information on
speciation, duplication and
gene loss
Not all proteins w/ similar fx have common evolutionary history
Nonhomologous genes can evolve similar fx through convergent evolution
Seq. similarity and structure, outside of functional sites, is expected to be low - here catalytic residues and overall structure of chymotrypsin (yellow) and subtilisin (green) = analogous enzymes
Analogous GenesAnalogous Genes
Sequence similarity not due to homology is homoplasy
Homoplasy can result from convergent evolution, parallel evolution or evolutionary reversal
HomoplasyHomoplasy
1- G A D D Y T T K L P2- G V E D Y T T K - P3- G A D D Y T T R L P4- C V E D Y T T R - P
1 2 3 412 3 3 1 44 5 2 4
132
4
Transfer of genes from one species to another, horizontal gene transfer (HGT) or lateral gene transfer (LGT), will confuse phylogenetic analysis - results in tangled tree - branches that join
HGT is more common in bacteria and archaea, but is also found in eukaryotes
HGT or LGTHGT or LGT
After transfer the gene in the donor and recipient species will be very similar - xenologous genes
Phylogenetic analysis of these sequences will indicated recipient is more closely related to donor than it truly is.
Here, 80% seq. identity between a eukaryotic gene and its likely bacterial source
Xenologous GenesXenologous Genes
Outgroup consists of members of the same gene superfamily
w/in ingroup, all seq. are bacterial except Trichomonas vaginalis, a parasitic protozoan
The # of orthologous, homologous and unique genes in human, chicken and puffer fish genomes - BLAST analysis
Core orthologs = Single copy orthologs in dark blue, genes present in all 3 but duplicated in at least 1 are in lighter blue
Pairwise orthologs = orthologs found in only 2 species
Orthologous genesOrthologous genes
The # of orthologous, homologous and unique genes in human, chicken and puffer fish genomes
Homologous genes for which orthology/paralogy cannot be determined in yellow
Unique genes in gray
Orthologous genesOrthologous genes
Duplication w/in a gene can result in complex proteins w/ repeated domains
These may be identifiable on a dot-plot
Here BRCA2 plotted against itself, repeats visible w/ window analysis
Duplication w/in genesDuplication w/in genes
W/ complete genome seq. - can compare entire genomes to identify equivalent regions and orthologous genes - syntentic regions - except that large scale rearrangements are common
Genes are lost and duplicated - and inverted or moved between chromosomes
The local genomic environment tends to be similar between orthologs, but the large-scale structures differ
SyntenySynteny
Comparative GenomicsSynteny is inversely correlated with time since last common ancestor
In 500 zebrafish genes 50-80% occur in conserved homology segments, 2 or more genes in the same order as in humans
Approx. 1/2 of the chromosomes retain ~ complete synteny between cats and humans
Orthologs must be distinguished from paralogs for phylogenetic reconstruction and assignment of possible function
Pseudogenes must be distinguished from both
Orthologs and ParalogsOrthologs and Paralogs
Gene loss can eliminate orthologs from two species - this is especially difficult with large (similar) gene families
Gene trees species trees, but multiple genes may
Orthologs, Paralogs and Gene LossOrthologs, Paralogs and Gene Loss
A,B,C,D are species
and are paralogs
Evolutionary history
Incorrect species tree based on
gene tree
BLAST can be used to identify orthologs and paralogs between 2 genomes
Mask low complexity and commonly occurring domains
Scan all gene sequences from one genenome are then scanned on another noting best-scoring BLAST hits (BeTs) - repeat for all possible pairs of genomes
Paralogous genes resulting from a duplication since the divergence between two species will be each others BeTs
Orthologs form groups from different genomes w/ reciprocal BeTs
Clustering of Orthologs and ParalogsClustering of Orthologs and Paralogs
Cluster of Orthologous Groups (COG) and euKaryotic Orthologous Groups (KOG) data bases have been constructed to identify large numbers of orthologs
Here all 3 genes from 3 different genomes are each others BeT in pairwise comparisons between the three genomes
Members of COGs or KOGs are assumed to have related fxs
This type of analysis is an alternative to exhaustive phylogenetic trees - large data sets (# species or genes)
Clustering of Orthologs and ParalogsClustering of Orthologs and Paralogs
This method identifies orthologs and paralogs in this case
With sufficient # of genomes - 2 COGS will form, one associated w/ the part and the other with the part of the tree
Clustering of Orthologs and ParalogsClustering of Orthologs and Paralogs
Gene loss can still be problematic
Comparison of only species A and B would incorrectly group and genes
Clustering of Orthologs and ParalogsClustering of Orthologs and Paralogs