gene families and functional annotation once genes have been id.ed they need to be functionally...

Gene Families and Functional AnnotationGene Families and Functional Annotation

Once genes have been id.ed they need to be functionally Once genes have been id.ed they need to be functionally annotatedannotated

A computational first step is to group genes w/ other genes - A computational first step is to group genes w/ other genes - some of which will hopefully have known fx.ssome of which will hopefully have known fx.s

Once genes are classified, we can begin to examine whether Once genes are classified, we can begin to examine whether certain genes are missing or overrepresented in the given certain genes are missing or overrepresented in the given genome - possibly reflecting the niche of the organismgenome - possibly reflecting the niche of the organism

As w/ earlier computational analyses, functional annotation As w/ earlier computational analyses, functional annotation based solely on based solely on in silicoin silico analyses is only a first step analyses is only a first step

08:47

Gene Families and Functional AnnotationGene Families and Functional Annotation

Sequence-similarity searches are a first pass in Sequence-similarity searches are a first pass in classificationclassification

BLAST - Basic Local Alignment Search ToolBLAST - Basic Local Alignment Search Tool

BLASTn - nucleotideBLASTn - nucleotide

BLASTp- proteinBLASTp- protein

BLASTx - translates a nucleotide sequence into all possible BLASTx - translates a nucleotide sequence into all possible reading frames and scans these against a protein databasereading frames and scans these against a protein database

All give a Expectation, E, value score - to evaluate the All give a Expectation, E, value score - to evaluate the significance of the matchsignificance of the match

In both eukaryotes and prokaryotes, 1/3 to 1/2 of searched In both eukaryotes and prokaryotes, 1/3 to 1/2 of searched genes do not match a protein = orphan genesgenes do not match a protein = orphan genes

08:47

Protein Structural DomainsProtein Structural DomainsProteins are made up of combinations of Proteins are made up of combinations of distinct structural units or domainsdistinct structural units or domains

Genes can be grouped based on the Genes can be grouped based on the domains they containdomains they contain

These groupings depend on structural These groupings depend on structural similarity - sequence similarity alone may similarity - sequence similarity alone may be insufficientbe insufficient

08:47

Gene clustering by seq. similarityGene clustering by seq. similarity

BLAST searches generally return matches from more than BLAST searches generally return matches from more than one protein from more than one speciesone protein from more than one species

This happens if the query protein is part of a gene (protein) This happens if the query protein is part of a gene (protein) family or contains multiple domains found in other proteinsfamily or contains multiple domains found in other proteins08:47

BLAST output can be interpreted as a match to one or more protein BLAST output can be interpreted as a match to one or more protein domains - Searches of closely related sp. often id. genes/proteins domains - Searches of closely related sp. often id. genes/proteins w/ similar domain structurew/ similar domain structure

Domains shuffle over evolutionary time and are often found in Domains shuffle over evolutionary time and are often found in different combinations across more distant comparisonsdifferent combinations across more distant comparisons

Domains do tend to follow biologically reasonable patterns - DNA Domains do tend to follow biologically reasonable patterns - DNA binding domains w/ other DNA binding domains, transmembrane binding domains w/ other DNA binding domains, transmembrane domains w/ intra and extracellular domainsdomains w/ intra and extracellular domains

08:47


Genes can be classified by domain contentGenes can be classified by domain content

The Enzyme Commission (EC) hierarchical classification of The Enzyme Commission (EC) hierarchical classification of enzymes - each enzyme is assigned a number that reflects enzymes - each enzyme is assigned a number that reflects sub-classification of function, sub-classification of function, e.g.e.g. ADH is EC1.1.1.1 ADH is EC1.1.1.1

Other classification schemes are not as obvious - protein Other classification schemes are not as obvious - protein function is often context-specificfunction is often context-specific

PFAM - protein database that allows access to biochemical PFAM - protein database that allows access to biochemical properties of predicted proteinsproperties of predicted proteins

08:47


InterPro - classifies individual protein domainsInterPro - classifies individual protein domains

08:47


Protein functional prediction ≠ assignment of genes to Protein functional prediction ≠ assignment of genes to familiesfamilies

Protein function prediction allows general conclusions about Protein function prediction allows general conclusions about protein function and genome content based on protein protein function and genome content based on protein domainsdomains

Classification of gene families involves distinguishing Classification of gene families involves distinguishing between paralogs and orthologsbetween paralogs and orthologs

08:47

Major Classes of Protein FunctionMajor Classes of Protein Function

Enzymes Enzymes

Signal transduction (receptors and kinases)Signal transduction (receptors and kinases)

Nucleic acid binding (transcription factors, nucleic acid Nucleic acid binding (transcription factors, nucleic acid enzymes)enzymes)

Structural (cytoskeletal, extracellular matrix, motor proteins)Structural (cytoskeletal, extracellular matrix, motor proteins)

Channel (voltage and chemically gated)Channel (voltage and chemically gated)

ImmunoglobinsImmunoglobins

Calcium-binding proteinsCalcium-binding proteins

TransportersTransporters

Subclasses vary - as do the representation w/in each Subclasses vary - as do the representation w/in each genomegenome

08:47

Gene ClustersGene Clusters

Alignment searches (BLAST) identify genes w/ similar Alignment searches (BLAST) identify genes w/ similar sequence to the query sequence to the query

If searches id. a single gene, or genes w/ a single fx then If searches id. a single gene, or genes w/ a single fx then functional assignment to query seq. is simple - but searches functional assignment to query seq. is simple - but searches often id lg # of seq.s w/ multiple functionsoften id lg # of seq.s w/ multiple functions

The most similar sequence is not nec. the seq. w/ which the The most similar sequence is not nec. the seq. w/ which the query seq. shares a fxquery seq. shares a fx

08:47

Gene ClustersGene Clusters

One approach is to try and define as large a protein family as One approach is to try and define as large a protein family as possible (including many possible functions) possible (including many possible functions)

PSI-BLAST can be used to identify a large set of potential PSI-BLAST can be used to identify a large set of potential protein family membersprotein family members

A BLAST search is conducted to create an initial protein A BLAST search is conducted to create an initial protein sequence alignment - which is then used to initiate a fresh sequence alignment - which is then used to initiate a fresh searchsearch

The process is then iterated until no further matches are The process is then iterated until no further matches are id.ed - this reduces the degree of seq. similarity required for id.ed - this reduces the degree of seq. similarity required for inclusion in the familyinclusion in the family

A “true” family of genes ought to be bounded by a A “true” family of genes ought to be bounded by a significance cut-off to limit the proteins includedsignificance cut-off to limit the proteins included

08:47

Gene ClustersGene ClustersClusters of orthologous genes, COGs, can be used to Clusters of orthologous genes, COGs, can be used to classify proteins classify proteins

COGs are created by id.ing the best hit for each gene in COGs are created by id.ing the best hit for each gene in complete pairwise comparisons across a set of genomescomplete pairwise comparisons across a set of genomes

08:47

Gene ClustersGene Clusters185,000 proteins from 66 microbial genomes id.ed 4,873 185,000 proteins from 66 microbial genomes id.ed 4,873 COGs - 75% of all predicted microbial proteins COGs - 75% of all predicted microbial proteins

50% of 110,00 proteins from fly, nematode, human, 50% of 110,00 proteins from fly, nematode, human, ariabidopsis, yeasts and a microsporidian form 4,852 COGsariabidopsis, yeasts and a microsporidian form 4,852 COGs

08:47

COG0837

Gene ClustersGene ClustersCOGs include both orthologs and paralogs COGs include both orthologs and paralogs

In (a) In (a) HuAHuA and and HuA’HuA’ are paralogs - distinguishing which are paralogs - distinguishing which retains the ancestral fx is not as simple as determining retains the ancestral fx is not as simple as determining which has the most similar seq.which has the most similar seq.

08:47

Gene ClustersGene ClustersHuAHuA and and MmAMmA differ in 5 a.a., none affect fx differ in 5 a.a., none affect fx

HuA’HuA’ and and MmAMmA differ in 4 a.a., but one of which changes the differ in 4 a.a., but one of which changes the charge of a critical residuecharge of a critical residue

Clustering based on similarity would lead to erroneous fx Clustering based on similarity would lead to erroneous fx classificationclassification

08:47

Gene PhylogeniesGene PhylogeniesClustering Clustering groupsgroups genes by seq. similarity genes by seq. similarity

Phylogentic analyses ascertain how groups of similar genes Phylogentic analyses ascertain how groups of similar genes are related by descentare related by descent

In the HuA, MmA example, the 2 A’ genes can either result In the HuA, MmA example, the 2 A’ genes can either result from one (orthologs) or two (paralogs) duplication eventsfrom one (orthologs) or two (paralogs) duplication events

Paralogs are less likely to share a functionParalogs are less likely to share a function

08:47

Gene PhylogeniesGene PhylogeniesOften gene fx can be inferred from phylogenetic analysis Often gene fx can be inferred from phylogenetic analysis

The first step is aligning the sequencesThe first step is aligning the sequences

A gene tree is then constructed using some algorithmA gene tree is then constructed using some algorithm

Duplications and gene relatedness are then ascertainedDuplications and gene relatedness are then ascertained

In the example on the In the example on the lftlft, an ancient duplication splits 2 fx.al , an ancient duplication splits 2 fx.al grps, on the grps, on the rtrt protein 2 likely has the same fx as 5 and 6 protein 2 likely has the same fx as 5 and 6

08:47

Gene OntologyGene OntologyMolecular function alone may not predict/describe biological fx Molecular function alone may not predict/describe biological fx (think crystallins)(think crystallins)

The Gene Ontology (GO) annotates and groups genes using a The Gene Ontology (GO) annotates and groups genes using a multi-character approach including cell biological and molecular fx multi-character approach including cell biological and molecular fx and/or subcellular localizationand/or subcellular localization

The GO project uses defined vocabulary and a hierarchical The GO project uses defined vocabulary and a hierarchical structure to classify genes and includes links indicating the type structure to classify genes and includes links indicating the type of evidence for the classificationof evidence for the classification

08:47

GO network GO network In this example, the gene INNER NO OUTER is at the center w/ the In this example, the gene INNER NO OUTER is at the center w/ the 3 separate classifications radiating out from it3 separate classifications radiating out from it

08:47

Gene OtologyGene Otology

The GO vocabulary includes 7000 terms describing molecular fx, The GO vocabulary includes 7000 terms describing molecular fx, 5000 describing biological process, some annotations include as 5000 describing biological process, some annotations include as many as 12 levels w/ in hierarchy termsmany as 12 levels w/ in hierarchy terms

This is too deep for efficient computational searches - other This is too deep for efficient computational searches - other simplified systems are also being developed to allow simplified systems are also being developed to allow computationally screen and classify genescomputationally screen and classify genes

08:47

• Homology = similarity due to common ancestry

• The Gpdh gene sequence from two different species are homologous sequences

• All comparisons made in molecular evolution (biology) are based on comparing homologous sequences = apples to apples

• Sequences must be aligned to allow comparison = homologous bases lined up in columns

Molecular Phylogenetics

Human MVHLTPBaboon MVHLTPCow MLTPSheep MLTPMouse MVHLTP

The cow and sheep β globin proteins are 2 a.a. shorter than the other sequences, so gaps are added to align the seqeunces

Human MVHLTPBaboon ......Cow .--...Sheep .--...Mouse ......

08:47

• Accumulation of sequence differences through time is the basis of molecular systematics, which analyses them in order to infer evolutionary relationships

• A gene tree is a diagram of the inferred ancestral history of a group of sequences

• A gene tree is only an estimate of the true pattern of evolutionary relations

• UPGMA and Neighbor joining = simple ways to estimate a gene tree

• Bootstrapping = sampling w/ replacement, a common technique for assessing the reliability of a node in a gene tree

• Taxon = the source of each sequence

Gene Trees

08:47

Rooted and Unrooted TreesRooted and Unrooted TreesAnalyses of a set of genes produces an unrooted tree Analyses of a set of genes produces an unrooted tree

Trees can be rooted, assigned polarity, by assignment of an Trees can be rooted, assigned polarity, by assignment of an outgroup - a sequence that is known to be more distantly related outgroup - a sequence that is known to be more distantly related than any within the rest of the analysis (the ingroup)than any within the rest of the analysis (the ingroup)

Tree branch length denotes the amountTree branch length denotes the amountof change along that branch in of change along that branch in somesometree building methodstree building methods

3 distinct unrooted trees

08:47

Tree Building methodsTree Building methods

The 3 primary methods (algorithms) for building gene trees are:The 3 primary methods (algorithms) for building gene trees are:

1. Parsimony1. Parsimony - a character-based approach that surveys every - a character-based approach that surveys every possible tree topology. The most parsimonious tree is the possible tree topology. The most parsimonious tree is the topology that requires the minimum # of steps (changes) in a data topology that requires the minimum # of steps (changes) in a data setset

Position 1 of this example - tree1 requires 1 change, tree2 2 Position 1 of this example - tree1 requires 1 change, tree2 2 changes and tree3 2 changes. When the 4 positions are summed changes and tree3 2 changes. When the 4 positions are summed tree 3 is found to be the best (shortest)tree 3 is found to be the best (shortest)

08:47



2. Maximum Likelihood2. Maximum Likelihood - also a character-based approach, surveys - also a character-based approach, surveys every possible tree topology and assigns all topologies a every possible tree topology and assigns all topologies a maximum likelihood estimate (score) based on a model of maximum likelihood estimate (score) based on a model of evolution describing the probability of changes (mutation) through evolution describing the probability of changes (mutation) through time. The ML tree is the one with the highest probabilitytime. The ML tree is the one with the highest probability

This method can be accurate, but is computationally expensiveThis method can be accurate, but is computationally expensive

08:47



3. Distance Methods3. Distance Methods - are not character based, instead they - are not character based, instead they calculate pairwise distances across entire aligned sequences and calculate pairwise distances across entire aligned sequences and construct data matrixes. Trees are built by grouping pairs with the construct data matrixes. Trees are built by grouping pairs with the shortest distances between them. These methods can also shortest distances between them. These methods can also incorporate complex evolutionary modelsincorporate complex evolutionary models

This method is computationally cheap, will always return and This method is computationally cheap, will always return and answer, but are not always accurate.answer, but are not always accurate.

The simplest distance method, The simplest distance method, Unweighted Pair Group Method with Arithmatic Mean, UPGMA, simply counts the number of sequence changes in all pairwise comparisons

08:47

UPGMA Trees

08:47

Hu Ba Co Sh Mo Ha ChHu 2 6 9 8 9 13Ba 7 10 7 10 13Co 3 11 12 16Sh 12 9 15Mo 7 16Ha 14

HuBa Co Sh Mo Ha ChHuBa 6.5 9.5 7.5 9.5 13Co 3 11 12 16Sh 12 9 15Mo 7 16Ha 14

Hu

Ba2/2 = 1.0

1.0

Co

Sh3/2 = 1.5

1.5

1.0

1.5

UPGMA Tree Construction

08:47

HuBa CoSh Mo Ha ChHuBa 8 7.5 9.5 13CoSh 11.5 10.5 15.5Mo 7 16Ha 14

HuBa Co Sh Mo Ha ChHuBa 6.5 9.5 7.5 9.5 13Co 3 11 12 16Sh 12 9 15Mo 7 16Ha 14

Hu

Ba

1.0

Co

Sh

7/2 = 3.5

1.5

1.0

1.5

Mo

Ha

3.5

3.5


08:47

HuBa CoSh Mo Ha ChHuBa 8 7.5 9.5 13CoSh 11.5 10.5 15.5Mo 7 16Ha 14

Hu

Ba

1.0Co

Sh

1.5

1.01.5

Mo

Ha

3.5

3.5

HuBa CoSh MoHa ChHuBa 8 8.5 13CoSh 11 15.5MoHa 15 8/2 = 4

Hu

Ba

1.0

1.0

Co

Sh

1.5

1.5

3.0

2.5


08:47

((HuBa)(CoSh)) MoHa Ch((HuBa)(Cosh)) 9.75 14.25MoHa 15

9.75/2 = 4.875

Hu

Ba

1.0

1.0

Co

Sh

1.5

1.5

3.0

2.5

HuBa CoSh MoHa ChHuBa 8 8.5 13CoSh 11 15.5MoHa 15

Mo

Ha

3.5

3.5

.875

1.375


08:47

((HuBa)(CoSh)) MoHa Ch((HuBa)(Cosh)) 9.75 14.25MoHa 15

((HuBa)(CoSh))(MoHa) Ch((HuBa)(Cosh))(MoHa) 14.625

14.625/2 = 7.3125

Hu

Ba

1.0

1.0

Co

Sh

1.5

1.5

3.0

2.5

Mo

Ha

3.5

3.5

.875

1.375

Ch7.3125

2.4375


08:47

a

b

1.0

1.0

c

d

1.5

1.5

3.0

2.5

e

f

3.5

3.5

.875

1.375

g7.3125

2.44

Final UPGMA Tree

08:47

Phylogenetic TreesPhylogenetic Trees

Phylogenetic trees are representations summarizing a Phylogenetic trees are representations summarizing a reconstructed evolutionary historyreconstructed evolutionary history

A phylogenetic tree is a diagram that proposes a A phylogenetic tree is a diagram that proposes a hypothesis for reconstructed evolutionary relationships hypothesis for reconstructed evolutionary relationships between a set of objects (taxa or OTUs)between a set of objects (taxa or OTUs)

Phylogenetic trees can represent relationships between Phylogenetic trees can represent relationships between species or genesspecies or genes

Phylogenetic TreesPhylogenetic Trees

OTUs are connected by a set of lines - branches or edgesOTUs are connected by a set of lines - branches or edges

External nodes or leaves are existing OTUs or extinct External nodes or leaves are existing OTUs or extinct objects tht did not give rise to descendentsobjects tht did not give rise to descendents

Internal nodes represent ancestral states hypothesized to Internal nodes represent ancestral states hypothesized to have occurred during evolutionhave occurred during evolution

Internal nodes can represent speciation or gene duplication Internal nodes can represent speciation or gene duplication eventsevents

A gene tree does not necessarily coincide with a species tree

Gene duplications will cause a gene tree to differ from a species tree

HumanMonkeyRatMouse

StrugeonChicken

ZebrafishPlaty

LampreyHagfish

ResolutionResolution

Trees may be fully or only partially resolvedTrees may be fully or only partially resolved

Every node in a fully resolved tree is bifurcating or Every node in a fully resolved tree is bifurcating or dichotomousdichotomous

Some nodes in unresolved trees are multifurcating or Some nodes in unresolved trees are multifurcating or polytomouspolytomous

HumanMonkeyRatMouse

StrugeonChicken

ZebrafishPlaty

LampreyHagfish

HumanMonkeyRatMouse

StrugeonChicken

ZebrafishPlaty

LampreyHagfish

RootingRooting

Unrooted trees establish the relationships among taxa, but Unrooted trees establish the relationships among taxa, but not the evolutionary pathwaynot the evolutionary pathway

For 4 taxa there are 3 unrooted trees, but 15 rooted treesFor 4 taxa there are 3 unrooted trees, but 15 rooted trees

HumanMonkeyRatMouseChicken

Human

Monkey

Rat

Mouse

Human Monkey

RatMouse

Human

MonkeyRat

Mouse

RootingRooting

Unrooted trees establish the relationships among taxa, but Unrooted trees establish the relationships among taxa, but not the evolutionary pathwaynot the evolutionary pathway

For 4 taxa there are 3 unrooted trees, but 15 rooted treesFor 4 taxa there are 3 unrooted trees, but 15 rooted trees


Human

Monkey

Rat

Mouse

Human

Monkey

Rat

Mouse

Human

Monkey

Rat

Mouse

Human

Monkey

Rat

MouseHuman

Monkey

Rat

Mouse

Human

Monkey

Rat

Mouse

Types of TreesTypes of Trees

Cladograms show the genealogy of taxa, but do not include Cladograms show the genealogy of taxa, but do not include timing or divergence (branch lengths have no meaning)timing or divergence (branch lengths have no meaning)

Human

Monkey

Rat

Mouse

Human

Monkey

Rat

Mouse


Additive trees show the genealogy of taxa and Additive trees show the genealogy of taxa and branch branch lengthslengths represent divergence between taxa represent divergence between taxa

Comparison of branch lengths gives a meaningful estimate Comparison of branch lengths gives a meaningful estimate of evolutionary divergenceof evolutionary divergence

Human

Monkey

Rat

Mouse


Ultrametric trees are similar to additive trees, but assume a Ultrametric trees are similar to additive trees, but assume a constant rate of change between characters used to build constant rate of change between characters used to build the tree - a molecular clockthe tree - a molecular clock

Comparison of branch lengths gives a meaningful estimate Comparison of branch lengths gives a meaningful estimate of evolutionary divergenceof evolutionary divergence

Ultrametric trees are always rootedUltrametric trees are always rooted

HumanMonkeyRatMouse

time

Outgroups Outgroups

The most accurate way to root a tree is to use an The most accurate way to root a tree is to use an “outgroup” a taxon or group of taxa more distantly related “outgroup” a taxon or group of taxa more distantly related than any member of the “ingroup”than any member of the “ingroup”


time

Representing PhylogeniesRepresenting Phylogenies

Phylogenetic Phylogenetic relationships can be relationships can be represented as represented as graphical trees, tables graphical trees, tables or parenthetical or parenthetical statements (Newick or statements (Newick or New Hampshire format)New Hampshire format)

((raccon, bear),((sea_lion, seal), ((monkey,cat), weasel)), dog);

((raccon:0.20, bear:0.07):0.01,((sea_lion:0.12, seal:0.12):0.08,((monkey:1.00,cat:0.47), weasel:0.18)), dog:0.25);

BootstrappingBootstrapping

Many tree building algorithms will give a single, fully Many tree building algorithms will give a single, fully resolved, tree from any data set. resolved, tree from any data set.

Nodes will all be equally represented even if one is Nodes will all be equally represented even if one is supported by many characters and another by very few.supported by many characters and another by very few.

How to quantify support for any given tree? We can’t re-How to quantify support for any given tree? We can’t re-run evolution. We can sample many different genes and we run evolution. We can sample many different genes and we can bootstrap our data.can bootstrap our data.

BootstrappingBootstrapping is is samplingsampling a data set, a data set, with replacementwith replacement, to , to generate a new data set. We then use this new set in a generate a new data set. We then use this new set in a phylogenetic analysis - and repeat this process hundreds phylogenetic analysis - and repeat this process hundreds or thousands of times.or thousands of times.

We can then present bootstrap scores at each node, the We can then present bootstrap scores at each node, the % % of bootstrapof bootstrap trees that contained that specific node trees that contained that specific node

BootstrappingBootstrapping1- G A D D Y T T K L P2- G V E D Y T T K - P3- G A D D Y T T R L P4- C V E D Y T T R - P

1- T K L L T P D A D G2- T K - - T P E V D G3- T R L L T P D A D G4- T R - - T P E V D C

1- G P K D K K T P D P2- G P K D K K T P E P3- G P R D R R T P D P4- C P R D R R T P E P

1- L P Y D A D D P T G2- - P Y E V D E P T G3- L P Y D A D D P T G4- - P Y E V D E P T C

BootstrappingBootstrapping1- G A D D Y T T K L P2- G V E D Y T T K - P3- G A D D Y T T R L P4- C V E D Y T T R - P

1- T K L L T P D A D G2- T K - - T P E V D G3- T R L L T P D A D G4- T R - - T P E V D C

1- G P K D K K T P D P2- G P K D K K T P E P3- G P R D R R T P D P4- C P R D R R T P E P

1- L P Y D A D D P T G2- - P Y E V D E P T G3- L P Y D A D D P T G4- - P Y E V D E P T C

1 2 3 412 3 3 1 44 5 2 4

132

4

1 2 3 412 4 3 1 54 6 2 5

132

4

1 2 3 412 4 3 0 44 4 1 5

132

4

1 2 3 412 1 3 3 44 5 4 2

123

4

Bootstrapping and Condensed TreesBootstrapping and Condensed Trees

In this example, bear In this example, bear and raccoon form a and raccoon form a pair in 50% of the pair in 50% of the data sets data sets

We can choose to We can choose to present a tree that present a tree that condenses branches condenses branches of less than some of less than some threshold bootstrap threshold bootstrap support - a support - a condensed treecondensed tree

Consensus TreesConsensus Trees

Some tree building Some tree building methods will produce methods will produce multiple equally “good” multiple equally “good” trees trees

A consensus tree shows A consensus tree shows the features that are the features that are shared by all or some shared by all or some trees.trees.

A strict consensus tree A strict consensus tree only includes features only includes features found in all treesfound in all trees

A majority-rule consensus A majority-rule consensus tree includes features tree includes features found ≥ a set %found ≥ a set %

Reconciled trees attempt to combine gene trees and species trees, clearly identifying both speciation and duplication events

Reconciled TreesReconciled Trees

Tree showing

duplicationsSpecies tree

Reconciled trees attempt to combine gene trees and species trees, clearly identifying both speciation and duplication events

Reconciled TreesReconciled Trees

Species tree indicating

locations of duplication

events

Tree showing information on

speciation, duplication and

gene loss

Not all proteins w/ similar fx have common evolutionary history

Nonhomologous genes can evolve similar fx through convergent evolution

Seq. similarity and structure, outside of functional sites, is expected to be low - here catalytic residues and overall structure of chymotrypsin (yellow) and subtilisin (green) = analogous enzymes

Analogous GenesAnalogous Genes

Sequence similarity not due to homology is homoplasy

Homoplasy can result from convergent evolution, parallel evolution or evolutionary reversal

HomoplasyHomoplasy

1- G A D D Y T T K L P2- G V E D Y T T K - P3- G A D D Y T T R L P4- C V E D Y T T R - P

1 2 3 412 3 3 1 44 5 2 4

132

4

Transfer of genes from one species to another, horizontal gene transfer (HGT) or lateral gene transfer (LGT), will confuse phylogenetic analysis - results in tangled tree - branches that join

HGT is more common in bacteria and archaea, but is also found in eukaryotes

HGT or LGTHGT or LGT

After transfer the gene in the donor and recipient species will be very similar - xenologous genes

Phylogenetic analysis of these sequences will indicated recipient is more closely related to donor than it truly is.

Here, 80% seq. identity between a eukaryotic gene and its likely bacterial source

Xenologous GenesXenologous Genes

Outgroup consists of members of the same gene superfamily

w/in ingroup, all seq. are bacterial except Trichomonas vaginalis, a parasitic protozoan

The # of orthologous, homologous and unique genes in human, chicken and puffer fish genomes - BLAST analysis

Core orthologs = Single copy orthologs in dark blue, genes present in all 3 but duplicated in at least 1 are in lighter blue

Pairwise orthologs = orthologs found in only 2 species

Orthologous genesOrthologous genes

The # of orthologous, homologous and unique genes in human, chicken and puffer fish genomes

Homologous genes for which orthology/paralogy cannot be determined in yellow

Unique genes in gray

Orthologous genesOrthologous genes

Duplication w/in a gene can result in complex proteins w/ repeated domains

These may be identifiable on a dot-plot

Here BRCA2 plotted against itself, repeats visible w/ window analysis

Duplication w/in genesDuplication w/in genes

W/ complete genome seq. - can compare entire genomes to identify equivalent regions and orthologous genes - syntentic regions - except that large scale rearrangements are common

Genes are lost and duplicated - and inverted or moved between chromosomes

The local genomic environment tends to be similar between orthologs, but the large-scale structures differ

SyntenySynteny

Comparative GenomicsSynteny is inversely correlated with time since last common ancestor

In 500 zebrafish genes 50-80% occur in conserved homology segments, 2 or more genes in the same order as in humans

Approx. 1/2 of the chromosomes retain ~ complete synteny between cats and humans

Orthologs must be distinguished from paralogs for phylogenetic reconstruction and assignment of possible function

Pseudogenes must be distinguished from both

Orthologs and ParalogsOrthologs and Paralogs

Gene loss can eliminate orthologs from two species - this is especially difficult with large (similar) gene families

Gene trees species trees, but multiple genes may

Orthologs, Paralogs and Gene LossOrthologs, Paralogs and Gene Loss

A,B,C,D are species

and are paralogs

Evolutionary history

Incorrect species tree based on

gene tree

BLAST can be used to identify orthologs and paralogs between 2 genomes

Mask low complexity and commonly occurring domains

Scan all gene sequences from one genenome are then scanned on another noting best-scoring BLAST hits (BeTs) - repeat for all possible pairs of genomes

Paralogous genes resulting from a duplication since the divergence between two species will be each others BeTs

Orthologs form groups from different genomes w/ reciprocal BeTs

Clustering of Orthologs and ParalogsClustering of Orthologs and Paralogs

Cluster of Orthologous Groups (COG) and euKaryotic Orthologous Groups (KOG) data bases have been constructed to identify large numbers of orthologs

Here all 3 genes from 3 different genomes are each others BeT in pairwise comparisons between the three genomes

Members of COGs or KOGs are assumed to have related fxs

This type of analysis is an alternative to exhaustive phylogenetic trees - large data sets (# species or genes)


This method identifies orthologs and paralogs in this case

With sufficient # of genomes - 2 COGS will form, one associated w/ the part and the other with the part of the tree


Gene loss can still be problematic

Comparison of only species A and B would incorrectly group and genes


gene families and functional annotation once genes have been id.ed they need to be functionally...

Documents

protein domains searches

gene protein family

query protein

individual protein domains

gene clustering

protein databaseall

obvious protein function

genes w similar sequence