ortholog assignment

Post on 10-May-2015

662 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Computational Prediction of Orthologs

Melvin Zhang

School of Computing,National University of Singapore

May 4, 2011

A gene is a unit of heredity in a living organism

One gene may encode for multiple proteins

Two genes are homologous if they descended from

a common ancestral gene1

In practice, homology is determined using sequence alignment.

Figure: A sequence alignment of two proteins

Have you seen phrases like “high homology”, “significanthomology”, or “35% homology”?

1with respect to a specific speciation event

Two genes are homologous if they descended from

a common ancestral gene1

In practice, homology is determined using sequence alignment.

Figure: A sequence alignment of two proteins

Have you seen phrases like “high homology”, “significanthomology”, or “35% homology”?

1with respect to a specific speciation event

Two genes are homologous if they descended from

a common ancestral gene1

In practice, homology is determined using sequence alignment.

Figure: A sequence alignment of two proteins

Have you seen phrases like “high homology”, “significanthomology”, or “35% homology”?

1with respect to a specific speciation event

Orthologs are due to speciation, paralogs are due

to duplication

MRCA of G and H

G H

speciation

duplication

main orthologs

orthologs

g h h′

paralogs

Orthologs maintain their function

Annotate genes with unknownfunctions.

Infer protein-proteininteractions.

Orthologs maintain their function

Annotate genes with unknownfunctions.

Infer protein-proteininteractions.

Orthologs are not one-to-one due to lineage

specific gene duplicationsMain orthologs are orthologs that have retained their ancestralposition.2

MRCA of G and H

G H

speciation

duplication

main orthologs

orthologs

g h h′

paralogs

2Burgetz et al., Evolutionary Bioinformatics 2006

Problem of identifying main orthologs

Input Position and sequences of genes in 2 genomes

Output For each gene in their common ancestor, find itsdirect descendant in G and H

Complications

I gene duplication

I gene loss

I horizontal gene transfer

I gene fusion, fission

Problem of identifying main orthologs

Input Position and sequences of genes in 2 genomes

Output For each gene in their common ancestor, find itsdirect descendant in G and H

Complications

I gene duplication

I gene loss

I horizontal gene transfer

I gene fusion, fission

Three main approaches for finding orthologs

Graph based Tree based Rearrangement based

Bidirectional Best Hit and variants

Most popular approach. Highlevel of functional relatedness.a

Reciprocal smallest distuse evolutionary distanceestimate instead of BLASTscores

OMA stable pairsintroduce a tolerance intervaland stable matching

aAltenhoff et al., PLoS CB 2009

EnsemblCompara GeneTrees3

Figure: Species tree for 4 species on top gene tree for gene A

Based on reconciliation of gene trees with species tree.

1. Partition genes into families and construct gene trees

2. Reconcile each gene tree and species tree3Vilella et al., Genome Res 2009

MSOAR24

Figure: Rearrangement scenario between human and mouse

1. Partition genes into families and assign a unique symbol

2. Reconstruct the most parsimonious rearrangement(inversion, translocation, fusion, fission, duplication)

3. Extract the corresponding orthologs

4Fu et al., JCB 2007

Can conserved gene neighborhood improveortholog predictions?

Human-mouse synteny blocksConserved synteny blocks between human and mouse genomegenerated by the Cinteny web server5

5Sinha and Meller, BMC Bioinformatics 2007

Local synteny criteria6

Figure: Local synteny: more than one unique match within +/- 3genes. Homology defined as BLASTP E-value < 1e-5

94% of sampled inter-species pairs are identified as orthologsby Inparanoid (based on BBH) and local synteny criteria.

6Jin Jun et al., BMC Genomics 2009

Local synteny score (LC)

G

H

g

h

The local synteny score of g and h is 4 since there are 4 edgesin the maximum matching.

Smith-Waterman alignment score (SW)

BBH-LS: bidirectional best hits based on linear

combination of SW and LC

G

H

g

h

+

sim(g , h) = (1−f )×SW(g , h)+f ×LC(g , h)

Human-Mouse-Rat dataset

InputHuman, mouse, and rat genes downloaded from Ensembl.

BenchmarkNo “golden” benchmark for true orthology.Assume that orthologs are assigned the same gene symbol.

Tuning the BBH-LS methodsim(g , h) = (1 − f ) × SW(g , h) + f × LC(g , h)

Figure: Performance of BBH-LS for different ratio of spatialsimilarity to sequence similarity on the human-mouse dataset.

Results for various methods on Human-Mouse

Figure: TP: same gene symbols, FP: different gene symbols

More true positives and less false positives than MSOAR2.

Results for various methods on Human-Rat

Figure: TP: same gene symbols, FP: different gene symbols

Results for various methods on Mouse-Rat

Figure: TP: same gene symbols, FP: different gene symbols

How local synteny helps

CTSH

MSH3

CKMT2RASGRF2MSH3RASGRF1 ANKRD34C

RASGRF2ANKRD34C RASGRF1 CKMT2CTSH

sw = 5265ls = 1

sw = 2003ls = 5

sw = 2466ls = 5

Humanchr 15

Humanchr 5

Mousechr 9

Mousechr 13

Bold edges are the pairing from BBH-LS, thin edges are thepairing from BBH.BBH paired RASGRF2 (human) to RASGRF1 (mouse) due tohigh SW, corrected by BBH-LS with LC.

Summary: Identifying main orthologs

MRCA of G and H

G H

speciation

duplication

main orthologs

orthologs

g h h′

paralogs

For each gene in their common ancestor, find its directdescendant in G and H

Summary: Three approaches

Graph based Tree based Rearrangement based

BBH-LS: bidirectional best hits based on linear

combination of SW and LC

G

H

g

h

+

BBH-LS: bidirectional best hits based on linear

combination of SW and LC

G

H

g

h

+

top related