bits - comparative genomics: gene family analysis

28
Comparative genomics in eukaryotes Gene family analysis Klaas Vandepoele, PhD Professor Ghent University Comparative & Integrative Genomics VIB – Ghent University, Belgium http://www.bits.vib.be

Upload: bits

Post on 10-May-2015

968 views

Category:

Technology


1 download

DESCRIPTION

This is the second presentation of the BITS training on 'Comparative genomics'. It reviews the different methods of investigating sequence homology on the gene family level.Thanks to Klaas Vandepoele of the PSB department.

TRANSCRIPT

Page 1: BITS - Comparative genomics: gene family analysis

Comparative genomicsin eukaryotes

Gene family analysis

Klaas Vandepoele, PhD

Professor Ghent UniversityComparative & Integrative GenomicsVIB – Ghent University, Belgium

http://www.bits.vib.be

Page 2: BITS - Comparative genomics: gene family analysis

2

Workflow

Page 3: BITS - Comparative genomics: gene family analysis

3

Applications of clustering the proteome(s)

Gene families form the basis for the evolutionary (or phylogenetic) analysis of Detection of orthologs and paralogs Gene duplication, family expansions,

pseudogene formation and gene loss Species taxonomies Horizontal Gene Transfer (HGT) Evolution of gene structure

• Introns• Protein domain organisation &

(re)arrangements Base composition and codon usage

Page 4: BITS - Comparative genomics: gene family analysis

4

I. Structural annotation: genome-wide versus family-wise

Rationale family-wise annotation Since every gene has different (sequence)

characteristics and different genes evolve at different rates, using these characteristics to determine homologous gene models will improve the overall structural annotation quality

Properties: Slow & nearly-manual procedure High-quality gene models revealing biological

novel findings

Page 5: BITS - Comparative genomics: gene family analysis

5

Workflow family-wise annotation procedure

MSA experimental representatives

HMMbuildFamily

HMM profile

Species Xproteome

HMMsearchPutative

Homologs

Protein motifs

Correction gene model

Classification usingPhylogenetic trees

Detailed characterization

Ab initio gene prediction

Collecting experi-mental representatives

EST/cDNA

http://hmmer.janelia.org/

BLAST

Page 6: BITS - Comparative genomics: gene family analysis

6

Experimental representatives

InterProScan

Clu

stal

w +

Jal

Vie

w

PFAM HMM logo

Page 7: BITS - Comparative genomics: gene family analysis

7

BLAST / HMMsearch

1. Use multiple sequence alignment to create HMM profile

2. Use HMM profile to search for similar proteins

Page 8: BITS - Comparative genomics: gene family analysis

8

Representatives + putative homologs

Multiple sequence alignments assist in the detection and correction of errors in the structural annotation (missed exon)

Suffix finalcds indicates corrected gene model compared to the original gene model generate by the ab-initio gene prediction

BioEdit Sequence Editor

Page 9: BITS - Comparative genomics: gene family analysis

9

Representatives + putative homologs

Multiple sequence alignments assist in the detection of errors in the structural annotation (false first exon)

Suffix finalcds indicates corrected gene model compared to the original gene model generate by the ab-initio gene prediction

Page 10: BITS - Comparative genomics: gene family analysis

10

Examples of family-specific protein motifs

B-type cyclins have HxKF signature Cyclin destruction boxes (B1-type cyclin R-[AV]LGDIGN)

Page 11: BITS - Comparative genomics: gene family analysis

11

Examples of family-specific protein motifs

D-type cyclins contain LxCxE Rb-binding motif Low conservation of phylogenetic signal at primary sequence level General rules are rarely general: exceptions (i.e. missing protein

motifs) are frequent and might indicate functional divergence

Arabidopsis

Rice

Page 12: BITS - Comparative genomics: gene family analysis

12

Classification using phylogenetic tree construction

D-type cyclins are G1-specific

A- and B-type cyclins are mitotic cyclins

H-type cyclins regulate activity of CDK-activating kinases

• The complexity of the cyclin gene family appears to be higher in plants than in mammals• Whether there is functional redundancy within A- and B-type cyclins or different regulation (and expression) of some cyclin subclasses remains to be analyzed

Page 13: BITS - Comparative genomics: gene family analysis

13

Unraveling functional divergence using large-scale expression compendia

Plant tissues

Gen

es

Page 14: BITS - Comparative genomics: gene family analysis

14

Unraveling functional divergence using large-scale expression compendia

A-type cyclin

B-type cyclin

D-type cyclin

Plant tissues

Gen

es

Genevestigator

Page 15: BITS - Comparative genomics: gene family analysis

15

II. Orthology & paralogy

A major goal of sequence analysis is evolutionary reconstruction. It is critical to distinguish between two principal types of homologous relationships, which differ in their evolutionary history and functional implications.

Orthologs, defined as homologous genes evolved through speciation (~evolutionary counterparts derived from a single ancestral gene in the last common ancestor of the given two species)

Paralogs, which are homologous genes evolved through duplication within the same (perhaps ancestral) genome.

These definitions were first introduced by Fitch (1970)

Page 16: BITS - Comparative genomics: gene family analysis

16

Orthology & paralogy inference

a1

b1

c1a2

b2

c2

Gene phylogenies Organism phylogeny(species tree)

A

B

C

gene duplication

a1

a2

b1

b2

c1

a)

b)

Outparalogs

Inparalogs

speciation

Page 17: BITS - Comparative genomics: gene family analysis

17

In- and outparalogy

Sonnhammer & Koonin: Orthology, paralogy and proposed classification for paralog subtypes

Page 18: BITS - Comparative genomics: gene family analysis

Tree reconciliation

The automatic detection of speciation and duplication events using a species tree and gene family tree

18

Page 19: BITS - Comparative genomics: gene family analysis

19

III. Types of proteome analysis

Page 20: BITS - Comparative genomics: gene family analysis

20

The evolution of multi-domain proteins

Page 21: BITS - Comparative genomics: gene family analysis

21

Interpreting the output of an all-against-all similarity search

Metrics for sequence similarity:• E-value, Bit score or percent identity• alignment coverage

Page 22: BITS - Comparative genomics: gene family analysis

22

Clustering of similar sequences

Proteins = vertices ~ nodesSequence similarity relationship = edges

Page 23: BITS - Comparative genomics: gene family analysis

23

Clustering of similar sequences

Page 24: BITS - Comparative genomics: gene family analysis

24

Advanced methods for protein (orthology) clustering

Sequence similarity-based COG (RBH) [Tatusov 1997] InParanoid [Remm et al., 2001] Tribe-MCL [Van Dongen 2000] OrthoMCL [Li et al., 2003]

Phylogenetic tree-based PhylomeDB [Huerta-Cepas et al., 2007] Ensembl Compara [Vilella et al., 2008]

Page 25: BITS - Comparative genomics: gene family analysis

Overview methodologies

25 Gabaldon, 2008

BBH

COG

Inparanoid

reconciliation

species overlap

Page 26: BITS - Comparative genomics: gene family analysis

IV. Resources

26

Page 27: BITS - Comparative genomics: gene family analysis

Resources (bis)

Ensembl (Vertebrates) EnsembGenomes (Metazoa, Protists,

Fungi, Plants & Bacteria)

OrthoMCLDB 5 (150 genomes) YGOB (>15 Fungi)

27

Page 28: BITS - Comparative genomics: gene family analysis

Hands-on

Goal: identify and characterize gene family members encoding for talin 2 (TLN2)

1. Select Query gene

2. Retrieve homo/orthologs

3. Create multiple sequence alignment

4. Identify conserved positions

5. Create phylogenetic tree and identify ortho/paralogous genes

28