global classification of (plant) proteins across multiple species

Global Classification of (Plant) Proteins across Multiple Species

Kerr WallJim Leebens-MackNaomi AltmanVictor AlbertDawn FieldHong MaClaude dePamphilis

Global Classification of Proteins

• The protein classification problem

• A method for global classification

• “Bootstrap” support for global classification

• Structure within clusters

• Structure between clusters

• Results from complete proteome classification: arabidopsis, oryza and populus

The protein classification problem

• Genomic sequence can be translated into protein sequence but …

• The function of most proteins is unknown.

• Protein classification is used to: infer protein folding structure infer protein function infer evolutionary relationships **

Similarity of Protein SequenceFFHPLECEPTLQMGFHSDQIS-VAA---AGPS--VNNN---FFHPLDCGPTLQMGYPSDSLTAEAAASVAGPS--C--S---FFHPLECEPTLQIGYQPDPIT-VAA---AGPS--VN-NYMPFFHPIECEPTLQMGYQQDQIT-VAAA--AGPSMTMN-S---FFQHIECEPTLHIGYQPDQIT-VAA---AGPS--MN-NYMQFFHPLECEPTLQIGYQHDQIT-IAA---PGPS--VS-NYMP

• Each row represents a different protein.• Each letter represents an amino acid.• Each “–” represents a space which is missing in this sequence but

has something in it in a different protein in this set.

• In closely related proteins, the distance between proteins is the number of mismatches.

• In distantly related species, the sequences are given a score – often the probability that a random sequence matches as well (e.g. BLAST E-value)

Inferring Evolutionary Relationships

Main methods: statistical phylogeny based on sequence alignment and evolutionary models

-requires a high degree of sequence similarity-good alignments use slow algorithms and often lots of

manual intervention

manual curation -requires a large amount of manual intervention-can incorporate sequence, folding structure and function.

These methods are good for 100’s of genes.

Global Classification of Proteins

Very high throughput:

Arabidopsis 26,207

Rice 57,915

Poplar 45,555

Total 129,677

Our goal: The joint classification of all known plant proteins using a “scaffold” derived from the 3 completely sequenced species

A method for global classification

• Clustering based on a similarity (or distance) matrix is commonly used.

• A quick method for clustering (sparse matrix computations are often used).

• Our similarity matrix is 129,677 x 129,677 so we need:

• A quick method for computing distance (BLAST E-values are often used; we use -log(E-value) as the similarity measure)

TribeMCL Clustering AlgorithmPredicted protein sequences from the fully sequenced genomes of Arabidopsis thaliana columbia (26207) and Oryza sativa japonica (57915) were downloaded from TIGR. Populus trichocarpa (45555) was downloaded from JGI.

All sequences were blasted against each other using BLASTp 2.4 with an E-value cutoff of 1x10-5

The TribeMCL package was used to predict putative protein families at low, medium, and high (I=1.2,3,5) stringencies

The results are stored at http://www.floralgenome.org/cgi-bin/tribedb/tribe.cgi

TribeMCL MethodEnright, Van Dongen and Ouzounis (2002)

• Similarity is measured by

-log10(BLAST E-value)

• Clustering is done by MCL Method

Suppose S is the similarity matrix.

1. Normalize the rows of S to sum to 1.

2. Raise each entry to the power r>1. (r is the “stringency”) and renormalize. S(r)

3. Take a “Markov step” – replace S(r)’S(r).

4. Iterate to convergence.

MCL Algorithmvan Dongen, 2000

It is very fast because low similarities are truncated to zero and sparse matrix methods can then be used.

A Heuristic for MCL

We take a random walk on the graph described by the similarity matrix

BUT

After each step we weaken the links between distant nodes and strengthen the links between nearby nodes

Graphic from van Dongen, 2000

Similarity Matrix

r=2.0

r=2.8

r=2.9

r=2.6Cluster pattern at Convergence as a function of r

Small groups break apart first.

The pattern is quite robust to changes in the similarity of the green region

16

40

60

Similarity Matrix

r=2.0

r=2.8

r=3.1


At r=3.6 all units separate

The additional similarity indicated by pink has a profound effect

16

40

60

50

Similarity Matrix

r=2.0

r=2.7

r=2.8


More strongly connecting the “background” disrupts the pattern until r=2.7, after which we quickly cycle through the pattern (2.9 turns the center group into singletons and 3.0 turns everything into singletons.)

30

40

60

Similarity Matrix

r=2.0

r=2.3


Weakening the within cluster similarity accelerates the breakdown into singletons

16

30

60

Similarity Matrix

r=2.0


Strengthening the “background” while weakening the within cluster similarity makes it difficult to pick out the clusters.

25

30

60

Some Summary Statistics for the Clusters

Protein Set Number of Proteins

Number of Clusters at r=3

Percent of Singletons

Arabidopsis 26,207 11,467

(44%)

69%

Arabidopsis+

Rice

84,122 28,175

(33%)

68%

Arabidopsis+

Rice + Poplar

129,677 35,873

(28%)

67%

Cluster ATH Rice Poplar

ATH 30% - -

+Rice 17% 25% -

+Poplar 12% 24% 15%

%Singletons

Tribes for large gene families show some, but not complete correspondence to inferred phylogenetic relationships. Tribes with MADS genes formed at low, medium and high stringencies are mapped on to the a recently published Arabidopsis MADS gene phylogeny (Martinez-Castilla & Alvarez-Buylla 2003).

Comparing Tribes to Phylogenetic Trees from Sequence Alignment

Comparisons with curated gene families

• Added tribe information to TAIR’s gene families– www.floralgenome.org/cgi-bin/tair/tair.cgi

– E.g. Cytochrome P450

“Bootstrap” Support for Clusters

To determine the stability of the clusters, we need some type of perturbation of the system. We use the “0.632 jackknife” instead of the bootstrap (as we want a set of unique proteins).

We clustered 100 samples, each a random selection of 63.2% of the proteins.

We count “1” for each tribe each time all the genes in the tribe selected for the bootstrap sample are clustered.

From Tribes to Phylogenetics• Within each tribe of 3 or more proteins we can

do hierarchical clustering using the similarity matrix (Harlow, Gogarten, Ragan, 2004) or forming a careful alignment and doing phylogenetic tree.

• We can also form SuperTribes, by clustering the tribes. Because we still have a large set of objects to cluster, we continue to use MCL.

• Within a SuperTribe, we can do hierarchical clustering.

• The SuperTribe for the MADS family shown earlier includes all the MADS sequences

Single Linkage TribeMCL• Define the distance

between tribes as the smallest pairwise E-value.

• Use TribeMCL on the resulting similarity matrix.

• Use hierarchical clustering within supertribes.

Single Linkage Tribe MCL

Hierarchical clustering or phylogenetic trees

Floral Genome Project and Plant ProteinClassification

Use of the Global Classification• Project goal is to understand the evolution

of flowers.• Data has been collected to various

degrees of intensity on 15 non-model species across the phylogeny of flowering plants and merged with data from other projects.

• PlantTribes will be used to assist in placing these proteins into families to infer evolutionary relationships.

And many thanks to:• Kerr Wall – FGP Bioinformatics (PSU)• Claude dePamphilis – FGP PI (PSU)• Jim Leebens-Mack – FGP Project Director(PSU)• Hong Ma – FGP co-PI (PSU)• Victor Albert – collaborator (U. Oslo)• Dawn Field – collaborator (Oxford U.)

And FGP collaborators at PSU, UFL and Cornell.

And especially

NSF – Plant Genome Research Program

global classification of (plant) proteins across multiple species

Documents

different protein

similarity matrix

joint classification

putative protein families

sequence alignment

random sequence

related proteins

distance matrix