global classification of (plant) proteins across multiple species
DESCRIPTION
Global Classification of (Plant) Proteins across Multiple Species. Kerr Wall Jim Leebens-Mack Naomi Altman Victor Albert Dawn Field Hong Ma Claude dePamphilis. Global Classification of Proteins. The protein classification problem A method for global classification - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Global Classification of (Plant) Proteins across Multiple Species](https://reader033.vdocument.in/reader033/viewer/2022051623/5681591d550346895dc6448c/html5/thumbnails/1.jpg)
Global Classification of (Plant) Proteins across Multiple Species
Kerr WallJim Leebens-MackNaomi AltmanVictor AlbertDawn FieldHong MaClaude dePamphilis
![Page 2: Global Classification of (Plant) Proteins across Multiple Species](https://reader033.vdocument.in/reader033/viewer/2022051623/5681591d550346895dc6448c/html5/thumbnails/2.jpg)
Global Classification of Proteins
• The protein classification problem
• A method for global classification
• “Bootstrap” support for global classification
• Structure within clusters
• Structure between clusters
• Results from complete proteome classification: arabidopsis, oryza and populus
![Page 3: Global Classification of (Plant) Proteins across Multiple Species](https://reader033.vdocument.in/reader033/viewer/2022051623/5681591d550346895dc6448c/html5/thumbnails/3.jpg)
The protein classification problem
• Genomic sequence can be translated into protein sequence but …
• The function of most proteins is unknown.
• Protein classification is used to: infer protein folding structure infer protein function infer evolutionary relationships **
![Page 4: Global Classification of (Plant) Proteins across Multiple Species](https://reader033.vdocument.in/reader033/viewer/2022051623/5681591d550346895dc6448c/html5/thumbnails/4.jpg)
Similarity of Protein SequenceFFHPLECEPTLQMGFHSDQIS-VAA---AGPS--VNNN---FFHPLDCGPTLQMGYPSDSLTAEAAASVAGPS--C--S---FFHPLECEPTLQIGYQPDPIT-VAA---AGPS--VN-NYMPFFHPIECEPTLQMGYQQDQIT-VAAA--AGPSMTMN-S---FFQHIECEPTLHIGYQPDQIT-VAA---AGPS--MN-NYMQFFHPLECEPTLQIGYQHDQIT-IAA---PGPS--VS-NYMP
• Each row represents a different protein.• Each letter represents an amino acid.• Each “–” represents a space which is missing in this sequence but
has something in it in a different protein in this set.
• In closely related proteins, the distance between proteins is the number of mismatches.
• In distantly related species, the sequences are given a score – often the probability that a random sequence matches as well (e.g. BLAST E-value)
![Page 5: Global Classification of (Plant) Proteins across Multiple Species](https://reader033.vdocument.in/reader033/viewer/2022051623/5681591d550346895dc6448c/html5/thumbnails/5.jpg)
Inferring Evolutionary Relationships
Main methods: statistical phylogeny based on sequence alignment and evolutionary models
-requires a high degree of sequence similarity-good alignments use slow algorithms and often lots of
manual intervention
manual curation -requires a large amount of manual intervention-can incorporate sequence, folding structure and function.
These methods are good for 100’s of genes.
![Page 6: Global Classification of (Plant) Proteins across Multiple Species](https://reader033.vdocument.in/reader033/viewer/2022051623/5681591d550346895dc6448c/html5/thumbnails/6.jpg)
Global Classification of Proteins
Very high throughput:
Arabidopsis 26,207
Rice 57,915
Poplar 45,555
Total 129,677
Our goal: The joint classification of all known plant proteins using a “scaffold” derived from the 3 completely sequenced species
![Page 7: Global Classification of (Plant) Proteins across Multiple Species](https://reader033.vdocument.in/reader033/viewer/2022051623/5681591d550346895dc6448c/html5/thumbnails/7.jpg)
A method for global classification
• Clustering based on a similarity (or distance) matrix is commonly used.
• A quick method for clustering (sparse matrix computations are often used).
• Our similarity matrix is 129,677 x 129,677 so we need:
• A quick method for computing distance (BLAST E-values are often used; we use -log(E-value) as the similarity measure)
![Page 8: Global Classification of (Plant) Proteins across Multiple Species](https://reader033.vdocument.in/reader033/viewer/2022051623/5681591d550346895dc6448c/html5/thumbnails/8.jpg)
TribeMCL Clustering AlgorithmPredicted protein sequences from the fully sequenced genomes of Arabidopsis thaliana columbia (26207) and Oryza sativa japonica (57915) were downloaded from TIGR. Populus trichocarpa (45555) was downloaded from JGI.
All sequences were blasted against each other using BLASTp 2.4 with an E-value cutoff of 1x10-5
The TribeMCL package was used to predict putative protein families at low, medium, and high (I=1.2,3,5) stringencies
The results are stored at http://www.floralgenome.org/cgi-bin/tribedb/tribe.cgi
![Page 9: Global Classification of (Plant) Proteins across Multiple Species](https://reader033.vdocument.in/reader033/viewer/2022051623/5681591d550346895dc6448c/html5/thumbnails/9.jpg)
![Page 10: Global Classification of (Plant) Proteins across Multiple Species](https://reader033.vdocument.in/reader033/viewer/2022051623/5681591d550346895dc6448c/html5/thumbnails/10.jpg)
![Page 11: Global Classification of (Plant) Proteins across Multiple Species](https://reader033.vdocument.in/reader033/viewer/2022051623/5681591d550346895dc6448c/html5/thumbnails/11.jpg)
![Page 12: Global Classification of (Plant) Proteins across Multiple Species](https://reader033.vdocument.in/reader033/viewer/2022051623/5681591d550346895dc6448c/html5/thumbnails/12.jpg)
TribeMCL MethodEnright, Van Dongen and Ouzounis (2002)
• Similarity is measured by
-log10(BLAST E-value)
• Clustering is done by MCL Method
![Page 13: Global Classification of (Plant) Proteins across Multiple Species](https://reader033.vdocument.in/reader033/viewer/2022051623/5681591d550346895dc6448c/html5/thumbnails/13.jpg)
Suppose S is the similarity matrix.
1. Normalize the rows of S to sum to 1.
2. Raise each entry to the power r>1. (r is the “stringency”) and renormalize. S(r)
3. Take a “Markov step” – replace S(r)’S(r).
4. Iterate to convergence.
MCL Algorithmvan Dongen, 2000
It is very fast because low similarities are truncated to zero and sparse matrix methods can then be used.
![Page 14: Global Classification of (Plant) Proteins across Multiple Species](https://reader033.vdocument.in/reader033/viewer/2022051623/5681591d550346895dc6448c/html5/thumbnails/14.jpg)
A Heuristic for MCL
We take a random walk on the graph described by the similarity matrix
BUT
After each step we weaken the links between distant nodes and strengthen the links between nearby nodes
Graphic from van Dongen, 2000
![Page 15: Global Classification of (Plant) Proteins across Multiple Species](https://reader033.vdocument.in/reader033/viewer/2022051623/5681591d550346895dc6448c/html5/thumbnails/15.jpg)
Similarity Matrix
r=2.0
r=2.8
r=2.9
r=2.6Cluster pattern at Convergence as a function of r
Small groups break apart first.
The pattern is quite robust to changes in the similarity of the green region
16
40
60
![Page 16: Global Classification of (Plant) Proteins across Multiple Species](https://reader033.vdocument.in/reader033/viewer/2022051623/5681591d550346895dc6448c/html5/thumbnails/16.jpg)
Similarity Matrix
r=2.0
r=2.8
r=3.1
r=2.6Cluster pattern at Convergence as a function of r
At r=3.6 all units separate
The additional similarity indicated by pink has a profound effect
16
40
60
50
![Page 17: Global Classification of (Plant) Proteins across Multiple Species](https://reader033.vdocument.in/reader033/viewer/2022051623/5681591d550346895dc6448c/html5/thumbnails/17.jpg)
Similarity Matrix
r=2.0
r=2.7
r=2.8
r=2.6Cluster pattern at Convergence as a function of r
More strongly connecting the “background” disrupts the pattern until r=2.7, after which we quickly cycle through the pattern (2.9 turns the center group into singletons and 3.0 turns everything into singletons.)
30
40
60
![Page 18: Global Classification of (Plant) Proteins across Multiple Species](https://reader033.vdocument.in/reader033/viewer/2022051623/5681591d550346895dc6448c/html5/thumbnails/18.jpg)
Similarity Matrix
r=2.0
r=2.3
r=2.1Cluster pattern at Convergence as a function of r
Weakening the within cluster similarity accelerates the breakdown into singletons
16
30
60
![Page 19: Global Classification of (Plant) Proteins across Multiple Species](https://reader033.vdocument.in/reader033/viewer/2022051623/5681591d550346895dc6448c/html5/thumbnails/19.jpg)
Similarity Matrix
r=2.0
r=2.3Cluster pattern at Convergence as a function of r
Strengthening the “background” while weakening the within cluster similarity makes it difficult to pick out the clusters.
25
30
60
![Page 20: Global Classification of (Plant) Proteins across Multiple Species](https://reader033.vdocument.in/reader033/viewer/2022051623/5681591d550346895dc6448c/html5/thumbnails/20.jpg)
Some Summary Statistics for the Clusters
Protein Set Number of Proteins
Number of Clusters at r=3
Percent of Singletons
Arabidopsis 26,207 11,467
(44%)
69%
Arabidopsis+
Rice
84,122 28,175
(33%)
68%
Arabidopsis+
Rice + Poplar
129,677 35,873
(28%)
67%
![Page 21: Global Classification of (Plant) Proteins across Multiple Species](https://reader033.vdocument.in/reader033/viewer/2022051623/5681591d550346895dc6448c/html5/thumbnails/21.jpg)
![Page 22: Global Classification of (Plant) Proteins across Multiple Species](https://reader033.vdocument.in/reader033/viewer/2022051623/5681591d550346895dc6448c/html5/thumbnails/22.jpg)
Cluster ATH Rice Poplar
ATH 30% - -
+Rice 17% 25% -
+Poplar 12% 24% 15%
%Singletons
![Page 23: Global Classification of (Plant) Proteins across Multiple Species](https://reader033.vdocument.in/reader033/viewer/2022051623/5681591d550346895dc6448c/html5/thumbnails/23.jpg)
![Page 24: Global Classification of (Plant) Proteins across Multiple Species](https://reader033.vdocument.in/reader033/viewer/2022051623/5681591d550346895dc6448c/html5/thumbnails/24.jpg)
![Page 25: Global Classification of (Plant) Proteins across Multiple Species](https://reader033.vdocument.in/reader033/viewer/2022051623/5681591d550346895dc6448c/html5/thumbnails/25.jpg)
Tribes for large gene families show some, but not complete correspondence to inferred phylogenetic relationships. Tribes with MADS genes formed at low, medium and high stringencies are mapped on to the a recently published Arabidopsis MADS gene phylogeny (Martinez-Castilla & Alvarez-Buylla 2003).
Comparing Tribes to Phylogenetic Trees from Sequence Alignment
![Page 26: Global Classification of (Plant) Proteins across Multiple Species](https://reader033.vdocument.in/reader033/viewer/2022051623/5681591d550346895dc6448c/html5/thumbnails/26.jpg)
Comparisons with curated gene families
• Added tribe information to TAIR’s gene families– www.floralgenome.org/cgi-bin/tair/tair.cgi
– E.g. Cytochrome P450
![Page 27: Global Classification of (Plant) Proteins across Multiple Species](https://reader033.vdocument.in/reader033/viewer/2022051623/5681591d550346895dc6448c/html5/thumbnails/27.jpg)
![Page 28: Global Classification of (Plant) Proteins across Multiple Species](https://reader033.vdocument.in/reader033/viewer/2022051623/5681591d550346895dc6448c/html5/thumbnails/28.jpg)
“Bootstrap” Support for Clusters
To determine the stability of the clusters, we need some type of perturbation of the system. We use the “0.632 jackknife” instead of the bootstrap (as we want a set of unique proteins).
We clustered 100 samples, each a random selection of 63.2% of the proteins.
We count “1” for each tribe each time all the genes in the tribe selected for the bootstrap sample are clustered.
![Page 29: Global Classification of (Plant) Proteins across Multiple Species](https://reader033.vdocument.in/reader033/viewer/2022051623/5681591d550346895dc6448c/html5/thumbnails/29.jpg)
![Page 30: Global Classification of (Plant) Proteins across Multiple Species](https://reader033.vdocument.in/reader033/viewer/2022051623/5681591d550346895dc6448c/html5/thumbnails/30.jpg)
![Page 31: Global Classification of (Plant) Proteins across Multiple Species](https://reader033.vdocument.in/reader033/viewer/2022051623/5681591d550346895dc6448c/html5/thumbnails/31.jpg)
From Tribes to Phylogenetics• Within each tribe of 3 or more proteins we can
do hierarchical clustering using the similarity matrix (Harlow, Gogarten, Ragan, 2004) or forming a careful alignment and doing phylogenetic tree.
• We can also form SuperTribes, by clustering the tribes. Because we still have a large set of objects to cluster, we continue to use MCL.
• Within a SuperTribe, we can do hierarchical clustering.
• The SuperTribe for the MADS family shown earlier includes all the MADS sequences
![Page 32: Global Classification of (Plant) Proteins across Multiple Species](https://reader033.vdocument.in/reader033/viewer/2022051623/5681591d550346895dc6448c/html5/thumbnails/32.jpg)
Single Linkage TribeMCL• Define the distance
between tribes as the smallest pairwise E-value.
• Use TribeMCL on the resulting similarity matrix.
• Use hierarchical clustering within supertribes.
Single Linkage Tribe MCL
Hierarchical clustering or phylogenetic trees
![Page 33: Global Classification of (Plant) Proteins across Multiple Species](https://reader033.vdocument.in/reader033/viewer/2022051623/5681591d550346895dc6448c/html5/thumbnails/33.jpg)
Floral Genome Project and Plant ProteinClassification
![Page 34: Global Classification of (Plant) Proteins across Multiple Species](https://reader033.vdocument.in/reader033/viewer/2022051623/5681591d550346895dc6448c/html5/thumbnails/34.jpg)
Use of the Global Classification• Project goal is to understand the evolution
of flowers.• Data has been collected to various
degrees of intensity on 15 non-model species across the phylogeny of flowering plants and merged with data from other projects.
• PlantTribes will be used to assist in placing these proteins into families to infer evolutionary relationships.
![Page 35: Global Classification of (Plant) Proteins across Multiple Species](https://reader033.vdocument.in/reader033/viewer/2022051623/5681591d550346895dc6448c/html5/thumbnails/35.jpg)
And many thanks to:• Kerr Wall – FGP Bioinformatics (PSU)• Claude dePamphilis – FGP PI (PSU)• Jim Leebens-Mack – FGP Project Director(PSU)• Hong Ma – FGP co-PI (PSU)• Victor Albert – collaborator (U. Oslo)• Dawn Field – collaborator (Oxford U.)
And FGP collaborators at PSU, UFL and Cornell.
And especially
NSF – Plant Genome Research Program