testing statistical significance scores of sequence comparison methods with structure similarity
DESCRIPTION
Testing statistical significance scores of sequence comparison methods with structure similarity. Tim Hulsen NCMLS PhD Two-Day Conference 2006-04-27. Introduction. Sequence comparison: Important for finding similar proteins (homologs) for a protein with unknown function - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Testing statistical significance scores of sequence comparison methods with structure similarity](https://reader035.vdocument.in/reader035/viewer/2022062521/56813848550346895d9ff64a/html5/thumbnails/1.jpg)
Testing statistical significance scores of sequence comparison methods with structure similarity
Tim HulsenNCMLS PhD Two-Day Conference
2006-04-27
![Page 2: Testing statistical significance scores of sequence comparison methods with structure similarity](https://reader035.vdocument.in/reader035/viewer/2022062521/56813848550346895d9ff64a/html5/thumbnails/2.jpg)
Introduction
• Sequence comparison: Important for finding similar proteins
(homologs) for a protein with unknown function
• Algorithms: BLAST, FASTA, Smith-Waterman
• Statistical scores: E-value (standard), Z-value
![Page 3: Testing statistical significance scores of sequence comparison methods with structure similarity](https://reader035.vdocument.in/reader035/viewer/2022062521/56813848550346895d9ff64a/html5/thumbnails/3.jpg)
E-value or Z-value?
• Smith-Waterman sequence comparison with Z-value statistics:100 randomized shuffles to test significance of SW score
O. MFTGQEYHSV
shuffle
1. GQHMSVFTEY2. YMSHQFTVGEetc.
# seqs
SW score
rnd ori: 5*SD Z = 5
![Page 4: Testing statistical significance scores of sequence comparison methods with structure similarity](https://reader035.vdocument.in/reader035/viewer/2022062521/56813848550346895d9ff64a/html5/thumbnails/4.jpg)
E-value or Z-value?
• Z-value calculation takes much time (2x100 randomizations)
• Comet et al. (1999) and Bastien et al. (2004): Z-value is theoretically more sensitive and more selective than E-value
• BUT Advantage of Z-value has never been proven by experimental results
![Page 5: Testing statistical significance scores of sequence comparison methods with structure similarity](https://reader035.vdocument.in/reader035/viewer/2022062521/56813848550346895d9ff64a/html5/thumbnails/5.jpg)
How to compare?
• Structural comparison is better than sequence comparison
• ASTRAL SCOP: Structural Classification Of Proteins
• e.g. a.2.1.3, c.1.2.4; same number ~ same structure
• Use structural classification as benchmark for sequence comparison methods
![Page 6: Testing statistical significance scores of sequence comparison methods with structure similarity](https://reader035.vdocument.in/reader035/viewer/2022062521/56813848550346895d9ff64a/html5/thumbnails/6.jpg)
ASTRAL SCOP statisticsmax. % identity members families avg. fam. size max. fam. size families =1 families >1
10% 3631 2250 1.614 25 1655 595
20% 3968 2297 1.727 29 1605 692
25% 4357 2313 1.884 32 1530 783
30% 4821 2320 2.078 39 1435 885
35% 5301 2322 2.283 46 1333 989
40% 5674 2322 2.444 47 1269 1053
50% 6442 2324 2.772 50 1178 1146
70% 7551 2325 3.248 127 1087 1238
90% 8759 2326 3.766 405 1023 1303
95% 9498 2326 4.083 479 977 1349
![Page 7: Testing statistical significance scores of sequence comparison methods with structure similarity](https://reader035.vdocument.in/reader035/viewer/2022062521/56813848550346895d9ff64a/html5/thumbnails/7.jpg)
Methods (1)• Smith-Waterman algorithms: dynamic programming; computationally
intensive– Paracel with e-value (PA E):
• SW implementation of Paracel– Biofacet with z-value (BF Z):
• SW implementation of Gene-IT– ParAlign with e-value (PA E):
• SW implementation of Sencel– SSEARCH with e-value (SS E):
• SW implementation of FASTA (see next page)
![Page 8: Testing statistical significance scores of sequence comparison methods with structure similarity](https://reader035.vdocument.in/reader035/viewer/2022062521/56813848550346895d9ff64a/html5/thumbnails/8.jpg)
Methods (2)
• Heuristic algorithms:– FASTA (FA E)
• Pearson & Lipman, 1988• Heuristic approximation; performs better than
BLAST with strongly diverged proteins– BLAST (BL E):
• Altschul et al., 1990• Heuristic approximation; stretches local alignments
(HSPs) to global alignment• Should be faster than FASTA
![Page 9: Testing statistical significance scores of sequence comparison methods with structure similarity](https://reader035.vdocument.in/reader035/viewer/2022062521/56813848550346895d9ff64a/html5/thumbnails/9.jpg)
Method parameters- all:
- matrix: BLOSUM62- gap open penalty: 12- gap extension penalty: 1
- Biofacet with z-value: 100 randomizations
![Page 10: Testing statistical significance scores of sequence comparison methods with structure similarity](https://reader035.vdocument.in/reader035/viewer/2022062521/56813848550346895d9ff64a/html5/thumbnails/10.jpg)
Receiver Operating Characteristic
• R.O.C.: statistical value, mostly used in clinical medicine
• Proposed by Gribskov & Robinson (1996) to be used for sequence comparison analysis
![Page 11: Testing statistical significance scores of sequence comparison methods with structure similarity](https://reader035.vdocument.in/reader035/viewer/2022062521/56813848550346895d9ff64a/html5/thumbnails/11.jpg)
ROC50 Examplequery d1c75a_ a.3.1.1
hit # pc e
1 d1gcya1 b.71.1.1 0.31
2 d1h32b_ a.3.1.1 0.4
3 d1gks__ a.3.1.1 0.52
4 d1a56__ a.3.1.1 0.52
5 d1kx2a_ a.3.1.1 0.67
6 d1etpa1 a.3.1.4 0.67
7 d1zpda3 c.36.1.9 0.87
8 d1eu1a2 c.81.1.1 0.87
9 d451c__ a.3.1.1 1.1
10 d1flca2 c.23.10.2 1.1
11 d1mdwa_ d.3.1.3 1.1
12 d2dvh__ a.3.1.1 1.5
13 d1shsa_ b.15.1.1 1.5
14 d1mg2d_ a.3.1.1 1.5
15 d1c53__ a.3.1.1 2.4
16 d3c2c__ a.3.1.1 2.4
17 d1bvsa1 a.5.1.1 6.8
18 d1dvva_ a.3.1.1 6.8
19 d1cyi__ a.3.1.1 6.8
20 d1dw0a_ a.3.1.1 6.8
21 d1h0ba_ b.29.1.11 6.8
22 d3pfk__ c.89.1.1 6.8
23 d1kful3 d.3.1.3 6.8
24 d1ixrc1 a.4.5.11 14
25 d1ixsb1 a.4.5.11 14
- Take 100 best hits- True positives: in same SCOP family, or false positives: not in same family- For each of first 50 false positives: calculate number of true positives higher in list (0,4,4,4,5,5,6,9,12,12,12,12,12)- Divide sum of these numbers by number of false positives (50) and by total number of possible true positives (size of family -1) = ROC50
(0,167)- Take average of ROC50 scores for all entries
![Page 12: Testing statistical significance scores of sequence comparison methods with structure similarity](https://reader035.vdocument.in/reader035/viewer/2022062521/56813848550346895d9ff64a/html5/thumbnails/12.jpg)
ROC50 results
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
pdb010 pdb020 pdb025 pdb030 pdb035 pdb040 pdb050 pdb070 pdb090 pdb095
ASTRAL SCOP set
mea
n R
OC
50 pc ebf zbl efa ess epa e
![Page 13: Testing statistical significance scores of sequence comparison methods with structure similarity](https://reader035.vdocument.in/reader035/viewer/2022062521/56813848550346895d9ff64a/html5/thumbnails/13.jpg)
Coverage vs. Error
• C.V.E. = Coverage vs. Error (Brenner et al., 1998)
• E.P.Q. = selectivity indicator (how much false positives?)
• Coverage = sensitivity indicator (how much true positives of total?)
![Page 14: Testing statistical significance scores of sequence comparison methods with structure similarity](https://reader035.vdocument.in/reader035/viewer/2022062521/56813848550346895d9ff64a/html5/thumbnails/14.jpg)
CVE Examplequery d1c75a_ a.3.1.1
hit # pc e
1 d1gcya1 b.71.1.1 0.31
2 d1h32b_ a.3.1.1 0.4
3 d1gks__ a.3.1.1 0.52
4 d1a56__ a.3.1.1 0.52
5 d1kx2a_ a.3.1.1 0.67
6 d1etpa1 a.3.1.4 0.67
7 d1zpda3 c.36.1.9 0.87
8 d1eu1a2 c.81.1.1 0.87
9 d451c__ a.3.1.1 1.1
10 d1flca2 c.23.10.2 1.1
11 d1mdwa_ d.3.1.3 1.1
12 d2dvh__ a.3.1.1 1.5
13 d1shsa_ b.15.1.1 1.5
14 d1mg2d_ a.3.1.1 1.5
15 d1c53__ a.3.1.1 2.4
16 d3c2c__ a.3.1.1 2.4
17 d1bvsa1 a.5.1.1 6.8
18 d1dvva_ a.3.1.1 6.8
19 d1cyi__ a.3.1.1 6.8
20 d1dw0a_ a.3.1.1 6.8
21 d1h0ba_ b.29.1.11 6.8
22 d3pfk__ c.89.1.1 6.8
23 d1kful3 d.3.1.3 6.8
24 d1ixrc1 a.4.5.11 14
25 d1ixsb1 a.4.5.11 14
- Vary threshold above which a hit is seen as a positive: e.g. e=10,e=1,e=0.1,e=0.01- True positives: in same SCOP family, or false positives: not in same family- For each threshold, calculate coverage: number of true positives divided by total number of possible true positives- For each threshold, calculate errors-per-query: number of false positives divided by number of queries- Plot coverage on x-axis and errors-per-query on y-axis; right-bottom is best
![Page 15: Testing statistical significance scores of sequence comparison methods with structure similarity](https://reader035.vdocument.in/reader035/viewer/2022062521/56813848550346895d9ff64a/html5/thumbnails/15.jpg)
CVE results
(for PDB010)
-
+
![Page 16: Testing statistical significance scores of sequence comparison methods with structure similarity](https://reader035.vdocument.in/reader035/viewer/2022062521/56813848550346895d9ff64a/html5/thumbnails/16.jpg)
Mean Average Precision
• A.P.: borrowed from information retrieval search (Salton, 1991)
• Recall: true positives divided by number of homologs
• Precision: true positives divided by number of hits
• A.P. = approximate integral to calculate area under recall-precision curve
![Page 17: Testing statistical significance scores of sequence comparison methods with structure similarity](https://reader035.vdocument.in/reader035/viewer/2022062521/56813848550346895d9ff64a/html5/thumbnails/17.jpg)
Mean AP Examplequery d1c75a_ a.3.1.1
hit # pc e
1 d1gcya1 b.71.1.1 0.31
2 d1h32b_ a.3.1.1 0.4
3 d1gks__ a.3.1.1 0.52
4 d1a56__ a.3.1.1 0.52
5 d1kx2a_ a.3.1.1 0.67
6 d1etpa1 a.3.1.4 0.67
7 d1zpda3 c.36.1.9 0.87
8 d1eu1a2 c.81.1.1 0.87
9 d451c__ a.3.1.1 1.1
10 d1flca2 c.23.10.2 1.1
11 d1mdwa_ d.3.1.3 1.1
12 d2dvh__ a.3.1.1 1.5
13 d1shsa_ b.15.1.1 1.5
14 d1mg2d_ a.3.1.1 1.5
15 d1c53__ a.3.1.1 2.4
16 d3c2c__ a.3.1.1 2.4
17 d1bvsa1 a.5.1.1 6.8
18 d1dvva_ a.3.1.1 6.8
19 d1cyi__ a.3.1.1 6.8
20 d1dw0a_ a.3.1.1 6.8
21 d1h0ba_ b.29.1.11 6.8
22 d3pfk__ c.89.1.1 6.8
23 d1kful3 d.3.1.3 6.8
24 d1ixrc1 a.4.5.11 14
25 d1ixsb1 a.4.5.11 14
- Take 100 best hits- True positives: in same SCOP family, or false positives: not in same family-For each of the true positives: divide the positive rank (1,2,3,4,5,6,7,8,9,10,11,12) by the true positive rank (2,3,4,5,9,12,14,15,16,18,19,20)- Divide the sum of all of these numbers by the total number of hits (100) = AP (0.140)- Take average of AP scores for all entries = mean AP
![Page 18: Testing statistical significance scores of sequence comparison methods with structure similarity](https://reader035.vdocument.in/reader035/viewer/2022062521/56813848550346895d9ff64a/html5/thumbnails/18.jpg)
Mean AP results
0.00
0.03
0.06
0.09
0.12
0.15
0.18
0.21
0.24
0.27
0.30
pdb010 pdb020 pdb025 pdb030 pdb035 pdb040 pdb050 pdb070 pdb090 pdb095
ASTRAL SCOP set
mea
n A
P
pc ebf zbl efa ess epa e
![Page 19: Testing statistical significance scores of sequence comparison methods with structure similarity](https://reader035.vdocument.in/reader035/viewer/2022062521/56813848550346895d9ff64a/html5/thumbnails/19.jpg)
Time consumption
• PDB095 all-against-all comparison:– Biofacet: multiple days (Z-value calc.!)– SSEARCH: 5h49m– ParAlign: 47m– FASTA: 40m– BLAST: 15m
![Page 20: Testing statistical significance scores of sequence comparison methods with structure similarity](https://reader035.vdocument.in/reader035/viewer/2022062521/56813848550346895d9ff64a/html5/thumbnails/20.jpg)
Conclusions
• e-value better than Z-value(!)• SW implementations are (more or less)
the same (SSEARCH, ParAlign and Biofacet), but SSEARCH with e-value scores best of all
• Use FASTA/BLAST only when time is important
• Larger structural comparison database needed for better analysis
![Page 21: Testing statistical significance scores of sequence comparison methods with structure similarity](https://reader035.vdocument.in/reader035/viewer/2022062521/56813848550346895d9ff64a/html5/thumbnails/21.jpg)
Credits
Peter Groenen
Wilco Fleuren
Jack Leunissen