ankit agrawal , sanchit misra, daniel honbo, alok...
TRANSCRIPT
Ankit Agrawal, Sanchit Misra, Daniel Honbo, Alok Choudhary
Dept. of Electrical Engg. and Computer Science
Northwestern UniversityJune 21, 2010
Motivation Pairwise Statistical Significance MPIPairwiseStatSig Experiments and Results Future Work
2
Motivation
3
Sequence-Comparison Applications
Multiple Sequence
Alignment
Database Search
Protein Structure Prediction
PhylogeneticTree
Construction
Genome Assembly
…
4
Pairwise Local Sequence Alignment
DNA: A, G, C, T
Protein: A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y
KQTGKG| |||KSAGKG
CTGTCG–CTGC|| ||
-TGC–CG–TG-
5
C T G T C G – C T G C- T G C – C G – T G -
-5 10 10 -2 -5 -2 -5 -5 10 10 -5
Alignment Score = 11
Match score: 10Mismatch score: -2Gap score: -5
6
http://www.ncbi.nlm.nih.gov/Class/FieldGuide/BLOSUM62.txt 7
Central application of pairwise sequence alignment• Identifying related sequence pairs
(evolved from a common ancestor, also known as homologs)
More related sequence pairs should have higher alignment scores
8
Alignment Score Distribution
Alignment score distribution depends on:
• Alignment program• Scoring scheme• Sequence lengths• Sequence
compositions
P-value
x < y, but x is more statistically significant
than y
Compared to alignment score, statistical significance is a better indicator of biological significance 9
[Karlin and Altchul, 1990] described rigorous statistical theory for ungapped alignment scores, following an Extreme Value distribution (EVD). In the limit of large sequence lengths m and n, the
statistics of HSP (High-Scoring Segment Pairs which correspond to local sequence alignment) scores are characterized by K and λ.
xE Kmne λ−=
Pr( ) 1 ES x e−> ≈ −
Statistical parameters K and λ characterize the EVD curve
10
Statistical/Biological Significance
P-value
Statistical Significance of Local Alignment Scores
P-value
P-value of an alignment score: The probability that an alignment with this score or higher occurs by chance alone
11
Sequence-Specificity
• Report statistical significance taking into account the properties and features of the specific sequence-pair being aligned.
Statistical Significance
Accuracy
• Accurate estimation of P-values for high scores in the right tail region.
Retrieval Accuracy
• Ability to identify related sequences.• Should assign lower P-values to pairs of related sequences than to pairs of
unrelated sequences.
Speed• Fast enough to be usable in practice
Characteristics of a good pairwisealignment based sequence-comparison
strategy
12
Dependent on the database. Not sequence specific. BLAST2.0 [Altschul et al., 1997]
Likelihood that a similarity as good or better would be obtained by two random sequences with average amino-acid composition and lengths similar to the sequences that produced the score.
FASTA [Pearson, 2000] Expectation that a sequence would obtain a similarity score
against an unrelated sequence drawn at random from the sequence database that was searched.
13
Database-independent Sequence specific Useful to evaluate alignments generated by an
alignment program, independent of any database. Better retrieval accuracy than database statistical
significance. Good for a few sequence pairs, but would be extremely
slow for a large number of sequence pairs.
14
Pairwise Statistical Significance vs. Database Statistical Significance
Agrawal and Huang (2009) Pairwise Statistical Significance of Local Sequence Alignment Using Sequence-Specific and Position-Specific Substitution Matrices, IEEE/ACM Transactions on Computational Biology and Bioinformatics
Acceleration of PSS estimation
using HPC methods
All-pervasive application of
sequence alignment methods in
bioinformatics
Ever-increasing
sequence data
PSS gives biologically better
estimates of statistical
significance than DSS
PSS is too slow, when applied to many sequence
pairs.
16
PairwiseStatistical
Significance
17
The PairwiseStatSig function Generates a score distribution by aligning Seq1 with N
shuffled versions of Seq2 using scoring scheme SC Fits a EVD to the empirical score distribution using censored
maximum likelihood fitting Reports the pairwise statistical significance estimate of the
pairwise alignment score between Seq1 and Seq2 using the EVD formula for P-value with estimated K and λ 18
Pairwise Statistical Significance
Scoring scheme
Number of shuffles
Pairwise Statistical Significance
Execution time break-up for different stages of pairwise statistical significance estimation
MPIPairwiseStatSig
20
MPIPairwiseStatSig
MPIPairwiseStatSig
Experiments and
Results
23
Experiments and Results
Sequence pairs of length 100, 200, 400, 800, 1600. Number of processors: 2, 4, 8, 16, 32, 64. Processors: 2.8GHz dual Intel Xeon nodes Substitution matrix: BLOSUM50 Gap penalty: 10+2k for a gap of length k. Number of shuffles, N= 1000.
Experiments and Results
Experiments and Results
Experiments and Results
Experiments and Results
Experiments and Results
Experiments and Results
Improving MPIPairwiseStatSig Reducing running time
Explore combining inter- and intra-task parallelism using other HPC techniques like using FPGA, GPU, etc.
Using heuristics to reduce time for constructing alignment score distribution.
MPIPairwiseStatSig Applications Any application requiring to judge relatedness of two
sequences based on sequence-data alone Database search Progressive multiple sequence alignment Phylogenetic tree construction
31
Future Work
Thank You!
32