ankit agrawal , sanchit misra, daniel honbo, alok...

Ankit Agrawal, Sanchit Misra, Daniel Honbo, Alok Choudhary

Dept. of Electrical Engg. and Computer Science

Northwestern UniversityJune 21, 2010

Motivation Pairwise Statistical Significance MPIPairwiseStatSig Experiments and Results Future Work

2

Motivation

3

Sequence-Comparison Applications

Multiple Sequence

Alignment

Database Search

Protein Structure Prediction

PhylogeneticTree

Construction

Genome Assembly

…

4

http://www.dnabaser.com/articles/phylogenetic-tree/phylogenetic-tree-big.jpg

Pairwise Local Sequence Alignment

DNA: A, G, C, T

Protein: A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y

KQTGKG| |||KSAGKG

CTGTCG–CTGC|| ||

-TGC–CG–TG-

5

C T G T C G – C T G C- T G C – C G – T G -

-5 10 10 -2 -5 -2 -5 -5 10 10 -5

Alignment Score = 11

Match score: 10Mismatch score: -2Gap score: -5

6

http://www.ncbi.nlm.nih.gov/Class/FieldGuide/BLOSUM62.txt 7

Central application of pairwise sequence alignment• Identifying related sequence pairs

(evolved from a common ancestor, also known as homologs)

More related sequence pairs should have higher alignment scores

8

Alignment Score Distribution

Alignment score distribution depends on:

• Alignment program• Scoring scheme• Sequence lengths• Sequence

compositions

P-value

x < y, but x is more statistically significant

than y

Compared to alignment score, statistical significance is a better indicator of biological significance 9

[Karlin and Altchul, 1990] described rigorous statistical theory for ungapped alignment scores, following an Extreme Value distribution (EVD). In the limit of large sequence lengths m and n, the

statistics of HSP (High-Scoring Segment Pairs which correspond to local sequence alignment) scores are characterized by K and λ.

xE Kmne λ−=

Pr( ) 1 ES x e−> ≈ −

Statistical parameters K and λ characterize the EVD curve

10

Statistical/Biological Significance

P-value

Statistical Significance of Local Alignment Scores

P-value

P-value of an alignment score: The probability that an alignment with this score or higher occurs by chance alone

11

Sequence-Specificity

• Report statistical significance taking into account the properties and features of the specific sequence-pair being aligned.

Statistical Significance

Accuracy

• Accurate estimation of P-values for high scores in the right tail region.

Retrieval Accuracy

• Ability to identify related sequences.• Should assign lower P-values to pairs of related sequences than to pairs of

unrelated sequences.

Speed• Fast enough to be usable in practice

Characteristics of a good pairwisealignment based sequence-comparison

strategy

12

Dependent on the database. Not sequence specific. BLAST2.0 [Altschul et al., 1997]

Likelihood that a similarity as good or better would be obtained by two random sequences with average amino-acid composition and lengths similar to the sequences that produced the score.

FASTA [Pearson, 2000] Expectation that a sequence would obtain a similarity score

against an unrelated sequence drawn at random from the sequence database that was searched.

13

Database-independent Sequence specific Useful to evaluate alignments generated by an

alignment program, independent of any database. Better retrieval accuracy than database statistical

significance. Good for a few sequence pairs, but would be extremely

slow for a large number of sequence pairs.

14

Pairwise Statistical Significance vs. Database Statistical Significance

Agrawal and Huang (2009) Pairwise Statistical Significance of Local Sequence Alignment Using Sequence-Specific and Position-Specific Substitution Matrices, IEEE/ACM Transactions on Computational Biology and Bioinformatics

Acceleration of PSS estimation

using HPC methods

All-pervasive application of

sequence alignment methods in

bioinformatics

Ever-increasing

sequence data

PSS gives biologically better

estimates of statistical

significance than DSS

PSS is too slow, when applied to many sequence

pairs.

16

PairwiseStatistical

Significance

17

The PairwiseStatSig function Generates a score distribution by aligning Seq1 with N

shuffled versions of Seq2 using scoring scheme SC Fits a EVD to the empirical score distribution using censored

maximum likelihood fitting Reports the pairwise statistical significance estimate of the

pairwise alignment score between Seq1 and Seq2 using the EVD formula for P-value with estimated K and λ 18

Pairwise Statistical Significance

Scoring scheme

Number of shuffles

Pairwise Statistical Significance

Execution time break-up for different stages of pairwise statistical significance estimation

MPIPairwiseStatSig

20

MPIPairwiseStatSig

Experiments and

Results

23

Experiments and Results

Sequence pairs of length 100, 200, 400, 800, 1600. Number of processors: 2, 4, 8, 16, 32, 64. Processors: 2.8GHz dual Intel Xeon nodes Substitution matrix: BLOSUM50 Gap penalty: 10+2k for a gap of length k. Number of shuffles, N= 1000.

Experiments and Results

Improving MPIPairwiseStatSig Reducing running time

Explore combining inter- and intra-task parallelism using other HPC techniques like using FPGA, GPU, etc.

Using heuristics to reduce time for constructing alignment score distribution.

MPIPairwiseStatSig Applications Any application requiring to judge relatedness of two

sequences based on sequence-data alone Database search Progressive multiple sequence alignment Phylogenetic tree construction

31

Future Work

Thank You!

32

ankit agrawal , sanchit misra, daniel honbo, alok...

Documents