a unified statistical framework for sequence comparison and structure comparison

28
A unified statistical framework for sequence comparison and structure comparison Michael Levitt Mark Gerstein

Upload: noe

Post on 04-Jan-2016

51 views

Category:

Documents


0 download

DESCRIPTION

A unified statistical framework for sequence comparison and structure comparison. Michael Levitt Mark Gerstein. Statistics Introduction. Statistics is the discipline which deals with inference in the presence of variation Given a score, how significant is it? - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A unified statistical framework for sequence comparison and structure comparison

A unified statistical frameworkfor sequence comparison and

structure comparison

Michael Levitt Mark Gerstein

Page 2: A unified statistical framework for sequence comparison and structure comparison

Statistics Introduction• Statistics is the discipline which deals with

inference in the presence of variation• Given a score, how significant is it? • Ho , HA , Critical Region, P-value Extreme Value Distribution-maximum over all

sequence scores is distributed as Extreme Value Distribution

Reason why extreme value distribution is useful: maximize score over all possible random alignments

Page 3: A unified statistical framework for sequence comparison and structure comparison

Introduction

• Given sequence and structural scores, develop hypothesis testing framework

• Ho : Two proteins compared are unrelated

• Distribution of scores of unrelated proteins determined empirically using PDB data at 40% sequence identity

No assumption of background distribution

Page 4: A unified statistical framework for sequence comparison and structure comparison

Sequence Comparison Framework

• Sequence score determined by SSEARCH and BLOSUM 50 substitution matrix

• Sseq (sequence score), n and m (lengths of two sequences compared) in p.d.f.

• Compared all possible pairs to determine empirically the p.d.f.

Page 5: A unified statistical framework for sequence comparison and structure comparison

P.D.F. for Sequence Score

Page 6: A unified statistical framework for sequence comparison and structure comparison

Cross Section of p.d.f for constant ln(nm)

Page 7: A unified statistical framework for sequence comparison and structure comparison

Density Distribution for constant ln(nm)

• Density distribution follows extreme value distribution: exp(-Z –exp(-Z))= pc

seq(Z)

• Z=(Sseq - µseq)/ơseq

• µseq = a ln(nm) + b; model average; a and b fitted to the observed density by least squares

• ơseq = a

Page 8: A unified statistical framework for sequence comparison and structure comparison

Comparison to BLAST and FASTA statistics

• Critical region to determine p-value for model: Pseq(z>Z)

• Comparison of model p-values with BLAST p-value found BLAST p-value higher than model

• FASTA statistic better coverage, more error than model

Page 9: A unified statistical framework for sequence comparison and structure comparison

Structure Comparison Algorithm

Page 10: A unified statistical framework for sequence comparison and structure comparison

Structure Comparison Framework

• The score obtained from the structure comparison algorithm is Sstr

• P.d.f. for Sstr used N (number of residues matched) and Sstr (pairs which scored high were removed)

• Kept N fixed and fitted extreme value distribution to density using all N

Page 11: A unified statistical framework for sequence comparison and structure comparison

Comparison with RMS

• RMS deviation in alpha-carbon after least squares fit is traditional method

• RMS score used to determine p.d.f. with ln(RMS score) and N

• Comparison of RMS with Sstr found RMS worse than S in coverage and accuracy

Page 12: A unified statistical framework for sequence comparison and structure comparison

Comparison with RMS (cont.)

• Three reasons:

• Sstr depends most strongly on best-fitting atoms; RMS depends most on worst-fitting atoms

• Sstr penalizes gaps; RMS does not

• Sstr is analogous to Sseq in the sense that both use dynamic-programming

Page 13: A unified statistical framework for sequence comparison and structure comparison

Comparison of Structure and Sequence Comparison

Page 14: A unified statistical framework for sequence comparison and structure comparison

Concluding Remarks

• Significance of sequence structure score can be calculated from any structural alignment program

• This method of statistical significance is between FASTA and BLAST methods

Page 15: A unified statistical framework for sequence comparison and structure comparison

Efficient Detection of Three-Dimensional Structural Motifs in Biological Macromolecules By Computer Vision Techniques

Ruth Nussinov Haim J. Wolfson

Page 16: A unified statistical framework for sequence comparison and structure comparison

Introduction

• One of the earlier papers addressing structure comparison

• Based on computer vision techniques ( geometric hashing paradigm)

• No a priori predefined motif assumed

• Advantage: Can be parallelized

Page 17: A unified statistical framework for sequence comparison and structure comparison

Problem

• Given 3D coordinates of atoms of two molecules, find a rigid transformation (rotation and translation allowed) so that a large number of atoms of one molecule match the atoms of the other molecule

• Closely related to 3D rigid object recognition

Page 18: A unified statistical framework for sequence comparison and structure comparison

Geometric Hashing Paradigm:Representation of

Geometric Constraints

• Proteins represented as points using coordinate frames (minimal representation of coordinate frames)

• Pick three noncolinear points to define a plane (RS) and construct orthogonal 3D coordinate system based on RS

Page 19: A unified statistical framework for sequence comparison and structure comparison

Representation of Geometric Constraints (cont.)

• Define orthonormal vectors w.r.t. RS so that any point can be represented as a linear combination of the orthonormal vectors

• To remove dependence on particular RS (may preclude recognition if at least one of the RS points does not match with input substructure), represent the m points in all basis triplets (I.e. all orthonormal vectors) with all possible RS

Page 20: A unified statistical framework for sequence comparison and structure comparison

Algorithm for Representation of Geometric Constraints

For each RS {

• Compute orthonormal 3D basis associated with each RS

• Compute coordinates of all other points in coordinate frame defined by 3D basis

• For each point define address of hash table with labels and measurements

• Use each address to enter hash table with pair (model, RS) }

Page 21: A unified statistical framework for sequence comparison and structure comparison

Determining Hash Table Entries with Model M1 and Points 4 and

1 as Basis

Page 22: A unified statistical framework for sequence comparison and structure comparison

Locations of Hash Table Entries for Model M1 after all bases, RS

Page 23: A unified statistical framework for sequence comparison and structure comparison

Geometric Hashing: Matching

1. Given observed object:2. 1. Choose an RS and compute 3D basis

associated with RS3. 2. Compute the coordinates of the other

observed object points in 3D basis4. 3. For each point, enter hash table at

address defined by labels and measurements and label and coordinate of new point

Page 24: A unified statistical framework for sequence comparison and structure comparison

Geometric Hashing: Matching (cont.)

• For step 3: Tally a vote for model and RS for each entry found at address; can histogram all hash table entries which received one or more votes

4. If no pair scores high (determine by threshold), then go to 1, and begin with different RS of the observed object

Page 25: A unified statistical framework for sequence comparison and structure comparison

Geometric Hashing: Matching (cont.)

5. Consider all the models from step 4 and find rigid motion that gives best least squares match

6. Transform the model point set according to the transformation of step 5 and check consistency of all biological information (I.e. match labeling)

Page 26: A unified statistical framework for sequence comparison and structure comparison

Modifications to Algorithm

• Could modify voting scheme, modify representation of coordinate axes to 2D coordinate axes (reduces worst case running time analysis), could apply representation of atoms to alpha-carbons only (no labeling allowed), could group atoms together into a single unit and analyze structures using these atom groups

Page 27: A unified statistical framework for sequence comparison and structure comparison

Algorithm Performance

• Experimented with bacterial proteins, bovine pancreas protein, calcium binding protein, bovine liver protein, and protein from hen egg

All experiments were “favorable” to “excellent” results in terms of fit

Page 28: A unified statistical framework for sequence comparison and structure comparison

Conclusion

• Algorithm needs O(N x m4) for hash table (can be big for large N, m)

• Running time for algorithm can also be long

• Can be parallelized (ie. representation stage independent of matching stage)

• Sequence order independent (ie. Insensitive to gaps, insertions, deletions)