inverse alignment cs 374 bahman bahmani fall 2006

Inverse AlignmentInverse Alignment

CS 374CS 374

Bahman BahmaniBahman Bahmani

Fall 2006Fall 2006

The Papers To Be PresentedThe Papers To Be Presented

Sequence Comparison - AlignmentSequence Comparison - Alignment

Alignments can be thought Alignments can be thought of as two sequences of as two sequences differing due to mutations differing due to mutations happened during the happened during the evolutionevolution

AGGCTATCACCTGACCTCCAGGCCGATGCCC

TAGCTATCACGACCGCGGTCGATTTGCCCGAC

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- | | | | | | | | | | | | | x | | | | | | | | | | |

TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

Scoring AlignmentsScoring Alignments Alignments are based on three basic operations:Alignments are based on three basic operations:

1.1. SubstitutionsSubstitutions

2.2. InsertionsInsertions

3.3. DeletionsDeletions

A score is assigned to each single operation (resulting in a A score is assigned to each single operation (resulting in a scoring matrix and also in gap penalties). Alignments are then scoring matrix and also in gap penalties). Alignments are then scored by scored by adding the scoresadding the scores of their operations. of their operations.

Standard formulations of string alignment optimize the above Standard formulations of string alignment optimize the above score of the alignment.score of the alignment.

An Example Of Scoring an An Example Of Scoring an Alignment Using a Scoring MatrixAlignment Using a Scoring Matrix

AA RR NN KK

AA 55 -2-2 -1-1 -1-1

RR -- 77 -1-1 33

NN -- -- 77 00

KK -- -- -- 66

Scoring Matrices in Practice Scoring Matrices in Practice Some choices for substitution scores are now common, largely due Some choices for substitution scores are now common, largely due

to conventionto convention Most commonly used Amino-Acid substitution matrices:Most commonly used Amino-Acid substitution matrices:

PAM (Percent Accepted Mutation)PAM (Percent Accepted Mutation) BLOSUM (Blocks Amino Acid Substitution Matrix)BLOSUM (Blocks Amino Acid Substitution Matrix)

BLOSUM50 Scoring MatrixBLOSUM50 Scoring Matrix

Gap PenaltiesGap Penalties Inclusion of gaps and gap penalties is necessary Inclusion of gaps and gap penalties is necessary

to obtain the best alignmentto obtain the best alignment

If gap penalty is too high, gaps will never appear If gap penalty is too high, gaps will never appear in the alignmentin the alignment

AATGCTGCAATGCTGC ATGCTGCAATGCTGCA

If gap penalty is too low, gaps will appear If gap penalty is too low, gaps will appear everywhere in the alignmenteverywhere in the alignment

AATGCTGC----AATGCTGC---- A----TGCTGCAA----TGCTGCA

Gap Penalties (Cont’d)Gap Penalties (Cont’d)

Separate penalties for gap opening and gap extensionSeparate penalties for gap opening and gap extension

Opening: The cost to introduce a gapOpening: The cost to introduce a gap

Extension: The cost to elongate a gapExtension: The cost to elongate a gap

Opening a gap is costly, while extending a gap is cheap Opening a gap is costly, while extending a gap is cheap

Despite scoring matrices, no gap penalties are commonly agreed uponDespite scoring matrices, no gap penalties are commonly agreed upon

LETVGYW----L

-5 -1 -1 -1

Parametric Sequence AlignmentParametric Sequence Alignment

For a given pair of strings, the alignment problem is For a given pair of strings, the alignment problem is solved for solved for effectively all possible choiceseffectively all possible choices of the scoring of the scoring parameters and penalties (exhaustive search).parameters and penalties (exhaustive search).

A A correct alignmentcorrect alignment is then used to find the best is then used to find the best parameter values.parameter values.

However, this method is However, this method is very inefficientvery inefficient if the number of if the number of parameters is large.parameters is large.

Inverse Parametric AlignmentInverse Parametric Alignment

INPUT: an alignment of a pair of strings.INPUT: an alignment of a pair of strings.

OUTPUT: a choice of parameters that makes the input OUTPUT: a choice of parameters that makes the input alignment be an optimal-scoring alignment of its strings.alignment be an optimal-scoring alignment of its strings.

From Machine Learning point of view, this learns the From Machine Learning point of view, this learns the parameters for optimal alignment from training examples parameters for optimal alignment from training examples of correct alignments.of correct alignments.

Inverse Optimal AlignmentInverse Optimal Alignment

Definition (Inverse Optimal Alignment): Definition (Inverse Optimal Alignment):

INPUT: alignments INPUT: alignments AA11, A, A22, …, A, …, Akk of strings, of strings,

an alignment scoring function an alignment scoring function ffww with with parameters parameters ww = ( = (ww11, w, w22, …, w, …, wpp). ).

OUTPUT: values OUTPUT: values x x = (= (xx11, x, x22, …, x, …, xpp) for ) for ww

GOAL: each input alignment be an optimal alignment of GOAL: each input alignment be an optimal alignment of its strings under its strings under ffxx . .

ATTENTION: ATTENTION: This problem may have no solution!This problem may have no solution!

Inverse Near-Optimal AlignmentInverse Near-Optimal Alignment

When minimizing the scoring function When minimizing the scoring function f, f, we we say an alignment say an alignment A A of a set of strings of a set of strings S S is is –optimal–optimal, for some if:, for some if:

where is the optimal alignment of where is the optimal alignment of S S under under f.f.

0

)()1()( *AfAf

*A

Inverse Near-Optimal Alignment Inverse Near-Optimal Alignment (Cont’d)(Cont’d)

Definition (Inverse Near-Optimal Alignment):Definition (Inverse Near-Optimal Alignment):

INPUT: alignments INPUT: alignments AAii

scoring function scoring function ff

real number real number

OUTPUT: find parameter values OUTPUT: find parameter values x x

GOAL: each alignment GOAL: each alignment AAii be -optimal under be -optimal under ffxx . .

The smallest possible can be found within accuracy The smallest possible can be found within accuracy using calls to the algorithm.using calls to the algorithm.

0

))/(log( O

Inverse Unique-Optimal AlignmentInverse Unique-Optimal Alignment

When minimizing the scoring function When minimizing the scoring function f, f, we we say an alignment say an alignment A A of a set of strings of a set of strings S S is is -unique-unique for some if: for some if:

for every alignment for every alignment B B of of S S other than other than A.A.

0

)()( AfBf

Inverse Unique-Optimal Alignment Inverse Unique-Optimal Alignment (Cont’d)(Cont’d)

Definition (Inverse Unique-Optimal Alignment): Definition (Inverse Unique-Optimal Alignment):

INPUT: alignments INPUT: alignments AAii

scoring function scoring function ff real number real number

OUTPUT: parameter values OUTPUT: parameter values x x

GOAL: each alignment GOAL: each alignment AAii be -unique under be -unique under ffxx

The largest possible can be found within accuracy using The largest possible can be found within accuracy using calls to the algorithm. calls to the algorithm.

0

))/(log( O

Let There Be Linear Functions …Let There Be Linear Functions …

For most standard forms of alignment, the alignment For most standard forms of alignment, the alignment scoring function scoring function ff is a linear function of its parameters: is a linear function of its parameters:

where each where each ffii measures one of the features of measures one of the features of A.A.

pp wAfwAfAfAf )(...)()(:)( 110

Let There Be Linear Functions … Let There Be Linear Functions … (Example I)(Example I)

With fixed substitution scores, and two With fixed substitution scores, and two parameters gap open ( ) and gap extension parameters gap open ( ) and gap extension ( ) penalties, ( ) penalties, p=2 and:p=2 and:

where:where: g(A) = g(A) = number of gapsnumber of gaps l(A) = l(A) = total length of gapstotal length of gaps s(A) =s(A) = total score of all substitutions total score of all substitutions

)()()()( AlAgAsAf

Let There Be Linear Functions … Let There Be Linear Functions … (Example II)(Example II)

With no parameters fixed, the substitution scores are With no parameters fixed, the substitution scores are also in our parameters and:also in our parameters and:

where:where:

aa and and b b range over all letters in the alphabetrange over all letters in the alphabet

hhabab(A)(A) = # of substitutions in = # of substitutions in A A replacing replacing a a by by bb

ab

)()())(()(,

AlAgAhAfba

abab

Linear Programming ProblemLinear Programming Problem INPUT: variables INPUT: variables x x = (= (xx11, x, x22, …, x, …, xnn))

a system of linear inequalities in a system of linear inequalities in xx a linear objective function in a linear objective function in xx

OUTPUT: OUTPUT: assignment of real values to assignment of real values to xx

GOAL: satisfy all the inequalities and minimize the objectiveGOAL: satisfy all the inequalities and minimize the objective

In general, the program can be In general, the program can be infeasibleinfeasible, , boundedbounded, , or or unboundedunbounded..

*x

}:{minarg0

* bAxcxxx

*x

Reducing The Inverse Alignment Reducing The Inverse Alignment Problems To Linear ProgrammingProblems To Linear Programming

Inverse Optimal Alignment: For each Inverse Optimal Alignment: For each AAii and every and every

alignment alignment B B of the set of the set SSii, we have an inequality:, we have an inequality:

or equivalently:or equivalently:

The number of alignments of a pair of strings of length The number of alignments of a pair of strings of length n n is is hence a total of inequalities in hence a total of inequalities in pp variables. Also, no specific objective function.variables. Also, no specific objective function.

)()( ixx AfBf

pj

ijijj BfAfxAfBf1

00 ))()(())()((

)4()/)23(( 2/1 nn n )4( nk

Separation TheoremSeparation Theorem Some definitions:Some definitions:1.1. Polyhedron:Polyhedron: intersection of half-spaces intersection of half-spaces2.2. Rational polyhedronRational polyhedron: described by inequalities with : described by inequalities with

only rational coefficientsonly rational coefficients3.3. Bounded polyhedronBounded polyhedron: no infinite rays: no infinite rays

Separation Theorem (Cont’d)Separation Theorem (Cont’d)

Optimization ProblemOptimization Problem for a rational polyhedron for a rational polyhedron P in P in : :

INPUT: rational coefficients INPUT: rational coefficients cc specifying the objective specifying the objective

OUTPUT: a point OUTPUT: a point xx in in P P minimizing minimizing cx, cx, or determining or determining that that P P is empty.is empty.

Separation ProblemSeparation Problem for for P P is: is:

INPUT: a point INPUT: a point y in y in

OUTPU: rational coefficients OUTPU: rational coefficients ww and and bb such that such that for all points for all points x x in in P, butP, but (a (a violated inequalityviolated inequality) or ) or determining that determining that y y is in is in P.P.

bwx

dR

dR

bwy

Separation Theorem (Cont’d)Separation Theorem (Cont’d)

Theorem (Equivalence of Separation and Theorem (Equivalence of Separation and Optimization):Optimization): The optimization problem on a The optimization problem on a bounded rational polyhedron can be solved in bounded rational polyhedron can be solved in polynomial time if and only if the separation polynomial time if and only if the separation problem can be solved in polynomial time.problem can be solved in polynomial time.

That is, for bounded rational polyhedrons:That is, for bounded rational polyhedrons:

OptimizationOptimization SeparationSeparation

Cutting-Plane AlgorithmCutting-Plane Algorithm

1.1. Start with a small subset Start with a small subset S S of the set of the set L L of all inequalitiesof all inequalities

2.2. Compute an optimal solution Compute an optimal solution xx under constraints in under constraints in SS

3.3. Call the separation algorithm for Call the separation algorithm for L L on on xx

4.4. If If xx is determined to satisfy is determined to satisfy LL output it and halt; output it and halt; otherwise,otherwise,

add the violated inequality to add the violated inequality to SS and loop back to step (2). and loop back to step (2).

Complexity of Inverse AlignmentComplexity of Inverse Alignment

Theorem:Theorem: Inverse Optimal and Near-Optimal Alignment can Inverse Optimal and Near-Optimal Alignment can be solved in polynomial time for any form of alignment in be solved in polynomial time for any form of alignment in which: which:

1.1. the alignment scoring function is linear the alignment scoring function is linear

2.2. the parameters values can be bounded the parameters values can be bounded

3.3. for any fixed parameter choice, an optimal alignment for any fixed parameter choice, an optimal alignment can be found in polynomial time. can be found in polynomial time.

Inverse Unique-Optimal Alignment can be solved in Inverse Unique-Optimal Alignment can be solved in polynomial time if in addition:polynomial time if in addition:

3’.3’. for any fixed parameter choice, a for any fixed parameter choice, a next-bestnext-best alignment alignment can be found in polynomial time. can be found in polynomial time.

Application to Global AlignmentApplication to Global Alignment

Initializing the Cutting-Plane Algorithm:Initializing the Cutting-Plane Algorithm: We consider We consider the problem in two cases:the problem in two cases:

1.1. All scores and penalties varyingAll scores and penalties varying: Then the parameter : Then the parameter space can be made bounded.space can be made bounded.

2.2. Substitution costs are fixedSubstitution costs are fixed: Then either (1) a : Then either (1) a bounding bounding inequalityinequality, or (2) two inequalities one of which is a , or (2) two inequalities one of which is a downward half-space, the other one is an upward half-downward half-space, the other one is an upward half-space, and the slope of the former is less than the slope space, and the slope of the former is less than the slope of the latter can be found in of the latter can be found in O(1) O(1) timetime, , if they exist.if they exist.

Application to Global Alignment Application to Global Alignment (Cont’d)(Cont’d)

Choosing an Objective Function:Choosing an Objective Function: Again we consider two Again we consider two different cases:different cases:

1.1. Fixed substitution scores:Fixed substitution scores: in this case we choose the in this case we choose the following objective:following objective:

2.2. Varying substitution scores:Varying substitution scores: In this case we choose the In this case we choose the following objective:following objective:

where where ss is the minimum of all non-identity substitution is the minimum of all non-identity substitution scores and scores and i i is the maximum of all identity scores. is the maximum of all identity scores.

}max{

}max{ is

Application to Global Alignment Application to Global Alignment (Cont’d)(Cont’d)

For every objective, two extreme solutions exist: For every objective, two extreme solutions exist: xxlargelarge

and and xxsmallsmall. Then for every we have a . Then for every we have a

corresponding solution:corresponding solution:

xx1/21/2 is expected to better generalize to alignments outside is expected to better generalize to alignments outside

the training set.the training set.

10

smallel xxx .).1(: arg

Computational ResultsComputational Results

Computational Results (Cont’d)Computational Results (Cont’d)

CONTRAlignCONTRAlign

What:What: extensible and fully automatic parameter learning extensible and fully automatic parameter learning framework for protein pair-wise sequence alignmentframework for protein pair-wise sequence alignment

How:How: pair conditional random fields (pair CRF s) pair conditional random fields (pair CRF s)

Who:Who:

Pair-HMMs for Sequence Pair-HMMs for Sequence AlignmentAlignment

),(),(),(),( .......),,( GGMMI

AIIM

YFMMM

GGMM xxx

yxaP

Pair-HMMs … (Cont’d)Pair-HMMs … (Cont’d)

If then:If then:

where:where:

TMM

GGMM ,...]log,log,[log ),(

w

)),,(exp();,,( yxayxaP Tfww

1

2

1

#

),(#

#

),,(transitionMMfollowsalignmenttimesof

MstateinGGgeneratesalignmenttimesof

Mstateinstartsalignmenttimesof

yxaf

Training Pair-HMMs Training Pair-HMMs

INPUT: a set of training examplesINPUT: a set of training examples

OUTPUT: the feature vector OUTPUT: the feature vector w w

METHOD: METHOD: maximizing the joint log-likelihoodmaximizing the joint log-likelihood of the data of the data and alignments under constraints on and alignments under constraints on ww::

mi

iii yxaD 1)()()( )},,{(

m

i

iii yxaPDl1

)()()( );,,(log:):( ww

Generating Alignments Using Pair-Generating Alignments Using Pair-HMMsHMMs

Viterbi AlgorithmViterbi Algorithm on a Pair-HMM: on a Pair-HMM:

INPUT: two sequences INPUT: two sequences x x and and yy

OUTPUT: the alignment OUTPUT: the alignment a a of of x x and and yy that maximizes that maximizes P(a|P(a|x,y;w)x,y;w)

RUNNING TIME: RUNNING TIME: O(|x|.|y|)O(|x|.|y|)

Pair-CRFsPair-CRFs

Directly model the conditional probabilities:Directly model the conditional probabilities:

where where w w is a real-valued parameter vector is a real-valued parameter vector not not necessarily corresponding to log-probabilitiesnecessarily corresponding to log-probabilities

''

)),,'(exp(

)),,(exp(

);,,'(

);,,();,|(

a

T

T

a

yxa

yxa

yxaP

yxaPyxaP

fw

fw

w

ww

Training Pair-CRFsTraining Pair-CRFs

INPUT: a set of training examplesINPUT: a set of training examples OUTPUT: real-valued feature vector OUTPUT: real-valued feature vector ww METHOD: METHOD: maximizing themaximizing the conditional log-likelihoodconditional log-likelihood of of

the data (discriminative/conditional learning)the data (discriminative/conditional learning)

where is where is a Gaussian prior on a Gaussian prior on ww, to , to prevent over-fitting.prevent over-fitting.

mi

iii yxaD 1)()()( )},,{(

m

i

iii PyxaPDl1

)()()( )(log);,|(log:):( www

)exp()( 2j jjwCP w

Properties of Pair-CRFsProperties of Pair-CRFs

Far weaker independence assumptions than Pair-Far weaker independence assumptions than Pair-HMMsHMMs

Capable of utilizing complex non-independent feature Capable of utilizing complex non-independent feature sets sets

Directly optimizing the predictive ability, ignoring Directly optimizing the predictive ability, ignoring P(x,y); the model to generate the input sequencesP(x,y); the model to generate the input sequences

Choice of Model Topology in Choice of Model Topology in CONTRAlignCONTRAlign

Some possible model topologies:Some possible model topologies:

CONTRAlignCONTRAlignDouble-AffineDouble-Affine : :

CONTRAlignCONTRAlignLocalLocal : :

Choice of Feature Sets in Choice of Feature Sets in CONTRAlignCONTRAlign

Some possible feature sets to utilize:Some possible feature sets to utilize:

1. Hydropathy-based gap context features 1. Hydropathy-based gap context features (CONTRAlign(CONTRAlignHYDROPATHYHYDROPATHY))

2. External Information:2. External Information:

2.1. Secondary structure (CONTRAlign2.1. Secondary structure (CONTRAlignDSSPDSSP))

2.2. Solvent accessibility 2.2. Solvent accessibility (CONTRAlign(CONTRAlignACCESSIBILITYACCESSIBILITY))

Results: Comparison of Model Results: Comparison of Model Topologies and Feature SetsTopologies and Feature Sets

Results: Comparison to Modern Results: Comparison to Modern Sequence Alignment ToolsSequence Alignment Tools

Results: Alignment Accuracy in the Results: Alignment Accuracy in the “Twilight Zone”“Twilight Zone”

For each conservation range, the uncolored bars give accuracies for For each conservation range, the uncolored bars give accuracies for MAFFT(L-INS-i), T-Coffee, CLUSTALW, MUSCLE, and PROBCONS (Bali) MAFFT(L-INS-i), T-Coffee, CLUSTALW, MUSCLE, and PROBCONS (Bali) in that order, and the colored bar indicated the accuracy for CONTRAlign.in that order, and the colored bar indicated the accuracy for CONTRAlign.

Questions?Questions?

Thank You!Thank You!

inverse alignment cs 374 bahman bahmani fall 2006

Documents

alignment problem

correct alignment

gap penalty

gap extension

parametric sequence

inverse alignment cs

scoring alignments alignments

scoring matrices