multiple sequence alignment: np-hardness and how to deal with it jens stoye bielefeld university,...
Post on 21-Dec-2015
213 views
TRANSCRIPT
Multiple Sequence Alignment:NP-Hardness and How to Deal with It
Jens StoyeBielefeld University, Germany
Preliminaries: Pairwise Alignment
>pdb|1KSW|A Chain A, Structure Of Human C-Src Tyrosine Kinase (Thr338gly Mutant) In Complex With N6-Benzyl Adp Length=452
Score = 161 bits (408), Expect = 5e-47, Method: Compositional matrix adjust. Identities = 81/85 (95%), Positives = 81/85 (95%), Gaps = 1/85 (1%)
Query 1 PRESLRLEAKLGQGCFGEVWMGTWNDTTRVAIKTLKPGTMSPEAFLQEAQVMKKLRHEKL 60 PRESLRLE KLGQGCFGEVWMGTWN TTRVAIKTLKPGTMSPEAFLQEAQVMKKLRHEKL Sbjct 182 PRESLRLEVKLGQGCFGEVWMGTWNGTTRVAIKTLKPGTMSPEAFLQEAQVMKKLRHEKL 241
Query 61 VQLYAVVS-EPIYIVIEYMSKGSLL 84 VQLYAVVS EPIYIV EYMSKGSLL Sbjct 242 VQLYAVVSEEPIYIVGEYMSKGSLL 266
PRESLRLEAKLGQGCFGEVWMGTWNDTTRVAIKTLKPGTMSPEAFLQEAQVMKKLRHEKL PRESLRLEVKLGQGCFGEVWMGTWNGTTRVAIKTLKPGTMSPEAFLQEAQVMKKLRHEKL
VQLYAVVS-EPIYIVIEYMSKGSLL VQLYAVVSEEPIYIVGEYMSKGSLL
Preliminaries: Pairwise Alignment
Find best alignment of two sequences:highest score/lowest cost
Analysis: O(n2) time
Multiple Alignment
k sequences, not just 2
sp|P00526|SRC ---GLAK--DAWEIPRESLRLEAKLGQGCFGEVWMGTWND-TTRVAIKTLKPGT--MSPE 52sp|P00527|YES ---GLAK--DAWEIPRESLRLEVKLGQGCFGEVWMGTWNG-TTKVAIKTLKLGT--MMPE 52sp|P00521|ABL TIYGVSPNYDKWEMERTDITMKHKLGGGQYGEVYEGVWKKYSLTVAVKTLKEDT--MEVE 58sp|P00542|FES -VLNRAVPKDKWVLNHEDLVLGEQIGRGNFGEVFSGRLRADNTLVAVKSCRETLPPDIKA 59sp|P00530|FPS -VLTRAVLKDKWVLNHEDVLLGERIGRGNFGEVFSGRLRADNTPVAVKSCRETLPPELKA 59sp|P00532|KRAF -------SSYYWKMEASEVMLSTRIGSGSFGTVYKGKWHGDVAVKILKVVDPTP--EQLQ 51 * : .: : ::* * :* *: * . :*
sp|P00526|SRC AFLQEAQVMKKLRHEKLVQLYAVVSEEP-IYIVIEYMSKGSLLDFLKGEMGKYLRLPQLV 111sp|P00527|YES AFLQEAQIMKKLRHDKLVPLYAVVSEEP-IYIVTEFMTKGSLLDFLKEGEGKFLKLPQLV 111sp|P00521|ABL EFLKEAAVMKEIKHPNLVQLLGVCTREPPFYIITEFMTYGNLLDYLRECNRQEVSAVVLL 118sp|P00542|FES KFLQEAKILKQYSHPNIVRLIGVCTQKQPIYIVMELVQGGDFLTFLRT-EGARLRMKTLL 118sp|P00530|FPS KFLQEARILKQCNHPNIVRLIGVCTQKQPIYIVMELVQGGDFLSFLRS-KGPRLKMKKLI 118sp|P00532|KRAF AFRNEVAVLRKTRHVNILLFMGYMTKDN-LAIVTQWCEGSSLYKHLHV-QETKFQMFQLI 109 * :*. :::: * ::: : . :.. : *: : ..: .*: . *:
sp|P00526|SRC DMAAQIASGMAYVERMNYVHRDLRAANILVGENLVCKVADFGLARLIEDNEYTARQGAK- 170sp|P00527|YES DMAAQIADGMAYIERMNYIHRDLRAANILVGDNLVCKIADFGLARLIEDNEYTARQGAK- 170sp|P00521|ABL YMATQISSAMEYLEKKNFIHRDLAARNCLVGENHLVKVADFGLSRLMTGDTYTAHAGAK- 177sp|P00542|FES QMVGDAAAGMEYLESKCCIHRDLAARNCLVTEKNVLKISDFGMSREAADGIYAASGGLRQ 178sp|P00530|FPS KMMENAAAGMEYLESKHCIHRDLAARNCLVTEKNTLKISDFGMSRQEEDGVYASTGGMKQ 178sp|P00532|KRAF DIARQTAQGMDYLHAKNIIHRDMKSNNIFLHEGLTVKIGDFGLATVKSRWSGSQQVEQPT 169 : : : .* *:. :***: : * :: : *:.***:: :
sp|P00526|SRC FPIKWTAPEAALYG---RFTIKSDVWSFGILLTELTTKGRVPYPGMVNR-EVLDQVERGY 226sp|P00527|YES FPIKWTAPEAALYG---RFTIKSDVWSFGILLTELVTKGRVPYPGMVNR-EVLEQVERGY 226sp|P00521|ABL FPIKWTAPESLAYN---KFSIKSDVWAFGVLLWEIATYGMSPYPGIDLS-QVYELLEKDY 233sp|P00542|FES VPVKWTAPEALNYG---RYSSESDVWSFGILLWETFSLGASPYPNLSNQ-QTREFVEKGG 234sp|P00530|FPS IPVKWTAPEALNYG---WYSSESDVWSFGILLWEAFSLGAVPYANLSNQ-QTREAIEQGV 234sp|P00532|KRAF GSVLWMAPEVIRMQDDNPFSFQSDVYSYGIVLYELMAG-ELPYAHINNRDQIIFMVGRGY 228 .: * *** :: :***:::*::* * : **. : : : :.
sp|P00526|SRC RMPCP----PECPESLHDLMCQCWRKDPEERPTFKYLQAQLLPACVLEVAE- 273sp|P00527|YES RMPCP----QGCPESLHELMKLCWKKDPDERPTFEYIQSFLEDYFTAAEPSG 274sp|P00521|ABL RMERP----EGCPEKVYELMRACWQWNPSDRPSFAEIHQAFETMFQESSIS- 280sp|P00542|FES RLPCP----ELCPDAVFRLMEQCWAYEPGQRPSFSAIYQELQSIRKRHR--- 279sp|P00530|FPS RLEPP----EQCPEDVYRLMQRCWEYDPHRRPSFGAVHQDLIAIRKRHR--- 279sp|P00532|KRAF ASPDLSRLYKNCPKAIKRLVADCVKKVKEERPLFPQILSSIELLQHSLPKIN 280 **. : *: * ** * : :
Multiple Alignment – Why?
Highlight similarities of the sequences in a family:– sequence assembly– molecular modeling, structure-function conclusions– database search (sequence families)– protein domains– primer design
Highlight dissimilarities between the sequences in a family:– reconstruction of phylogenetic trees– analysis of single nucleotide polymorphisms (SNPs)
„One or two homologous sequences whisper ... a full multiple alignment shouts out loud“
(Hubbard et al., 1996)
Multiple Alignment Objective Functions
• Find best alignment of k sequences:highest score/lowest cost
• Based on pairwise projections:
a) sum of all pairs:
b) tree alignment score:
NP Hardness
CS terminology:
The computational problem of SP multiple sequence alignment is NP hard.
In practice:
Don‘t even try it for more than 10 or 12 sequences.
What can we do?– compute anyway– running time heuristics– approximation algorithms– fixed parameter algorithms– correctness heuristics
Multiple Alignment in Practice
Mostly progressive, e.g. CLUSTAL W
Not covered:
hybrid approaches, e.g. T-COFFEE, MAUVE, Clustal Omegalocal multiple alignment, e.g. DIALIGN