a similar fragments merging approach to learn automata on proteins
DESCRIPTION
A Similar Fragments Merging Approach to Learn Automata on Proteins. Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes. Outline of the talk. Protein families signatures Similar Fragment Merging Approach (Protomata-L) Characterization Similar Fragment Pairs (SFPs) Ordering the SFPs - PowerPoint PPT PresentationTRANSCRIPT
A Similar Fragments Merging Approach to Learn Automata on
Proteins
Goulven KERBELLEC & François COSTEIRISA / INRIA Rennes
Outline of the talk Protein families signatures Similar Fragment Merging Approach (Protomata-L)
Characterization Similar Fragment Pairs (SFPs) Ordering the SFPs
Generalization Merging of SFP in an automaton Gap generalization Identification of Physico-chemical properties
Experiments
Protein families Amino acid alphabet :
Protein sequence :
Protein data set :
>AQP1_BOVINMASEFKKKLFWRAVVAEFLAMILFIFISIGSALGFHYPIKSNQTTGAVQDNVKVSLAFGLSI…
>AQP1_BOVINMASEFKKKLFWRAVVAEFLAMILFIFISIGSALGFHYPIKSNQTTGAVQDNVKVSLAFGLSI…>AQP2_RATMWELRSIAFSRAVLAEFLATLLFVFFGLGSALQWASSPPSVLQIAVAFGLGIGILVQALGH…>AQP3_MOUSEMGRQKELMNRCGEMLHIRYRLLRQALAECLGTLILVMFGCGSVAQVVLSRGTHGGFLT…
{A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V}
Common function & Common topology (3D structure)
Characterization of a protein family
x x x x
x x x x x x x x
C H x \ / x
x Zn x x / \ x
C H x x x x
C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H
ZBT11 ...Csi..CgrtLpklyslriHmlk..H...
ZBT10 ...Cdi..CgklFtrrehvkrHslv..H...
ZBT34 ...Ckf..CgkkYtrkdqleyHirg..H... Zinc Finger Pattern
Expressivity classes of patternsClass Example
A T-C-T-T-G-A
B D-R-C-C-x(2)-H-D-x-C
C G-G-G-T-F-[ILV]-[ST]-[ILV]
D V-x-P-x(2)-[RQ]-x(4)-G-x(2)-L-[LM]
E G-C-x(1,3)-C-P-x(8,10)-C-C
F C-x(2,4)-C-x(3)-[ILVFYC]-x(8)-H-x(3,5)-H
G D-T-A-G-Q-E-*-L-V-G-N-K
H D-T-A-G-[NQ]-*-L-V-G-N-[KEH]
I D-T-A-x(2,5)-G-[NQ]-*-L-V-G-N-[KEH]
J Regular Expression / Automaton
PROSITE PRATTTEIRESIAS
PROTOMATA-L
Characterization
Similar Fragment Pairs Significantly similar fragment pairs (SFPs) Natural selection Important area characterization
Data set D:
Ordering the SFPs Problem :
Solution : ordering the SFPs by scoring each SFP S(f1,f2)= ? 3 different scoring functions :
dialign Sd support Ss implication Si
Dialign Score
Sd ( f1 , f2 ) = - log P ( L , Sim )
L = |f1| = |f2| Sim = Sum of the individual similarity values P = Probability that a random SFP of the same L
has the same S
Blossum62similarity
Support Score
Taking into account the representativeness of SFP
Ss (f1,f2,D) = Number of sequences supporting <f1,f2>
f1f2
f
<f1,f2> is supported by f with respectthe triangular inequality :Sd(f,f1) + Sd(f,f2) Sd(f1,f2)
Implication Score Taking into account a counter-example set N Discriminative fragments Lerman index:
Si(f1,f2,D,N) =
avec P(X) =
-P( Ss(f1,f2,N) ) + P( Ss(f1,f2,D) ) x P(N)
P( Ss(f1 ,f2 ,D) ) x |N|
|X|
|D| + |N|
Generalization
From protein data sets to automata
MASEIKLFW
M A S E I K L F W
From protein data sets to automata
MASEIKLFW
MGYEVKYRV
M G Y E V K Y R V
M A S E I K L F W
Merging SFPs
MASEIKLFW
MGYEVKYRV
M G Y E V K Y R V
M A S E I K L F W
Merging SFPs
MASEIKLFW
MGYEVKYRV
M G YE K
Y R V
M AS L
F W[I,V]
Merging SFPsMASEIKLFW
MGYEVKYRV
M G YE [I,V] K
Y R V
M AS L
F W
MASEVKLFM MGYEIKYRV
MASEIKYRV MGYEVKLFW
MASEVKYRV MGYEIKLFW
Protein Sequence Data SetList of SFPs
MCA
Automaton / Regular Expression
Ordered List of SFPs
MERGING
Gap Generalization Merging on themself non-representative transitions Treat them as "gaps"
Identification of Physico-chemical properties
Similar Fragments ~ potential function area Amino acids share out the same position Physicochemical property at play => Generalization from a group (of amino acids) to a Taylor group
I,V I,Q,W,P
aliphatic
xI,L,V
no information
C
C
[I,V] [I,L,V] C C[I,Q,W,P] X {A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V}
Likelihood ratio test To decide if the multi-set A has been generated
according to a physico-chemical group G or not by a likelihood ratio test:
Given a threshold , we test the expansion of A to G and reject it when LRG/A <
Experiments
MIP : the Major Intrinsic Protein Family
FamilyMIP
SubfamiliesAQP, Glpf, Gla
Data sets
UNIPROTMIP in SWISS-PROT
Set « T » (159 seq)
Set « M» (44 seq)identity<90%
Set « W+» (24 seq)
Set « W-» (16 seq)
Set « C» (49 seq)Blast(1<e<100) not MIP
Set « E» (79 seq)
Set « U » (911 seq)
Water-specific
Experiments First Common Fragment on a Family
MIP family Positive set Comparison with pattern discovery tools
Teiresias Pratt Protomata-L (short pattern)
Water-specific Characterization MIP sub-families Positive and negative sets Leave-one-out cross-validation
Protomata-L (short to long pattern)
First Common Fragment Automaton
Results of 4 patterns scannedon Swiss-Prot protein Database
Set « M» (44 seq)
Learning Set
Learning set
Set « T » (159 seq)Target set
From short automata to long automata
Previous experiment only the first SFPs of the ordered list of SFPs short automaton first common fragment automaton
Next experiment larger cut-offs in the list of SFPs Protomat-L is able to create longer automata with more
common subparts Long patterns are closed of the topoly (3D-structure) of
the family
Water-specific characterization Leave-one-out cross-validation
Learning set W+ \ Si : Positive learning set W- \ Sj : Negative learning set
Test set { Si U Sj }
Control set Set T
Implication score
Set « W+» (24 seq)
Set « W-» (16 seq)
Set « C» (49 seq)
Leave-one-out cross-validation
Error Correcting Cost The error correcting cost of a sequence S represents the
distance (blossum similarity) between S and the closest sequence given by the automaton A.
Distibution of sequences with long automata (size Approx. 100)
Leave-one-out cross-validationWith Error Correcting Cost
Leave-one-out cross-validation
Conclusion & Perspective Good characterization of protein family using automata
(-> hmm structure) No need of a multiple alignment greedy data-driven algorithm
Important subparts localization Physico-chemical identification and generalization
Counter example sets Bringing of knowledge is possible in automata
(-> 2D structure)
Questions ?
?
?
?
?
?
??
??
?
?
?
?? ?
Demo
Protomata-L ’s Approach
First Common Fragment
Protomata-L ’s Approach
To get a more precise automaton
IDENTIFICATION OF PHYSICOCHEMICAL
GROUPS
Data set (Protein sequences)
Pairs of fragments
SORT
EXTRACTION
Initial Automaton(MCA)
MERGING
IDENTIFICATION OF « GAPS »
Structural discrimination
Aromatique
Hydrophobe
Non Informatif
Generalization of an Aquaporins automaton
Physico-chemical properties identification
Ratio likelihood test
AliphaticSmallx