![Page 1: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c7d5503460f94932478/html5/thumbnails/1.jpg)
A Similar Fragments Merging Approach to Learn Automata on
Proteins
Goulven KERBELLEC & François COSTEIRISA / INRIA Rennes
![Page 2: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c7d5503460f94932478/html5/thumbnails/2.jpg)
Outline of the talk Protein families signatures Similar Fragment Merging Approach (Protomata-L)
Characterization Similar Fragment Pairs (SFPs) Ordering the SFPs
Generalization Merging of SFP in an automaton Gap generalization Identification of Physico-chemical properties
Experiments
![Page 3: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c7d5503460f94932478/html5/thumbnails/3.jpg)
Protein families Amino acid alphabet :
Protein sequence :
Protein data set :
>AQP1_BOVINMASEFKKKLFWRAVVAEFLAMILFIFISIGSALGFHYPIKSNQTTGAVQDNVKVSLAFGLSI…
>AQP1_BOVINMASEFKKKLFWRAVVAEFLAMILFIFISIGSALGFHYPIKSNQTTGAVQDNVKVSLAFGLSI…>AQP2_RATMWELRSIAFSRAVLAEFLATLLFVFFGLGSALQWASSPPSVLQIAVAFGLGIGILVQALGH…>AQP3_MOUSEMGRQKELMNRCGEMLHIRYRLLRQALAECLGTLILVMFGCGSVAQVVLSRGTHGGFLT…
{A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V}
Common function & Common topology (3D structure)
![Page 4: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c7d5503460f94932478/html5/thumbnails/4.jpg)
Characterization of a protein family
x x x x
x x x x x x x x
C H x \ / x
x Zn x x / \ x
C H x x x x
C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H
ZBT11 ...Csi..CgrtLpklyslriHmlk..H...
ZBT10 ...Cdi..CgklFtrrehvkrHslv..H...
ZBT34 ...Ckf..CgkkYtrkdqleyHirg..H... Zinc Finger Pattern
![Page 5: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c7d5503460f94932478/html5/thumbnails/5.jpg)
Expressivity classes of patternsClass Example
A T-C-T-T-G-A
B D-R-C-C-x(2)-H-D-x-C
C G-G-G-T-F-[ILV]-[ST]-[ILV]
D V-x-P-x(2)-[RQ]-x(4)-G-x(2)-L-[LM]
E G-C-x(1,3)-C-P-x(8,10)-C-C
F C-x(2,4)-C-x(3)-[ILVFYC]-x(8)-H-x(3,5)-H
G D-T-A-G-Q-E-*-L-V-G-N-K
H D-T-A-G-[NQ]-*-L-V-G-N-[KEH]
I D-T-A-x(2,5)-G-[NQ]-*-L-V-G-N-[KEH]
J Regular Expression / Automaton
PROSITE PRATTTEIRESIAS
PROTOMATA-L
![Page 6: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c7d5503460f94932478/html5/thumbnails/6.jpg)
Characterization
![Page 7: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c7d5503460f94932478/html5/thumbnails/7.jpg)
Similar Fragment Pairs Significantly similar fragment pairs (SFPs) Natural selection Important area characterization
Data set D:
![Page 8: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c7d5503460f94932478/html5/thumbnails/8.jpg)
Ordering the SFPs Problem :
Solution : ordering the SFPs by scoring each SFP S(f1,f2)= ? 3 different scoring functions :
dialign Sd support Ss implication Si
![Page 9: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c7d5503460f94932478/html5/thumbnails/9.jpg)
Dialign Score
Sd ( f1 , f2 ) = - log P ( L , Sim )
L = |f1| = |f2| Sim = Sum of the individual similarity values P = Probability that a random SFP of the same L
has the same S
Blossum62similarity
![Page 10: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c7d5503460f94932478/html5/thumbnails/10.jpg)
Support Score
Taking into account the representativeness of SFP
Ss (f1,f2,D) = Number of sequences supporting <f1,f2>
f1f2
f
<f1,f2> is supported by f with respectthe triangular inequality :Sd(f,f1) + Sd(f,f2) Sd(f1,f2)
![Page 11: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c7d5503460f94932478/html5/thumbnails/11.jpg)
Implication Score Taking into account a counter-example set N Discriminative fragments Lerman index:
Si(f1,f2,D,N) =
avec P(X) =
-P( Ss(f1,f2,N) ) + P( Ss(f1,f2,D) ) x P(N)
P( Ss(f1 ,f2 ,D) ) x |N|
|X|
|D| + |N|
![Page 12: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c7d5503460f94932478/html5/thumbnails/12.jpg)
Generalization
![Page 13: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c7d5503460f94932478/html5/thumbnails/13.jpg)
From protein data sets to automata
MASEIKLFW
M A S E I K L F W
![Page 14: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c7d5503460f94932478/html5/thumbnails/14.jpg)
From protein data sets to automata
MASEIKLFW
MGYEVKYRV
M G Y E V K Y R V
M A S E I K L F W
![Page 15: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c7d5503460f94932478/html5/thumbnails/15.jpg)
Merging SFPs
MASEIKLFW
MGYEVKYRV
M G Y E V K Y R V
M A S E I K L F W
![Page 16: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c7d5503460f94932478/html5/thumbnails/16.jpg)
Merging SFPs
MASEIKLFW
MGYEVKYRV
M G YE K
Y R V
M AS L
F W[I,V]
![Page 17: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c7d5503460f94932478/html5/thumbnails/17.jpg)
Merging SFPsMASEIKLFW
MGYEVKYRV
M G YE [I,V] K
Y R V
M AS L
F W
MASEVKLFM MGYEIKYRV
MASEIKYRV MGYEVKLFW
MASEVKYRV MGYEIKLFW
![Page 18: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c7d5503460f94932478/html5/thumbnails/18.jpg)
Protein Sequence Data SetList of SFPs
MCA
Automaton / Regular Expression
Ordered List of SFPs
MERGING
![Page 19: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c7d5503460f94932478/html5/thumbnails/19.jpg)
Gap Generalization Merging on themself non-representative transitions Treat them as "gaps"
![Page 20: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c7d5503460f94932478/html5/thumbnails/20.jpg)
Identification of Physico-chemical properties
Similar Fragments ~ potential function area Amino acids share out the same position Physicochemical property at play => Generalization from a group (of amino acids) to a Taylor group
I,V I,Q,W,P
aliphatic
xI,L,V
no information
C
C
[I,V] [I,L,V] C C[I,Q,W,P] X {A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V}
![Page 21: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c7d5503460f94932478/html5/thumbnails/21.jpg)
Likelihood ratio test To decide if the multi-set A has been generated
according to a physico-chemical group G or not by a likelihood ratio test:
Given a threshold , we test the expansion of A to G and reject it when LRG/A <
![Page 22: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c7d5503460f94932478/html5/thumbnails/22.jpg)
Experiments
![Page 23: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c7d5503460f94932478/html5/thumbnails/23.jpg)
MIP : the Major Intrinsic Protein Family
FamilyMIP
SubfamiliesAQP, Glpf, Gla
![Page 24: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c7d5503460f94932478/html5/thumbnails/24.jpg)
Data sets
UNIPROTMIP in SWISS-PROT
Set « T » (159 seq)
Set « M» (44 seq)identity<90%
Set « W+» (24 seq)
Set « W-» (16 seq)
Set « C» (49 seq)Blast(1<e<100) not MIP
Set « E» (79 seq)
Set « U » (911 seq)
Water-specific
![Page 25: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c7d5503460f94932478/html5/thumbnails/25.jpg)
Experiments First Common Fragment on a Family
MIP family Positive set Comparison with pattern discovery tools
Teiresias Pratt Protomata-L (short pattern)
Water-specific Characterization MIP sub-families Positive and negative sets Leave-one-out cross-validation
Protomata-L (short to long pattern)
![Page 26: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c7d5503460f94932478/html5/thumbnails/26.jpg)
First Common Fragment Automaton
Results of 4 patterns scannedon Swiss-Prot protein Database
Set « M» (44 seq)
Learning Set
Learning set
Set « T » (159 seq)Target set
![Page 27: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c7d5503460f94932478/html5/thumbnails/27.jpg)
From short automata to long automata
Previous experiment only the first SFPs of the ordered list of SFPs short automaton first common fragment automaton
Next experiment larger cut-offs in the list of SFPs Protomat-L is able to create longer automata with more
common subparts Long patterns are closed of the topoly (3D-structure) of
the family
![Page 28: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c7d5503460f94932478/html5/thumbnails/28.jpg)
Water-specific characterization Leave-one-out cross-validation
Learning set W+ \ Si : Positive learning set W- \ Sj : Negative learning set
Test set { Si U Sj }
Control set Set T
Implication score
Set « W+» (24 seq)
Set « W-» (16 seq)
Set « C» (49 seq)
![Page 29: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c7d5503460f94932478/html5/thumbnails/29.jpg)
Leave-one-out cross-validation
![Page 30: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c7d5503460f94932478/html5/thumbnails/30.jpg)
Error Correcting Cost The error correcting cost of a sequence S represents the
distance (blossum similarity) between S and the closest sequence given by the automaton A.
Distibution of sequences with long automata (size Approx. 100)
![Page 31: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c7d5503460f94932478/html5/thumbnails/31.jpg)
Leave-one-out cross-validationWith Error Correcting Cost
![Page 32: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c7d5503460f94932478/html5/thumbnails/32.jpg)
Leave-one-out cross-validation
![Page 33: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c7d5503460f94932478/html5/thumbnails/33.jpg)
Conclusion & Perspective Good characterization of protein family using automata
(-> hmm structure) No need of a multiple alignment greedy data-driven algorithm
Important subparts localization Physico-chemical identification and generalization
Counter example sets Bringing of knowledge is possible in automata
(-> 2D structure)
![Page 34: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c7d5503460f94932478/html5/thumbnails/34.jpg)
Questions ?
?
?
?
?
?
??
??
?
?
?
?? ?
![Page 35: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c7d5503460f94932478/html5/thumbnails/35.jpg)
Demo
![Page 36: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c7d5503460f94932478/html5/thumbnails/36.jpg)
Protomata-L ’s Approach
First Common Fragment
![Page 37: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c7d5503460f94932478/html5/thumbnails/37.jpg)
Protomata-L ’s Approach
To get a more precise automaton
![Page 38: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c7d5503460f94932478/html5/thumbnails/38.jpg)
IDENTIFICATION OF PHYSICOCHEMICAL
GROUPS
Data set (Protein sequences)
Pairs of fragments
SORT
EXTRACTION
Initial Automaton(MCA)
MERGING
IDENTIFICATION OF « GAPS »
![Page 39: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c7d5503460f94932478/html5/thumbnails/39.jpg)
Structural discrimination
![Page 40: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c7d5503460f94932478/html5/thumbnails/40.jpg)
![Page 41: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c7d5503460f94932478/html5/thumbnails/41.jpg)
Aromatique
Hydrophobe
Non Informatif
Generalization of an Aquaporins automaton
![Page 42: A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes](https://reader030.vdocument.in/reader030/viewer/2022032516/56649c7d5503460f94932478/html5/thumbnails/42.jpg)
Physico-chemical properties identification
Ratio likelihood test
AliphaticSmallx