a similar fragments merging approach to learn automata on proteins

A Similar Fragments Merging Approach to Learn Automata on

Proteins

Goulven KERBELLEC & François COSTEIRISA / INRIA Rennes

Outline of the talk Protein families signatures Similar Fragment Merging Approach (Protomata-L)

Characterization Similar Fragment Pairs (SFPs) Ordering the SFPs

Generalization Merging of SFP in an automaton Gap generalization Identification of Physico-chemical properties

Experiments

Protein families Amino acid alphabet :

Protein sequence :

Protein data set :

>AQP1_BOVINMASEFKKKLFWRAVVAEFLAMILFIFISIGSALGFHYPIKSNQTTGAVQDNVKVSLAFGLSI…

>AQP1_BOVINMASEFKKKLFWRAVVAEFLAMILFIFISIGSALGFHYPIKSNQTTGAVQDNVKVSLAFGLSI…>AQP2_RATMWELRSIAFSRAVLAEFLATLLFVFFGLGSALQWASSPPSVLQIAVAFGLGIGILVQALGH…>AQP3_MOUSEMGRQKELMNRCGEMLHIRYRLLRQALAECLGTLILVMFGCGSVAQVVLSRGTHGGFLT…

{A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V}

Common function & Common topology (3D structure)

Characterization of a protein family

x x x x

x x x x x x x x

C H x \ / x

x Zn x x / \ x

C H x x x x

C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H

ZBT11 ...Csi..CgrtLpklyslriHmlk..H...

ZBT10 ...Cdi..CgklFtrrehvkrHslv..H...

ZBT34 ...Ckf..CgkkYtrkdqleyHirg..H... Zinc Finger Pattern

Expressivity classes of patternsClass Example

A T-C-T-T-G-A

B D-R-C-C-x(2)-H-D-x-C

C G-G-G-T-F-[ILV]-[ST]-[ILV]

D V-x-P-x(2)-[RQ]-x(4)-G-x(2)-L-[LM]

E G-C-x(1,3)-C-P-x(8,10)-C-C

F C-x(2,4)-C-x(3)-[ILVFYC]-x(8)-H-x(3,5)-H

G D-T-A-G-Q-E-*-L-V-G-N-K

H D-T-A-G-[NQ]-*-L-V-G-N-[KEH]

I D-T-A-x(2,5)-G-[NQ]-*-L-V-G-N-[KEH]

J Regular Expression / Automaton

PROSITE PRATTTEIRESIAS

PROTOMATA-L

Characterization

Similar Fragment Pairs Significantly similar fragment pairs (SFPs) Natural selection Important area characterization

Data set D:

Ordering the SFPs Problem :

Solution : ordering the SFPs by scoring each SFP S(f1,f2)= ? 3 different scoring functions :

dialign Sd support Ss implication Si

Dialign Score

Sd ( f1 , f2 ) = - log P ( L , Sim )

L = |f1| = |f2| Sim = Sum of the individual similarity values P = Probability that a random SFP of the same L

has the same S

Blossum62similarity

Support Score

Taking into account the representativeness of SFP

Ss (f1,f2,D) = Number of sequences supporting <f1,f2>

f1f2

f

<f1,f2> is supported by f with respectthe triangular inequality :Sd(f,f1) + Sd(f,f2) Sd(f1,f2)

Implication Score Taking into account a counter-example set N Discriminative fragments Lerman index:

Si(f1,f2,D,N) =

avec P(X) =

-P( Ss(f1,f2,N) ) + P( Ss(f1,f2,D) ) x P(N)

P( Ss(f1 ,f2 ,D) ) x |N|

|X|

|D| + |N|

Generalization

From protein data sets to automata

MASEIKLFW

M A S E I K L F W

From protein data sets to automata

MASEIKLFW

MGYEVKYRV

M G Y E V K Y R V

M A S E I K L F W

Merging SFPs

MASEIKLFW

MGYEVKYRV

M G Y E V K Y R V

M A S E I K L F W

Merging SFPs

MASEIKLFW

MGYEVKYRV

M G YE K

Y R V

M AS L

F W[I,V]

Merging SFPsMASEIKLFW

MGYEVKYRV

M G YE [I,V] K

Y R V

M AS L

F W

MASEVKLFM MGYEIKYRV

MASEIKYRV MGYEVKLFW

MASEVKYRV MGYEIKLFW

Protein Sequence Data SetList of SFPs

MCA

Automaton / Regular Expression

Ordered List of SFPs

MERGING

Gap Generalization Merging on themself non-representative transitions Treat them as "gaps"

Identification of Physico-chemical properties

Similar Fragments ~ potential function area Amino acids share out the same position Physicochemical property at play => Generalization from a group (of amino acids) to a Taylor group

I,V I,Q,W,P

aliphatic

xI,L,V

no information

C

C

[I,V] [I,L,V] C C[I,Q,W,P] X {A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V}

Likelihood ratio test To decide if the multi-set A has been generated

according to a physico-chemical group G or not by a likelihood ratio test:

Given a threshold , we test the expansion of A to G and reject it when LRG/A <

Experiments

MIP : the Major Intrinsic Protein Family

FamilyMIP

SubfamiliesAQP, Glpf, Gla

Data sets

UNIPROTMIP in SWISS-PROT

Set « T » (159 seq)

Set « M» (44 seq)identity<90%

Set « W+» (24 seq)

Set « W-» (16 seq)

Set « C» (49 seq)Blast(1<e<100) not MIP

Set « E» (79 seq)

Set « U » (911 seq)

Water-specific

Experiments First Common Fragment on a Family

MIP family Positive set Comparison with pattern discovery tools

Teiresias Pratt Protomata-L (short pattern)

Water-specific Characterization MIP sub-families Positive and negative sets Leave-one-out cross-validation

Protomata-L (short to long pattern)

First Common Fragment Automaton

Results of 4 patterns scannedon Swiss-Prot protein Database

Set « M» (44 seq)

Learning Set

Learning set

Set « T » (159 seq)Target set

From short automata to long automata

Previous experiment only the first SFPs of the ordered list of SFPs short automaton first common fragment automaton

Next experiment larger cut-offs in the list of SFPs Protomat-L is able to create longer automata with more

common subparts Long patterns are closed of the topoly (3D-structure) of

the family

Water-specific characterization Leave-one-out cross-validation

Learning set W+ \ Si : Positive learning set W- \ Sj : Negative learning set

Test set { Si U Sj }

Control set Set T

Implication score

Set « W+» (24 seq)

Set « W-» (16 seq)

Set « C» (49 seq)

Leave-one-out cross-validation

Error Correcting Cost The error correcting cost of a sequence S represents the

distance (blossum similarity) between S and the closest sequence given by the automaton A.

Distibution of sequences with long automata (size Approx. 100)

Leave-one-out cross-validationWith Error Correcting Cost

Leave-one-out cross-validation

Conclusion & Perspective Good characterization of protein family using automata

(-> hmm structure) No need of a multiple alignment greedy data-driven algorithm

Important subparts localization Physico-chemical identification and generalization

Counter example sets Bringing of knowledge is possible in automata

(-> 2D structure)

Questions ?

?

?

?

?

?

??

??

?

?

?

?? ?

Protomata-L ’s Approach

First Common Fragment

Protomata-L ’s Approach

To get a more precise automaton

IDENTIFICATION OF PHYSICOCHEMICAL

GROUPS

Data set (Protein sequences)

Pairs of fragments

SORT

EXTRACTION

Initial Automaton(MCA)

MERGING

IDENTIFICATION OF « GAPS »

Structural discrimination

Aromatique

Hydrophobe

Non Informatif

Generalization of an Aquaporins automaton

Physico-chemical properties identification

Ratio likelihood test

AliphaticSmallx

a similar fragments merging approach to learn automata on proteins

Documents

v c c i

d x pnp ssf1

cc fcx2

c cgg

stilv dvx

n p ssf1

h gdt

physicochemical group