a mathematical model of the genetic code: structure and applications

59
A mathematical A mathematical model of the genetic model of the genetic code: structure and code: structure and applications applications Antonino Sciarrino Università di Napoli “Federico II” INFN, Sezione di Napoli QuickTim decompres sono necessa

Upload: dung

Post on 01-Feb-2016

26 views

Category:

Documents


0 download

DESCRIPTION

A mathematical model of the genetic code: structure and applications. Antonino Sciarrino Università di Napoli “Federico II” INFN, Sezione di Napoli TAG 2006 Annecy-leVieux, 9 November 2006. - PowerPoint PPT Presentation

TRANSCRIPT

  • A mathematical model of the genetic code: structure and applicationsAntonino Sciarrino Universit di Napoli Federico II INFN, Sezione di Napoli

    TAG 2006 Annecy-leVieux, 9 November 2006

  • Mathematical Model of the Genetic CodeWork in collaboration withLuc FRAPPATPaul SORBADiego COCURULLO

  • SUMMARYIntroductionDescription of the modelApplications : Codon usage frequencies DNA dimers free energyWork in progress

  • It is amazing that the complex biochemical relations between DNA and proteins were very quickly reduced to a mathematical model. Just few months after the WATSON-CRICK discovery G. GAMOW proposed the diamond code

  • Gamow diamond codeGamow, Nature (1954)Nucleotides aredenoted by number 1,2,3,4 Amino-acids FIT the rhomb -shaped holes formed by the 4 nucleotides 20 a.a. !

  • Since 1954 many mathematical modelisations of the genetic coded have been proposed (based on informatiom, thermodynamic, symmetry, topology arguments) Weak point of the models: often poor explanatory and/or predictive power

  • The genetic code

  • Crystal basis model of the genetic code 4 basis C, U/T (Pyrimidines) G, A (Purines) are identified by a couple of spin labels (+ 1/2, - -1/2)L.Frappat, A. Sciarrino, P. Sorba: Phys.Lett. A (1998)Mathematically - C,U/T,G,A transform as the 4 basis vectors of irrep. (1/2, 1/2) of U q 0 (sl(2)H sl(2)V)

  • Crystal basis model of the genetic code Dinucleotides are composite states ( 16 basis vectors of (1/2, 1/2)2 ) belonging to sets identified by two integer numbers JH JV In each set the dinucleotide is identified by two labels - JH JH,3 JH - JV JV,3 JV Ex. CU = (+,+) (+, -) ( JH = 1/2, JH,3 = 1/2; JV = 1/2, JV,3 = 1/2) Follows from property of U(q 0) (sl(2))

  • DINUCLEOTIDE Representation Content

  • Crystal basis model of the genetic codeCodons are composite states ( 64 basis vectors of (1/2, 1/2) )

    belonging to sets identified by half- integer JH JV (set irreducible representation = irrep.) Ex. CUA = (+,+) (-, +) (-,-) ( JH = 1/2, JH,3 = 1/2; JV = 1/2, JV,3 = 1/2)

    Follows from property of U(q 0) (sl(2))

  • Codons in the crystal basis

  • Codon usage frequencySynonymous codons are not used uniformly (codon bias)codon bias (not fully understood) ascribed to evolutive-selective effectscodon bias depends Biological species (b.sp.) Sequence analysed Amino acid (a.a.) encoded Structure of the considered multiplet Nature of codon XYZ .

  • Codon usage in Homo sap.

  • Our analysis deals with global codon usage , i.e. computed over all the coding sequences (exonic region) for the b.sp.of the considered specimen To put into evidence possible general features of the standardeukaryotic genetic code ascribable to its organisation and itsevolution

  • Let us define the codon usage probability for the codon XZN (X,Z,N {A,C,G,UT in DNA} )P(XZN) = limit n n XZN / N tot n XZN number of times codon XZN used in the processes N tot total number of codons in the same processes

    For fixed XZ Normalization N P(XZN) = 1 Note - Sextets are considered quartets + doublets 8 quartets

  • Def. - Correlation coefficient rXY for two variables X P..X Y P..Y

  • Specimen (GenBank Release 149.0 09/2005 - Ncodons > 100.000) 26 VERTEBRATES 28 INVERTEBRATES 38 PLANTSTOTAL - 92 Biological species

  • Correlation coefficient VERTEBRATES

    rXY

    r CA

    r UG

    r UC

    r AG

    r UA

    r CG

    P

    -0.89

    -0.69

    -0.75

    -0.55

    -0.76

    -0.21

    T

    -0.92

    -0.71

    -0.89

    -0.68

    -0.91

    -0.40

    A

    -0.88

    -0.53

    -0.89

    -0.60

    -0.76

    -0.30

    S

    -0.92

    -0.77

    -0.87

    -0.60

    -0.75

    -0.51

    V

    -0.84

    -0.93

    -0.69

    -0.74

    -0.68

    -0.53

    L

    -0.83

    -0.93

    -0.87

    -0.91

    -0.87

    -0.69

    R

    -0.90

    -0.93

    -0.39

    -0.27

    -0.41

    -0.11

    G

    -0.94

    -0.89

    -0.75

    -0.74

    -0.77

    -0.56

    a.a.

    -0.89

    -0.80

    -0.76

    -0.64

    -0.74

    -0.41

  • Correlation coefficient PLANTS

    rXY

    r CA

    r UG

    r UC

    r AG

    r UA

    r CG

    P

    -0.91

    -0.81

    -0.54

    -0.61

    -0.41

    -0.48

    T

    -0.94

    -0.87

    -0.79

    -0.59

    -0.75

    -0.48

    A

    -0.94

    -0.93

    -0.72

    -0.57

    -0.63

    -0.55

    S

    -0.87

    -0.86

    -0.75

    -0.78

    -0.71

    -0.56

    V

    -0.66

    -0.72

    -0.75

    -0.65

    -0.71

    -0.15

    L

    -0.72

    -0.85

    -0.57

    -0.52

    -0.54

    -0.17

    R

    -0.76

    -0.66

    -0.67

    -0.50

    -0.16

    -0.49

    G

    -0.83

    -0.48

    -0.73

    -0.14

    -0.36

    -0.07

    a.a.

    -0.83

    -0.77

    -0.69

    -0.55

    -0.53

    -0.37

  • Correlation coefficient INVERTEBRATES

    rXY

    r CA

    r UG

    r UC

    r AG

    r UA

    r CG

    P

    -0.78

    -0.63

    -0.50

    -0.74

    -0.20

    -0.52

    T

    -0.85

    -0.87

    -0.74

    -0.76

    -0.62

    -0.60

    A

    -0.82

    -0.79

    -0.75

    -0.68

    -0.51

    -0.53

    S

    -0.91

    -0.83

    -0.71

    -0.86

    -0.55

    -0.79

    V

    -0.78

    -0.92

    -0.66

    -0.78

    -0.72

    -0.46

    L

    -0.49

    -0.92

    -0.48

    -0.66

    -0.50

    -0.25

    R

    -0.55

    -0.76

    -0.76

    -0.27

    -0.01

    -0.53

    G

    -0.73

    -0.48

    -0.57

    -0.14

    -0.02

    -0.08

    a.a.

    -0.74

    -0.78

    -0.65

    -0.61

    -0.38

    -0.47

  • Averaged value of P(..N)

  • Averaged value of P(..N)

  • Averaged value of sum of two correlated P(N)

  • Ratios of obs2(X+Y) and th2(X+Y) = obs2(X)+ obs2(Y) averaged over the 8 a.a. for the sum of two codon probabilities

  • Indication for correlation for codon usage probabilities P(A) and P(C) ( P(U) and P(G)) for quartets.

  • Correlation between codon probabilities for different a.a.Correlation coefficients between the 28 couples P XZN-XZN where XZ (XZ) specify 8 quartets. The following pattern comes out for the whole eucaryotes specimen (n = 92)

    Eucar.

    r XZA-XZA

    r XZC-XZ C

    r XZG-XZG

    r XZ U-XZ U

    Ser-Thr

    0.88

    0.94

    0.90

    0.80

    Ser-Pro

    0.93

    0.90

    0.87

    0.91

    Ser-Ala

    0.86

    0.93

    0.82

    0.81

    Thr-Pro

    0.83

    0.91

    0.93

    0.74

    Thr-Ala

    0.91

    0.93

    0.93

    0.94

    Pro-Ala

    0.86

    0.93

    0.93

    0.94

    Leu-Val

    0.85

    0.82

    -0.70

    0.96

  • The set of 8 quartets splits into 3 subsets4 a.a. with correlated codon usage (Ser, Pro, Arg, Thr)2 a.a. with correlated codon usage (Leu, Val)2 a.a. with generally uncorrelated codon usage (Arg, Gly)

  • Statistical analysis Correlation for P(XZA)-P(XZC), XZ quartets Correlation for P(N) between {Ser, Pro, Thr, Ala} and {Leu, Val} The observed correlations well fit in the mathematical scheme of the crystal basis model of the genetic code

  • In the crystal basis model P(XYZ) can be written as function of

  • ASSUMPTION

  • SUM RULESK INDEPENDENT OF THE b.s.XZ QUARTETS

  • SUM RULES Theoretical correlation matrixXZ = NC,CG,GG,CU,GU

  • Observed averaged value of the correlation matrix , in red thetheoretical value

  • Irrep. JH, JV

    Codons

    3/2, 3/2

    Pro CCC, Ser UCC, Ala GCC, Thr ACC

    (1/2, 3/2) 1

    Pro CCU, Ser UCU, Ala GCU, Thr ACU

    (3/2, 1/2) 1

    Pro CCG, Ser UCG, Ala GCG, Thr ACG

    (1/2, 1/2) 1

    Pro CCA, Ser UCA, Ala GCA, Thr ACA

    (1/2, 3/2) 2

    Leu CUC, Leu CUU, Val GUC, Val GUU

    (1/2, 1/2) 2

    Leu CUG, Leu CUA, Val GUG, Val GUA

  • Shannon EntropyLet us define the Shannon entropy for the amino-acid specified by the first two nucleotide XZ (8 quartes)

  • Shannon EntropyUsing the previous expression for P(XZN) we get

    N (XZN), HbsN Hbs(XZN), PN P(XZN) SXZ largely independent of the b.sp.

  • Shannon Entropy

  • DNA dinucleotide free energy Free energy for a pair of nucleotides, ex. GC, lying on one strand of DNA, coupled with complementary pair, CG, on the other strand.

    CG from 5 3 correlated with GC from 3 5

  • DINUCLEOTIDE Representation Content

  • SUM RULES for FREE ENERGY

  • Comparison with exp. dataG in Kcal/mol

  • DINUCLEOTIDE Distribution

  • Comparison with experimental data

  • Work in progress and future perspectivesFron the correspondence{C,U/T,G,A} I.R. (1/2,1/2) of U q 0 (sl(2)H sl(2)V) Any ordered N nucleotides sequence Vector of I.R. (1/2,1/2)N of U q 0 (sl(2)H sl(2)V) New pametrization of nucleotidees sequences

  • Spin parametrisation

  • Algorithm for the spin parametrisation of orderedn-nucleotide sequence

  • From this parametrisation:Alternative construction of mutation model, where mutation intensitydoes not depend from the Hamming distance between the sequences, but from the change of labels of the sets. C. Minichini, A.S., Biosystems (2006)Characterization of particular sequences (exons, introns, promoter, 5 or 3 UTR sequences,.) L. Frappat, P. Sorba, A.S., L. Vuillon, in progress

  • For each gene of Homo Sap. (total ~28.000 genes)Consider the N-nucleotide coding sequence (CDS)Compute the labels JH, J3H ; JV, J3V for any n-nucleotide subsequence (1 n N) Plot labels versus n

  • Red JH - Green J3H Blue JV - Black J3V

  • Red JH - Green J3H Blue JV - Black J3V

  • Red JH - Green J3H Blue JV - Black J3V

  • Red JH - Green J3H Blue JV - Black J3V

  • Numerical estimatorDefine for any sequence of length NPlot number of CDS with the same value of Diff (Sum) versus Diff (Sum) Compute Diff (Sum) for 28.000 random sequences (300 < N < 4300)with uniform probability for each nucleotideComparison number of CDS - random sequences

  • ConclusionsCorrelations in codon usage frequencies computed over the whole exonic region fit well in the mathematical scheme of the crystal basis model of the genetic code Missing explanation for the correlations Formalism of crystal basis model useful to parametrize free energy for DNA dimersMore generally, use of U q 0 (sl(2)H sl(2)V) mathematical structure may be useful to describe sequences of nucleotides .