yaroslav ryabov lognormal pattern of exon size distributions in eukaryotic genomes

Post on 14-Jan-2016

215 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Yaroslav Ryabov

Lognormal Pattern

of Exon size distributions

in Eukaryotic genomes

Outline• Global and local approaches in analysis of genetic

information

• Vocabulary of contemporary genetics: Exons and Introns

• What we can learn from exon size distributions? Lognormal pattern and two clases of exons

• How we can model exon size distributions of real genomes?

• What could be the biological reason for observed pattern of exons size distributions?

• Conclusions

Analyzing Genetic Information

Local approach

analyzing details of nucleotide sequence Basic Local

Alignment Search Tool1990

Watson and Crick1953Marshall Nirenberg1968

Global approach

analyzing properties of entire genome

Georg Johann Mendel1866

Recombination of inherited properties

Frequency of mutations

Size of genome etc.

Analyzing Genetic Information in Post-Genomic Era

More than 60 animal genomes with complete annotations

Analyzing Genetic Information in Post-Genomic Era

Back to Global Approach ?

More than 60 animal genomes with complete annotations

Prokaryote and Eukaryote

pro + karyon

before + nucleos

eu + karyon

good + nucleos

Gene expression

DNA

Transcriptionwith RNA polymerase

mRNATranslationwith ribosome

Protein

DNA

Transcriptionwith RNA polymerase

mRNA

Exons

Introns

Splicing

Splicing in Eukaryotes

Prokaryote vs Eukaryote

Most of DNA code is used to produce some cellular product

Substantial fraction “silent” DNA regions

Exons Introns

Short genomes: ~ 1 000 exons

Long genomes: ~ 100 000 exons

Long exons: ~ 1 000 base pairs

Short exons: ~ 100 base pairs

0.0 0.5 1.0 1.5 2.00

10

20

30

40

50

60

70

80

Num

ber o

f cou

nts

in th

ousa

nds

Exon length in 1000 base pairs

0.0 0.5 1.0 1.5 2.00.00

0.05

0.10

0.15

0.20

Num

ber

of c

ount

s in

tho

usan

ds

Exon length in 1000 base pairs

Prokaryote vs Eukaryote Exon size distributions

Homo SapiensFlavobacteria Bacterium

0.0 0.5 1.0 1.5 2.00

10

20

30

40

50

60

70

80

Num

ber o

f cou

nts

in th

ousa

nds

Exon length in 1000 base pairs

0.0 0.5 1.0 1.5 2.00.00

0.05

0.10

0.15

0.20

Num

ber

of c

ount

s in

tho

usan

ds

Exon length in 1000 base pairs

Homo SapiensFlavobacteria Bacterium

2 592 exons 1 075 b.p. mean exon size

299 298 exons 267 b.p. mean exon size

What we can learn from exonsize distributions ?

Assumption:

Probability to split DNA at any location and any time is Constant

Consequence:

The lengths of intervals between splitting points obey

Exponential distribution

Poisson Process

0.0 0.5 1.0 1.5 2.00

10

20

30

40

50

60

70

80

Num

ber

of c

ount

s in

tho

usan

ds

Exon length, 1000 b.p.

= Const frequency of splitting

0 1 2 3 40

10

20

30

40

Num

ber

of c

ount

s in

thou

sand

s

Log10

(Exon Length)

Assumption:

Probability to split DNA at any location and any time is Constant

or in logarithm scale

Consequence:

The lengths of intervals between splitting points obey

Exponential distribution

Poisson Process

= Const frequency of splitting

Poisson Process Kolmogoroff Process

Lognormal distributionExponential distribution

0 1 2 3 40

10

20

30

40

Num

ber

of c

ount

s in

tho

usan

ds

Log10

(Exon Length)

0 1 2 3 40

10

20

30

40

Num

ber

of c

ount

s in

thou

sand

s

Log10

(Exon Length)

Lognormal distribution is a consequence of Central Limit Theorem

Sumof random variables

Normal distribution

Lognormal distribution

Productof random variables

Kolmogoroff process (1941)

Kolmogoroff process (1941)

Kolmogoroff process (1941)

Assumption:

Probability to split any exon is Independent of exon size

Kolmogoroff process (1941)

Assumption:

Probability to split any exon is Independent of exon size

Kolmogoroff process (1941)

Assumption:

Probability to split any exon is Independent of exon size

Kolmogoroff process (1941)

Assumption:

Probability to split any exon is Independent of exon size

Kolmogoroff process (1941)

Log (Exon Size)

Num

ber

of C

ount

s

Assumption:

Probability to split any exon is Independent of exon size

Consequence:

The lengths of exons obey

Lognormal distributionM

2s

Real Genomes: Two Lognormal Peaks

1 2 3 4 5 6 7 8 9 10

5

10

15

20

25

30

Ln(Exon Size)

peak A M = 4.80 ± 0.01 = 0.85 ± 0.01

peak B M = 5.44 ± 0.09 = 3.10 ± 0.20

Homo Sapiens

Data: SatDatHS_BModel: TwoGauss Chi^2/DoF = 182733.65479R^2 = 0.99722 y0 0 ±0A 28378.88917 ±679.3328w 0.85204 ±0.01438xc 4.80305 ±0.00554A1 15303.67377 ±838.0008w1 3.09794 ±0.15891xc1 5.43988 ±0.0879

Numb

er of

Cou

nts (t

hous

ands

)

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6peak A M = 5.07 ± 0.01 = 0.53 ± 0.02

peak B M = 5.78 ± 0.01 = 2.10 ± 0.02

Drosophila MelanogasterData: StaDR_BModel: TwoGauss Chi^2/DoF = 6036.76567R^2 = 0.99806 y0 0 ±0A 10674.50108 ±120.44604w 2.09918 ±0.02005xc 5.77986 ±0.01405A1 1461.13835 ±83.49117w1 0.53408 ±0.02349xc1 5.07437 ±0.00955

Numb

er of

Cou

nts (t

hous

ands

)

Ln(Exon Size)

2 4 6 8 100

5

10

15

20

25

30

35

peak A M = 4.81 ± 0.01 = 0.82 ± 0.02

peak B M = 5.08 ± 0.07 = 3.13 ± 0.18

Num

ber

of C

oun

ts (

tho

usan

ds)

Ln(Exon Size)

Pan troglodytes (chimpanzee)

2 4 6 8 100

1

2

3

4

5

6

7

peak A M = 5.07 ± 0.01 = 0.55 ± 0.03

peak B M = 5.53 ± 0.01 = 1.89 ± 0.02

Nu

mb

er

of

Co

un

ts (

tho

usa

nd

s)

Ln(Exon Size)

Anopheles gambiae (mosquito)

2 4 6 8 100

5

10

15

20

25

30

35

peak A M = 4.59 ± 0.05 = 2.06 ± 0.12

peak B M = 4.83 ± 0.01 = 0.77 ± 0.03

Nu

mb

er

of

Co

un

ts (

tho

usa

nd

s)

Ln(Exon Size)

Bos taurus (cow)

2 4 6 8 100

2

4

6

8

10

12

14

16

peak A M = 4.75 ± 0.02 = 0.71 ± 0.05

peak B M = 5.22 ± 0.03 = 1.46 ± 0.03

Nu

mb

er

of

Co

un

ts (

tho

usa

nd

s)

Ln(Exon Size)

Caenorhabditis elegans (worm)

2 4 6 8 100

5

10

15

20

25

30

35

peak A M = 4.62 ± 0.05 = 2.10 ± 0.12

peak B M = 4.84 ± 0.01 = 0.77 ± 0.02

Nu

mb

er

of

Co

un

ts (

tho

usa

nd

s)

Ln(Exon Size)

Canis familiaris (dog)

2 4 6 8 100

5

10

15

20

25

30

35

40

peak A M = 4.50 ± 0.05 = 2.49 ± 0.11

peak B M = 4.83 ± 0.01 = 0.77 ± 0.02

Nu

mb

er

of

Co

un

ts (

tho

usa

nd

s)

Ln(Exon Size)

Danio rerio (zebrafish)

2 4 6 8 100

5

10

15

20

25

30

peak A M = 4.56 ± 0.06 = 2.21 ± 0.13

peak B M = 4.83 ± 0.01 = 0.76 ± 0.03

Nu

mb

er

of

Co

un

ts (

tho

usa

nd

s)

Ln(Exon Size)

Gallus gallus (chiken)

2 4 6 8 100

5

10

15

20

25

30

35

peak A M = 4.81 ± 0.01 = 0.82 ± 0.02

peak B M = 4.86 ± 0.05 = 2.73 ± 0.16

Nu

mb

er

of

Co

un

ts (

tho

usa

nd

s)

Ln(Exon Size)

Macaca mulatta (rhesus macaque)

2 4 6 8 100

5

10

15

20

25

30

35

peak A M = 4.82 ± 0.01 = 0.84 ± 0.02

peak B M = 5.38 ± 0.10 = 3.09 ± 0.20

Nu

mb

er

of

Co

un

ts (

tho

usa

nd

s)

Ln(Exon Size)

Mus musculus (house mouse)

2 4 6 8 100

5

10

15

20

25

30

35

peak A M = 4.85 ± 0.01 = 0.76 ± 0.02

peak B M = 5.00 ± 0.03 = 1.95 ± 0.10

Nu

mb

er

of

Co

un

ts (

tho

usa

nd

s)

Ln(Exon Size)

Tetraodon nigroviridis (spotted green pufferfish)

Parameters of Two Lognormal peaks model: M location of peak maximum

Ryabov & Gribskov, Nucleic Acids Research, 36, 2756-2763 (2008)

Parameters of Two Lognormal peaks model: s peak width

Ryabov & Gribskov, Nucleic Acids Research, 36, 2756-2763 (2008)

Ryabov & Gribskov, Nucleic Acids Research, 36, 2756-2763 (2008)

Parameters of Two Lognormal peaks model: peak area Narrow and Wide, %

Summary of Observed Facts

• Exon lengths distributions of studied eukaryotic genomes can be fitted by Two Lognormal Distributions

• Parameters of those two peaks follow two distinctive patterns: changes of peak width and relative peak area correlate with complexity of species.

• This may indicate presence of two different classes of exons with different evolutionary histories

How we can model exon size distributions of real genomes ?

Growth of total genome length

Duplications in genome code

Merging exons together (intron loss)

Exon loss

any exon of length has probability to be modified during time interval

If

Parameters of Kolmogoroff Processas model parameters for elementary “exon modification” event

average number of exons with sizes

and

with

peak position peak width

Then the process converges to a lognormal distribution with

distribution

Exons splitting

(decreasing Exon length)

Increasing Exon length

Parameters of Q(k) function

0 1 2

d Q(k)

k

0 1 2

d Q(k)

k

0 1 2

d Q(k)

k

Exons duplicating

Initial exon size distribution A mockup of a bacterial genome with 500 exons

Having 1000 bp mean exons length

Modeling exon size distributions in real genomes

10000

1x103

2x103

Dis

trib

utio

n de

nsity

Log (Exon Length)

2 4 6 8 100

1x104

2x104

3x104

4x104

Dis

trib

utio

n de

nsity

Log (Exon Length)

0.0 0.5 1.0 1.5 2.0

dQ (k) Wide peakQ()=1.200

k

15 000 time steps for p=0.001

Modeling exon size distributions in real genomes

2 4 6 8 100

1x104

2x104

3x104

4x104

Dis

trib

utio

n de

nsity

Log (Exon Length)

0.0 0.5 1.0 1.5 2.0

dQ (k) Wide peakQ()=1.200

k

0.0 0.5 1.0 1.5 2.0

dQ (k) Narrow peakQ()=1.263

k

15 000 time steps for p=0.001

Modeling exon size distributions in real genomes

2 4 6 8 100

1x104

2x104

3x104

4x104

Dis

trib

utio

n de

nsity

Log (Exon Length)0.0 0.5 1.0 1.5 2.0

dQ (k) Narrow peakQ()=1.263

k

0.0 0.5 1.0 1.5 2.0

dQ (k) Wide peakQ()=1.200

k

15 000 time steps for p=0.001

Modeling exon size distributions in real genomes

50 000 time steps for p=0.001

0.0 0.5 1.0 1.5 2.0

dQ (k) Wide peakQ()=1.049

k

2 4 6 8 100

1x103

2x103

3x103

4x103

5x103

6x103

Dis

trib

utio

n de

nsity

Log (Exon Length)

Modeling exon size distributions in real genomes

2 4 6 8 100

1x103

2x103

3x103

4x103

5x103

6x103

Dis

trib

utio

n de

nsity

Log (Exon Length)

50 000 time steps for p=0.001

0.0 0.5 1.0 1.5 2.0

dQ (k) Wide peakQ()=1.049

k

0.0 0.5 1.0 1.5 2.0

dQ (k) Narrow peakQ()=1.010

k

Modeling exon size distributions in real genomes

2 4 6 8 100

1x103

2x103

3x103

4x103

5x103

6x103

Dis

trib

utio

n de

nsity

Log (Exon Length)

50 000 time steps for p=0.001

0.0 0.5 1.0 1.5 2.0

dQ (k) Wide peakQ()=1.049

k

0.0 0.5 1.0 1.5 2.0

dQ (k) Narrow peakQ()=1.010

k

Modeling exon size distributions in real genomes

What cloud be the biological reason for two exons peaks?

Narrow peak

Wide peak

Num

ber

of C

ount

s

Log (Exon Size)

Narrow peak

Holds approximately Constant position

Has approximately Constant width

Has greater relative occupation for Complex organisms

Wide peak

Changes position

Width increases with increasing complexity of and organism

Has greater relative occupation for Simple organisms

What could be the biological reason for two exons peaks?

DNA

mRNA

Protein

Alterative SplicingUntranslated

regions of mRNA

Introns

Two ways of Gene Expression Regulation

DNA

mRNA

Protein

Untranslated

regions of mRNA

Introns

Two ways of Gene Expression Regulation

Conclusions• Analysis of global properties of eukaryotic genomes reveals two

distinct peaks in statistical distribution of exon sizes

• The observed peak could be fitted by a sum of two lognormal distributions which may imply that they originated by two different exon splitting pathways described in the general frameworks of Kolmogoroff splitting process

• Two observed peaks of exons could be correlated with the phenomenon of alternative splicing and with exons contributing into untranslated regions of mRNA. This suggests that the observed separation of exons in two different classes may be originated from two different ways of protein expression regulation.

Acknowledgments

Michael Gribskov

Alexander Berezhkovskii

top related