users.fred.netusers.fred.net/tds/lab/images/kanpur2016/lecture_7_thomas_schne… · rsequence and...

161
T C C T T C T A C T C T G C T G A C T A G C T A G C T A G C T A G C T A G C T A G C T A G C T A G C T A G C T G A C T G A C T G A C T G A C T A T C AG T C A G Information Theory in Biology Thomas D. Schneider, Ph.D. Molecular Information Theory Group Center for Cancer Research Gene Regulation and Chromosome Biology Laboratory National Cancer Institute Frederick, MD 21702-1201

Upload: others

Post on 17-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

GATC

GATC

GACT

GATC

GACT

GACT

GACT

AGCT

GACT

AGCT

AGCT

AGCT

AGCT

AGCT

AGCT

AGCT

AGCT

AGCT

GACT

GACT

GACT

GACT

ATGCGATC

GTCATAC

GTCAG

CGAT

Information Theory in BiologyThomas D. Schneider, Ph.D.

Molecular Information Theory GroupCenter for Cancer Research

Gene Regulation and Chromosome Biology LaboratoryNational Cancer InstituteFrederick, MD 21702-1201

Page 2: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Information Theory: One-Minute Lesson

38

B

symbols of bits

4

2

M

2

1

11

01 00

number ofexample

10

number

H T

M=2 B=log2M

B

Page 3: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Information Theory: One-Minute Lesson

38

B

symbols of bits

4

2

M

2

1

11

01 00

number ofexample

10

number

H T

M=2 B=log2M

B

Page 4: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Information Theory: One-Minute Lesson

38

B

symbols of bits

4

2

M

2

1

11

01 00

number ofexample

10

number

H T

M=2 B=log2M

B

ATGC

Page 5: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Information Theory: One-Minute Lesson

38

B

symbols of bits

4

2

M

2

1

11

01 00

number ofexample

10

number

H T

M=2 B=log2M

B

Page 6: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Information Theory: One-Minute Lesson

38

B

symbols of bits

4

2

M

2

1

11

01 00

number ofexample

10

number

H T

M=2 B=log2M

B

Page 7: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

El Duomo, Florence, Italy

Page 8: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

T7 RNA polymerase + DNA

http://www.ebi.ac.uk/pdbe/entry/pdb/1qln/portfolio/?view=entry_index#ad-image-0

Page 9: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Sequence Logo

Schneider &StephensNucl. Acids Res.18: 6097-61001990

6 of 17 sites

1 ttattaatacaactcactataaggagag 2 aaatcaatacgactcactatagagggac 3 cggttaatacgactcactataggagaac 4 gaagtaatacgactcagtatagggacaa 5 taattaattgaactcactaaagggagac 6 cgcttaatacgactcactaaaggagaca

17 Bacteriophage T7 RNA polymerase binding sites

0

1

2

bit

s

5′ -21TCGA

-20G

TA

-19CT

GA

-18CG

AT

-17C

T-1

6TA

-15A

-14T

-13C

TA

-12G

C-1

1AG

-10A

-9

C-8

T-7

C-6

A-5

GC

-4

T-3

A-2

AT

-1

TA

0AG

1AG

2AG

3

GA

4CAG

5TCA

6TGCA

3′

Page 10: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Sequence Logo

Schneider &StephensNucl. Acids Res.18: 6097-61001990

6 of 17 sites

1 ttattaatacaactcactataaggagag 2 aaatcaatacgactcactatagagggac 3 cggttaatacgactcactataggagaac 4 gaagtaatacgactcagtatagggacaa 5 taattaattgaactcactaaagggagac 6 cgcttaatacgactcactaaaggagaca

17 Bacteriophage T7 RNA polymerase binding sites

0

1

2

bit

s

5′ -21TCGA

-20G

TA

-19CT

GA

-18CG

AT

-17C

T-1

6TA

-15A

-14T

-13C

TA

-12G

C-1

1AG

-10A

-9

C-8

T-7

C-6

A-5

GC

-4

T-3

A-2

AT

-1

TA

0AG

1AG

2AG

3

GA

4CAG

5TCA

6TGCA

3′

2 bits/base

Page 11: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Sequence Logo

Schneider &StephensNucl. Acids Res.18: 6097-61001990

6 of 17 sites

1 ttattaatacaactcactataaggagag 2 aaatcaatacgactcactatagagggac 3 cggttaatacgactcactataggagaac 4 gaagtaatacgactcagtatagggacaa 5 taattaattgaactcactaaagggagac 6 cgcttaatacgactcactaaaggagaca

17 Bacteriophage T7 RNA polymerase binding sites

0

1

2

bit

s

5′ -21TCGA

-20G

TA

-19CT

GA

-18CG

AT

-17C

T-1

6TA

-15A

-14T

-13C

TA

-12G

C-1

1AG

-10A

-9

C-8

T-7

C-6

A-5

GC

-4

T-3

A-2

AT

-1

TA

0AG

1AG

2AG

3

GA

4CAG

5TCA

6TGCA

3′

1 bit/base

Page 12: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Sequence Logo

Schneider &StephensNucl. Acids Res.18: 6097-61001990

6 of 17 sites

1 ttattaatacaactcactataaggagag 2 aaatcaatacgactcactatagagggac 3 cggttaatacgactcactataggagaac 4 gaagtaatacgactcagtatagggacaa 5 taattaattgaactcactaaagggagac 6 cgcttaatacgactcactaaaggagaca

17 Bacteriophage T7 RNA polymerase binding sites

0

1

2

bit

s

5′ -21TCGA

-20G

TA

-19CT

GA

-18CG

AT

-17C

T-1

6TA

-15A

-14T

-13C

TA

-12G

C-1

1AG

-10A

-9

C-8

T-7

C-6

A-5

GC

-4

T-3

A-2

AT

-1

TA

0AG

1AG

2AG

3

GA

4CAG

5TCA

6TGCA

3′

0 bits/base

Page 13: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Sequence Logo

Schneider &StephensNucl. Acids Res.18: 6097-61001990

1 ttattaatacaactcactataaggagag 33.3 2 aaatcaatacgactcactatagagggac 37.4 3 cggttaatacgactcactataggagaac 34.4 4 gaagtaatacgactcagtatagggacaa 33.1 5 taattaattgaactcactaaagggagac 30.1 6 cgcttaatacgactcactaaaggagaca 29.1

17 Bacteriophage T7 RNA polymerase binding sites

Bits0

1

2

bit

s

5′ -21TCGA

-20G

TA

-19CT

GA

-18CG

AT

-17C

T-1

6TA

-15A

-14T

-13C

TA

-12G

C-1

1AG

-10A

-9

C-8

T-7

C-6

A-5

GC

-4

T-3

A-2

AT

-1

TA

0AG

1AG

2AG

3

GA

4CAG

5TCA

6TGCA

3′

Individual Information

Page 14: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Sequence Logo and Sequence Walker

Schneider &StephensNucl. Acids Res.18: 6097-61001990

1 ttattaatacaactcactataaggagag 33.3 2 aaatcaatacgactcactatagagggac 37.4 3 cggttaatacgactcactataggagaac 34.4 4 gaagtaatacgactcagtatagggacaa 33.1 5 taattaattgaactcactaaagggagac 30.1 6 cgcttaatacgactcactaaaggagaca 29.1

17 Bacteriophage T7 RNA polymerase binding sites

Bits0

1

2

bit

s

5′ -21TCGA

-20G

TA

-19CT

GA

-18CG

AT

-17C

T-1

6TA

-15A

-14T

-13C

TA

-12G

C-1

1AG

-10A

-9

C-8

T-7

C-6

A-5

GC

-4

T-3

A-2

AT

-1

TA

0AG

1AG

2AG

3

GA

4CAG

5TCA

6TGCA

3′

29.1 bits

SequenceWalkerPatent5,867,402

Page 15: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Sequence Logo and Sequence Walker and Rsequence

Schneider &StephensNucl. Acids Res.18: 6097-61001990

1 ttattaatacaactcactataaggagag 33.3 2 aaatcaatacgactcactatagagggac 37.4 3 cggttaatacgactcactataggagaac 34.4 4 gaagtaatacgactcagtatagggacaa 33.1 5 taattaattgaactcactaaagggagac 30.1 6 cgcttaatacgactcactaaaggagaca 29.1

17 Bacteriophage T7 RNA polymerase binding sites

Bits0

1

2

bit

s

5′ -21TCGA

-20G

TA

-19CT

GA

-18CG

AT

-17C

T-1

6TA

-15A

-14T

-13C

TA

-12G

C-1

1AG

-10A

-9

C-8

T-7

C-6

A-5

GC

-4

T-3

A-2

AT

-1

TA

0AG

1AG

2AG

3

GA

4CAG

5TCA

6TGCA

3′

Rsequence is the average: 35.0± 0.6 bits

Page 16: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Sequence Logo and Sequence Walker and Rsequence

Schneider &StephensNucl. Acids Res.18: 6097-61001990

1 ttattaatacaactcactataaggagag 33.3 2 aaatcaatacgactcactatagagggac 37.4 3 cggttaatacgactcactataggagaac 34.4 4 gaagtaatacgactcagtatagggacaa 33.1 5 taattaattgaactcactaaagggagac 30.1 6 cgcttaatacgactcactaaaggagaca 29.1

17 Bacteriophage T7 RNA polymerase binding sites

Bits0

1

2

bit

s

5′ -21TCGA

-20G

TA

-19CT

GA

-18CG

AT

-17C

T-1

6TA

-15A

-14T

-13C

TA

-12G

C-1

1AG

-10A

-9

C-8

T-7

C-6

A-5

GC

-4

T-3

A-2

AT

-1

TA

0AG

1AG

2AG

3

GA

4CAG

5TCA

6TGCA

3′

Rsequence is the average: 35.0± 0.6 bits= “area under the logo”

Page 17: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Rfrequency

Information requiredto find a set of binding sites

G = # of potential binding sites

Page 18: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Rfrequency

Information requiredto find a set of binding sites

G = # of potential binding sites= genome size in some cases

Page 19: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Rfrequency

Information requiredto find a set of binding sites

G = # of potential binding sites= genome size in some cases

γ = number of binding sites on genome

Page 20: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Rfrequency

Information requiredto find a set of binding sites

G = # of potential binding sites= genome size in some cases

γ = number of binding sites on genome

Rfrequency = Hbefore −Hafter

Page 21: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Rfrequency

Information requiredto find a set of binding sites

G = # of potential binding sites= genome size in some cases

γ = number of binding sites on genome

Rfrequency = Hbefore −Hafter

= log2G− log2 γ

Page 22: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Rfrequency

Information requiredto find a set of binding sites

G = # of potential binding sites= genome size in some cases

γ = number of binding sites on genome

Rfrequency = Hbefore −Hafter

= log2G− log2 γ= − log2 γ/G

Page 23: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Rfrequency

2log 16/2 = 3 bits

16 positions 2 sites

16 positions 1 sitelog 16/1 = 4 bits2

Information requiredto find a set of binding sitesin a genome

Page 24: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

RNA Splicing

DNA

intron exon

donor acceptor

RNA

Spliced RNA

RNA Splicing

Copy DNA (transcription)

exon

exonexonexon

intronexon

Page 25: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Donor and acceptor logos

5′

3′

exon

intron

exon

donor

acceptor

TGAC

GTCA

CTAG

CT

GGACTC

TGA

TCGA

CTAG

ACGT

TCAG

GATC

GATC

GACT

GATC

GACT

GACT

GACT

AGCT

GACT

AGCT

AGCT

AGCT

AGCT

AGCT

AGCT

AGCT

AGCT

AGCT

GACT

GACT

GACT

GACT

ATGCGATC

GTCATAC

GTCAG

CGAT

Page 26: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Rsequence and Rfrequency for Splice Acceptors

Rsequence

GATC

GATC

GACT

GATC

GACT

GACT

GACT

AGCT

GACT

AGCT

AGCT

AGCT

AGCT

AGCT

AGCT

AGCT

AGCT

AGCT

GACT

GACT

GACT

GACT

ATGCGATC

GTCATAC

GTCAG

CGAT

• Information at binding site sequences (area under sequence logo)• from: binding site sequences• 9.4 bits per site

Page 27: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Rsequence and Rfrequency for Splice Acceptors

Rsequence

GATC

GATC

GACT

GATC

GACT

GACT

GACT

AGCT

GACT

AGCT

AGCT

AGCT

AGCT

AGCT

AGCT

AGCT

AGCT

AGCT

GACT

GACT

GACT

GACT

ATGCGATC

GTCATAC

GTCAG

CGAT

• Information at binding site sequences (area under sequence logo)• from: binding site sequences• 9.4 bits per site

Rfrequency

acceptor

intron exon

donor

• Information needed to locate the sites• from: size of genome and number of sites (length of intron+exon)• 9.7 bits per site

Rfrequency/Rsequence = 0.97

Page 28: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Rsequence = Rfrequency Hypothesis

Hypothesis:The information in

binding site patternsis just sufficient

for the sites to be foundin the genome

Page 29: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Rsequence versus Rfrequency

Binding Site Total Pattern Information needed to Pattern InfoLocation Info

Recognizer1 Information Locate Site in Genome

= Rsequence = Rfrequency =Rsequence

Rfrequency

(bits) (bits)

Spliceosome acceptor2 9.35± 0.12 9.66 0.97± 0.01Spliceosome donor 7.92± 0.09 9.66 0.82± 0.01

Ribosome 11.0 10.6 1.0λ cI/cro 17.7± 1.6 19.3 0.9± 0.1LexA 21.5± 1.7 18.4 1.2± 0.1TrpR 23.4± 1.9 20.3 1.2± 0.1LacI 19.2± 2.8 21.9 0.9± 0.1ArgR 16.4 18.4 0.9O (λ Origin) 20.9 19.9 1.0Ara C 19.3 19.3 1.0Transcription at TATA3

3.3 ∼ 3 ∼ 1

T7 Promoter 35.4 16.5 2.11T. D. Schneider, G. D. Stormo, L. Gold, and A. Ehrenfeucht. J. Mol. Biol., 188:415-431, 1986.2R. M. Stephens and T. D. Schneider. J. Mol. Biol., 228:1124-1136, 1992.3F. E. Penotti. J Mol Biol, 213:37-52, 1990.

Page 30: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Rsequence versus Rfrequency - meaning

The information in the binding site pattern (Rsequence)is close to

The information needed to find the binding sites (Rfrequency)

Page 31: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Rsequence versus Rfrequency - meaning

The information in the binding site pattern (Rsequence)is close to

The information needed to find the binding sites (Rfrequency)

But for a species in a stable environment:

• size of genome (G) is fixed (e. g. E. coli has 4.7× 106 bp)• number of binding sites (γ) is fixed (e. g. there are ∼50 E. coli LexA sites)

so Rfrequency = log2G/γ is fixed

Page 32: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Rsequence versus Rfrequency - meaning

The information in the binding site pattern (Rsequence)is close to

The information needed to find the binding sites (Rfrequency)

But for a species in a stable environment:

• size of genome (G) is fixed (e. g. E. coli has 4.7× 106 bp)• number of binding sites (γ) is fixed (e. g. there are ∼50 E. coli LexA sites)

so Rfrequency = log2G/γ is fixed

Rsequence must evolve towards Rfrequency!

Page 33: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Evolution of Binding Sites

• Rfrequency is fixed relative to Rsequence

Page 34: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Evolution of Binding Sites

• Rfrequency is fixed relative to Rsequence• Does Rsequence evolve toward Rfrequency?

Page 35: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Evolution of Binding Sites

• Rfrequency is fixed relative to Rsequence• Does Rsequence evolve toward Rfrequency?

Setup a Computer Model, ‘Ev’:A population of “creatures” with

Page 36: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Evolution of Binding Sites

• Rfrequency is fixed relative to Rsequence• Does Rsequence evolve toward Rfrequency?

Setup a Computer Model, ‘Ev’:A population of “creatures” with

• genomes containing 4 bases (A, C, G, T)

Page 37: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Evolution of Binding Sites

• Rfrequency is fixed relative to Rsequence• Does Rsequence evolve toward Rfrequency?

Setup a Computer Model, ‘Ev’:A population of “creatures” with

• genomes containing 4 bases (A, C, G, T)• a defined genome size (G)

Page 38: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Evolution of Binding Sites

• Rfrequency is fixed relative to Rsequence• Does Rsequence evolve toward Rfrequency?

Setup a Computer Model, ‘Ev’:A population of “creatures” with

• genomes containing 4 bases (A, C, G, T)• a defined genome size (G)• predetermined binding site locations (γ)

(to fix the frequency of sites)

Page 39: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Evolution of Binding Sites

• Rfrequency is fixed relative to Rsequence• Does Rsequence evolve toward Rfrequency?

Setup a Computer Model, ‘Ev’:A population of “creatures” with

• genomes containing 4 bases (A, C, G, T)• a defined genome size (G)•

}

Rfrequencyis fixedpredetermined binding site locations (γ)

(to fix the frequency of sites)

Page 40: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Evolution of Binding Sites

• Rfrequency is fixed relative to Rsequence• Does Rsequence evolve toward Rfrequency?

Setup a Computer Model, ‘Ev’:A population of “creatures” with

• genomes containing 4 bases (A, C, G, T)• a defined genome size (G)•

}

Rfrequencyis fixedpredetermined binding site locations (γ)

(to fix the frequency of sites)• a recognizer gene encoded in the sequence:

use a weight matrix

Page 41: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

How A Weight Matrix Works

Sequence matrix, s(b, l, j) for sequence j

base b position l

C A G G T C T G C A−3 −2 −1 0 1 2 3 4 5 6

A 0 1 0 0 0 0 0 0 0 1

C 1 0 0 0 0 1 0 0 1 0

G 0 0 1 1 0 0 0 1 0 0

T 0 0 0 0 1 0 1 0 0 0

Individual information weight matrix, Riw(b, l)

base b position l

−3 −2 −1 0 1 2 3 4 5 6

A +0.4 +1.3 −1.4 −8.8 −5.8 +1.1 +1.5 −1.8 −0.7 +0.0

C +0.6 −0.8 −2.4 −7.8 −5.5 −3.7 −1.6 −2.2 −0.5 −0.2G −0.6 −1.0 +1.6 +2.0 −6.2 +0.7 −1.1 +1.7 −0.3 +0.4

T −1.0 −0.9 −1.7 −5.8 +2.0 −3.4 −1.6 −2.2 +0.9 −0.5

Page 42: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

How A Weight Matrix Works

Sequence matrix, s(b, l, j) for sequence j

base b position l

C A G G T C T G C A−3 −2 −1 0 1 2 3 4 5 6

A 0 1 0 0 0 0 0 0 0 1

C 1 0 0 0 0 1 0 0 1 0

G 0 0 1 1 0 0 0 1 0 0

T 0 0 0 0 1 0 1 0 0 0

Individual information weight matrix, Riw(b, l)

base b position l

−3 −2 −1 0 1 2 3 4 5 6

A +0.4 +1.3 −1.4 −8.8 −5.8 +1.1 +1.5 −1.8 −0.7 +0.0

C +0.6 −0.8 −2.4 −7.8 −5.5 −3.7 −1.6 −2.2 −0.5 −0.2G −0.6 −1.0 +1.6 +2.0 −6.2 +0.7 −1.1 +1.7 −0.3 +0.4

T −1.0 −0.9 −1.7 −5.8 +2.0 −3.4 −1.6 −2.2 +0.9 −0.5

5’ c a g g t c t g c a 3’

Sequence Walker

Page 43: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Unevolved Ev Creature

Page 44: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Unevolved Ev Creature}

“blue”geneweightmatrix:6 bpwide

Page 45: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Unevolved Ev Creature}

“blue”geneweightmatrix:6 bpwide

Genome positions available G = 256 bases

Page 46: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Unevolved Ev Creature}

“blue”geneweightmatrix:6 bpwide

Genome positions available G = 256 basesRfrequency = log2 256/16 = 4 bits

}

γ = 16bindingsites

Page 47: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Unevolved Ev Creature}

“blue”geneweightmatrix:6 bpwide

Genome positions available G = 256 basesRfrequency = log2 256/16 = 4 bits

}

γ = 16bindingsites

found real site

Page 48: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Unevolved Ev Creature}

“blue”geneweightmatrix:6 bpwide

Genome positions available G = 256 basesRfrequency = log2 256/16 = 4 bits

}

γ = 16bindingsites

found real sitemissed real site

Page 49: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Unevolved Ev Creature}

“blue”geneweightmatrix:6 bpwide

Genome positions available G = 256 basesRfrequency = log2 256/16 = 4 bits

}

γ = 16bindingsites

found real sitemissed real sitefound wrong site

Page 50: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Evolution Cycle

sort

mutate

replicate

kill

select

ion

evaluate

• EVALUATE each creature

Page 51: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Evolution Cycle

sort

mutate

replicate

kill

select

ion

evaluate

• EVALUATE each creature

• translate the recognizer gene into a weight matrix

Page 52: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Evolution Cycle

sort

mutate

replicate

kill

select

ion

evaluate

• EVALUATE each creature

• translate the recognizer gene into a weight matrix• scan the weight matrix across the genome

Page 53: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Evolution Cycle

sort

mutate

replicate

kill

select

ion

evaluate

• EVALUATE each creature

• translate the recognizer gene into a weight matrix• scan the weight matrix across the genome• count the number of mistakes:

Page 54: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Evolution Cycle

sort

mutate

replicate

kill

select

ion

evaluate

• EVALUATE each creature

• translate the recognizer gene into a weight matrix• scan the weight matrix across the genome• count the number of mistakes:

missing a site at a right place

Page 55: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Evolution Cycle

sort

mutate

replicate

kill

select

ion

evaluate

• EVALUATE each creature

• translate the recognizer gene into a weight matrix• scan the weight matrix across the genome• count the number of mistakes:

missing a site at a right placefinding a site at a wrong place

Page 56: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Evolution Cycle

sort

mutate

replicate

kill

select

ion

evaluate

• EVALUATE each creature

• translate the recognizer gene into a weight matrix• scan the weight matrix across the genome• count the number of mistakes:

missing a site at a right placefinding a site at a wrong place

• Sort the creatures by their mistakes

Page 57: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Evolution Cycle

sort

mutate

replicate

kill

select

ion

evaluate

• EVALUATE each creature

• translate the recognizer gene into a weight matrix• scan the weight matrix across the genome• count the number of mistakes:

missing a site at a right placefinding a site at a wrong place

• Sort the creatures by their mistakes

• REPLICATE: the best creatures areduplicated and replace the worst ones

Page 58: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Evolution Cycle

sort

mutate

replicate

kill

select

ion

evaluate

• EVALUATE each creature

• translate the recognizer gene into a weight matrix• scan the weight matrix across the genome• count the number of mistakes:

missing a site at a right placefinding a site at a wrong place

• Sort the creatures by their mistakes

• REPLICATE: the best creatures areduplicated and replace the worst ones

• MUTATE all genomes randomly

Page 59: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Evolved Ev Creature

Page 60: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Evolution of Binding Sites

16 evolving binding sitesGeneration 100

Rs = -0.1 +/- 0.5 bits

0

1

2b

its

5′ -5 -4

CATG

-3 -2 -1 0 1CGAT

2 3TCGA

4CATG

5ATCG

6 7 8CATG

9 10 3′

16 evolving binding sitesGeneration 200

Rs = -0.0 +/- 0.5 bits

0

1

2

bit

s

5′ -5 -4

CATG

-3 -2

GATC

-1 0 1GACT

2 3CTAG

4 5ATGC

6 7TGAC 8 9 10 3′

16 evolving binding sitesGeneration 300

Rs = 1.6 +/- 0.5 bits

0

1

2

bit

s

5′ -5

AGCT

-4 -3

TACG

-2 -1 0GCTA

1AGCT

2 3TAG

4 5TGC

6 7GTCA

8GCTA

9 10

GCTA

3′

16 evolving binding sitesGeneration 400

Rs = 2.6 +/- 0.5 bits

0

1

2

bit

s

5′ -5 -4

ATCG

-3

ACG

-2 -1

TCAG

0TGCA

1GACT

2GCTA

3CTAG

4TCAG

5TGC

6 7TGCA 8 9ACTG

10

CGTA

3′

16 evolving binding sitesGeneration 500

Rs = 2.7 +/- 0.5 bits

0

1

2

bit

s

5′ -5 -4 -3 -2 -1

CTAG 0TGA

1ACT

2GCTA

3CTAG

4TCAG

5AGC

6 7GTCA 8GCTA

9 10

GTA

3′

16 evolving binding sitesGeneration 600

Rs = 2.7 +/- 0.5 bits

0

1

2b

its

5′ -5 -4 -3 -2

GTAC

-1 0TCGA

1GACT

2TCA

3ATG

4CGA

5AGC

6 7TGCA 8GCAT 9 10

GCTA

3′

16 evolving binding sitesGeneration 700

Rs = 3.7 +/- 0.5 bits

0

1

2

bit

s

5′ -5 -4 -3

TCGA

-2

ATCG

-1

TCGA 0TGA

1GCT

2CAT

3TGA

4CAG

5TGAC

6 7CTGA 8CTGA 9 10

GTA

3′

16 evolving binding sitesGeneration 800

Rs = 3.2 +/- 0.5 bits

0

1

2

bit

s

5′ -5

GATC

-4 -3 -2

ATGC

-1 0

GA

1ACT

2CTA

3CTGA

4ACG

5TGAC

6 7TGCA 8CGAT 9CTAG

10

CGTA

3′

16 evolving binding sitesGeneration 900

Rs = 4.5 +/- 0.5 bits

0

1

2

bit

s

5′ -5

GCAT

-4 -3 -2

ATCG

-1

TAG

0

GA

1GACT

2CTA

3ACTG 4CAG

5TGAC

6 7 8TGA

9TGA

10 3′

16 evolving binding sitesGeneration 1000

Rs = 4.9 +/- 0.5 bits

0

1

2

bit

s

5′ -5 -4 -3

TAGC

-2

ATCG

-1

TAG

0GA

1

CT

2CTA

3 4CAG

5GAC

6 7 8 9CTGA

10

CGTA

3′

16 evolving binding sitesGeneration 1100

Rs = 5.0 +/- 0.5 bits

0

1

2b

its

5′ -5 -4 -3

ATGC

-2

ATCG

-1

CAG

0

GA

1GACT

2TA

3CTAG 4CAG

5GAC

6 7 8TCGA

9CTGA

10

TAG

3′

16 evolving binding sitesGeneration 1200

Rs = 4.5 +/- 0.5 bits

0

1

2

bit

s

5′ -5 -4 -3

TCG

-2

ACGT

-1

CTGA

0GA

1GACT

2CTA

3 4ACG

5GTAC

6 7 8TCGA

9GTA

10

TGA

3′

16 evolving binding sitesGeneration 1300

Rs = 4.2 +/- 0.5 bits

0

1

2

bit

s

5′ -5

GTCA

-4 -3

CATG

-2

ACGT

-1

TCGA

0GA

1GATC

2CTA

3ACTG 4CAG

5TGCA

6CTA

7 8 9CGTA

10

TGA

3′

16 evolving binding sitesGeneration 1400

Rs = 4.4 +/- 0.5 bits

0

1

2

bit

s

5′ -5

GATC

-4

GATC

-3

ATCG

-2

GCT

-1 0TGA

1ACT

2TCA

3ATCG

4CTAG

5GAC

6 7 8GCTA

9 10

CTGA

3′

16 evolving binding sitesGeneration 1500

Rs = 3.9 +/- 0.5 bits

0

1

2

bit

s

5′ -5

GATC

-4 -3

ACTG

-2

ATGC

-1

TGCA 0TGA

1ACT

2TCA

3TACG

4CTAG

5GAC

6CGAT 7 8 9GCTA

10

TCGA

3′

16 evolving binding sitesGeneration 1600

Rs = 4.9 +/- 0.5 bits

0

1

2b

its

5′ -5

GATC

-4

AGTC

-3

TACG

-2 -1

TCGA

0

GA

1

CT

2TCA

3CTG

4TCAG

5TGAC

6AGCT

7 8GCTA

9CTGA

10

CTGA

3′

16 evolving binding sitesGeneration 1700

Rs = 4.7 +/- 0.5 bits

0

1

2

bit

s

5′ -5 -4 -3

ACG

-2

ATCG

-1

TCAG

0

GA

1ACT

2GTCA

3CTG

4TAG

5GATC

6ATGC

7TACG

8GCTA

9TCGA

10

CTGA

3′

16 evolving binding sitesGeneration 1800

Rs = 3.7 +/- 0.5 bits

0

1

2

bit

s

5′ -5 -4 -3

TCGA

-2 -1 0

GA

1ACT

2TCA

3CATG

4

GA

5GTAC

6 7 8GCTA 9TCGA

10

GTA

3′

16 evolving binding sitesGeneration 1900

Rs = 4.7 +/- 0.5 bits

0

1

2

bit

s

5′ -5 -4 -3 -2

GACT

-1 0

GA

1ACT

2TCA

3CATG

4

GA

5TAC

6 7 8TCA

9TCGA

10

CTGA

3′

16 evolving binding sitesGeneration 2000

Rs = 5.2 +/- 0.5 bits

0

1

2

bit

s

5′ -5

TGAC

-4 -3 -2

CGAT

-1

GTCA

0

GA

1ACT

2TCA

3ATCG

4

GA

5TAC

6CAGT 7 8GTCA 9CTGA

10

TGA

3′

Page 61: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Evolution of Binding Sites

16 evolving binding sitesGeneration 100

Rs = -0.1 +/- 0.5 bits

0

1

2b

its

5′ -5 -4

CATG

-3 -2 -1 0 1CGAT

2 3TCGA

4CATG

5ATCG

6 7 8CATG

9 10 3′

16 evolving binding sitesGeneration 200

Rs = -0.0 +/- 0.5 bits

0

1

2

bit

s

5′ -5 -4

CATG

-3 -2

GATC

-1 0 1GACT

2 3CTAG

4 5ATGC

6 7TGAC 8 9 10 3′

16 evolving binding sitesGeneration 300

Rs = 1.6 +/- 0.5 bits

0

1

2

bit

s

5′ -5

AGCT

-4 -3

TACG

-2 -1 0GCTA

1AGCT

2 3TAG

4 5TGC

6 7GTCA

8GCTA

9 10

GCTA

3′

16 evolving binding sitesGeneration 400

Rs = 2.6 +/- 0.5 bits

0

1

2

bit

s

5′ -5 -4

ATCG

-3

ACG

-2 -1

TCAG

0TGCA

1GACT

2GCTA

3CTAG

4TCAG

5TGC

6 7TGCA 8 9ACTG

10

CGTA

3′

16 evolving binding sitesGeneration 500

Rs = 2.7 +/- 0.5 bits

0

1

2

bit

s

5′ -5 -4 -3 -2 -1

CTAG 0TGA

1ACT

2GCTA

3CTAG

4TCAG

5AGC

6 7GTCA 8GCTA

9 10

GTA

3′

16 evolving binding sitesGeneration 600

Rs = 2.7 +/- 0.5 bits

0

1

2b

its

5′ -5 -4 -3 -2

GTAC

-1 0TCGA

1GACT

2TCA

3ATG

4CGA

5AGC

6 7TGCA 8GCAT 9 10

GCTA

3′

16 evolving binding sitesGeneration 700

Rs = 3.7 +/- 0.5 bits

0

1

2

bit

s

5′ -5 -4 -3

TCGA

-2

ATCG

-1

TCGA 0TGA

1GCT

2CAT

3TGA

4CAG

5TGAC

6 7CTGA 8CTGA 9 10

GTA

3′

16 evolving binding sitesGeneration 800

Rs = 3.2 +/- 0.5 bits

0

1

2

bit

s

5′ -5

GATC

-4 -3 -2

ATGC

-1 0

GA

1ACT

2CTA

3CTGA

4ACG

5TGAC

6 7TGCA 8CGAT 9CTAG

10

CGTA

3′

16 evolving binding sitesGeneration 900

Rs = 4.5 +/- 0.5 bits

0

1

2

bit

s

5′ -5

GCAT

-4 -3 -2

ATCG

-1

TAG

0

GA

1GACT

2CTA

3ACTG 4CAG

5TGAC

6 7 8TGA

9TGA

10 3′

16 evolving binding sitesGeneration 1000

Rs = 4.9 +/- 0.5 bits

0

1

2

bit

s

5′ -5 -4 -3

TAGC

-2

ATCG

-1

TAG

0GA

1

CT

2CTA

3 4CAG

5GAC

6 7 8 9CTGA

10

CGTA

3′

16 evolving binding sitesGeneration 1100

Rs = 5.0 +/- 0.5 bits

0

1

2b

its

5′ -5 -4 -3

ATGC

-2

ATCG

-1

CAG

0

GA

1GACT

2TA

3CTAG 4CAG

5GAC

6 7 8TCGA

9CTGA

10

TAG

3′

16 evolving binding sitesGeneration 1200

Rs = 4.5 +/- 0.5 bits

0

1

2

bit

s

5′ -5 -4 -3

TCG

-2

ACGT

-1

CTGA

0GA

1GACT

2CTA

3 4ACG

5GTAC

6 7 8TCGA

9GTA

10

TGA

3′

16 evolving binding sitesGeneration 1300

Rs = 4.2 +/- 0.5 bits

0

1

2

bit

s

5′ -5

GTCA

-4 -3

CATG

-2

ACGT

-1

TCGA

0GA

1GATC

2CTA

3ACTG 4CAG

5TGCA

6CTA

7 8 9CGTA

10

TGA

3′

16 evolving binding sitesGeneration 1400

Rs = 4.4 +/- 0.5 bits

0

1

2

bit

s

5′ -5

GATC

-4

GATC

-3

ATCG

-2

GCT

-1 0TGA

1ACT

2TCA

3ATCG

4CTAG

5GAC

6 7 8GCTA

9 10

CTGA

3′

16 evolving binding sitesGeneration 1500

Rs = 3.9 +/- 0.5 bits

0

1

2

bit

s

5′ -5

GATC

-4 -3

ACTG

-2

ATGC

-1

TGCA 0TGA

1ACT

2TCA

3TACG

4CTAG

5GAC

6CGAT 7 8 9GCTA

10

TCGA

3′

16 evolving binding sitesGeneration 1600

Rs = 4.9 +/- 0.5 bits

0

1

2b

its

5′ -5

GATC

-4

AGTC

-3

TACG

-2 -1

TCGA

0

GA

1

CT

2TCA

3CTG

4TCAG

5TGAC

6AGCT

7 8GCTA

9CTGA

10

CTGA

3′

16 evolving binding sitesGeneration 1700

Rs = 4.7 +/- 0.5 bits

0

1

2

bit

s

5′ -5 -4 -3

ACG

-2

ATCG

-1

TCAG

0

GA

1ACT

2GTCA

3CTG

4TAG

5GATC

6ATGC

7TACG

8GCTA

9TCGA

10

CTGA

3′

16 evolving binding sitesGeneration 1800

Rs = 3.7 +/- 0.5 bits

0

1

2

bit

s

5′ -5 -4 -3

TCGA

-2 -1 0

GA

1ACT

2TCA

3CATG

4

GA

5GTAC

6 7 8GCTA 9TCGA

10

GTA

3′

16 evolving binding sitesGeneration 1900

Rs = 4.7 +/- 0.5 bits

0

1

2

bit

s

5′ -5 -4 -3 -2

GACT

-1 0

GA

1ACT

2TCA

3CATG

4

GA

5TAC

6 7 8TCA

9TCGA

10

CTGA

3′

16 evolving binding sitesGeneration 2000

Rs = 5.2 +/- 0.5 bits

0

1

2

bit

s

5′ -5

TGAC

-4 -3 -2

CGAT

-1

GTCA

0

GA

1ACT

2TCA

3ATCG

4

GA

5TAC

6CAGT 7 8GTCA 9CTGA

10

TGA

3′

selection no selection

0 500 1000 1500 2000 0

4

8

12

16

20

Generation

Mistakes of Best Organism

Page 62: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Evolution of Binding Sites

16 evolving binding sitesGeneration 100

Rs = -0.1 +/- 0.5 bits

0

1

2b

its

5′ -5 -4

CATG

-3 -2 -1 0 1CGAT

2 3TCGA

4CATG

5ATCG

6 7 8CATG

9 10 3′

16 evolving binding sitesGeneration 200

Rs = -0.0 +/- 0.5 bits

0

1

2

bit

s

5′ -5 -4

CATG

-3 -2

GATC

-1 0 1GACT

2 3CTAG

4 5ATGC

6 7TGAC 8 9 10 3′

16 evolving binding sitesGeneration 300

Rs = 1.6 +/- 0.5 bits

0

1

2

bit

s

5′ -5

AGCT

-4 -3

TACG

-2 -1 0GCTA

1AGCT

2 3TAG

4 5TGC

6 7GTCA

8GCTA

9 10

GCTA

3′

16 evolving binding sitesGeneration 400

Rs = 2.6 +/- 0.5 bits

0

1

2

bit

s

5′ -5 -4

ATCG

-3

ACG

-2 -1

TCAG

0TGCA

1GACT

2GCTA

3CTAG

4TCAG

5TGC

6 7TGCA 8 9ACTG

10

CGTA

3′

16 evolving binding sitesGeneration 500

Rs = 2.7 +/- 0.5 bits

0

1

2

bit

s

5′ -5 -4 -3 -2 -1

CTAG 0TGA

1ACT

2GCTA

3CTAG

4TCAG

5AGC

6 7GTCA 8GCTA

9 10

GTA

3′

16 evolving binding sitesGeneration 600

Rs = 2.7 +/- 0.5 bits

0

1

2b

its

5′ -5 -4 -3 -2

GTAC

-1 0TCGA

1GACT

2TCA

3ATG

4CGA

5AGC

6 7TGCA 8GCAT 9 10

GCTA

3′

16 evolving binding sitesGeneration 700

Rs = 3.7 +/- 0.5 bits

0

1

2

bit

s

5′ -5 -4 -3

TCGA

-2

ATCG

-1

TCGA 0TGA

1GCT

2CAT

3TGA

4CAG

5TGAC

6 7CTGA 8CTGA 9 10

GTA

3′

16 evolving binding sitesGeneration 800

Rs = 3.2 +/- 0.5 bits

0

1

2

bit

s

5′ -5

GATC

-4 -3 -2

ATGC

-1 0

GA

1ACT

2CTA

3CTGA

4ACG

5TGAC

6 7TGCA 8CGAT 9CTAG

10

CGTA

3′

16 evolving binding sitesGeneration 900

Rs = 4.5 +/- 0.5 bits

0

1

2

bit

s

5′ -5

GCAT

-4 -3 -2

ATCG

-1

TAG

0

GA

1GACT

2CTA

3ACTG 4CAG

5TGAC

6 7 8TGA

9TGA

10 3′

16 evolving binding sitesGeneration 1000

Rs = 4.9 +/- 0.5 bits

0

1

2

bit

s

5′ -5 -4 -3

TAGC

-2

ATCG

-1

TAG

0GA

1

CT

2CTA

3 4CAG

5GAC

6 7 8 9CTGA

10

CGTA

3′

16 evolving binding sitesGeneration 1100

Rs = 5.0 +/- 0.5 bits

0

1

2b

its

5′ -5 -4 -3

ATGC

-2

ATCG

-1

CAG

0

GA

1GACT

2TA

3CTAG 4CAG

5GAC

6 7 8TCGA

9CTGA

10

TAG

3′

16 evolving binding sitesGeneration 1200

Rs = 4.5 +/- 0.5 bits

0

1

2

bit

s

5′ -5 -4 -3

TCG

-2

ACGT

-1

CTGA

0GA

1GACT

2CTA

3 4ACG

5GTAC

6 7 8TCGA

9GTA

10

TGA

3′

16 evolving binding sitesGeneration 1300

Rs = 4.2 +/- 0.5 bits

0

1

2

bit

s

5′ -5

GTCA

-4 -3

CATG

-2

ACGT

-1

TCGA

0GA

1GATC

2CTA

3ACTG 4CAG

5TGCA

6CTA

7 8 9CGTA

10

TGA

3′

16 evolving binding sitesGeneration 1400

Rs = 4.4 +/- 0.5 bits

0

1

2

bit

s

5′ -5

GATC

-4

GATC

-3

ATCG

-2

GCT

-1 0TGA

1ACT

2TCA

3ATCG

4CTAG

5GAC

6 7 8GCTA

9 10

CTGA

3′

16 evolving binding sitesGeneration 1500

Rs = 3.9 +/- 0.5 bits

0

1

2

bit

s

5′ -5

GATC

-4 -3

ACTG

-2

ATGC

-1

TGCA 0TGA

1ACT

2TCA

3TACG

4CTAG

5GAC

6CGAT 7 8 9GCTA

10

TCGA

3′

16 evolving binding sitesGeneration 1600

Rs = 4.9 +/- 0.5 bits

0

1

2b

its

5′ -5

GATC

-4

AGTC

-3

TACG

-2 -1

TCGA

0

GA

1

CT

2TCA

3CTG

4TCAG

5TGAC

6AGCT

7 8GCTA

9CTGA

10

CTGA

3′

16 evolving binding sitesGeneration 1700

Rs = 4.7 +/- 0.5 bits

0

1

2

bit

s

5′ -5 -4 -3

ACG

-2

ATCG

-1

TCAG

0

GA

1ACT

2GTCA

3CTG

4TAG

5GATC

6ATGC

7TACG

8GCTA

9TCGA

10

CTGA

3′

16 evolving binding sitesGeneration 1800

Rs = 3.7 +/- 0.5 bits

0

1

2

bit

s

5′ -5 -4 -3

TCGA

-2 -1 0

GA

1ACT

2TCA

3CATG

4

GA

5GTAC

6 7 8GCTA 9TCGA

10

GTA

3′

16 evolving binding sitesGeneration 1900

Rs = 4.7 +/- 0.5 bits

0

1

2

bit

s

5′ -5 -4 -3 -2

GACT

-1 0

GA

1ACT

2TCA

3CATG

4

GA

5TAC

6 7 8TCA

9TCGA

10

CTGA

3′

16 evolving binding sitesGeneration 2000

Rs = 5.2 +/- 0.5 bits

0

1

2

bit

s

5′ -5

TGAC

-4 -3 -2

CGAT

-1

GTCA

0

GA

1ACT

2TCA

3ATCG

4

GA

5TAC

6CAGT 7 8GTCA 9CTGA

10

TGA

3′

selection no selection

0 500 1000 1500 2000 0

4

8

12

16

20

Generation

Mistakes of Best Organism

Rfrequency Rsequence

selection

no selection

0 500 1000 1500 2000 -1.0

0.0

1.0

2.0

3.0

4.0

5.0

6.0

Generation

Information (bits per site)

Page 63: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Donor and acceptor logos

5′

3′

exon

intron

exon

donor

acceptor

TGAC

GTCA

CTAG

CT

GGACTC

TGA

TCGA

CTAG

ACGT

TCAG

GATC

GATC

GACT

GATC

GACT

GACT

GACT

AGCT

GACT

AGCT

AGCT

AGCT

AGCT

AGCT

AGCT

AGCT

AGCT

AGCT

GACT

GACT

GACT

GACT

ATGCGATC

GTCATAC

GTCAG

CGAT

Page 64: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Human Splice Junction Information Curves

Donor

Sequence

Conservation→

inbitsper

base

Acceptor

l

a

c

g

t

Rs

(l)

Rs(l

) =

Rseq

uenc

e(l)

, In

form

atio

n in

bit

s

Position L (in bases)

0.0

1.0

2.0

-9 481 444 497 371 0.01

-8 461 461 453 422 -0.00

-7 434 457 530 378 0.01

-6 498 433 452 416 0.00

-5 474 489 398 438 0.00

-4 492 527 446 334 0.02

-3 603 673 301 222 0.14

-2 1071 263 219 246 0.39

-1 169 85 1404 141 0.90

0 0 2 1789 8 1.95

1 8 10 6 1775 1.88

2 974 35 738 43 0.75

3 1274 152 215 149 0.68

4 128 95 1468 97 1.04

5 270 304 348 812 0.16

6 428 369 564 296 0.04

7 319 492 408 424 0.02

8 332 494 413 384 0.01

9 302 437 463 408 0.02

10 281 419 424 342 0.02

11 274 382 436 369 0.02

12 273 414 400 355 0.02

l

a

c

g

t

Rs

(l)

Rs(l

) =

Rseq

uenc

e(l)

, In

form

atio

n in

bit

s

Position L (in bases)

0.0

1.0

2.0

-25 309 392 208 386 0.04

-24 282 420 216 388 0.04

-23 273 392 220 424 0.05

-22 290 411 214 397 0.04

-21 309 400 194 416 0.05

-20 248 412 203 461 0.08

-19 256 388 225 482 0.07

-18 213 442 246 455 0.08

-17 201 453 183 535 0.15

-16 193 471 199 517 0.14

-15 182 459 212 533 0.14

-14 158 476 193 619 0.21

-13 135 486 197 656 0.25

-12 122 511 174 673 0.29

-11 121 500 152 732 0.34

-10 110 483 149 775 0.37

-9 131 534 175 800 0.33

-8 156 583 192 727 0.27

-7 172 656 172 674 0.27

-6 165 653 158 726 0.30

-5 123 691 110 788 0.43

-4 127 590 103 914 0.46

-3 394 499 432 411 0.00

-2 67 1253 20 397 0.92

-1 1717 10 6 10 1.86

0 9 11 1720 4 1.87

1 449 231 871 187 0.26

2 426 308 413 590 0.04

3 424 426 419 468 0.00

4 400 487 440 410 0.00

5 424 502 347 464 0.01

6 394 527 424 392 0.01

7 441 431 485 380 0.00

https://alum.mit.edu/www/toms/papers/splice/

• The consensus sequences match . . .

Page 65: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Human Splice Junction Information Curves

Donor

Sequence

Conservation→

inbitsper

base

Acceptor

l

a

c

g

t

Rs

(l)

Rs(l

) =

Rseq

uenc

e(l)

, In

form

atio

n in

bit

s

Position L (in bases)

0.0

1.0

2.0

-9 481 444 497 371 0.01

-8 461 461 453 422 -0.00

-7 434 457 530 378 0.01

-6 498 433 452 416 0.00

-5 474 489 398 438 0.00

-4 492 527 446 334 0.02

-3 603 673 301 222 0.14

-2 1071 263 219 246 0.39

-1 169 85 1404 141 0.90

0 0 2 1789 8 1.95

1 8 10 6 1775 1.88

2 974 35 738 43 0.75

3 1274 152 215 149 0.68

4 128 95 1468 97 1.04

5 270 304 348 812 0.16

6 428 369 564 296 0.04

7 319 492 408 424 0.02

8 332 494 413 384 0.01

9 302 437 463 408 0.02

10 281 419 424 342 0.02

11 274 382 436 369 0.02

12 273 414 400 355 0.02

l

a

c

g

t

Rs

(l)

Rs(l

) =

Rseq

uenc

e(l)

, In

form

atio

n in

bit

s

Position L (in bases)

0.0

1.0

2.0

-25 309 392 208 386 0.04

-24 282 420 216 388 0.04

-23 273 392 220 424 0.05

-22 290 411 214 397 0.04

-21 309 400 194 416 0.05

-20 248 412 203 461 0.08

-19 256 388 225 482 0.07

-18 213 442 246 455 0.08

-17 201 453 183 535 0.15

-16 193 471 199 517 0.14

-15 182 459 212 533 0.14

-14 158 476 193 619 0.21

-13 135 486 197 656 0.25

-12 122 511 174 673 0.29

-11 121 500 152 732 0.34

-10 110 483 149 775 0.37

-9 131 534 175 800 0.33

-8 156 583 192 727 0.27

-7 172 656 172 674 0.27

-6 165 653 158 726 0.30

-5 123 691 110 788 0.43

-4 127 590 103 914 0.46

-3 394 499 432 411 0.00

-2 67 1253 20 397 0.92

-1 1717 10 6 10 1.86

0 9 11 1720 4 1.87

1 449 231 871 187 0.26

2 426 308 413 590 0.04

3 424 426 419 468 0.00

4 400 487 440 410 0.00

5 424 502 347 464 0.01

6 394 527 424 392 0.01

7 441 431 485 380 0.00

C

https://alum.mit.edu/www/toms/papers/splice/

• The consensus sequences match . . .

Page 66: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Human Splice Junction Information Curves

Donor

Sequence

Conservation→

inbitsper

base

Acceptor

l

a

c

g

t

Rs

(l)

Rs(l

) =

Rseq

uenc

e(l)

, In

form

atio

n in

bit

s

Position L (in bases)

0.0

1.0

2.0

-9 481 444 497 371 0.01

-8 461 461 453 422 -0.00

-7 434 457 530 378 0.01

-6 498 433 452 416 0.00

-5 474 489 398 438 0.00

-4 492 527 446 334 0.02

-3 603 673 301 222 0.14

-2 1071 263 219 246 0.39

-1 169 85 1404 141 0.90

0 0 2 1789 8 1.95

1 8 10 6 1775 1.88

2 974 35 738 43 0.75

3 1274 152 215 149 0.68

4 128 95 1468 97 1.04

5 270 304 348 812 0.16

6 428 369 564 296 0.04

7 319 492 408 424 0.02

8 332 494 413 384 0.01

9 302 437 463 408 0.02

10 281 419 424 342 0.02

11 274 382 436 369 0.02

12 273 414 400 355 0.02

l

a

c

g

t

Rs

(l)

Rs(l

) =

Rseq

uenc

e(l)

, In

form

atio

n in

bit

s

Position L (in bases)

0.0

1.0

2.0

-25 309 392 208 386 0.04

-24 282 420 216 388 0.04

-23 273 392 220 424 0.05

-22 290 411 214 397 0.04

-21 309 400 194 416 0.05

-20 248 412 203 461 0.08

-19 256 388 225 482 0.07

-18 213 442 246 455 0.08

-17 201 453 183 535 0.15

-16 193 471 199 517 0.14

-15 182 459 212 533 0.14

-14 158 476 193 619 0.21

-13 135 486 197 656 0.25

-12 122 511 174 673 0.29

-11 121 500 152 732 0.34

-10 110 483 149 775 0.37

-9 131 534 175 800 0.33

-8 156 583 192 727 0.27

-7 172 656 172 674 0.27

-6 165 653 158 726 0.30

-5 123 691 110 788 0.43

-4 127 590 103 914 0.46

-3 394 499 432 411 0.00

-2 67 1253 20 397 0.92

-1 1717 10 6 10 1.86

0 9 11 1720 4 1.87

1 449 231 871 187 0.26

2 426 308 413 590 0.04

3 424 426 419 468 0.00

4 400 487 440 410 0.00

5 424 502 347 464 0.01

6 394 527 424 392 0.01

7 441 431 485 380 0.00

C A

https://alum.mit.edu/www/toms/papers/splice/

• The consensus sequences match . . .

Page 67: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Human Splice Junction Information Curves

Donor

Sequence

Conservation→

inbitsper

base

Acceptor

l

a

c

g

t

Rs

(l)

Rs(l

) =

Rseq

uenc

e(l)

, In

form

atio

n in

bit

s

Position L (in bases)

0.0

1.0

2.0

-9 481 444 497 371 0.01

-8 461 461 453 422 -0.00

-7 434 457 530 378 0.01

-6 498 433 452 416 0.00

-5 474 489 398 438 0.00

-4 492 527 446 334 0.02

-3 603 673 301 222 0.14

-2 1071 263 219 246 0.39

-1 169 85 1404 141 0.90

0 0 2 1789 8 1.95

1 8 10 6 1775 1.88

2 974 35 738 43 0.75

3 1274 152 215 149 0.68

4 128 95 1468 97 1.04

5 270 304 348 812 0.16

6 428 369 564 296 0.04

7 319 492 408 424 0.02

8 332 494 413 384 0.01

9 302 437 463 408 0.02

10 281 419 424 342 0.02

11 274 382 436 369 0.02

12 273 414 400 355 0.02

l

a

c

g

t

Rs

(l)

Rs(l

) =

Rseq

uenc

e(l)

, In

form

atio

n in

bit

s

Position L (in bases)

0.0

1.0

2.0

-25 309 392 208 386 0.04

-24 282 420 216 388 0.04

-23 273 392 220 424 0.05

-22 290 411 214 397 0.04

-21 309 400 194 416 0.05

-20 248 412 203 461 0.08

-19 256 388 225 482 0.07

-18 213 442 246 455 0.08

-17 201 453 183 535 0.15

-16 193 471 199 517 0.14

-15 182 459 212 533 0.14

-14 158 476 193 619 0.21

-13 135 486 197 656 0.25

-12 122 511 174 673 0.29

-11 121 500 152 732 0.34

-10 110 483 149 775 0.37

-9 131 534 175 800 0.33

-8 156 583 192 727 0.27

-7 172 656 172 674 0.27

-6 165 653 158 726 0.30

-5 123 691 110 788 0.43

-4 127 590 103 914 0.46

-3 394 499 432 411 0.00

-2 67 1253 20 397 0.92

-1 1717 10 6 10 1.86

0 9 11 1720 4 1.87

1 449 231 871 187 0.26

2 426 308 413 590 0.04

3 424 426 419 468 0.00

4 400 487 440 410 0.00

5 424 502 347 464 0.01

6 394 527 424 392 0.01

7 441 431 485 380 0.00

C A G

https://alum.mit.edu/www/toms/papers/splice/

• The consensus sequences match . . .

Page 68: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Human Splice Junction Information Curves

Donor

Sequence

Conservation→

inbitsper

base

Acceptor

l

a

c

g

t

Rs

(l)

Rs(l

) =

Rseq

uenc

e(l)

, In

form

atio

n in

bit

s

Position L (in bases)

0.0

1.0

2.0

-9 481 444 497 371 0.01

-8 461 461 453 422 -0.00

-7 434 457 530 378 0.01

-6 498 433 452 416 0.00

-5 474 489 398 438 0.00

-4 492 527 446 334 0.02

-3 603 673 301 222 0.14

-2 1071 263 219 246 0.39

-1 169 85 1404 141 0.90

0 0 2 1789 8 1.95

1 8 10 6 1775 1.88

2 974 35 738 43 0.75

3 1274 152 215 149 0.68

4 128 95 1468 97 1.04

5 270 304 348 812 0.16

6 428 369 564 296 0.04

7 319 492 408 424 0.02

8 332 494 413 384 0.01

9 302 437 463 408 0.02

10 281 419 424 342 0.02

11 274 382 436 369 0.02

12 273 414 400 355 0.02

l

a

c

g

t

Rs

(l)

Rs(l

) =

Rseq

uenc

e(l)

, In

form

atio

n in

bit

s

Position L (in bases)

0.0

1.0

2.0

-25 309 392 208 386 0.04

-24 282 420 216 388 0.04

-23 273 392 220 424 0.05

-22 290 411 214 397 0.04

-21 309 400 194 416 0.05

-20 248 412 203 461 0.08

-19 256 388 225 482 0.07

-18 213 442 246 455 0.08

-17 201 453 183 535 0.15

-16 193 471 199 517 0.14

-15 182 459 212 533 0.14

-14 158 476 193 619 0.21

-13 135 486 197 656 0.25

-12 122 511 174 673 0.29

-11 121 500 152 732 0.34

-10 110 483 149 775 0.37

-9 131 534 175 800 0.33

-8 156 583 192 727 0.27

-7 172 656 172 674 0.27

-6 165 653 158 726 0.30

-5 123 691 110 788 0.43

-4 127 590 103 914 0.46

-3 394 499 432 411 0.00

-2 67 1253 20 397 0.92

-1 1717 10 6 10 1.86

0 9 11 1720 4 1.87

1 449 231 871 187 0.26

2 426 308 413 590 0.04

3 424 426 419 468 0.00

4 400 487 440 410 0.00

5 424 502 347 464 0.01

6 394 527 424 392 0.01

7 441 431 485 380 0.00

C A G —

https://alum.mit.edu/www/toms/papers/splice/

• The consensus sequences match . . .

Page 69: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Human Splice Junction Information Curves

Donor

Sequence

Conservation→

inbitsper

base

Acceptor

l

a

c

g

t

Rs

(l)

Rs(l

) =

Rseq

uenc

e(l)

, In

form

atio

n in

bit

s

Position L (in bases)

0.0

1.0

2.0

-9 481 444 497 371 0.01

-8 461 461 453 422 -0.00

-7 434 457 530 378 0.01

-6 498 433 452 416 0.00

-5 474 489 398 438 0.00

-4 492 527 446 334 0.02

-3 603 673 301 222 0.14

-2 1071 263 219 246 0.39

-1 169 85 1404 141 0.90

0 0 2 1789 8 1.95

1 8 10 6 1775 1.88

2 974 35 738 43 0.75

3 1274 152 215 149 0.68

4 128 95 1468 97 1.04

5 270 304 348 812 0.16

6 428 369 564 296 0.04

7 319 492 408 424 0.02

8 332 494 413 384 0.01

9 302 437 463 408 0.02

10 281 419 424 342 0.02

11 274 382 436 369 0.02

12 273 414 400 355 0.02

l

a

c

g

t

Rs

(l)

Rs(l

) =

Rseq

uenc

e(l)

, In

form

atio

n in

bit

s

Position L (in bases)

0.0

1.0

2.0

-25 309 392 208 386 0.04

-24 282 420 216 388 0.04

-23 273 392 220 424 0.05

-22 290 411 214 397 0.04

-21 309 400 194 416 0.05

-20 248 412 203 461 0.08

-19 256 388 225 482 0.07

-18 213 442 246 455 0.08

-17 201 453 183 535 0.15

-16 193 471 199 517 0.14

-15 182 459 212 533 0.14

-14 158 476 193 619 0.21

-13 135 486 197 656 0.25

-12 122 511 174 673 0.29

-11 121 500 152 732 0.34

-10 110 483 149 775 0.37

-9 131 534 175 800 0.33

-8 156 583 192 727 0.27

-7 172 656 172 674 0.27

-6 165 653 158 726 0.30

-5 123 691 110 788 0.43

-4 127 590 103 914 0.46

-3 394 499 432 411 0.00

-2 67 1253 20 397 0.92

-1 1717 10 6 10 1.86

0 9 11 1720 4 1.87

1 449 231 871 187 0.26

2 426 308 413 590 0.04

3 424 426 419 468 0.00

4 400 487 440 410 0.00

5 424 502 347 464 0.01

6 394 527 424 392 0.01

7 441 431 485 380 0.00

C A G — G

https://alum.mit.edu/www/toms/papers/splice/

• The consensus sequences match . . .

Page 70: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Human Splice Junction Information Curves

Donor

Sequence

Conservation→

inbitsper

base

Acceptor

l

a

c

g

t

Rs

(l)

Rs(l

) =

Rseq

uenc

e(l)

, In

form

atio

n in

bit

s

Position L (in bases)

0.0

1.0

2.0

-9 481 444 497 371 0.01

-8 461 461 453 422 -0.00

-7 434 457 530 378 0.01

-6 498 433 452 416 0.00

-5 474 489 398 438 0.00

-4 492 527 446 334 0.02

-3 603 673 301 222 0.14

-2 1071 263 219 246 0.39

-1 169 85 1404 141 0.90

0 0 2 1789 8 1.95

1 8 10 6 1775 1.88

2 974 35 738 43 0.75

3 1274 152 215 149 0.68

4 128 95 1468 97 1.04

5 270 304 348 812 0.16

6 428 369 564 296 0.04

7 319 492 408 424 0.02

8 332 494 413 384 0.01

9 302 437 463 408 0.02

10 281 419 424 342 0.02

11 274 382 436 369 0.02

12 273 414 400 355 0.02

l

a

c

g

t

Rs

(l)

Rs(l

) =

Rseq

uenc

e(l)

, In

form

atio

n in

bit

s

Position L (in bases)

0.0

1.0

2.0

-25 309 392 208 386 0.04

-24 282 420 216 388 0.04

-23 273 392 220 424 0.05

-22 290 411 214 397 0.04

-21 309 400 194 416 0.05

-20 248 412 203 461 0.08

-19 256 388 225 482 0.07

-18 213 442 246 455 0.08

-17 201 453 183 535 0.15

-16 193 471 199 517 0.14

-15 182 459 212 533 0.14

-14 158 476 193 619 0.21

-13 135 486 197 656 0.25

-12 122 511 174 673 0.29

-11 121 500 152 732 0.34

-10 110 483 149 775 0.37

-9 131 534 175 800 0.33

-8 156 583 192 727 0.27

-7 172 656 172 674 0.27

-6 165 653 158 726 0.30

-5 123 691 110 788 0.43

-4 127 590 103 914 0.46

-3 394 499 432 411 0.00

-2 67 1253 20 397 0.92

-1 1717 10 6 10 1.86

0 9 11 1720 4 1.87

1 449 231 871 187 0.26

2 426 308 413 590 0.04

3 424 426 419 468 0.00

4 400 487 440 410 0.00

5 424 502 347 464 0.01

6 394 527 424 392 0.01

7 441 431 485 380 0.00

C A G — G T

https://alum.mit.edu/www/toms/papers/splice/

• The consensus sequences match . . .

Page 71: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Human Splice Junction Information Curves

Donor

Sequence

Conservation→

inbitsper

base

Acceptor

l

a

c

g

t

Rs

(l)

Rs(l

) =

Rseq

uenc

e(l)

, In

form

atio

n in

bit

s

Position L (in bases)

0.0

1.0

2.0

-9 481 444 497 371 0.01

-8 461 461 453 422 -0.00

-7 434 457 530 378 0.01

-6 498 433 452 416 0.00

-5 474 489 398 438 0.00

-4 492 527 446 334 0.02

-3 603 673 301 222 0.14

-2 1071 263 219 246 0.39

-1 169 85 1404 141 0.90

0 0 2 1789 8 1.95

1 8 10 6 1775 1.88

2 974 35 738 43 0.75

3 1274 152 215 149 0.68

4 128 95 1468 97 1.04

5 270 304 348 812 0.16

6 428 369 564 296 0.04

7 319 492 408 424 0.02

8 332 494 413 384 0.01

9 302 437 463 408 0.02

10 281 419 424 342 0.02

11 274 382 436 369 0.02

12 273 414 400 355 0.02

l

a

c

g

t

Rs

(l)

Rs(l

) =

Rseq

uenc

e(l)

, In

form

atio

n in

bit

s

Position L (in bases)

0.0

1.0

2.0

-25 309 392 208 386 0.04

-24 282 420 216 388 0.04

-23 273 392 220 424 0.05

-22 290 411 214 397 0.04

-21 309 400 194 416 0.05

-20 248 412 203 461 0.08

-19 256 388 225 482 0.07

-18 213 442 246 455 0.08

-17 201 453 183 535 0.15

-16 193 471 199 517 0.14

-15 182 459 212 533 0.14

-14 158 476 193 619 0.21

-13 135 486 197 656 0.25

-12 122 511 174 673 0.29

-11 121 500 152 732 0.34

-10 110 483 149 775 0.37

-9 131 534 175 800 0.33

-8 156 583 192 727 0.27

-7 172 656 172 674 0.27

-6 165 653 158 726 0.30

-5 123 691 110 788 0.43

-4 127 590 103 914 0.46

-3 394 499 432 411 0.00

-2 67 1253 20 397 0.92

-1 1717 10 6 10 1.86

0 9 11 1720 4 1.87

1 449 231 871 187 0.26

2 426 308 413 590 0.04

3 424 426 419 468 0.00

4 400 487 440 410 0.00

5 424 502 347 464 0.01

6 394 527 424 392 0.01

7 441 431 485 380 0.00

C A G — G T

https://alum.mit.edu/www/toms/papers/splice/

• The consensus sequences match . . .

• BUT the information curves (sequence conservation) differ!

Page 72: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Human Splice Junction Information Curves

Donor

Sequence

Conservation→

inbitsper

base

Acceptor

l

a

c

g

t

Rs

(l)

Rs(l

) =

Rseq

uenc

e(l)

, In

form

atio

n in

bit

s

Position L (in bases)

0.0

1.0

2.0

-9 481 444 497 371 0.01

-8 461 461 453 422 -0.00

-7 434 457 530 378 0.01

-6 498 433 452 416 0.00

-5 474 489 398 438 0.00

-4 492 527 446 334 0.02

-3 603 673 301 222 0.14

-2 1071 263 219 246 0.39

-1 169 85 1404 141 0.90

0 0 2 1789 8 1.95

1 8 10 6 1775 1.88

2 974 35 738 43 0.75

3 1274 152 215 149 0.68

4 128 95 1468 97 1.04

5 270 304 348 812 0.16

6 428 369 564 296 0.04

7 319 492 408 424 0.02

8 332 494 413 384 0.01

9 302 437 463 408 0.02

10 281 419 424 342 0.02

11 274 382 436 369 0.02

12 273 414 400 355 0.02

l

a

c

g

t

Rs

(l)

Rs(l

) =

Rseq

uenc

e(l)

, In

form

atio

n in

bit

s

Position L (in bases)

0.0

1.0

2.0

-25 309 392 208 386 0.04

-24 282 420 216 388 0.04

-23 273 392 220 424 0.05

-22 290 411 214 397 0.04

-21 309 400 194 416 0.05

-20 248 412 203 461 0.08

-19 256 388 225 482 0.07

-18 213 442 246 455 0.08

-17 201 453 183 535 0.15

-16 193 471 199 517 0.14

-15 182 459 212 533 0.14

-14 158 476 193 619 0.21

-13 135 486 197 656 0.25

-12 122 511 174 673 0.29

-11 121 500 152 732 0.34

-10 110 483 149 775 0.37

-9 131 534 175 800 0.33

-8 156 583 192 727 0.27

-7 172 656 172 674 0.27

-6 165 653 158 726 0.30

-5 123 691 110 788 0.43

-4 127 590 103 914 0.46

-3 394 499 432 411 0.00

-2 67 1253 20 397 0.92

-1 1717 10 6 10 1.86

0 9 11 1720 4 1.87

1 449 231 871 187 0.26

2 426 308 413 590 0.04

3 424 426 419 468 0.00

4 400 487 440 410 0.00

5 424 502 347 464 0.01

6 394 527 424 392 0.01

7 441 431 485 380 0.00

TCAG TACGTCGAGTACTGACTGAC

GTCA

CTAG

CT

GGACT

CTGA

TCGA

CTAG

ACGT

TCAG

AGTCATGCATCGATCGATCGATGC

GATC

GATC

GACT

GATC

GACT

GACT

GACT

AGCT

GACT

AGCT

AGCT

AGCT

AGCT

AGCT

AGCT

AGCT

AGCT

AGCT

GACT

GACT

GACT

GACT

ATGCGATC

GTCA

TAC

GTCAG

CGAT

GACTATGCGATCTAGCTCAG

C A G — G T

https://alum.mit.edu/www/toms/papers/splice/

• The consensus sequences match . . .

• BUT the information curves (sequence conservation) differ!

• Put letters into the graph proportional to their frequency!

Page 73: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Human Splice Junction Information Curves

Donor

Sequence

Conservation→

inbitsper

base

Acceptor

l

a

c

g

t

Rs

(l)

Rs(l

) =

Rseq

uenc

e(l)

, In

form

atio

n in

bit

s

Position L (in bases)

0.0

1.0

2.0

-9 481 444 497 371 0.01

-8 461 461 453 422 -0.00

-7 434 457 530 378 0.01

-6 498 433 452 416 0.00

-5 474 489 398 438 0.00

-4 492 527 446 334 0.02

-3 603 673 301 222 0.14

-2 1071 263 219 246 0.39

-1 169 85 1404 141 0.90

0 0 2 1789 8 1.95

1 8 10 6 1775 1.88

2 974 35 738 43 0.75

3 1274 152 215 149 0.68

4 128 95 1468 97 1.04

5 270 304 348 812 0.16

6 428 369 564 296 0.04

7 319 492 408 424 0.02

8 332 494 413 384 0.01

9 302 437 463 408 0.02

10 281 419 424 342 0.02

11 274 382 436 369 0.02

12 273 414 400 355 0.02

l

a

c

g

t

Rs

(l)

Rs(l

) =

Rseq

uenc

e(l)

, In

form

atio

n in

bit

s

Position L (in bases)

0.0

1.0

2.0

-25 309 392 208 386 0.04

-24 282 420 216 388 0.04

-23 273 392 220 424 0.05

-22 290 411 214 397 0.04

-21 309 400 194 416 0.05

-20 248 412 203 461 0.08

-19 256 388 225 482 0.07

-18 213 442 246 455 0.08

-17 201 453 183 535 0.15

-16 193 471 199 517 0.14

-15 182 459 212 533 0.14

-14 158 476 193 619 0.21

-13 135 486 197 656 0.25

-12 122 511 174 673 0.29

-11 121 500 152 732 0.34

-10 110 483 149 775 0.37

-9 131 534 175 800 0.33

-8 156 583 192 727 0.27

-7 172 656 172 674 0.27

-6 165 653 158 726 0.30

-5 123 691 110 788 0.43

-4 127 590 103 914 0.46

-3 394 499 432 411 0.00

-2 67 1253 20 397 0.92

-1 1717 10 6 10 1.86

0 9 11 1720 4 1.87

1 449 231 871 187 0.26

2 426 308 413 590 0.04

3 424 426 419 468 0.00

4 400 487 440 410 0.00

5 424 502 347 464 0.01

6 394 527 424 392 0.01

7 441 431 485 380 0.00

TCAG TACGTCGAGTACTGACTGAC

GTCA

CTAG

CT

GGACT

CTGA

TCGA

CTAG

ACGT

TCAG

AGTCATGCATCGATCGATCGATGC

GATC

GATC

GACT

GATC

GACT

GACT

GACT

AGCT

GACT

AGCT

AGCT

AGCT

AGCT

AGCT

AGCT

AGCT

AGCT

AGCT

GACT

GACT

GACT

GACT

ATGCGATC

GTCA

TAC

GTCAG

CGAT

GACTATGCGATCTAGCTCAG

1799 Human donor sites

0

1

2

bit

s

5′ -9

TCAG

-8 -7

TACG

-6

TCGA

-5

GTAC

-4

TGAC

-3

TGAC

-2

GTCA

-1

CTAG

0CT

G

1GACT 2CTGA

3TCGA

4CTAG

5ACGT

6TCAG 7AGTC 8ATGC 9ATCG

10

ATCG

11

ATCG

12

ATGC

3′

1744 Human acceptor sites

0

1

2

bit

s

5′ -25GATC

-24GATC

-23GACT

-22GATC

-21GAC

T

-20GA

CT

-19GAC

T

-18AGC

T

-17G

ACT

-16A

GCT

-15A

GCT

-14A

GCT

-13A

GCT

-12A

GCT

-11A

GCT

-10A

GCT

-9

AGCT

-8

AGCT

-7

GACT

-6

GACT

-5

GACT

-4

GACT

-3

ATGC

-2

GATC

-1

GTCA 0TAC

G1T

CAG

2CGAT 3GACT 4ATGC 5GATC 6TAGC 7TCAG

3′

https://alum.mit.edu/www/toms/papers/logo/

• That’s how and why we invented sequence logos!

Page 74: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Splice Junction Sequence Logos

5′

3′

exon

intron

exon

donor

acceptor

TGAC

GTCA

CTAG

CT

GGACTC

TGA

TCGA

CTAG

ACGT

TCAG

GATC

GATC

GACT

GATC

GACT

GACT

GACT

AGCT

GACT

AGCT

AGCT

AGCT

AGCT

AGCT

AGCT

AGCT

AGCT

AGCT

GACT

GACT

GACT

GACT

ATGCGATC

GTCATAC

GTCAG

CGAT

• 90% of the splice junctioninformation is on the intron side

https://alum.mit.edu/www/toms/papers/splice/

Page 75: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Splice Junction Sequence Logos

5′

3′

exon

intron

exon

donor

acceptor

TGAC

GTCA

CTAG

CT

GGACTC

TGA

TCGA

CTAG

ACGT

TCAG

GATC

GATC

GACT

GATC

GACT

GACT

GACT

AGCT

GACT

AGCT

AGCT

AGCT

AGCT

AGCT

AGCT

AGCT

AGCT

AGCT

GACT

GACT

GACT

GACT

ATGCGATC

GTCATAC

GTCAG

CGAT

donor acceptor

proto

GTAC

GTCA

CTAG

TCAG

CGAT

TGAC

GTCA

CTAG

CT

GGACT

GATC

GTCA

TAC

GTCAG

CGAT

• 90% of the splice junctioninformation is on the intron side

• Hypothesis:donor and acceptor sites had acommon ancestor that duplicated

https://alum.mit.edu/www/toms/papers/splice/

Page 76: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Splice Junction Sequence Logos

5′

3′

exon

intron

exon

donor

acceptor

TGAC

GTCA

CTAG

CT

GGACTC

TGA

TCGA

CTAG

ACGT

TCAG

GATC

GATC

GACT

GATC

GACT

GACT

GACT

AGCT

GACT

AGCT

AGCT

AGCT

AGCT

AGCT

AGCT

AGCT

AGCT

AGCT

GACT

GACT

GACT

GACT

ATGCGATC

GTCATAC

GTCAG

CGAT

donor acceptor

proto

GTAC

GTCA

CTAG

TCAG

CGAT

TGAC

GTCA

CTAG

CT

GGACT

GATC

GTCA

TAC

GTCAG

CGAT

• 90% of the splice junctioninformation is on the intron side

• Hypothesis:donor and acceptor sites had acommon ancestor that duplicated

• They evolved to put the informationinto the intron. This avoids affectingthe proteins.

https://alum.mit.edu/www/toms/papers/splice/

Page 77: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

3D Sequence Logos for tRNA Correlations

• tRNA reads RNAto make protein

https://alum.mit.edu/www/toms/papers/correlogo/ Nucleic Acids Res. 2006 34:W405-11

Page 78: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

3D Sequence Logos for tRNA Correlations

• tRNA reads RNAto make protein

• Correlations can be measured in bits!

https://alum.mit.edu/www/toms/papers/correlogo/ Nucleic Acids Res. 2006 34:W405-11

Page 79: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

3D Sequence Logos for tRNA Correlations

• tRNA reads RNAto make protein

• Correlations can be measured in bits!

• 3D Sequence logo

https://alum.mit.edu/www/toms/papers/correlogo/ Nucleic Acids Res. 2006 34:W405-11

Page 80: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

3D Sequence Logos for tRNA Correlations

• tRNA reads RNAto make protein

• Correlations can be measured in bits!

• 3D Sequence logo

• OBSERVED: tRNA stems

https://alum.mit.edu/www/toms/papers/correlogo/ Nucleic Acids Res. 2006 34:W405-11

Page 81: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Sequence Walker example: rrnB P1

. *4164240 . *4164250 . *4164260 . *4164270 . 5’ g g a g c t g a a c a a t t a t t g c c c g t t t t a c a g c g t t a c g g c t t c g a 3’ 3’ c c t c g a c t t g t t a a t a a c g g g c a a a a t g t c g c a a t g c c g a a g c t 5’

Fis 12.0 bits

*4164280 . *4164290 . *4164300 . *4164310 . *4164320 . *4164330 5’ a a c g c t c g a a a a a c t g g c a g t t t t a g g c t g a t t t g g t t g a a t g t t g c g c g g t c a 3’ 3’ t t g c g a g c t t t t t g a c c g t c a a a a t c c g a c t a a a c c a a c t t a c a a c g c g c c a g t 5’

Fis 5.3 bits Fis 10.4 bits

. *4164340 . *4164350 . *4164360 . *4164370 . *4164380 5’ g a a a a t t a t t t t a a a t t t c c t c t t g t c a g g c c g g a a t a a c t c c c t a t a a t g 3’ 3’ c t t t t a a t a a a a t t t a a a g g a g a a c a g t c c g g c c t t a t t g a g g g a t a t t a c 5’

distalUP 6.6 bits p35 5.5 bits p10 8.4 bits | | | |- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - | p35- ( 22) - p10 4164377 Gap 2.3 bits | | | | proximalUP 4.4 bits | | | | | | | | | |- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - | proximalUP- ( 27) - p10 4164377 Gap 3.4 bits | | |- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - | distalUP- ( 45) - p10 4164377 Gap 5.4 bits |- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |- - - - - - - - - |- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - | distalUP- proximalUP- p35- p10 4164377 total 13.7 bits

rrnB P1

Page 82: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Complex Sequence Walker Example

• σ70 promoters have a −35 and a −10

Page 83: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Complex Sequence Walker Example

• σ70 promoters have a −35 and a −10• Using information theory we discovered that

stress-response σ38 promoters do not have a −35

Page 84: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Complex Sequence Walker Example

• σ70 promoters have a −35 and a −10• Using information theory we discovered that

stress-response σ38 promoters do not have a −35• Instead, they have a −10 and two UP elements

Page 85: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Complex Sequence Walker Example

• σ70 promoters have a −35 and a −10• Using information theory we discovered that

stress-response σ38 promoters do not have a −35• Instead, they have a −10 and two UP elements• σ38 promoter talA P1 is complex!

Page 86: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Important Discovery1744 Human acceptor sites

0

1

2

bit

s

5′ -20GA

CT

-19GAC

T

-18AGC

T

-17G

ACT

-16A

GCT

-15A

GCT

-14A

GCT

-13A

GCT

-12A

GCT

-11A

GCT

-10A

GCT

-9

AGCT

-8AGCT

-7GACT

-6GACT

-5

GACT

-4

GACT

-3

ATGC

-2

GATC

-1

GTCA 0TAC

G1T

CAG

2CGAT

3′

1799 Human donor sites

0

1

2

bit

s

5′ -3

TGAC

-2

GTCA

-1

CTAG

0CT

G1GACT 2C

TGA

3TCGA

4CTAG

5ACGT

3′

• Area under a sequence logo is the total information.

Page 87: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Important Discovery1744 Human acceptor sites

0

1

2

bit

s

5′ -20GA

CT

-19GAC

T

-18AGC

T

-17G

ACT

-16A

GCT

-15A

GCT

-14A

GCT

-13A

GCT

-12A

GCT

-11A

GCT

-10A

GCT

-9

AGCT

-8AGCT

-7GACT

-6GACT

-5

GACT

-4

GACT

-3

ATGC

-2

GATC

-1

GTCA 0TAC

G1T

CAG

2CGAT

3′

1799 Human donor sites

0

1

2

bit

s

5′ -3

TGAC

-2

GTCA

-1

CTAG

0CT

G1GACT 2C

TGA

3TCGA

4CTAG

5ACGT

3′

• Area under a sequence logo is the total information.

• How is that related to the binding energy?

Page 88: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Important Discovery1744 Human acceptor sites

0

1

2

bit

s

5′ -20GA

CT

-19GAC

T

-18AGC

T

-17G

ACT

-16A

GCT

-15A

GCT

-14A

GCT

-13A

GCT

-12A

GCT

-11A

GCT

-10A

GCT

-9

AGCT

-8AGCT

-7GACT

-6GACT

-5

GACT

-4

GACT

-3

ATGC

-2

GATC

-1

GTCA 0TAC

G1T

CAG

2CGAT

3′

1799 Human donor sites

0

1

2

bit

s

5′ -3

TGAC

-2

GTCA

-1

CTAG

0CT

G1GACT 2C

TGA

3TCGA

4CTAG

5ACGT

3′

• Area under a sequence logo is the total information.

• How is that related to the binding energy?

• Information gained for energy dissipated

Page 89: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Important Discovery1744 Human acceptor sites

0

1

2

bit

s

5′ -20GA

CT

-19GAC

T

-18AGC

T

-17G

ACT

-16A

GCT

-15A

GCT

-14A

GCT

-13A

GCT

-12A

GCT

-11A

GCT

-10A

GCT

-9

AGCT

-8AGCT

-7GACT

-6GACT

-5

GACT

-4

GACT

-3

ATGC

-2

GATC

-1

GTCA 0TAC

G1T

CAG

2CGAT

3′

1799 Human donor sites

0

1

2

bit

s

5′ -3

TGAC

-2

GTCA

-1

CTAG

0CT

G1GACT 2C

TGA

3TCGA

4CTAG

5ACGT

3′

• Area under a sequence logo is the total information.

• How is that related to the binding energy?

• Information gained for energy dissipated

• Isothermal efficiency

Page 90: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Important Discovery1744 Human acceptor sites

0

1

2

bit

s

5′ -20GA

CT

-19GAC

T

-18AGC

T

-17G

ACT

-16A

GCT

-15A

GCT

-14A

GCT

-13A

GCT

-12A

GCT

-11A

GCT

-10A

GCT

-9

AGCT

-8AGCT

-7GACT

-6GACT

-5

GACT

-4

GACT

-3

ATGC

-2

GATC

-1

GTCA 0TAC

G1T

CAG

2CGAT

3′

1799 Human donor sites

0

1

2

bit

s

5′ -3

TGAC

-2

GTCA

-1

CTAG

0CT

G1GACT 2C

TGA

3TCGA

4CTAG

5ACGT

3′

• Area under a sequence logo is the total information.

• How is that related to the binding energy?

• Information gained for energy dissipated

• Isothermal efficiency

• My most important discovery:

Molecules are often 70% efficient

Page 91: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Information of EcoRI DNA Binding

• EcoRI - restriction enzyme

Page 92: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Information of EcoRI DNA Binding

EcoRI sites

0

1

2

bit

s5′

0

G1A 2A 3T 4T 5

C3′

• EcoRI - restriction enzyme

• EcoRI binds DNA at 5′ GAATTC 3′

Page 93: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Information of EcoRI DNA Binding

EcoRI sites

0

1

2

bit

s5′

0

G1A 2A 3T 4T 5

C3′

• EcoRI - restriction enzyme

• EcoRI binds DNA at 5′ GAATTC 3′

• information required:

6 bases × 2 bits per base = 12 bits

Page 94: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Energy Dissipation by EcoRI

• Measured specific binding constant:

Kspec = 1.6× 105

Page 95: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Energy Dissipation by EcoRI

• Measured specific binding constant:

Kspec = 1.6× 105

• Average energy dissipated by one molecule as it binds:

∆G◦

spec = −kBT lnKspec (joules per binding)

Page 96: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Energy Dissipation by EcoRI

• Measured specific binding constant:

Kspec = 1.6× 105

• Average energy dissipated by one molecule as it binds:

∆G◦

spec = −kBT lnKspec (joules per binding)

• The Second Law of Thermodynamics as a conversion factor:

Emin = kBT ln 2 (joules per bit)

Page 97: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Energy Dissipation by EcoRI

• Measured specific binding constant:

Kspec = 1.6× 105

• Average energy dissipated by one molecule as it binds:

∆G◦

spec = −kBT lnKspec (joules per binding)

• The Second Law of Thermodynamics as a conversion factor:

Emin = kBT ln 2 (joules per bit)

• Number of bits that could have been selected:

Renergy = −∆G◦/Emin

= kBT lnKspec/kBT ln 2

= log2Kspec ⇐ SO SIMPLE!

= 17.3 bits per binding

Page 98: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Information/Energy = Efficiency of EcoRI

EcoRI could have made 17.3 binary choices

Page 99: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Information/Energy = Efficiency of EcoRI

EcoRI sites

0

1

2

bit

s

5′

0

G

1A 2A 3T 4T 5

C3′

EcoRI could have made 17.3 binary choices. . . but it only made 12 choices.

Page 100: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Information/Energy = Efficiency of EcoRI

EcoRI sites

0

1

2

bit

s

5′

0

G

1A 2A 3T 4T 5

C3′

EcoRI could have made 17.3 binary choices. . . but it only made 12 choices.

Efficiency is‘WORK’ DONE / ENERGY DISSIPATED

Page 101: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Information/Energy = Efficiency of EcoRI

EcoRI sites

0

1

2

bit

s

5′

0

G

1A 2A 3T 4T 5

C3′

EcoRI could have made 17.3 binary choices. . . but it only made 12 choices.

Efficiency is‘WORK’ DONE / ENERGY DISSIPATED

12 bits per binding

17.3 bits per binding= 0.7

Page 102: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Information/Energy = Efficiency of EcoRI = 70%

EcoRI sites

0

1

2

bit

s

5′

0

G

1A 2A 3T 4T 5

C3′

EcoRI could have made 17.3 binary choices. . . but it only made 12 choices.

Efficiency is‘WORK’ DONE / ENERGY DISSIPATED

12 bits per binding

17.3 bits per binding= 0.7

The efficiency is 70%.

Page 103: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Information/Energy = Efficiency of EcoRI = 70%

EcoRI sites

0

1

2

bit

s

5′

0

G

1A 2A 3T 4T 5

C3′

EcoRI could have made 17.3 binary choices. . . but it only made 12 choices.

Efficiency is‘WORK’ DONE / ENERGY DISSIPATED

12 bits per binding

17.3 bits per binding= 0.7

The efficiency is 70%.

18 out of 19 DNA binding proteins give ∼70% efficiency.

Page 104: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Rhodopsin Shape Change

Dark State

P. Scheerer et al. Nature 455, 497-502(2008) doi:10.1038/nature07330

Page 105: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Rhodopsin Shape Change

Dark State

P. Scheerer et al. Nature 455, 497-502(2008) doi:10.1038/nature07330

Page 106: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Rhodopsin Shape Change

Dark State After Photon - Light State

P. Scheerer et al. Nature 455, 497-502(2008) doi:10.1038/nature07330

Page 107: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Rhodopsin Shape Change

Dark State After Photon - Light State

P. Scheerer et al. Nature 455, 497-502(2008) doi:10.1038/nature07330

Page 108: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Rhodopsin Shape Change

Dark State After Photon - Light State

hν70%

P. Scheerer et al. Nature 455, 497-502(2008) doi:10.1038/nature07330

Page 109: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Rhodopsin Shape Change

Dark State After Photon - Light State

hν70%

30%

P. Scheerer et al. Nature 455, 497-502(2008) doi:10.1038/nature07330

Page 110: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Muscle Structure

https://en.wikipedia.org/wiki/Muscle

Page 111: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Muscle Structure

https://en.wikipedia.org/wiki/Muscle

Page 112: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Muscle Structure

https://en.wikipedia.org/wiki/Muscle

Page 113: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Muscle Structure

https://en.wikipedia.org/wiki/Muscle

Page 114: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Muscle Structure

https://en.wikipedia.org/wiki/Muscle

Page 115: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Muscle Structure

https://en.wikipedia.org/wiki/Muscle

Page 116: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Tom’s Model of Muscle Mechanism

54321 54321

54321 54321

1 2 3 4 5 54321

1 2 3 4 5 5432154321

+PiADP

+PiADP

+PiADP

ATP

heat

operation

before state

after statedegenerate

after stateforward

ATP

heat + ADP + Pi

70% success

heat + ADP + Pi

prime

"power stroke"

30% failure

Page 117: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Efficiency of Muscle

• Experiments by Kushmerick’s lab since (at least) 1969

http://dx.doi.org/10.1113/jphysiol.2007.146829 http://dx.doi.org/10.1242/jeb.052985

Page 118: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Efficiency of Muscle

• Experiments by Kushmerick’s lab since (at least) 1969

• new work: 2008, 2011

http://dx.doi.org/10.1113/jphysiol.2007.146829 http://dx.doi.org/10.1242/jeb.052985

Page 119: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Efficiency of Muscle

• Experiments by Kushmerick’s lab since (at least) 1969

• new work: 2008, 2011

•Weight lifting gives work done

http://dx.doi.org/10.1113/jphysiol.2007.146829 http://dx.doi.org/10.1242/jeb.052985

Page 120: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Efficiency of Muscle

• Experiments by Kushmerick’s lab since (at least) 1969

• new work: 2008, 2011

•Weight lifting gives work done

• NMR coil gives ATP = energy used

http://dx.doi.org/10.1113/jphysiol.2007.146829 http://dx.doi.org/10.1242/jeb.052985

Page 121: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Efficiency of Muscle

• Experiments by Kushmerick’s lab since (at least) 1969

• new work: 2008, 2011

•Weight lifting gives work done

• NMR coil gives ATP = energy used

• Efficiency: 0.68± 0.09

http://dx.doi.org/10.1113/jphysiol.2007.146829 http://dx.doi.org/10.1242/jeb.052985

Page 122: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Why are molecular machines 70% efficient?

EcoRI sites

0

1

2

bit

s

5′

0

G

1A 2A 3T 4T 5

C3′

70% efficiency appears widely in biology:

Page 123: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Why are molecular machines 70% efficient?

EcoRI sites

0

1

2

bit

s

5′

0

G

1A 2A 3T 4T 5

C3′

70% efficiency appears widely in biology:

• DNA - protein binding

Page 124: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Why are molecular machines 70% efficient?

EcoRI sites

0

1

2

bit

s

5′

0

G

1A 2A 3T 4T 5

C3′

70% efficiency appears widely in biology:

• DNA - protein binding

• rhodopsin

Page 125: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Why are molecular machines 70% efficient?

EcoRI sites

0

1

2

bit

s

5′

0

G

1A 2A 3T 4T 5

C3′

70% efficiency appears widely in biology:

• DNA - protein binding

• rhodopsin• muscle

Page 126: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Why are molecular machines 70% efficient?

EcoRI sites

0

1

2

bit

s

5′

0

G

1A 2A 3T 4T 5

C3′

70% efficiency appears widely in biology:

• DNA - protein binding

• rhodopsin• muscle

• other systems

Page 127: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Why are molecular machines 70% efficient?

EcoRI sites

0

1

2

bit

s

5′

0

G

1A 2A 3T 4T 5

C3′

70% efficiency appears widely in biology:

• DNA - protein binding

• rhodopsin• muscle

• other systems

Why 70% efficiency?

Page 128: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Why are molecular machines 70% efficient?

EcoRI sites

0

1

2

bit

s

5′

0

G

1A 2A 3T 4T 5

C3′

70% efficiency appears widely in biology:

• DNA - protein binding

• rhodopsin• muscle

• other systems

Why 70% efficiency?

Information theory explanation

Page 129: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Lock and Key

Like a key in a lockwhich has many independent pins,it takes many numbersto describe the vibrational stateof a molecular machine

G A A T T C

Page 130: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Gaussians

•2

x

ep(x) −x

Pin motion x has a Gaussian distribution:

p(x) =1√2πσ2

e−(x−µ)2

2σ2

µ = mean, σ = standard deviation

Page 131: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Gaussians

•2

x

ep(x) −x

Pin motion x has a Gaussian distribution:

p(x) =1√2πσ2

e−(x−µ)2

2σ2

µ = mean, σ = standard deviation

• Gaussian distributions are generated by the sum ofmany small random variables

Page 132: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Gaussians

•2

x

ep(x) −x

Pin motion x has a Gaussian distribution:

p(x) =1√2πσ2

e−(x−µ)2

2σ2

µ = mean, σ = standard deviation

• Gaussian distributions are generated by the sum ofmany small random variables

•Drunkard’s walk: Galton’s quincunx device!

http://www.youtube.com/watch?v=xDIyAOBa_yU

Page 133: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Two Gaussians

p(x) =1√2πσ2

e−(x−µ)2

2σ2 (1)

p(y) =1√2πσ2

e−(y−µ)2

2σ2 (2)

Page 134: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Two Gaussians

p(x) =1√2πσ2

e−(x−µ)2

2σ2 ∝ e−x2

(1)

p(y) =1√2πσ2

e−(y−µ)2

2σ2 ∝ e−y2

(2)

Page 135: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Two Gaussians

p(x) =1√2πσ2

e−(x−µ)2

2σ2 ∝ e−x2

(1)

p(y) =1√2πσ2

e−(y−µ)2

2σ2 ∝ e−y2

(2)

r

p(x,y)

x

y

Page 136: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Two Gaussians

p(x) =1√2πσ2

e−(x−µ)2

2σ2 ∝ e−x2

(1)

p(y) =1√2πσ2

e−(y−µ)2

2σ2 ∝ e−y2

(2)

r

p(x,y)

x

y

p(x, y) = p(x)× p(y) (3)

Page 137: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Two Gaussians

p(x) =1√2πσ2

e−(x−µ)2

2σ2 ∝ e−x2

(1)

p(y) =1√2πσ2

e−(y−µ)2

2σ2 ∝ e−y2

(2)

r

p(x,y)

x

y

p(x, y) = p(x)× p(y) (3)

∝ e−x2 × e−y

2

(4)

Page 138: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Two Gaussians

p(x) =1√2πσ2

e−(x−µ)2

2σ2 ∝ e−x2

(1)

p(y) =1√2πσ2

e−(y−µ)2

2σ2 ∝ e−y2

(2)

r

p(x,y)

x

y

p(x, y) = p(x)× p(y) (3)

∝ e−x2 × e−y

2

(4)

∝ e−(x2+y2) (5)

Page 139: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Two Gaussians

p(x) =1√2πσ2

e−(x−µ)2

2σ2 ∝ e−x2

(1)

p(y) =1√2πσ2

e−(y−µ)2

2σ2 ∝ e−y2

(2)

r

p(x,y)

x

y

credit: http://en.wikipedia.org/wiki/Pythagoras

p(x, y) = p(x)× p(y) (3)

∝ e−x2 × e−y

2

(4)

∝ e−(x2+y2) (5)

∝ e−r2

(6)

Page 140: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Two Gaussians

p(x) =1√2πσ2

e−(x−µ)2

2σ2 ∝ e−x2

(1)

p(y) =1√2πσ2

e−(y−µ)2

2σ2 ∝ e−y2

(2)

r

p(x,y)

x

y

credit: http://en.wikipedia.org/wiki/Pythagoras

p(x, y) = p(x)× p(y) (3)

∝ e−x2 × e−y

2

(4)

∝ e−(x2+y2) (5)

∝ e−r2

(6)

If p(x, y) is a constant,then r is a constant.

Page 141: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Two Gaussians

p(x) =1√2πσ2

e−(x−µ)2

2σ2 ∝ e−x2

(1)

p(y) =1√2πσ2

e−(y−µ)2

2σ2 ∝ e−y2

(2)

r

p(x,y)

x

y

credit: http://en.wikipedia.org/wiki/Pythagoras

p(x, y) = p(x)× p(y) (3)

∝ e−x2 × e−y

2

(4)

∝ e−(x2+y2) (5)

∝ e−r2

(6)

If p(x, y) is a constant,then r is a constant.

Circular distribution!

Page 142: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

1 Dimension

Energy

States

1 dimension is too simple!

Page 143: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Bowls in 2 Dimensions

Page 144: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Spheres in 3 Dimensions

Page 145: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

N Dimensional Sphere

Page 146: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Spheres tighten in high dimensions

D=1

D=2

D=4

D=8

0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40 1.60 1.80 2.00 2.20 2.40 2.60 2.80 3.00 0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

Normalized Radius

Normalized Density

Page 147: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Good Sphere Packing

Good packing of spheresgives a moleculethe capacityto make selections efficiently

Page 148: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

N Dimensional Sphere Separation

Degenerate Sphere

Page 149: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

N Dimensional Sphere Separation

Degenerate Sphere Forward Sphere

Page 150: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

N Dimensional Sphere Separation

Degenerate Sphere Forward Sphere

√Noise

Page 151: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

N Dimensional Sphere Separation

Degenerate Sphere Forward Sphere

√Noise √

Power

Page 152: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

N Dimensional Sphere Separation

Degenerate Sphere Forward Sphere

√Noise √

Power

Energy dissipated to escape the Degenerate Sphere must exceed the Noise

Page 153: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

N Dimensional Sphere Separation

Degenerate Sphere Forward Sphere

√Noise √

Power

Energy dissipated to escape the Degenerate Sphere must exceed the Noise√Power >

√Noise

Page 154: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Theoretical Isothermal Efficiency

T. D. Schneider, Nucleic Acids Research (2010) 38: 5995-6006

• For molecular states of molecules with dspace ‘parts’Py energy is dissipated for noise Ny and

Cy = dspace log2(Py/Ny + 1) ← machine capacity

t

P

y

=N

y

Page 155: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Theoretical Isothermal Efficiency

T. D. Schneider, Nucleic Acids Research (2010) 38: 5995-6006

• For molecular states of molecules with dspace ‘parts’Py energy is dissipated for noise Ny and

Cy = dspace log2(Py/Ny + 1) ← machine capacity

ǫt =ln

(

Py

Ny+1

)

Py

Ny

← isothermal efficiency

t

P

y

=N

y

Page 156: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Theoretical Isothermal Efficiency

T. D. Schneider, Nucleic Acids Research (2010) 38: 5995-6006

• For molecular states of molecules with dspace ‘parts’Py energy is dissipated for noise Ny and

Cy = dspace log2(Py/Ny + 1) ← machine capacity

ǫt =ln

(

Py

Ny+1

)

Py

Ny

← isothermal efficiency

t

P

y

=N

y

0 1 2 3 40.00

1.00

0.69

0.55

0.46

0.40

Second Law upper bound

Isothermal Efficiency upper bound

The curve is an upper bound

Page 157: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Theoretical Isothermal Efficiency

T. D. Schneider, Nucleic Acids Research (2010) 38: 5995-6006

• For molecular states of molecules with dspace ‘parts’Py energy is dissipated for noise Ny and

Cy = dspace log2(Py/Ny + 1) ← machine capacity

ǫt =ln

(

Py

Ny+1

)

Py

Ny

← isothermal efficiency

t

P

y

=N

y

0 1 2 3 40.00

1.00

0.69

0.55

0.46

0.40

Second Law upper bound

Isothermal Efficiency upper bound

The curve is an upper bound

• If Py/Ny = 1 the efficiency is 70%!

Page 158: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Acknowledgements

• Mentors:• Larry Gold (graduate school mentor)• Andrej Ehrenfeucht (information theory)• Students:

• Mike Stephens (logos, splicing)• John Spouge (mathematics)• Paul C. Anagnostopoulos (Evolution model)• Bruce Shapiro and Eckart Bindewald (3D logo)• Kevin Franco (σ38)• Ding Jin (σ38)• Useful discussions with

• Jeff Strathern• Amar Klar• Kemi Abolude• Susan Lauffer• Cedric Cagliero• Amar Klar• Zhi-Ming Zheng• Mark Lewandoski• This research was supported by the Intramural Research Program of the NIH,National Cancer Institute, Center for Cancer Research.

Page 159: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Information theory: the mathematics of biology1744 Human acceptor sites

0

1

2

bit

s

5′ -20GA

CT

-19GAC

T

-18AGC

T

-17G

ACT

-16A

GCT

-15A

GCT

-14A

GCT

-13A

GCT

-12A

GCT

-11A

GCT

-10A

GCT

-9

AGCT

-8AGCT

-7GACT

-6GACT

-5

GACT

-4

GACT

-3

ATGC

-2

GATC

-1

GTCA 0TAC

G1T

CAG

2CGAT

3′

1799 Human donor sites

0

1

2

bit

s

5′ -3

TGAC

-2

GTCA

-1

CTAG

0CT

G1GACT 2C

TGA

3TCGA

4CTAG

5ACGT

3′ EcoRI sites

0

1

2

bit

s

5′

0

G

1A 2A 3T 4T 5

C3′

https://alum.mit.edu/www/toms/

Page 160: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T
Page 161: users.fred.netusers.fred.net/tds/lab/images/kanpur2016/Lecture_7_Thomas_Schne… · Rsequence and Rfrequency for Splice Acceptors R sequence GAT C GAT C GAC T GAT C GA C T G A C T

Version

version = 1.89 of kanpurtalk.tex 2016 Oct 16