Molecular Information Theory
Niru Chennagiri
Probability and Statistics
Fall 2004
Dr. Michael Partensky
Overview
Why do we study Molecular Info. Theory?What are molecular machines?Power of LogarithmComponents of a Communication SystemDiscrete Noiseless SystemChannel CapacityMolecular Machine Capacity
Motivation
Needle in a haystack situation.How will you go about looking for the
needle?How much energy you need to spend?How fast can you find the needle?Haystack = DNA, Needle = Binding site,
You = Ribosome
What is a Molecular Machine?
One or more molecules or a molecular complex: not a macroscopic reaction.
Performs a specific function.Energized before the reaction.Dissipates energy during reaction.Gains information.An isothermal engine.
Where is the candy?
Is it in the left four boxes? Is it in the bottom four boxes? Is it in the front four boxes?
You need answer to three questions to find the candy
Box labels: 000, 001, 010, 011, 100, 101, 110, 111
Need log8 = 3 bits of information
More candies…
Box labels: 00, 01, 10, 11, 00, 01, 10, 11 Candy in both boxes labeled 01.Need only log8 - log2 = 2 bits of
information.
In general,
m boxes with n candies need
log m - log n bits of information
Ribosomes
2600 binding sites from
4.7 million base pairs
Need
log(4.7 million) - log(2600)
= 10.8 bits of information.
Communication System
Information Source
Represented by a stochastic processMathematically a Markov chainWe are interested in ergodic sources: Every
sequence is statistically same as every other sequence.
How much information is produced?
Measure of uncertainty H should be:Continuous in the probability.Monotonic increasing function of the
number of events.When a choice is broken down into two
successive choices, Total H = weighted sum of individual H
Enter Entropy
H=- Kâi=1
n
pi logHpiL
0.2 0.4 0.6 0.8 1
0.2
0.4
0.6
0.8
1
Properties of EntropyH is zero iff all but one p are zero.H is never negative.H is maximum when all the events are
equally probableIf x and y are two events
H(x,y) H(x) + H(y)Conditional entropy:
H y p i j p jx ii j
( ) ( , ) lo g ( ),
Hx(y) H(y)
Why is entropy important?Entropy is a measure of uncertainty. Entropy relation from thermodynamics
Also from thermodynamics
For every bit of information gained, the machine dissipates kBTln2 joules.
H H HA fter B efore
S k H k RB B ( ln ) ( ln )2 2
Sq
T
Ribosome binding sites
Information in sequence
Position p H Before H After Change inH
1 A-1/2G-1/2
2 1 1
2 U-1 2 0 2
3 G-1 2 0 2
Information curve
H l f b l f b lb A C G T
( ) ( , ) lo g ( , ){ , , , }
R l H lSequence ( ) ( ) 2Information gain for site l is
Plot of this across the sites gives Information curve.For E.Coli, Total information is about 11 bits.… same as what the ribosome needs.
Sequence Logo
Channel capacity
Source transmitting 0 and 1 at 1000 symbols/sec.1 in 100 symbols have an error.What is the rate of transmission?Need to apply a correction correction = uncertainty in x for a given value of y Same as conditional entropy
H xy ( ) ( . lo g . . lo g . ) 0 9 9 0 9 9 0 0 1 0 0 1
= 81 bits/sec
Channel capacity contd.
C M ax H x H xy { ( ) ( )}
Shannon’s theorem:As long as the rate of transmission is below C, the number of errors can me made as small as needed.
For a continuous source with white noise,
C WP
N
lo g 1Signal to noise ratio
Bandwidth
Molecular Machine Capacity
Lock and key mechanism.Each pin on the ribosome is a simple
harmonic oscillator in thermal bath.Velocity of the pins represented by points in
2-d velocity spaceMore pins -> more dimensions.Distribution of points is spherical.
Machine capacity
For larger dimensions:All points are in a thin spherical shellRadius of the shell is the velocity and hencesquare root of the energyBefore binding:
r P Nbefore y y
After Binding:
r Na ftere y
Number of choices = Number of ‘after’ spheres that can sit in the ‘before’ sphere=Vol. of Before sphere/Vol. Of after sphereMachine capacity = logarithm of number of choices
C dP
N
lo g 1
References
Claude Shannon, Mathematical Theory of communication, Reprinted with
corrections from The Bell System Technical Journal,Vol. 27, pp. 379–423, 623–656,
July, October, 1948.
Mathematical Theory of Communication by Claude E. Shannon, Warren Weaver
T. D. Schneider, Sequence Logos, Machine/Channel Capacity, Maxwell's Demon,
and Molecular Computers: a Review of the Theory of Molecular Machines,
Nanotechnology, 5: 1-18, 1994
T. D. Schneider, Theory of Molecular Machines. I. Channel Capacity of Molecular
Machines J. Theor. Biol., 148:, 83-123, 1991
How (and why) to find a needle in a haystack Article in The Economist (April 5th-
11th 1997, British version: p. 105-107, American version: p. 73-75, Asian version: p.
79-81).
http://www.math.tamu.edu/~rahe/Math664/gene1.html
http://www.lecb.ncifcrf.gov/~toms/