biotech 4490 bioinformatics i fall 2006 j.c. salerno 1 biological information

Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno

1

Biological InformationBiological Information


2

The basics:The basics:

• DNA (deoxyribonucleic acid) stores information, codes for more DNA and for

• RNA (ribonucleic acid) , which is the intermediate between long term storage in the nucleus and

• Proteins, which do most of the work in living cells

• DNA (deoxyribonucleic acid) stores information, codes for more DNA and for

• RNA (ribonucleic acid) , which is the intermediate between long term storage in the nucleus and

• Proteins, which do most of the work in living cells


3

Alphabets and translationAlphabets and translation• DNA and RNA use four letter alphabets

(ACGT or ACGU); base pairing (A-T and G-C) in DNA double helix is the key to replication, and in DNA-RNA duplex is the key to transcription

• Proteins have a basic 20 letter alphabet corresponing to the amino acids. Since strands of DNA, RNA, and polypeptide are linear, unbranched polymers, they can be treated as character strings.

• DNA and RNA use four letter alphabets (ACGT or ACGU); base pairing (A-T and G-C) in DNA double helix is the key to replication, and in DNA-RNA duplex is the key to transcription

• Proteins have a basic 20 letter alphabet corresponing to the amino acids. Since strands of DNA, RNA, and polypeptide are linear, unbranched polymers, they can be treated as character strings.


4

Alphabets and translationAlphabets and translation• Transcription of DNA to RNA is a

simple 1:1 read- a strand of DNA produces its complement

• Translation of RNA to protein amino acid sequence is complex

• Transcription of DNA to RNA is a simple 1:1 read- a strand of DNA produces its complement

• Translation of RNA to protein amino acid sequence is complex


5

Alphabets and translationAlphabets and translation• One base alone could only code for 4

different AA• Two bases together could code for 4x4=16

different AA- close, but no cigar• Three bases could code for 64 different

AA- we only need 21 for the 20 AA used in proteins and a stop signal

• One base alone could only code for 4 different AA

• Two bases together could code for 4x4=16 different AA- close, but no cigar

• Three bases could code for 64 different AA- we only need 21 for the 20 AA used in proteins and a stop signal


6

Alphabets and translationAlphabets and translation• In translation, groups of three bases

(codons) are translated into amino acids

• Since there are 64 (4x4x4) codons, most AAs have multiple codons (serine has 6!). We say that the genetic code is degenerate. This isn’t a comment on its character.

• In translation, groups of three bases (codons) are translated into amino acids

• Since there are 64 (4x4x4) codons, most AAs have multiple codons (serine has 6!). We say that the genetic code is degenerate. This isn’t a comment on its character.


7

Alphabets and translationAlphabets and translation• One consequence of the degeneracy

of the genetic code is that you can translate nucleic acid sequences to AA sequences, but you can’t reverse translate to a unique nucleic acid sequence.

• One consequence of the degeneracy of the genetic code is that you can translate nucleic acid sequences to AA sequences, but you can’t reverse translate to a unique nucleic acid sequence.


8

Information contentInformation content

• How much information can you put into a character string?

• The computer age has provided the current generation of students with valuable intuition in this area

• If I can put 10,000 songs on one ipod, how many songs can I put on two ipods?

• How much information can you put into a character string?

• The computer age has provided the current generation of students with valuable intuition in this area

• If I can put 10,000 songs on one ipod, how many songs can I put on two ipods?


9


• In general, we expect the amount of information to increase linearly with the amount of space available to store it: songs with ipods, phone numbers with pages in the phone book, digital photos with memory cards.

• In general, we expect the amount of information to increase linearly with the amount of space available to store it: songs with ipods, phone numbers with pages in the phone book, digital photos with memory cards.


10

Information contentInformation content• More precisely, we express information

content in terms of bits (or bytes) of information. The information content of a string of binary characters is just the number of characters.

• 1010101= 7 bits• 0001000= 7 bits• 10 = 2 bits (no shave or haircut)

• This assumes 1 and 0 are equally likely

• More precisely, we express information content in terms of bits (or bytes) of information. The information content of a string of binary characters is just the number of characters.

• 1010101= 7 bits• 0001000= 7 bits• 10 = 2 bits (no shave or haircut)

• This assumes 1 and 0 are equally likely


11


• It should be obvious that the information content of a number is independent of how we express it – 999 should have the same significance written in binary as it does in base 10,

• It should be obvious that the information content of a number is independent of how we express it – 999 should have the same significance written in binary as it does in base 10,


12


• In general, if the characters in the alphabet are equally probable we can express the information content of a character string as N log2M, where N is the number of characters in a sequence and M is the number of letters in the alphabet. For binary strings, there are only two characters so N log2M, = N.

• In general, if the characters in the alphabet are equally probable we can express the information content of a character string as N log2M, where N is the number of characters in a sequence and M is the number of letters in the alphabet. For binary strings, there are only two characters so N log2M, = N.


13


• For nucleic acids, M = 4 (ACGT)

so N log2M =2 N

• For proteins, M=20 (ACDEFGHIKLMNPQRSTVWY)

so N log2M ~ 4.3 N

• For nucleic acids, M = 4 (ACGT)

so N log2M =2 N

• For proteins, M=20 (ACDEFGHIKLMNPQRSTVWY)

so N log2M ~ 4.3 N


14


• A protein sequence has more than twice the information content of a nucleic acid sequence of the same length.

• But since it takes 3 bases to code for a single AA, a protein sequence has only about .7 the information content of the DNA sequence that originally coded for it.

• A protein sequence has more than twice the information content of a nucleic acid sequence of the same length.

• But since it takes 3 bases to code for a single AA, a protein sequence has only about .7 the information content of the DNA sequence that originally coded for it.


15

Information contentInformation content• Suppose we translate a 15 base pair

sequence into a five AA sequence. The information content of the nucleic acid sequence is just 2N=30 bits.

• The information content of the protein sequence is 5log220 ( this is an upper bound assuming all AAs equally probable), or about 21.6 bits

• Almost 81/2 bits are lost to degeneracy.

• Suppose we translate a 15 base pair sequence into a five AA sequence. The information content of the nucleic acid sequence is just 2N=30 bits.

• The information content of the protein sequence is 5log220 ( this is an upper bound assuming all AAs equally probable), or about 21.6 bits

• Almost 81/2 bits are lost to degeneracy.


16

Information and EntropyInformation and EntropyEntropy is a measure of the number of ways a system

can exist.

Example: the oversimplified 2 state molecule

______ B

_______ A

Entropy is a measure of the number of ways a system can exist.

Example: the oversimplified 2 state molecule

______ B

_______ A

Molecule has two states, A and B

In a large ensemble (sample) of molecules the populations of the states are Na and Nb


17

Information and EntropyInformation and Entropy

The oversimplified 2 state molecule

______ B

_______ A

The oversimplified 2 state molecule

______ B

_______ A

If a photon with energy h can induce transitions between the states the energy difference between them is just = h, and at temperature T the population ratio Nb/Na is e-/kT, where K is the Boltzmann constant


18


The oversimplified 2 state molecule: multiplicity

______ B

_______ A

The oversimplified 2 state molecule: multiplicity

______ B

_______ A

Now suppose that A consists of n substates and B of m substates. The ratio of the populations of any substate of B to any substate of A is e/kT, so the ratio the populations of all the B states to A states is just n/m (e-/kT)


19


The oversimplified 2 state molecule: free energy and entropy

The oversimplified 2 state molecule: free energy and entropy

We can rearrange the expression n/m(e-/kT) using simple algebra to obtain the equivalent expression e-(D+kTlog(n/m)/kT. In the exponent, the term (D+kTln(n/m) has units of energy and is a free energy. Free energies in general determine equilibria. Ln(n/M) is an entropy term representing the difference in entropy between A and B (S=Sb-Sa).


20

Information and EntropyInformation and EntropyQuestion: What has entropy got to do with information?

Answer: Everything, because entropy is just a measure of the number of possible states.

The entropy of a state is just the natural logarithm of the # of ways that state can exist. (That’s why it’s related to the degree of order: there are more ways of making a mess than of keeping things neat).


21

Information and EntropyInformation and EntropyHalf a century ago Claude Shannon’s seminal work on information theory showed that the information content in a message could be expressed as an function we call the Shannon entropy. The basic idea is that the information content is the difference between the ln of the ways the message might read before we see it and the ln of the ways it might read after we read it. (Shannon was interested in errors as well as perfect reads.) Other people has similar ideas, (e.g., Norbert Weiner, who coined the term cybernetics) but Shannon got the details right.


22


The information content (in bits) of a string of N characters with M ‘letters’ in the alphabet is Nlog2M if characters are equally probable.

More generally, information content can be written in terms of probabilities as –logPi, which looks worse than it is. Suppose that in an organism the CG content is 60%. The Pi are .3 for C and G and .2 for A and T . Each C or G contributes –log2(.3) bits, and each A or T contributes –log2(.2) bits. The average information per position is –PilogPi~1.96.

biotech 4490 bioinformatics i fall 2006 j.c. salerno 1 biological information

Documents