biotech 4490 bioinformatics i fall 2006 j.c. salerno 1 biological information

22
Biotech 4490 Bioinformati cs I Fall 20 06 J.C. Salerno 1 Biological Information

Post on 18-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno 1 Biological Information

Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno

1

Biological InformationBiological Information

Page 2: Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno 1 Biological Information

Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno

2

The basics:The basics:

• DNA (deoxyribonucleic acid) stores information, codes for more DNA and for

• RNA (ribonucleic acid) , which is the intermediate between long term storage in the nucleus and

• Proteins, which do most of the work in living cells

• DNA (deoxyribonucleic acid) stores information, codes for more DNA and for

• RNA (ribonucleic acid) , which is the intermediate between long term storage in the nucleus and

• Proteins, which do most of the work in living cells

Page 3: Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno 1 Biological Information

Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno

3

Alphabets and translationAlphabets and translation• DNA and RNA use four letter alphabets

(ACGT or ACGU); base pairing (A-T and G-C) in DNA double helix is the key to replication, and in DNA-RNA duplex is the key to transcription

• Proteins have a basic 20 letter alphabet corresponing to the amino acids. Since strands of DNA, RNA, and polypeptide are linear, unbranched polymers, they can be treated as character strings.

• DNA and RNA use four letter alphabets (ACGT or ACGU); base pairing (A-T and G-C) in DNA double helix is the key to replication, and in DNA-RNA duplex is the key to transcription

• Proteins have a basic 20 letter alphabet corresponing to the amino acids. Since strands of DNA, RNA, and polypeptide are linear, unbranched polymers, they can be treated as character strings.

Page 4: Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno 1 Biological Information

Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno

4

Alphabets and translationAlphabets and translation• Transcription of DNA to RNA is a

simple 1:1 read- a strand of DNA produces its complement

• Translation of RNA to protein amino acid sequence is complex

• Transcription of DNA to RNA is a simple 1:1 read- a strand of DNA produces its complement

• Translation of RNA to protein amino acid sequence is complex

Page 5: Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno 1 Biological Information

Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno

5

Alphabets and translationAlphabets and translation• One base alone could only code for 4

different AA• Two bases together could code for 4x4=16

different AA- close, but no cigar• Three bases could code for 64 different

AA- we only need 21 for the 20 AA used in proteins and a stop signal

• One base alone could only code for 4 different AA

• Two bases together could code for 4x4=16 different AA- close, but no cigar

• Three bases could code for 64 different AA- we only need 21 for the 20 AA used in proteins and a stop signal

Page 6: Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno 1 Biological Information

Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno

6

Alphabets and translationAlphabets and translation• In translation, groups of three bases

(codons) are translated into amino acids

• Since there are 64 (4x4x4) codons, most AAs have multiple codons (serine has 6!). We say that the genetic code is degenerate. This isn’t a comment on its character.

• In translation, groups of three bases (codons) are translated into amino acids

• Since there are 64 (4x4x4) codons, most AAs have multiple codons (serine has 6!). We say that the genetic code is degenerate. This isn’t a comment on its character.

Page 7: Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno 1 Biological Information

Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno

7

Alphabets and translationAlphabets and translation• One consequence of the degeneracy

of the genetic code is that you can translate nucleic acid sequences to AA sequences, but you can’t reverse translate to a unique nucleic acid sequence.

• One consequence of the degeneracy of the genetic code is that you can translate nucleic acid sequences to AA sequences, but you can’t reverse translate to a unique nucleic acid sequence.

Page 8: Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno 1 Biological Information

Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno

8

Information contentInformation content

• How much information can you put into a character string?

• The computer age has provided the current generation of students with valuable intuition in this area

• If I can put 10,000 songs on one ipod, how many songs can I put on two ipods?

• How much information can you put into a character string?

• The computer age has provided the current generation of students with valuable intuition in this area

• If I can put 10,000 songs on one ipod, how many songs can I put on two ipods?

Page 9: Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno 1 Biological Information

Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno

9

Information contentInformation content

• In general, we expect the amount of information to increase linearly with the amount of space available to store it: songs with ipods, phone numbers with pages in the phone book, digital photos with memory cards.

• In general, we expect the amount of information to increase linearly with the amount of space available to store it: songs with ipods, phone numbers with pages in the phone book, digital photos with memory cards.

Page 10: Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno 1 Biological Information

Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno

10

Information contentInformation content• More precisely, we express information

content in terms of bits (or bytes) of information. The information content of a string of binary characters is just the number of characters.

• 1010101= 7 bits• 0001000= 7 bits• 10 = 2 bits (no shave or haircut)

• This assumes 1 and 0 are equally likely

• More precisely, we express information content in terms of bits (or bytes) of information. The information content of a string of binary characters is just the number of characters.

• 1010101= 7 bits• 0001000= 7 bits• 10 = 2 bits (no shave or haircut)

• This assumes 1 and 0 are equally likely

Page 11: Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno 1 Biological Information

Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno

11

Information contentInformation content

• It should be obvious that the information content of a number is independent of how we express it – 999 should have the same significance written in binary as it does in base 10,

• It should be obvious that the information content of a number is independent of how we express it – 999 should have the same significance written in binary as it does in base 10,

Page 12: Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno 1 Biological Information

Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno

12

Information contentInformation content

• In general, if the characters in the alphabet are equally probable we can express the information content of a character string as N log2M, where N is the number of characters in a sequence and M is the number of letters in the alphabet. For binary strings, there are only two characters so N log2M, = N.

• In general, if the characters in the alphabet are equally probable we can express the information content of a character string as N log2M, where N is the number of characters in a sequence and M is the number of letters in the alphabet. For binary strings, there are only two characters so N log2M, = N.

Page 13: Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno 1 Biological Information

Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno

13

Information contentInformation content

• For nucleic acids, M = 4 (ACGT)

so N log2M =2 N

• For proteins, M=20 (ACDEFGHIKLMNPQRSTVWY)

so N log2M ~ 4.3 N

• For nucleic acids, M = 4 (ACGT)

so N log2M =2 N

• For proteins, M=20 (ACDEFGHIKLMNPQRSTVWY)

so N log2M ~ 4.3 N

Page 14: Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno 1 Biological Information

Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno

14

Information contentInformation content

• A protein sequence has more than twice the information content of a nucleic acid sequence of the same length.

• But since it takes 3 bases to code for a single AA, a protein sequence has only about .7 the information content of the DNA sequence that originally coded for it.

• A protein sequence has more than twice the information content of a nucleic acid sequence of the same length.

• But since it takes 3 bases to code for a single AA, a protein sequence has only about .7 the information content of the DNA sequence that originally coded for it.

Page 15: Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno 1 Biological Information

Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno

15

Information contentInformation content• Suppose we translate a 15 base pair

sequence into a five AA sequence. The information content of the nucleic acid sequence is just 2N=30 bits.

• The information content of the protein sequence is 5log220 ( this is an upper bound assuming all AAs equally probable), or about 21.6 bits

• Almost 81/2 bits are lost to degeneracy.

• Suppose we translate a 15 base pair sequence into a five AA sequence. The information content of the nucleic acid sequence is just 2N=30 bits.

• The information content of the protein sequence is 5log220 ( this is an upper bound assuming all AAs equally probable), or about 21.6 bits

• Almost 81/2 bits are lost to degeneracy.

Page 16: Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno 1 Biological Information

Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno

16

Information and EntropyInformation and EntropyEntropy is a measure of the number of ways a system

can exist.

Example: the oversimplified 2 state molecule

______ B

_______ A

Entropy is a measure of the number of ways a system can exist.

Example: the oversimplified 2 state molecule

______ B

_______ A

Molecule has two states, A and B

In a large ensemble (sample) of molecules the populations of the states are Na and Nb

Page 17: Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno 1 Biological Information

Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno

17

Information and EntropyInformation and Entropy

The oversimplified 2 state molecule

______ B

_______ A

The oversimplified 2 state molecule

______ B

_______ A

If a photon with energy h can induce transitions between the states the energy difference between them is just = h, and at temperature T the population ratio Nb/Na is e-/kT, where K is the Boltzmann constant

Page 18: Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno 1 Biological Information

Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno

18

Information and EntropyInformation and Entropy

The oversimplified 2 state molecule: multiplicity

______ B

_______ A

The oversimplified 2 state molecule: multiplicity

______ B

_______ A

Now suppose that A consists of n substates and B of m substates. The ratio of the populations of any substate of B to any substate of A is e/kT, so the ratio the populations of all the B states to A states is just n/m (e-/kT)

Page 19: Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno 1 Biological Information

Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno

19

Information and EntropyInformation and Entropy

The oversimplified 2 state molecule: free energy and entropy

The oversimplified 2 state molecule: free energy and entropy

We can rearrange the expression n/m(e-/kT) using simple algebra to obtain the equivalent expression e-(D+kTlog(n/m)/kT. In the exponent, the term (D+kTln(n/m) has units of energy and is a free energy. Free energies in general determine equilibria. Ln(n/M) is an entropy term representing the difference in entropy between A and B (S=Sb-Sa).

Page 20: Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno 1 Biological Information

Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno

20

Information and EntropyInformation and EntropyQuestion: What has entropy got to do with information?

Answer: Everything, because entropy is just a measure of the number of possible states.

The entropy of a state is just the natural logarithm of the # of ways that state can exist. (That’s why it’s related to the degree of order: there are more ways of making a mess than of keeping things neat).

Page 21: Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno 1 Biological Information

Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno

21

Information and EntropyInformation and EntropyHalf a century ago Claude Shannon’s seminal work on information theory showed that the information content in a message could be expressed as an function we call the Shannon entropy. The basic idea is that the information content is the difference between the ln of the ways the message might read before we see it and the ln of the ways it might read after we read it. (Shannon was interested in errors as well as perfect reads.) Other people has similar ideas, (e.g., Norbert Weiner, who coined the term cybernetics) but Shannon got the details right.

Page 22: Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno 1 Biological Information

Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno

22

Information and EntropyInformation and Entropy

The information content (in bits) of a string of N characters with M ‘letters’ in the alphabet is Nlog2M if characters are equally probable.

More generally, information content can be written in terms of probabilities as –logPi, which looks worse than it is. Suppose that in an organism the CG content is 60%. The Pi are .3 for C and G and .2 for A and T . Each C or G contributes –log2(.3) bits, and each A or T contributes –log2(.2) bits. The average information per position is –PilogPi~1.96.