introduction to information theory - mickey...

Introduction to Information Theory

Gurinder Singh “Mickey” Atwal [email protected]

Center for Quantitative Biology

Summary

•  Shannon’s coding theorems

•  Entropy

•  Mutual Information

•  Multi-information

•  Kullback-Leibler Divergence

Role of Information Theory in Biology

i)  Mathematical modeling of biological phenomena e.g. Optimization of early neural processing in the brain; bacterial population strategies

ii)  Extraction of biological information from large

data-sets e.g. Gene expression analyses; GWAS (genome-wide association studies)

Mathematical Theory of Communication

•  Claude Shannon (1948) •  Bell Sys. Tech. J.

Vol.27, 379-423, 623-656

•  How to encode information? •  How to transmit messages reliably?

Model of General Communication System

Information source Destination

message

Channel

Visual Image Retina Visual Cortex

Morphogen Concentration

Differentiation Genes Gene Pathway

Computer File

Fiber Optic Cable

Another Computer


Information source Transmitter Receiver Destination

message message signal

noise

Channel

MESSAGE ENCODED

MESSAGE DECODED


Shannon’s Source Coding theorem There exists a fundamental lower bound on the size of the compressed message without losing information



noise

Channel




noise

Channel

2) Shannon’s channel coding theorem Information can be transmitted, with negligible error, at rates no faster than the channel capacity

Information Theory Information content of a message (random variable) ? How much uncertainty is there in an outcome of an

event ? e.g.

0

0.1

0.2

0.3

0.4

0.5

A T G C

High information content

0

0.1

0.2

0.3

0.4

0.5

A T G C

Low information content

p(A)=p(T)=p(G)=p(C)=0.25

p(A)=p(T)=0.4 p(G)=p(C)=0.1

Homo sapiens

Plasmodium falciparum

Measure of Uncertainty H({pi}) Suppose we have a set of N possible events with

probabilities p1p2…pN General requirements of H •  Continuous in pi

•  If all pi are equal then H should be monotonically increasing with N

•  H should be consistent 1/2

1/3

1/6

1/2

1/2

2/3

1/3

=

Entropy as a measure of uncertainty

Unique answer provided by Shannon

!

H[B] = " p(b)log2 p(b)b#B$

base 2

• Similar to Gibbs entropy in statistical mechanics • Maximum when all probabilities are equal, p(b)=1/N, • Units are measured in bits (binary digits)

random variable B with N elements b

Discrete states

∫−= dbbpbpBH )(log)(][ 2 Continuous states

NBH 2max log][ = Boltzmann entropy

Intrepretations of entropy H •  Average length of shortest code to transmit a message

(Shannon’s source coding theorem) •  Captures variability of a variable without making any

model assumptions •  Average yes/no questions to determine the outcome of a

random event

0

0.1

0.2

0.3

0.4

0.5

A T G C

H = 2 bits p(A)=p(T)=p(G)=p(C)=0.25

0

0.1

0.2

0.3

0.4

0.5

A T G C

p(A)=p(T)=0.4 p(G)=p(C)=0.1

H ~ 1 bit

Homo Sapiens

Plasmodium falciparum

Entropy as average length of shortest code

Symbol Probability of symbol, P(x)

Optimal code length =-log2(P)

Optimal code

A 1/2 1 0

C 1/4 2 10

T 1/8 3 110

G 1/8 3 111

Note that the average length of the optimal code is equal to the entropy of the distribution

=− �log2 P (x)�P (x)

≡−�

x

P (x) log2 P (x)

≡H[x]

Avg length=1.75

Example : Binding sequence conservation •  Sequence conservation

⎟⎠

⎞⎜⎝

⎛−−=−= ∑

=

N

nnnobsseq ppNHHR

122max loglog

CAP (Catabolite Activator Protein), acts as a transcription promoter at more than 100 sites within the E. Coli genome Sequence conservation reveals CAP binding site

Two random variables? •  Joint entropy

∑∈∈

−=YyXx

yxpyxpYXH,

2 ),(log),(],[

• If variables are independent p(x,y)=p(x)p(y) then H[X,Y]=H[X]+H[Y] • Difference measures total amount of correlation between two variables

∑∈∈

=

−+=

YyXx ypxpyxpyxp

YXHYHXHYXI

,2 )()(

),(log),(

],[][][];[Mutual Information, I(X;Y)

Mutual Information, I(X;Y)

I[X;Y] H[X|Y] H[Y|X]

H[Y] H[X]

H[X,Y] )|()();( YXHXHYXI −=

• I(X;Y) quantifies how much uncertainty of X is reduced if we know Y • If X and Y are independent, then I(X;Y)=0 • Model independent • Captures all non-linear correlations (c.f. Pearson’s correlation) • Independent of measurement scale • Units (bits) have physical meaning

Mutual information captures non-linear relationships

0 0.5 10

1

2

x

y

R2 = 0.487 ± 0.019I = 0.72 ± 0.08

MIC = 0.48 ± 0.02

1 0 10

1

2

x

y

R2 = 0.001 ± 0.002I = 0.70 ± 0.09

MIC = 0.40 ± 0.02

A B

Kinney and Atwal, PNAS 2014

Responsiveness to “complicated” relations

MI~1 bit; Corr.~0.9

gene-A expression level

gene

-B e

xpre

ssio

n le

vel

MI~1.3 bits; Corr.~0

gene

-B e

xpre

ssio

n le

vel

gene-A expression level

Data processing inequality

•  Suppose we have a sequence of processes e.g. a signal transduction pathway (Markov process)

CBA →→Physical Statement In any physical process the information about A gets continually degraded along the sequence of processes Mathematical Statement

);();();(CBIBAICAI

≤

≤

Multi-Entropy, H(x1x2…xn)

)...(log)...(]...[ 21...

2212121

nxxx

nn xxxpxxxpXXXHn

∑−=

Measures total correlation in n variables

!

I[X1X2...Xn ] = p(x1x2 ...xn )log2p(x1x2 ...xn )

p(x1)p(x2)...p(xn )i=1

n

"

Multi-Information, I(x1x2…xn)

Generalised correlation between more than two elements

•  Multiinformation is a natural extension of Shannon’s mutual information to an arbitrary number of random variables

•  Provides a general measure of nonindependence among

multiple variables in a network •  Captures higher-order interactions than just simple pair-

wise interactions

∑=

−=N

iNiN XXXHXHXXXI

12121 }),...,,({)(}),...,,({

Capturing more than pairwise relations

MI~0 bits; Corr.~0

Experiment index

gene

-A/g

ene-

B ex

pres

sion

Experiment index

gene

-A/g

ene-

B/ge

ne-C

exp

ress

ion

Multi-information ~ 1.0 bits

Multi-allelic associations

Phenotype

allele A

allele B

A B P

0 0 0

0 1 1

1 0 1

1 1 0

XOR

I(A;B)=I(A;P)=I(B;P)=0

I(A;B;P)=1 bit

Multi-loci associations can be completely masked by single-loci studies !

Synergy and Redundancy

)];();([)};,({);();();();;(

ZYIZXIZYXIZYIZXIYXIZYXIS

+−=

−−−=

S compares the information that X and Y together provide about Z with the information that these two variables provide separately If S < 0 then X and Y are redundant in providing information about Z If S > 0 then there is synergy between X and Y

Motivating example X : SNP 1 Y : SNP 2 Z : phenotype (apoptosis level)

How do we quantify distance between distributions?

Kullback-Leibler Divergence (DKL) •  Also known as relative entropy •  Quantifies difference between two distributions:

P(x) and Q(x)

•  Non-symmetric measure •  DKL(P||Q)≥0, DKL(P||Q)=0 if and only if P=Q •  Invariant to reparameterization of x

DKL (P ||Q) = P(x)ln P(x)Q(x)x

!

= P(x)ln" P(x)Q(x)

dx

(discrete)

(continuous)

Kullback-Leibler Divergence DKL≥0

Proof, use Jensen’s inequality: for a concave function f(x), ln(x)

for a concave function, every chord lies below the function

x

f x( ) ! f (x)

DKL (P ||Q) = P(x)ln P(x)Q(x)x

! = " P(x)lnQ(x)P(x)x

! = " lnQ(x)P(x) P(x )

# " ln Q(x)P(x) P(x )

= " ln P(x)x! Q(x)

P(x)= ln Q(x)

x! = ln1= 0

E.g. ln x( ) ! ln(x)

! DKL (P ||Q) " 0

Kullback-Leibler Divergence Motivation 1: Counting Statistics

•  Flip a fair coin N times, i.e., qH=qT=0.5 •  E.g. N=50, observe 27 heads and 23 tails •  What is the probability of observing this?

0

0.2

0.4

0.6

Heads Tails

Observed Distribution

0

0.2

0.4

0.6

Heads Tails

Actual Distribution

P(x)={0.54;0.46} Q(x)={0.50;0.50} pH pT qH qT

Kullback-Leibler Divergence Motivation 1: Counting Statistics

P (nH, nT) =N !

nH!nT!qnH

HqnT

T

≈ exp (−NpH ln pH/qH −NpT ln pT/qT)

= exp (−NDKL[P ||Q])

- Probability of observing counts depends on i) N and ii) how much observed distribution differs from true distribution - DKL emerges from the large N limit of a binomial (multinomial) distribution. - DKL quantifies how much the observed distribution diverges from the true underlying distribution. - If DKL>1/N then the distributions are “very” different.

(Binomial distribution)

(for large N)

Kullback-Leibler Divergence Motivation 2: Information Theory

•  How many extra bits, on average, do we need to code samples from P(x) using a code optimized for Q(x)?

DKL (P ||Q) = avg no. of bits using bad code - avg no. of bits using optimal code

= ! P(x)log2Q(x)x"

#

$%

&

'(! ! P(x)log2 P(x)

x"

#

$%

&

'(

= P(x)log2P(x)Q(x)x

"

Kullback-Leibler Divergence Motivation 2: Information Theory

Symbol Probability of symbol, P(x)

Bad code, but optimal

for Q(x)

Optimal code for P(x)

A 1/2 00 0

C 1/4 01 10

T 1/8 10 110

G 1/8 11 111

P(x)={1/2,1/4,1/8,1/8} Q(x)={1/4,1/4,1/4,1/4} Entropy of symbol distribution = ! p(x)log2 p(x)

x"

=1.75 bits

Avg length =2 bits

Avg length =1.75

DKL(P||Q)=2-1.75=0.25 i.e. there is an additional overhead of 0.25 bits per symbol if we use the bad code {A=00;C=01;T=10;G=11} instead of the optimal code.

This is equal to the entropy and thus is optimal

introduction to information theory - mickey...

Documents