algorithmic information theory and computational biology

40
Algorithmic Information Theory and Computational Biology Hector Zenil Unit of Computational Medicine Karolinska Institutet Sweden Hector Zenil AIT Tools for Biology and Medicine

Upload: hector-zenil

Post on 05-Dec-2014

2.063 views

Category:

Education


0 download

DESCRIPTION

I present cutting-edge concepts and tools drawn from algorithmic information theory (AIT) for new generation genetic sequencing, network biology and bioinformatics in general. AIT is the most advanced mathematical theory of information theory formally characterising the concepts and differences between simplicity, randomness and structure. Measures of AIT will empower computational medicine and systems biology to deal with big data, sophisticated analytics and a powerful new understanding framework.

TRANSCRIPT

Page 1: Algorithmic Information Theory and Computational Biology

Algorithmic Information Theory andComputational Biology

Hector Zenil

Unit of Computational MedicineKarolinska Institutet

Sweden

Hector Zenil AIT Tools for Biology and Medicine

Page 2: Algorithmic Information Theory and Computational Biology

Complex Adaptive Systems (CAS)

Hector Zenil AIT Tools for Biology and Medicine

Page 3: Algorithmic Information Theory and Computational Biology

Complexity is hard to quantify in biology

Mapping quantitative stimuli to qualitative behaviour

Hector Zenil AIT Tools for Biology and Medicine

Page 4: Algorithmic Information Theory and Computational Biology

Information Theory in Biology

Sequence alignment

Pattern recognition

Sequence logos

Binding site detection

Motif detection

Consensus sequences

Biological significance

[based on Claude Shannon’s Information Theory, 1940]Hector Zenil AIT Tools for Biology and Medicine

Page 5: Algorithmic Information Theory and Computational Biology

Algorithmic Information Theory

Which sequence looks more random?(a) AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

(b) AGGTCGTGAAGTGCGATGGCCTTACGTAGC(c) GCGCGCGCGCGCGCGCGCGCGCGCGCGC

Classical probability theory vs. Kolmogorov Complexity

Definition

KU(s) = min{|p|,U(p) = s} (1)

Compressibility

A sequence with low Kolmogorov complexity is c-compressible if|p|+ c = |s|. A sequence is random if K (s) ≈ |s|.

[Kolmogorov (1965); Chaitin (1966)]Hector Zenil AIT Tools for Biology and Medicine

Page 6: Algorithmic Information Theory and Computational Biology

Examples

Example 1

Sequences like (a) have low algorithmic complexity because theyallow a short description. For example, “20 times A”. No matterhow long (a) grows in length, the description increases only byabout log2(k) (k times A).

Example 2

The sequence (b) is algorithmic random because it doesn’t seem toallow a (much) shorter description other than the length of (b)itself.

For example, for sequence (a), a proof of non-randomness impliesthe exhibition of a short program. Compressibility is therefore asufficient test of non-randomness.

Hector Zenil AIT Tools for Biology and Medicine

Page 7: Algorithmic Information Theory and Computational Biology

Example of an evaluation of K

The sequence (b) GCGCGC...GC is not algorithmic random (or haslow K complexity) because it can be produced by the followingprogram (take G=0 and C=1):

Program A(i):1: n:= 02: Print n mod 23: n:= n+14: If n=i Goto 65: Goto 26: End

The length of A (in bits) is an upper bound of K (GCGCGC ...GC ).

Hector Zenil AIT Tools for Biology and Medicine

Page 8: Algorithmic Information Theory and Computational Biology

The ultimate measure of pattern detection and optimalprediction

Kolmogorov and Chaitin, Schnorr, and Martin-Lofindependently provided 3 different approaches to randomness(compression, predictability and typicality).

They proved (for infinite sequences):

incompressibility ⇐⇒ unpredictability ⇐⇒ typicality

When this happens in mathematics a concept has objectively beencaptured (randomness).

This is why prediction in biology is hard. AIT tells that no effectivestatistical test will succeed to recognise all patterns and nocomputable technique can fully predict all outcomes. The problemis deeply connected to computability and algorithmic informationtheory.

[Solomonoff (1964); Kolmogorov (1965); Chaitin (1969)]Hector Zenil AIT Tools for Biology and Medicine

Page 9: Algorithmic Information Theory and Computational Biology

Information distances and similarity metrics

Measures waiting to be introduced in bioinformatics

Information Distance ID(x , y) = max K (x |y),K (y |x)

Universal Similarity MetricUSM(x , y) = max K (x |y),K (y |x)/max K (x),K (y)

Normalised Information Distance:NCD(x , y) = K (xy)−min K (x),K (y)/max K (x),K (y) andNCD.

Normalized Compression Measure (NCM): NC (s) = K (s)/|s|(asymptotic behaviour)

Bennett’s Logical Depth:LDd(s) = min{t(p) : (|p| − |p∗| < d) and (U(p) = s)}

(e.g. of an app. see Zenil, Complexity 2011)

Hector Zenil AIT Tools for Biology and Medicine

Page 10: Algorithmic Information Theory and Computational Biology

Non-systematic but succesful attempts in biology

GenCompress is a compression algorithm to compress DNAsequences: d(x , y) = 1− (K (x)− K (x |y))/K (xy)

NCD applied to genetic similarity:

AIT looks at the genome as information, not as data (letters).Counting: traditional Shannon-entropy style sequencing.Interpreting: AIT. The full power of the theory hasn’t yet beenunleashed.

Hector Zenil AIT Tools for Biology and Medicine

Page 11: Algorithmic Information Theory and Computational Biology

To be or not to be...

Borel’s “Infinite Monkey” theorem

0

Syntax error

1

Input

∞∞

1024

“To be or not

to be, that is the

question.”

CH3

√2

π

Hector Zenil AIT Tools for Biology and Medicine

Page 12: Algorithmic Information Theory and Computational Biology

Algorithmic probability

Hector Zenil AIT Tools for Biology and Medicine

Page 13: Algorithmic Information Theory and Computational Biology

Producing π

This C-language code produces the first 1000 digits of π (GjerritMeinsma):

long k = 4e3, p, a[337], q, t = 1e3;main(j){for(; a[j = q = 0]+ = 2, k ; )for(p = 1 + 2 ∗ k ; j < 337; q = a[j ] ∗ k + q%p ∗ t, a[j + +] = q/p)k! = j > 2? : printf (“%.3d”, a[j2]%t + q/p/t); }

Producing non-random sequences:

If an object has low Kolmogorov complexity then it has a short descriptionand a greater probability to be produced by a random program. The lessrandom a string the more likely to be produced by a short program.

Hector Zenil AIT Tools for Biology and Medicine

Page 14: Algorithmic Information Theory and Computational Biology

Biological Big Data Analysis

The information bottleneck:

Small Data matters: Local measurements of information contentare a good indication of the global information content of an

object. Evidence: BDM Image classification. Compression works atlarge scales looking for long regularities, while BDM is very local.

Yet both yield astonishing similar results for this object sizes.

Hector Zenil AIT Tools for Biology and Medicine

Page 15: Algorithmic Information Theory and Computational Biology

Complementary methods for different sequence lengths

The methods to approximate K coexist and complement eachother for different sequence lengths.

short strings long strings scalability< 100 bits > 100 bits

Lossless compressionmethod ×

√ √

Coding Theoremmethod

√× ×

Block Decompositionmethod

√ √ √

[Zenil, Soler, Delahaye, Gauvrit, Two-Dimensional KolmogorovComplexity and Validation of the Coding Theorem Method by

Compressibility (2012)]

Hector Zenil AIT Tools for Biology and Medicine

Page 16: Algorithmic Information Theory and Computational Biology

Coding Theorem method and lossless compression

The transition between one method and the other. What is complex forthe Coding Theorem method is less compressible.

[Soler, Zenil, Delahaye, Gauvrit, Correspondence and Independence ofNumerical Evaluations of Algorithmic Information Measures (2012)]

Hector Zenil AIT Tools for Biology and Medicine

Page 17: Algorithmic Information Theory and Computational Biology

Online Algorithmic Complexity Calculator

Provides: Shannon’s entropy, lossless compression (Deflate) values,Kolmogorov complexity approximations and relative frequency order(algorithmic probability).

A Mathematica API and an R module.

Datasets available online at the Dataverse Network.

Basic data analysis tool for shorts sequence comparison.

[http://www.complexitycalculator.com]

Hector Zenil AIT Tools for Biology and Medicine

Page 18: Algorithmic Information Theory and Computational Biology

Online Algorithmic Complexity Calculator 2

[http://www.complexitycalculator.com]

Hector Zenil AIT Tools for Biology and Medicine

Page 19: Algorithmic Information Theory and Computational Biology

Simulation of natural systems w/complex symbolic systems

An elementary cellular automaton (ECA) is defined by a localfunction f : {0, 1}3 → {0, 1},

f maps the state of a cell and its two immediate neighbours (range= 1) to a new cell state: ft : r−1, r0, r+1 → r0. Cells are updated

synchronously according to f over all cells in a row.

[Wolfram, (1994)]

Hector Zenil AIT Tools for Biology and Medicine

Page 20: Algorithmic Information Theory and Computational Biology

Behavioural classes of CA

Wolfram’s classes of behaviour:

Class I: Systems evolve into a stable state.

Class II: Systems evolve in a periodic (e.g. fractal) state.

Class III: Systems evolve into random-looking states.

Class IV: Systems evolve into localised complex structures.e.g. Rule 110 or the Game of Life.

[Wolfram, (1994)]

Hector Zenil AIT Tools for Biology and Medicine

Page 21: Algorithmic Information Theory and Computational Biology

Block Decomposition method (BDM)

The Block Decomposition method uses the Coding Theoremmethod. Formally, we will say that an object c has complexity:

K logm,2Dd×d(c) =

∑(ru ,nu)∈cd×d

(nu − 1) log2(Km,2D(ru)) + Km,2D(ru)

(2)where cd×d represents the set with elements (ru, nu), obtainedfrom decomposing the object into blocks of d × d with boundaryconditions. In each (ru, nu) pair, ru is one of such squares and nu

its multiplicity.

[H. Zenil, F. Soler-Toscano, J.-P. Delahaye and N. Gauvrit, (2012)]

Hector Zenil AIT Tools for Biology and Medicine

Page 22: Algorithmic Information Theory and Computational Biology

Classification of ECA by BDM versus lossless compression

Compressors have limitations (small sequences, timecomplexity)

Applications to machine learning

Problems of classification and clustering

BDM is computationally efficient (runs in O(nd) time, hencelinear (d = 1) time for sequences)

[H. Zenil, F. Soler-Toscano, J.-P. Delahaye and N. Gauvrit, (2012)]

Hector Zenil AIT Tools for Biology and Medicine

Page 23: Algorithmic Information Theory and Computational Biology

Asymptotic behaviour of complex systems

[Zenil, Complex Systems (2010)]

Hector Zenil AIT Tools for Biology and Medicine

Page 24: Algorithmic Information Theory and Computational Biology

Rule space of 3-symbol 1D CA

[Zenil, Complex Systems (2011)]Hector Zenil AIT Tools for Biology and Medicine

Page 25: Algorithmic Information Theory and Computational Biology

Phase transition detection

Definition

cnt = |C(Mt(i1))−C(Mt(i2))|+...+|C(Mt(in−1))−C(Mt(in))|

t(n−1)

[Zenil, Complex Systems (2011)]Hector Zenil AIT Tools for Biology and Medicine

Page 26: Algorithmic Information Theory and Computational Biology

A measure of programmability

Cnt (M) =

∂f (cnt )

∂t(3)

[Zenil, Complex Systems (2011)]

Hector Zenil AIT Tools for Biology and Medicine

Page 27: Algorithmic Information Theory and Computational Biology

Examples

Figure : ECA Rule 4 has a low C nt for random chosen n and t (it doesn’t

react much to external stimuli). limn,t→∞ C nt (R4) = 0

[H. Zenil, Philosophy & Technology, (2013)]Hector Zenil AIT Tools for Biology and Medicine

Page 28: Algorithmic Information Theory and Computational Biology

Examples (cont.)

Figure : ECA R110 has large coefficient C nt value for sensible choices of t

and n, which is compatible with the fact that it has been proven to becapable of universal computation (for particular semi-periodic initialconfigurations). limn,t→∞ C n

t (R110) = 1

Hector Zenil AIT Tools for Biology and Medicine

Page 29: Algorithmic Information Theory and Computational Biology

Classification of graphs

[Zenil, Soler, Dingle, Graph Automorphism Estimation and ComplexNetwork Topological Characterization by Algorithmic Randomness]

Hector Zenil AIT Tools for Biology and Medicine

Page 30: Algorithmic Information Theory and Computational Biology

Characterisation of complex networks

Complex Networks w/preferential attachment algorithms preserveproperties invariant under network size (connectedness, robustness)

at a low cost (unlike costly random nets in the number of links).

[Zenil, Soler, Dingle, Graph Automorphism Estimation and ComplexNetwork Topological Characterization by Algorithmic Randomness]

Hector Zenil AIT Tools for Biology and Medicine

Page 31: Algorithmic Information Theory and Computational Biology

Biological case study: Programmable Porphyrin molecules

Much about the dynamics of these molecules is known, one can performMonte-Carlo simulations based in these mathematical models andestablish a correspondence between Wang tiles and simple molecules.

[joint work with ICOS, U. of Nottingham] [G. Terrazas, H. Zenil and N.Krasnogor, Exploring Programmable Self-Assembly in Non DNA-based

Molecular Computing]

Hector Zenil AIT Tools for Biology and Medicine

Page 32: Algorithmic Information Theory and Computational Biology

Quantitative dynamics of living systems

Aggregations with similar Kolmogorov complexity cluster in similarconfigurations.

[G. Terrazas, H. Zenil and N. Krasnogor, Exploring ProgrammableSelf-Assembly in Non DNA-based Molecular Computing]

Hector Zenil AIT Tools for Biology and Medicine

Page 33: Algorithmic Information Theory and Computational Biology

Mapping output behaviour to external stimuli: Parameterdiscovery

Parameter Space P → Target Space T

Target space T : Set a configuration from P that triggers thedesired behaviour in T .

To investigate:

Reduction of the parameter spaceCharacterisation of the target space

[G. Terrazas, H. Zenil and N. Krasnogor, Exploring ProgrammableSelf-Assembly in Non DNA-based Molecular Computing]

Hector Zenil AIT Tools for Biology and Medicine

Page 34: Algorithmic Information Theory and Computational Biology

Robustness and pervasiveness

Concentration changes preserving behaviour:

Output parameters that have the highest impact can be tested insilico before experiments in materio.

[G. Terrazas, H. Zenil and N. Krasnogor, Exploring ProgrammableSelf-Assembly in Non DNA-based Molecular Computing]

Hector Zenil AIT Tools for Biology and Medicine

Page 35: Algorithmic Information Theory and Computational Biology

Orthogonality

Specific concentrations producing certain behaviour using themathematical model to be tested against empirical data.

Hector Zenil AIT Tools for Biology and Medicine

Page 36: Algorithmic Information Theory and Computational Biology

Highlights and goals

Ultimate goal (a few years time): An information-theoreticaltoolbox for systems and synthetic biology

[Complex3D Proteins Database (graph representation) &Z Chen et al. Lung cancer pathways in response to treatments.]

Pushing boundaries.

A cutting-edge mathematical approach

Tools from Complexity theory.

Hector Zenil AIT Tools for Biology and Medicine

Page 37: Algorithmic Information Theory and Computational Biology

New Generation Sequence data analysis

Heavily driven by:

Explosion of experimental data

Difficulties in data interpretation

New paradigms for knowledge extraction

Data mining the behaviour of natural systems

Towards an AIT tool-kit for systems biology, a functionallibrary of programmable biological modules with a SBMLinterface.

Hector Zenil AIT Tools for Biology and Medicine

Page 38: Algorithmic Information Theory and Computational Biology

J.P. Delahaye and H. Zenil, On the Kolmogorov-Chaitin complexityfor short sequences, in Cristian Calude (eds), Complexity andRandomness: From Leibniz to Chaitin, World Scientific, 2007.

J.-P. Delahaye and H. Zenil, Numerical Evaluation of the Complexityof Short Strings, Applied Mathematics and Computation, 2011.

H. Zenil, F. Soler-Toscano, J.-P. Delahaye and N. Gauvrit,Two-Dimensional Kolmogorov Complexity and Validation of theCoding Theorem Method by Compressibility, arXiv:1212.6745 [cs.CC]

F. Soler-Toscano, H. Zenil, J.-P. Delahaye and N. Gauvrit,Correspondence and Independence of Numerical Evaluations ofAlgorithmic Information Measures, Numerical Algorithms (in 2ndrevision)

F. Soler-Toscano, H. Zenil, J.-P. Delahaye and N. Gauvrit,Calculating Kolmogorov Complexity from the Frequency OutputDistributions of Small Turing Machines, arXiv:1211.1302 [cs.IT]

H. Zenil, Compression-based Investigation of the DynamicalProperties of Cellular Automata and Other Systems, ComplexSystems, Vol. 19, No. 1, pages 1-28, 2010.

Hector Zenil AIT Tools for Biology and Medicine

Page 39: Algorithmic Information Theory and Computational Biology

H. Zenil and J.A.R. Marshall, Some Aspects of ComputationEssential to Evolution and Life, Ubiquity, 2012.

H. Zenil, What is Nature-like Computation? A Behavioural Approachand a Notion of Programmability, Philosophy & Technology (specialissue on History and Philosophy of Computing), 2013.

H. Zenil, On the Dynamic Qualitative Behavior of UniversalComputation Complex Systems, vol. 20, No. 3, pp. 265-278, 2012.

H. Zenil, A Turing Test-Inspired Approach to Natural ComputationIn G. Primiero and L. De Mol (eds.), Turing in Context II (Brussels,10-12 October 2012), Historical and Contemporary Research inLogic, Computing Machinery and Artificial Intelligence, Proceedingspublished by the Royal Flemish Academy of Belgium for Science andArts, 2013.

G.J. Chaitin A Theory of Program Size Formally Identical toInformation Theory, J. Assoc. Comput. Mach. 22, 329-340, 1975.

A. N. Kolmogorov, Three approaches to the quantitative definitionof information Problems of Information and Transmission, 1(1):1–7,1965.

Hector Zenil AIT Tools for Biology and Medicine

Page 40: Algorithmic Information Theory and Computational Biology

L. Levin, Laws of information conservation (non-growth) and aspectsof the foundation of probability theory, Problems of InformationTransmission, 10(3):206–210, 1974.

M. Li, P. Vitanyi, An Introduction to Kolmogorov Complexity and ItsApplications, Springer, 3rd. ed., 2008.

R.J. Solomonoff. A formal theory of inductive inference: Parts 1 and2, Information and Control, 7:1–22 and 224–254, 1964.

S. Wolfram, A New Kind of Science, Wolfram Media, 2002.

Hector Zenil AIT Tools for Biology and Medicine