why simple organisms can cope with complex environments? why simple organisms can cope with complex...

Why simple organisms can copewith complex environments?

NIPS 2009 WorkshopThe curse of dimensionality – how can the brain solve it?

Naftali Tishby

Interdisciplinary Center for Neural ComputationHebrew University, Jerusalem, Israel

Outline• Is the RL “curse of dimensionality” real ?

…the “number of parameters” debate revisited … ?

• The Brain’s primary task: making valuable predictions – The perception-action cycle of information– Optimal solution: the Past-Future Information Bottleneck

• Predictive information is rare – Only a tiny fraction of the world’s complexity is relevant– How difficult it is to extract it?

• The brain’s complexity reflects behavior (not the world)– New bounds on predictive representation complexity – Information bounded Reinforcement Learning– Robustness and generalization theorems

The brain’s primary task is making valuable predictions

Perception is goal oriented directed by active predictions

Hierarchies and reverse hierarchies

Tsostos 1990; Hochstein and Ahissar 2002

The auditory pathways

Feedback

reverse

hierarchy

Feed-fo

rward hierarch

y

Low level representations are sensitive to fine

temporal cues, in a μs resolution

Phonological/semantic level

……

day bay

nightdream

Initial perception is based on high-level,

phonological representations

Nelken et al, 2005

Perception-Action Cycles

Multiple cycles with Multiple time scales!

The Perception-Action Cycle

The circular flow of information that takes place between the organism and its environment in the course of a sensory-guided sequence of behavior towards a goal. (JM Fuster)

Why Predictability? Life is all about making good

predictions…

The essence of the cycle

Sensing costs

Prediction value

Internal Representations

NOW

The Environment: stationary stochastic process


PAST FUTURE

InternalRepresentation


PAST FUTURE

X Y

T

(Optimal) Internal Representationswe like to think probabilistically

X

T

Y

YXP ,

XTP | TYP |

• Environment: P(X,Y)

• Internal representation: P(T|X), P(Y|T)

X

T

Y

YXI ;

XTI ; YTI ;

• Environment: I(X;Y) – predictive information

• Internal representation: I(T;X) , I(T;Y) - compression & prediction

(Optimal) Internal Representationsand we want a computational principle…

X

T

Y

YXI ;

XTI ; YTI ;

Model Quantifiers:

• Complexity (“cost”): I (T;X)

• Predictive Info (“value”): I(T;Y)

Optimality Trade-off:

• minimize complexity

• maximize predictive-info

model

past future

(Optimal) Internal Representationsand a computational principle…

• Environment: I(X;Y) – predictive information

• Internal representation: I(T;X) , I(T;Y) - compression & prediction

A simple illustration

2,,

18,18,...,2,1

YBAy

Xx

YXP ,2 4 6 8 10 12 14 16 18

A

B

2 4 6 8 10 12 14 16 180

0.2

0.4

0.6

0.8

1

2 4 6 8 10 12 14 16 18

A

B

2 4 6 8 10 12 14 16 180

0.2

0.4

0.6

0.8

1

P (

Y=

B|X

)

0 1 2 3 40

0.05

0.1

0.15

0.2

I(T;X)

I(T

;Y)

Info Curve

X

T

P(T|X)

2 4 6 8 10 12 14 16 18

2

4

6

8

10

12

14

16

182 4 6 8 10 12 14 16 18

0

0.2

0.4

0.6

0.8

1

X

Predictions


XHXTIXT ;,P

(‘B

’|X

)

(most complex) (perfect copy) (perfect predictions)

0 1 2 3 40

0.05

0.1

0.15

0.2

I(T;X)

I(T

;Y)

Info Curve

X

T

P(T|X)

2 4 6 8 10 12 14 16 18

2

4

6

8

10

12

14

16

182 4 6 8 10 12 14 16 18

0

0.2

0.4

0.6

0.8

1

X

Predictions


bit3; XTIP

(‘B

’|X

)

0 1 2 3 40

0.05

0.1

0.15

0.2

I(T;X)

I(T

;Y)

Info Curve

X

T

P(T|X)

2 4 6 8 10 12 14 16 18

2

4

6

8

10

12

14

16

182 4 6 8 10 12 14 16 18

0

0.2

0.4

0.6

0.8

1

X

Predictions


bit2; XTIP

(‘B

’|X

)

0 1 2 3 40

0.05

0.1

0.15

0.2

I(T;X)

I(T

;Y)

Info Curve

X

T

P(T|X)

2 4 6 8 10 12 14 16 18

2

4

6

8

10

12

14

16

182 4 6 8 10 12 14 16 18

0

0.2

0.4

0.6

0.8

1

X

Predictions


bit1; XTIP

(‘B

’|X

)

0 1 2 3 40

0.05

0.1

0.15

0.2

I(T;X)

I(T

;Y)

Info Curve

X

T

P(T|X)

2 4 6 8 10 12 14 16 18

2

4

6

8

10

12

14

16

182 4 6 8 10 12 14 16 18

0

0.2

0.4

0.6

0.8

1

X

Predictions


bit5.0; XTIP

(‘B

’|X

)

0 1 2 3 40

0.05

0.1

0.15

0.2

I(T;X)

I(T

;Y)

Info Curve

X

T

P(T|X)

2 4 6 8 10 12 14 16 18

2

4

6

8

10

12

14

16

182 4 6 8 10 12 14 16 18

0

0.2

0.4

0.6

0.8

1

X

Predictions


bit0; XTIP

(‘B

’|X

)

How much of the past The brain really needs?

Predictive Information: The Capacity of the Future-Past

Channel(with Bialek and Nemenman, 2001)

– Estimate PT(W(-),W(+)) : T- past-future distribution

W(t)

past futureW(-)- T-window

t=0

W(+)- T-window

( , )

( | )[ ] log

( )past future

T Tfuture past

pred Tfuture p W W

p W WI T

p W

Entropy of words in a Spin Chain

12

02 )(log)()(

N

kKNKN WPWPNS

Entropy of spin Chains

total.spins 10 · 1

spins 400000every

)j-i

1 (0,Νfrom

randomat taken is J •

spins 400000every

1) N(0, from randomat

takenis J , J J •

J •

9

ij

01ji,0ij

1ji,ij

Entropy is Extensive : it shows No distinction between the cases!

Predictive Information –Subextensive Component of the

Entropy

shows a qualitativ

e distinctio

n between the cases!

Subextensive component

growth is reflecting

the underlying complexity!

Logarithmic growth for finite dimensional processes

• Finite parameter processes (e.g. Markov chains)

• Similar to stochastic complexity (MDL)

dim( )( ) log

2predI T T

Power law growth

• Fast growth is a signature of infinite dimensional processes (e.g. speech)

• Power laws appear in cases where the interactions/correlations have long range.

( ) 1predI T T

Efficient predictors: Prediction Suffix Trees

deep sparse trees do better than full trees

[Ron, Singer, Tishby, 1994,95]

– Most of the past is irrelevant for the future!

– The “relevant components” can be extracted efficiently from small samples (typically),

much smaller than required for reliable Entropy estimation!

But WHAT - in the past - is predictive ?

W(t)

past futureW(-)- T-window

t=0

W(+)- T-window

How much information is needed

for valuable behavior?

Bellman meets Shannon

37Perception-Action-Cycles © 2009 Naftali Tishby

Richard Ernest Bellman (August 26, 1920 – March 19, 1984)

Claude Elwood Shannon (April 30, 1916 – February 24, 2001)

38

Value and Information parallels …

asstttt

t

tt

t

Passps

r

sasAat

Sst

t

',11

1

),|( with :statenext resulting

:reward resulting gets

)|( with )( : stepat action produces

: stepat state TRUE observesAgent

,2,1,0 :steps at timeinteract t environmen andAgent

and ,given DPby )(for solved

)()|()(

:

ass

ass

a s

ass

ass

RPsV

sVRPsasV

V

valuefor equation Bellman

Agent:Internal

knowledge

Environment:complex external

states

Action/sensing at

information gain It

“State of Knowledge“

on goal gsimplex

estimation of p(g|st)

asst

ass

tt

t

Ps

ΔI

sasAat

sgpssp

Ss

t

Gg

',1

',

with :statenext worldresulting

:gainn informatio teget/estima

)ˆ|( with )( : stepat action produces

)ˆ|( ),|ˆ(by zedcharacteri

,ˆˆ : state internalan infer estimates/

,2,1,0 :steps at timet environmen with interacts

variablegoal a hasAgent

inference prob. and DP using for solved

);ˆ()ˆ|();ˆ(

:

I

gsIIPsagsI

I

a s

ass

ass

for equation Bellman


Combining (future) Value and Information

In cases where information is free, we can maximize value

irrespective of its information cost.

In gene

to reduce decision comple

ral, however, we want

(1) (get home in the simplest way)

(2) maxi

xity

mize

increase robustness to model f

(e.g. with the coins)

(3)

All three can be obtained by co

th

mbi

luct

ning

uations

the Inf

e enviornment information

ormation and Value

equati

gain

ons.


Trading Value and (future) Information

1

1 2 2

1

1( , , ,

( , |

,1 11 1

1 2

1, )

1

1

1 2, |

11

1

)...

( | , )log

( ) (( )

( (

( ) (

)

( | , )log

) )

( | )log

(

,..., then:

,...

),

( ),

t t

t

t

t

t

t t t tt t a

t p s a s

t t t t

t t

t t

p s a s a

t

t t

t

t at

t

t t

t

s

t t

a a

as E

s E

p s s s a

p s p s

p s s a

p s

ss

a

a a

a

a

a

I

I I

11

1

1

1

1 1

(

1 1 1

)

With:

We want: arg min arg min

with

( | , )

( , ) ( , ) ( , , )

( | , ) (,

( ,

)

( |

( ,

( , | , )

:

)

)

)

)

( t

tt

t

t

t t t

t t t t t

s t t t t

t t t t

t t

p s

t t

t

t

ast t

t

s

p s s a

s a Q s a F s

p s s a sQ

Q

p s

a V

s a

a s

a

Rs

a

a s a

E

s

I

I

1 1 11

1 11 1

1 1( | )

1| , ) 1

1

1( | ,

1

1 1( | )

11

1

1

1

)

( | , )log ( )

( | )log

( )

( | )

( )

( |log

(

( ,

,, )

lo( ) )

)

(g

t t

t

t

t t t t

t

tt tt t t

t t ts a t

t

t

at ta

t tp s s a

s s st

at ta s s s

t

tt

t

t

a sR

a

a s

p s s aV s

pa

s

p s s a

p s a

s

E R s

I+

+I

1 11 1

1 1

1 1( | , ) 1

1

1( | )

1

1

1

( , )

( | , ), , log ,

( )( ) (

(

)

or

)| )

log ,( )

t

t t t tt t t

a tt

t t

t t t tt p s s a t

t

t

s s a s tt

a s

aa R

a

p s s a sa

a

Q

Es sp s

EF F

+


Information bounded RL

, '

, '

,define

the "optimal" (reward as ) transition probabilities

( '): ( ' | , )

(

( ')

,

and the state prior,

sufficient statist

exp , , )

exp , is the local pa

( , ,

i

rtition) ')

c

(

a

s s

a

s s

a

s

s s

p sq s s a q

Z

p s

Z s a

sR

Rs

a

p

11

1

111

1 1

11

, ( ,

function.

Then the state-action free energy, Bellman equation is

( | )lo

( , ) ),

( | , )log

( | , ),

:( , )

( , )g

( )( |( )| , )tt

t t t

t

t

t

t t

t t t

t t ts t t

t t

ta tt t

F s s a

F

a Q s a

p s s a

q

a s

aa s s aa p s s as s

I

1 1, (

the desired optimal policy is (somewhat surprisin

,

( )( | ) exp(

gly):

, ,

) log , , )

))(

(

(

(

( ) exp( ,

( ) ( |

,

( )

These 3 equatio

, )

ns should be iter

)

)

, ( ))

t

a

t t

s

t a

aa s a

Z s

Z s a a

a a s

Z

p

F s a

F

s

s

s

s

F

ated till convergence for every state (like Blahut Arimoto).

Biological evidence?

Auditory cortex encodes surprise

(with Eli Nelken and Jonathan Rubin)

The predictive bottleneck

0 0.33 0.67 1 1.33 1.67 2 2.330

0.05

0.1

0.15

0.2

Model Complexity (bits)

Pre

dic

tive

Po

we

r (b

its)

0 1 2 3 4 5

123456

0 1 2 3 4 5

123456

0 1 23 4 5

123456

0 1 2 3 4 5

12345

0 1 2 3 4 5

123

0 1 2 3 4 5

12

0 1 2 3 4 5

12

Information curve showing the optimal predictive information (surprise) as a function of the complexity of the internal model (memory bits) for the next-tone prediction of oddball sequences using a memory duration of 5 tones back.

Left: scatter plots of the neural responses to either ‘A’ (blue) or ‘B’ (red) and the surprise values calculated for a specific model. Dots mark the mean response at a given surprise level, and the error-bars represent 25 and 75 percentile of the data. Right: (1) PSTH for stimulus ‘A’, each row is the averaged PSTH corresponding to a single point in the scatter-plot, sorted from low to high surprise level. (2) PSTH for stimulus ‘B’. (3) Correlations for ‘A’ (as explained before). (4) Correlations for ‘B’.

The PSTH plots help to see what part of signal is correlated with the surprise. For instance the onset seems pretty constant (and absent in the responses to ‘B’), where the sustained part seems to be very correlated with the surprise.

(1)

(2)

(3)

(4)

Conclusions

- Prediction complexity – is governed by the “predictive information” of the environment – NOT by its complexity (Entropy). The predictive information is a tiny (exp. small) fraction of the full Entropy of the environment.

- The brain can extract/learn efficient (good enough) predictors

from small samples. No need to capture the full complexity of the world. - There is accumulating experimental evidence that the

brain represents predictive information (surprises). - This view is in full agreement with the top-down

(reverse hierarchy) models of perception and attention.

- Bellman’s “curse of dimensionality” is avoided (not solved) by the brain because the brain’s main task is making predictions, not modeling the world.

why simple organisms can cope with complex environments? why simple organisms can cope with complex...

Documents

y internal representation

natural environment

circular flow of information

models of perception

brains complexity

optimal adaptation

good predictions

active predictions hierarchies