the dynamics of iterated learning tom griffiths uc berkeley with mike kalish, steve lewandowsky,...

The dynamics of iterated learning

Tom GriffithsUC Berkeley

with Mike Kalish, Steve Lewandowsky, Simon Kirby, and Mike Dowman

Learning

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.data

are needed to see this picture.

Iterated learning(Kirby, 2001)

What are the consequences of learners learning from other learners?

Objects of iterated learning

• Knowledge communicated across generations through provision of data by learners

• Examples:– religious concepts– social norms– myths and legends– causal theories– language

Language

• The languages spoken by humans are typically viewed as the result of two factors– individual learning – innate constraints (biological evolution)

• This limits the possible explanations for different kinds of linguistic phenomena

Linguistic universals

• Human languages possess universal properties– e.g. compositionality

(Comrie, 1981; Greenberg, 1963; Hawkins, 1988)

• Traditional explanation:– linguistic universals reflect innate constraints

specific to a system for acquiring language(e.g., Chomsky, 1965)

Cultural evolution

• Languages are also subject to change via cultural evolution (through iterated learning)

• Alternative explanation:– linguistic universals emerge as the result of the fact

that language is learned anew by each generation (using general-purpose learning mechanisms)

(e.g., Briscoe, 1998; Kirby, 2001)

Analyzing iterated learning

PL(h|d): probability of inferring hypothesis h from data d

PP(d|h): probability of generating data d from hypothesis h

PL(h|d)

PP(d|h)

PL(h|d)

PP(d|h)

• Variables x(t+1) independent of history given x(t)

• Converges to a stationary distribution under easily checked conditions (i.e., if it is ergodic)

x x x x x x x x

Transition matrixT = P(x(t+1)|x(t))

Markov chains

Stationary distributions

• Stationary distribution:

• In matrix form

is the first eigenvector of the matrix T– easy to compute for finite Markov chains– can sometimes be found analytically

i = P(x(t +1) = i |j

∑ x(t ) = j)π j = Tijπ j

d0 h1 d1 h2PL(h|d) PP(d|h) PL(h|d)

d2 h3PP(d|h) PL(h|d)

d PP(d|h)PL(h|d)h1 h2d PP(d|h)PL(h|d)

A Markov chain on hypotheses

A Markov chain on data

Bayesian inference

QuickTime™ and aTIFF (Uncompressed) decompressor

Reverend Thomas Bayes

• Rational procedure for updating beliefs

• Foundation of many learning algorithms

• Lets us make the inductive biases of learners precise

Bayes’ theorem

P(h | d) =P(d | h)P(h)

P(d | ′ h )P( ′ h )′ h ∈H

Posteriorprobability

Likelihood Priorprobability

Sum over space of hypothesesh: hypothesis

d: data

Iterated Bayesian learning

PL(h|d)

PP(d|h)

PL(h|d)

PP(d|h)

PL (h | d) =PP (d | h)P(h)

PP (d | ′ h )P( ′ h )′ h ∈H

Assume learners sample from their posterior distribution:

Stationary distributions

• Markov chain on h converges to the prior, P(h)

• Markov chain on d converges to the “prior predictive distribution”

P(d) = P(d | h)h

∑ P(h)

(Griffiths & Kalish, 2005)

Explaining convergence to the prior

PL(h|d)

PP(d|h)

PL(h|d)

PP(d|h)

• Intuitively: data acts once, prior many times

• Formally: iterated learning with Bayesian agents is a Gibbs sampler for P(d,h)

(Griffiths & Kalish, submitted)

Gibbs sampling

For variables x = x1, x2, …, xn

Draw xi(t+1) from P(xi|x-i)

x-i = x1(t+1), x2

(t+1),…, xi-1(t+1)

, xi+1(t)

, …, xn(t)

Converges to P(x1, x2, …, xn)

(a.k.a. the heat bath algorithm in statistical physics)

(Geman & Geman, 1984)

Gibbs sampling

(MacKay, 2003)

Explaining convergence to the prior

PL(h|d)

PP(d|h)

PL(h|d)

PP(d|h)

When the target distribution is P(d,h), the conditional distributions are PL(h|d) and PP(d|h)

Implications for linguistic universals

• When learners sample from P(h|d), the distribution over languages converges to the prior– identifies a one-to-one correspondence between

inductive biases and linguistic universals

Iterated Bayesian learning

PL(h|d)

PP(d|h)

PL(h|d)

PP(d|h)

PL (h | d) =PP (d | h)P(h)

PP (d | ′ h )P( ′ h )′ h ∈H

Assume learners sample from their posterior distribution:

PL (h | d)∝PP (d | h)P(h)

PP (d | ′ h )P( ′ h )′ h ∈H

⎢ ⎢ ⎢

⎥ ⎥ ⎥

r = 1 r = 2 r =

From sampling to maximizing

• General analytic results are hard to obtain– (r = is a stochastic version of the EM algorithm)

• For certain classes of languages, it is possible to show that the stationary distribution gives each hypothesis h probability proportional to P(h)r

– the ordering identified by the prior is preserved, but not the corresponding probabilities

(Kirby, Dowman, & Griffiths, submitted)

From sampling to maximizing

Implications for linguistic universals

• When learners sample from P(h|d), the distribution over languages converges to the prior– identifies a one-to-one correspondence between inductive

biases and linguistic universals

• As learners move towards maximizing, the influence of the prior is exaggerated– weak biases can produce strong universals– cultural evolution is a viable alternative to traditional

explanations for linguistic universals

• The outcome of iterated learning is determined by the inductive biases of the learners– hypotheses with high prior probability ultimately

appear with high probability in the population

• Clarifies the connection between constraints on language learning and linguistic universals…

• …and provides formal justification for the idea that culture reflects the structure of the mind

Conclusions

• Iterated learning provides a lens for magnifying the inductive biases of learners– small effects for individuals are big effects for groups

• When cognition affects culture, studying groups can give us better insight into individuals

are needed to see this picture.data

the dynamics of iterated learning tom griffiths uc berkeley with mike kalish, steve lewandowsky,...

x x x transition matrix

learning data slide

iterated learning p

variables x t

x n draw x

t markov chains slide

prior plhdplhd p p dh

data d p p dh

Documents

keith dowman - eye of the storm

united states court of appeals - the daily caller ·...

learning syntax with minimum description length mike dowman...

scienti c uncertainty and climate change: part ii...

using expectation disconfirmation theory · pdf filethe...

society 8032) the record april, 1942 dowman miss … ·...

mini mike mike

introduction to mike flood · introduction to mike flood...

markov chain monte carlo with people tom griffiths...

stephan lewandowsky, facss fapslewandowsky, s. (2015, july)....

mike smorul, mike mcgann, joseph jaja

ktm keith dowman buddhist guide to the kathmandu valley

2011 dowman soft touch catalogue

essential mathematics and statistics for science, 2e...

inferencerelatedtocommonbreaksinamultivariate ... · fyfe...

antonio sanfilippo – pnnl katy börner, james p....

wisdom publications fall trade catalog (1).pdfwisdom...

rt-sc 2019 morning presentation · shaun dowman...

minimum description length an adequate syntactic theory?...

why syntax is impossible mike dowman. syntax flanguages have...