part iii hierarchical bayesian models. phrase structure utterance speech signal grammar universal...
Post on 20-Dec-2015
235 views
TRANSCRIPT
VerbVP
NPVPVP
VNPRelRelClause
RelClauseNounAdjDetNP
VPNPS
][
][][
Phrase structure
Utterance
Speech signal
Grammar
Universal Grammar Hierarchical phrase structure grammars (e.g., CFG, HPSG, TAG)
Principles
Structure
Data
Whole-object principleShape biasTaxonomic principleContrast principleBasic-level bias
Word learning
Hierarchical Bayesian models
• Can represent and reason about knowledge at multiple levels of abstraction.
• Have been used by statisticians for many years.
Hierarchical Bayesian models• Can represent and reason about knowledge
at multiple levels of abstraction.• Have been used by statisticians for many
years.• Have been applied to many cognitive
problems:– causal reasoning (Mansinghka et al, 06)
– language (Chater and Manning, 06)
– vision (Fei-Fei, Fergus, Perona, 03)
– word learning (Kemp, Perfors, Tenenbaum,06)
– decision making (Lee, 06)
VerbVP
NPVPVP
VNPRelRelClause
RelClauseNounAdjDetNP
VPNPS
][
][][
Phrase structure
Utterance
Speech signal
Grammar
Universal Grammar Hierarchical phrase structure grammars (e.g., CFG, HPSG, TAG)
P(phrase structure | grammar)
P(utterance | phrase structure)
P(speech | utterance)
P(grammar | UG)
Phrase structure
Utterance
Grammar
Universal Grammar
u1 u2 u3 u4 u5 u6
s1 s2 s3 s4 s5 s6
G
U
Hierarchical Bayesian model
P(G|U)
P(s|G)
P(u|s)
Phrase structure
Utterance
Grammar
Universal Grammar
u1 u2 u3 u4 u5 u6
s1 s2 s3 s4 s5 s6
G
U
A hierarchical Bayesian model specifies a joint distribution over all variables in the hierarchy:
P({ui}, {si}, G | U)
= P ({ui} | {si}) P({si} | G) P(G|U)
Hierarchical Bayesian model
P(G|U)
P(s|G)
P(u|s)
Knowledge at multiple levels
1. Top-down inferences: – How does abstract knowledge guide
inferences at lower levels?
2. Bottom-up inferences:– How can abstract knowledge be acquired?
3. Simultaneous learning at multiple levels of abstraction
Phrase structure
Utterance
Grammar
Universal Grammar
u1 u2 u3 u4 u5 u6
s1 s2 s3 s4 s5 s6
G
U
Top-down inferences
Given grammar G and a collection of utterances, construct a phrase structure for each utterance.
Phrase structure
Utterance
Grammar
Universal Grammar
u1 u2 u3 u4 u5 u6
s1 s2 s3 s4 s5 s6
G
U
Infer {si} given {ui}, G:
P( {si} | {ui}, G) α P( {ui} | {si} ) P( {si} |G)
Top-down inferences
Phrase structure
Utterance
Grammar
Universal Grammar
u1 u2 u3 u4 u5 u6
s1 s2 s3 s4 s5 s6
G
U
Bottom-up inferences
Given a collection of phrase structures, learn a grammar G.
Phrase structure
Utterance
Grammar
Universal Grammar
u1 u2 u3 u4 u5 u6
s1 s2 s3 s4 s5 s6
G
U
Infer G given {si} and U:
P(G| {si}, U) α P( {si} | G) P(G|U)
Bottom-up inferences
Phrase structure
Utterance
Grammar
Universal Grammar
u1 u2 u3 u4 u5 u6
s1 s2 s3 s4 s5 s6
G
U
Given a set of utterances {ui} and innate knowledge U, construct a grammar G and a phrase structure for each utterance.
Simultaneous learning at multiple levels
Phrase structure
Utterance
Grammar
Universal Grammar
u1 u2 u3 u4 u5 u6
s1 s2 s3 s4 s5 s6
G
U
Simultaneous learning at multiple levels
• A chicken-or-egg problem: – Given a grammar, phrase structures can be
constructed– Given a set of phrase structures, a grammar
can be learned
Phrase structure
Utterance
Grammar
Universal Grammar
u1 u2 u3 u4 u5 u6
s1 s2 s3 s4 s5 s6
G
U
Infer G and {si} given {ui} and U:
P(G, {si} | {ui}, U) α P( {ui} | {si} )P({si} |G)P(G|U)
Simultaneous learning at multiple levels
Phrase structure
Utterance
Grammar
Universal Grammar
u1 u2 u3 u4 u5 u6
s1 s2 s3 s4 s5 s6
G
U
Hierarchical Bayesian model
P(G|U)
P(s|G)
P(u|s)
Knowledge at multiple levels
1. Top-down inferences: – How does abstract knowledge guide
inferences at lower levels?
2. Bottom-up inferences:– How can abstract knowledge be acquired?
3. Simultaneous learning at multiple levels of abstraction
Folk Biology
R: principles
S: structure
D: data
mouse
squirrel
chimp
gorilla
The relationships between living kinds are well described by tree-structured representations
“Gorillas have hands”
Folk Biology
R: principles
S: structure
D: data
mouse
squirrel
chimp
gorilla
F1 F2 F3 F4
mouse ● ○ ○ ●squirrel ● ○ ○ ?
chimp ○ ● ● ?
gorilla ○ ● ● ?
Structural form: tree
Outline
• A high-level view of HBMs
• A case study: Semantic knowledge– Property induction– Learning structured representations– Learning the abstract organizing principles of a
domain
Property induction
R: principles
S: structure
D: data
mouse
squirrel
chimp
gorilla
F1 F2 F3 F4
mouse ● ○ ○ ●squirrel ● ○ ○ ?
chimp ○ ● ● ?
gorilla ○ ● ● ?
Structural form: tree
Property Induction
R: Principles
S: structure
D: data
mouse
squirrel
chimp
gorilla
Structural form: treeStochastic process: diffusion
mouse ●squirrel ?
chimp ?
gorilla ?
Approach: work with the distribution P(D|S,R)
Property Induction
Horses have T4 cells.Elephants have T4 cells.
All mammals have T4 cells.
Horses have T4 cells.Seals have T4 cells.
All mammals have T4 cells.
Previous approaches: Rips (75), Osherson et al (90), Sloman (93), Heit (98)
Horse ● ○ ● ● … ● … ○Cow ? ○ ○ ● ○ ○Chimp ? ○ ○ ○ ● ○Gorilla ? ○ ○ ○ ○ ○Mouse ? ○ ○ ○ ○ ○Squirrel ? ○ ○ ○ ○ ○Dolphin ? ○ ○ ○ ○ ○Seal ? ○ ○ ○ ○ ○Rhino ? ○ ○ ○ ○ ○Elephant ● ○ ○ ○ … ○ … ○
0.04 0.01 0.02 0.001 0.04
Hypotheses
Bayesian Property Induction
Horse ● ○ ● ● … ● … ●Cow ? ○ ○ ● ○ ●Chimp ? ○ ○ ○ ● ●Gorilla ? ○ ○ ○ ○ ●Mouse ? ○ ○ ○ ○ ●Squirrel ? ○ ○ ○ ○ ●Dolphin ? ○ ○ ○ ○ ●Seal ? ○ ○ ○ ○ ●Rhino ? ○ ○ ○ ○ ●Elephant ● ○ ○ ○ … ○ … ●
0.04 0.01 0.02 0.001 0.04
Hypotheses
Bayesian Property Induction
Horse ● ○ ● ● … ● … ●Cow ? ○ ○ ● ○ ●Chimp ? ○ ○ ○ ● ●Gorilla ? ○ ○ ○ ○ ●Mouse ? ○ ○ ○ ○ ●Squirrel ? ○ ○ ○ ○ ●Dolphin ? ○ ○ ○ ○ ●Seal ? ○ ○ ○ ○ ●Rhino ? ○ ○ ○ ○ ●Elephant ● ○ ○ ○ … ○ … ●
0.04 0.01 0.02 0.001 0.04
Horses have T4 cells.Elephants have T4 cells.
Cows have T4 cells.
D
C
}
Choosing a prior
Chimps have T4 cells.
Gorillas have T4 cells.
Poodles can bite through wire.
Dobermans can bite through wire.
Salmon carry E. Spirus bacteria.
Grizzly bears carry E. Spirus bacteria.
Taxonomic similarity
Jaw strength
Food web relations
Bayesian Property Induction
• A challenge:– We have to specify the prior, which typically
includes many numbers
• An opportunity:– The prior can capture knowledge about the
problem.
Property Induction
R: Principles
S: structure
D: data
mouse
squirrel
chimp
gorilla
Structural form: treeStochastic process: diffusion
mouse ●squirrel ?
chimp ?
gorilla ?
Biological properties
• Structure:– Living kinds are organized into a tree
• Stochastic process:– Nearby species in the tree tend to share
properties
Smooth Not smooth
Stochastic Process• Nearby species in the tree tend to share
properties.
• In other words, properties tend to be smooth over the tree.
Horse ● ○ ● ● … ● … ●Cow ? ○ ○ ● ○ ●Chimp ? ○ ○ ○ ● ●Gorilla ? ○ ○ ○ ○ ●Mouse ? ○ ○ ○ ○ ●Squirrel ? ○ ○ ○ ○ ●Dolphin ? ○ ○ ○ ○ ●Seal ? ○ ○ ○ ○ ●Rhino ? ○ ○ ○ ○ ●Elephant ● ○ ○ ○ … ○ … ●
0.04 0.01 0.02 0.001 0.04
Hypotheses
Stochastic process
Horse 0.5 ●Cow 0.4 ●Chimp -0.7 ○Gorilla -0.1 ○Mouse -0.4 ○Squirrel -0.4 ○Dolphin -1.5 ○Seal -1.5 ○Rhino -0.1 ○Elephant -0.1 ○
Generating a property
y h where y tends to be smooth over the tree:
threshold
The diffusion process
where Ө(yi) is 1 if yi ≥ 0 and 0 otherwise
the covariance K encourages y to be smooth over the graph S
Let yi be the feature value at node i}
i
j
p(y|S,R): Generating a property
(Zhu, Lafferty, Ghahramani 03)
Biological properties
R: Principles
S: structure
D: data
mouse
squirrel
chimp
gorilla
Structural form: treeStochastic process: diffusion
mouse ●squirrel ?
chimp ?
gorilla ?
Approach: work with the distribution P(D|S,R)
Horse ● ○ ● ● … ● … ●Cow ? ○ ○ ● ○ ●Chimp ? ○ ○ ○ ● ●Gorilla ? ○ ○ ○ ○ ●Mouse ? ○ ○ ○ ○ ●Squirrel ? ○ ○ ○ ○ ●Dolphin ? ○ ○ ○ ○ ●Seal ? ○ ○ ○ ○ ●Rhino ? ○ ○ ○ ○ ●Elephant ● ○ ○ ○ … ○ … ●
0.04 0.01 0.02 0.001 0.04
Horses have T4 cells.Elephants have T4 cells.
Cows have T4 cells.
D
C
}
Results
Dolphins have property P.Seals have property P.
Horses have property P. (Osherson et al)
Cows have property P.Elephants have property P.
Horses have property P.
Model
Human
Results
Cows have property P.Elephants have property P.Horses have property P.
All mammals have property P.
Gorillas have property P.Mice have property P.Seals have property P.
All mammals have property P.
Model
Human
Spatial model
R: principles
S: structure
D: data
mouse
squirrel
chimp
gorilla
Structural form: 2D spaceStochastic process: diffusion
mouse ●squirrel ?
chimp ?
gorilla ?
Biological Properties
R: Principles
S: structure
D: data
mouse
squirrel
chimp
gorilla
Structural form: treeStochastic process: diffusion
mouse ●squirrel ?
chimp ?
gorilla ?
Class C
Class A
Class D
Class E
Class G
Class F
Class BClass C
Class A
Class D
Class E
Class G
Class F
Class B
Class C
Class G
Class F
Class E
Class D
Class B
Class A
Three inductive contexts
R:
S:
tree +diffusion process
chain +driftprocess
network +causal transmission
“has T4 cells”“can bite through wire”
“carries E. Spirus bacteria”
Threshold properties• “can bite through wire”
• “has skin that is more resistant to penetration than most synthetic fibers”
HippoCat Lion Camel Elephant
Poodle Collie Doberman
(Osherson et al; Blok et al)
Threshold properties
• Structure:– The categories can be organized along a
single dimension
• Stochastic process:– Categories towards one end of the
dimension are more likely to have the novel property
Results “has skin that is more resistant to
penetration than most synthetic fibers”
(Blok et al, Smith et al)
1D + drift
1D + diffusion
Class C
Class A
Class D
Class E
Class G
Class F
Class BClass C
Class A
Class D
Class E
Class G
Class F
Class B
Class C
Class G
Class F
Class E
Class D
Class B
Class A
Three inductive contexts
R:
S:
tree +diffusion process
chain +driftprocess
network +causal transmission
“has T4 cells”“can bite through wire”
“carries E. Spirus bacteria”
Causally transmitted properties
(Medin et al; Shafto and Coley)
Salmon carry E. Spirus bacteria.
Grizzly bears carry E. Spirus bacteria.
Salmon
Grizzly bear
Causally transmitted properties
• Structure:– The categories can be organized into a
directed network
• Stochastic process:– Properties are generated by a noisy
transmission process
Class C
Class A
Class D
Class E
Class G
Class F
Class BClass C
Class A
Class D
Class E
Class G
Class F
Class B
Class C
Class G
Class F
Class E
Class D
Class B
Class A
Three inductive contexts
R:
S:
tree +diffusion process
chain +driftprocess
network +causal transmission
“has T4 cells”“can bite through wire”
“carries E. Spirus bacteria”
Property Induction
R: Principles
S: structure
D: data
mouse
squirrel
chimp
gorilla
Structural form: treeStochastic process: diffusion
mouse ●squirrel ?
chimp ?
gorilla ?
Approach: work with the distribution P(D|S,R)
Conclusions : property induction
• Hierarchical Bayesian models help to explain how abstract knowledge can be used for induction
Outline
• A high-level view of HBMs
• A case study: Semantic knowledge– Property induction– Learning structured representations– Learning the abstract organizing principles of a
domain
Structure learning
R: Principles
S: structure
D: data
Structural form: treeStochastic process: diffusion
mouse
squirrel
chimp
gorilla
F1 F2 F3 F4
mouse ● ○ ○ ●squirrel ● ○ ○ ?
chimp ○ ● ● ?
gorilla ○ ● ● ?
Structure learning
R: principles
S: structure
D: data
F1 F2 F3 F4
mouse ● ○ ○ ●squirrel ● ○ ○ ?
chimp ○ ● ● ?
gorilla ○ ● ● ?
?
Goal: find S that maximizes P(S|D,R)
Structural form: treeStochastic process: diffusion
Structure learning
R: principles
S: structure
D: data
F1 F2 F3 F4
mouse ● ○ ○ ●squirrel ● ○ ○ ?
chimp ○ ● ● ?
gorilla ○ ● ● ?
?
Goal: find S that maximizes P(S|D,R) α P(D|S,R) P(S|R)
Structural form: treeStochastic process: diffusion
Structure learning
R: principles
S: structure
D: data
F1 F2 F3 F4
mouse ● ○ ○ ●squirrel ● ○ ○ ?
chimp ○ ● ● ?
gorilla ○ ● ● ?
?
Goal: find S that maximizes P(S|D,R) α P(D|S,R) P(S|R)
The distributionpreviously used for property induction
Structural form: treeStochastic process: diffusion
mouse
squirrel
chimp
gorilla
F1 F2 F3 F4
mouse ● ○ ○ ●squirrel ● ○ ○ ?
chimp ○ ● ● ?
gorilla ○ ● ● ?
Generating features over the tree
Structure learning
R: principles
S: structure
D: data
F1 F2 F3 F4
mouse ● ○ ○ ●squirrel ● ○ ○ ?
chimp ○ ● ● ?
gorilla ○ ● ● ?
?
Goal: find S that maximizes P(S|D,R) α P(D|S,R) P(S|R)
Structural form: treeStochastic process: diffusion
P(S|R): Generating structures
mouse
squirrel
chimp
gorilla
mouse
squirrel
chimp
gorilla
mousesquirrel
chimp gorilla
Consistent with R Inconsistent with R
P(S|R): Generating structures
mouse
squirrel
chimp
gorilla
mouse
squirrel
chimp
gorilla
mousesquirrel
chimp gorilla
Complex Simple
P(S|R): Generating structures
if S inconsistent with R
otherwise
• Each structure is weighted by the number of nodes it contains:
where is the number of nodes in S
mouse
squirrel
chimp
gorilla
mouse
squirrel
chimp
gorilla
mousesquirrel
chimp gorilla
Structure Learning
P(S|D,R) will be high when:– The features in D vary smoothly
over S – S is a simple graph (a graph with
few nodes)
Aim: find S that maximizes P(S|D,R) α P(D|S) P(S|R)
R: principles
S: structure
D: data
Structure Learning
P(S|D,R) will be high when:– The features in D vary smoothly
over S – S is a simple graph (a graph with
few nodes)
Aim: find S that maximizes P(S|D,R) α P(D|S) P(S|R)
R: principles
S: structure
D: data
• Participants rated the goodness of 85 features for 48 animals
• E.g., elephant: gray hairless toughskin big bulbous longleg tail chewteeth tusks smelly walks slow strong muscle quadrapedal inactive vegetation grazer oldworld bush jungle ground timid smart group
Structure learning example
(Osherson et al)
Spatial model
R: principles
S: structure
D: data
mouse
squirrel
chimp
gorilla
Structural form: 2D spaceStochastic process: diffusion
F1 F2 F3 F4
mouse ● ○ ○ ●squirrel ● ○ ○ ?
chimp ○ ● ● ?
gorilla ○ ● ● ?
Conclusions: structure learning
• Hierarchical Bayesian models provide a unified framework for the acquisition and use of structured representations
Outline
• A high-level view of HBMs
• A case study: Semantic knowledge– Property induction– Learning structured representations– Learning the abstract organizing principles of a
domain
Learning structural form
R: principles
S: structure
D: data
mouse
squirrel
chimp
gorilla
F1 F2 F3 F4
mouse ● ○ ○ ●squirrel ● ○ ○ ?
chimp ○ ● ● ?
gorilla ○ ● ● ?
Structural form: treeStochastic process: diffusion
Learning structural form
R: principles
S: structure
D: data
F1 F2 F3 F4
mouse ● ○ ○ ●squirrel ● ○ ○ ?
chimp ○ ● ● ?
gorilla ○ ● ● ?
?
Goal: find S,F that maximize P(S,F|D)
could betree,2D space,ring, ….
Structural form: FStochastic process: diffusion
Learning structural form
R: principles
S: structure
D: data
F1 F2 F3 F4
mouse ● ○ ○ ●squirrel ● ○ ○ ?
chimp ○ ● ● ?
gorilla ○ ● ● ?
?
Aim: find S,F that maximize P(S,F|D) α P(D|S)P(S|F) P(F)
Uniform distribution on the set of forms
Structural form: FStochastic process: diffusion
Learning structural form
R: principles
S: structure
D: data
F1 F2 F3 F4
mouse ● ○ ○ ●squirrel ● ○ ○ ?
chimp ○ ● ● ?
gorilla ○ ● ● ?
?
Aim: find S,F that maximize P(S,F|D) α P(D|S) P(S|F)P(F)
The distribution used for property induction
Structural form: FStochastic process: diffusion
Learning structural form
R: principles
S: structure
D: data
F1 F2 F3 F4
mouse ● ○ ○ ●squirrel ● ○ ○ ?
chimp ○ ● ● ?
gorilla ○ ● ● ?
?
Aim: find S,F that maximize P(S,F|D) α P(D|S) P(S|F)P(F)
Structural form: FStochastic process: diffusion
The distribution used for structure learning
P(S|F): Generating structures from forms
if S inconsistent with F
otherwise
• Each structure is weighted by the number of nodes it contains:
where is the number of nodes in S
mouse
squirrel
chimp
gorilla
mouse
squirrel
chimp
gorilla
mousesquirrel
chimp gorilla
• Simpler forms are preferred
A B C
P(S|F): Generating structures from forms
D
All possible graph structures S
P(S|F)
Chain
Grid
Learning structural form
F: form
S: structure
D: data
F1 F2 F3 F4
mouse ● ○ ○ ●squirrel ● ○ ○ ?
chimp ○ ● ● ?
gorilla ○ ● ● ?
?
Goal: find S,F that maximize P(S,F|D)
?
Learning structural form
• P(S,F|D) will be high when:– The features in D vary smoothly
over S – S is a simple graph (a graph with
few nodes)– F is a simple form (a form that can
generate only a few structures)
F: form
S: structure
D: data
Aim: find S,F that maximize P(S,F|D) α P(D|S) P(S|F)P(F)
Learning structural form
• P(S,F|D) will be high when:– The features in D vary smoothly
over F – S is a simple graph (a graph with
few nodes)– F is a simple form (a form that can
generate only a few structures)
F: form
S: structure
D: data
Aim: find S,F that maximize P(S,F|D) α P(D|S) P(S|F)P(F)
Outline
• A high-level view of HBMs
• A case study: Semantic knowledge– Property induction– Learning structured representations– Learning the abstract organizing principles of a
domain
mouse
squirrel
chimp
gorilla
F1 F2 F3 F4
mouse ● ○ ○ ●squirrel ● ○ ○ ?
chimp ○ ● ● ?
gorilla ○ ● ● ?
Structural form: treeStochastic process: diffusion
mouse
squirrel
chimp
gorilla
F1 F2 F3 F4
mouse ● ○ ○ ●squirrel ● ○ ○ ?
chimp ○ ● ● ?
gorilla ○ ● ● ?
Structural form: treeStochastic process: diffusion
When can we stop adding levels?
• When the knowledge at the top level is simple or general enough that it can be plausibly assumed to be innate.
Conclusions
• Hierarchical Bayesian models provide a unified framework which can
– Explain how abstract knowledge is used for induction
– Explain how abstract knowledge can be acquired