probabilistic context free grammar language structure is not linear the velocity of seismic waves...
Post on 19-Dec-2015
223 views
TRANSCRIPT
Context free grammars – a reminder
A CFG G consists of - A set of terminals {wk}, k=1, …, V A set of nonterminals {Ni}, i=1, …, n A designated start symbol, N1
A set of rules, {Niπj} (where πj is a sequence of terminals and nonterminals)
A very simple example
G’s rewrite rules – SaSb Sab
Possible derivations – SaSbaabb SaSbaaSbbaaabbb
In general, G creates the language anbn
Modeling natural language
G is given by the rewrite rules – SNP VP NPthe N | a N Nman | boy | dog VPV NP Vsaw | heard | sensed | sniffed
Recursion can be included
G is given by the rewrite rules – SNP VP NPthe N | a N Nman CP | boy CP | dog CP VPV NP Vsaw | heard | sensed | sniffed CPthat VP | ε
Probabilistic Context Free Grammars
A PCFG G consists of – A set of terminals {wk}, k=1, …, V A set of nonterminals {Ni}, i=1, …, n A designated start symbol, N1
A set of rules, {Niπj} (where πj is a sequence of terminals and nonterminals)
A corresponding set of probabilities on rules
Training PCFGs
Given a corpus, it’s possible to estimate rule probabilities to maximize its likelihood
This is regarded a form of ‘grammar induction’ However the rules of the grammar
must be pre-given
Questions for PCFGs What is the probability of a sentence w1n
given a grammar G – P(w1n|G)? Calculated using dynamic programming
What is the most likely parse for a given sentence – argmaxtP(t|w1n, G) Likewise
How can we choose rule probabilities for the grammar G that maximize the probability of a given corpus? The inside-outside algorithm
Chomsky Normal Form
We will be dealing only with PCFGs of the above-mentioned form
That means that there are exactly two types of rules – NiNjNk
Niwj
Estimating string probability Define ‘inside probabilities’ –
We would like to calculate
A dynamic programming algorithm Base step
( , ) ( | , )pqj pq jp q P w N G
11 1 1 1( | ) ( | , ) (1, )m m mP w G P w N G m
( , ) ( | , ) ( | )j jj k kk kk k P w N G P N w G
Estimating string probability
Induction step1
,
( , ) ( ) ( , ) ( 1, )q
j r sj r s
r s d p
p q P N N N p d d q
Drawbacks of PCFGs Do not factor in lexical co-occurrence Rewrite rules must be pre-given
according to human intuitions The ATIS-CFG fiasco
The capacity of PCFG to determine the most likely parse is very limited As grammars grow larger, they become
increasingly ambiguous The following sentences look the same to a
PCFG, although suggest different parses I saw the boat with the telescope I saw the man with the scar
PCFGs – some more drawbacks Have some inappropriate biases
In general, the probability of a smaller tree will be larger than a larger one
Most frequent length for Wall Street Journal sentences is around 23 words
Training is slow and problematic Converges to a local optimum Non-terminals do not always
resemble true syntactic classes
PCFGs and language models Because they ignore lexical co-
occurrence, PCFGs are not good as language models
However, some work has been done on combining PCFGs with n-gram models PCFGs modeled long-range syntactic
constraints Performance generally improved
Is natural language a CFG?
There is an on-going debate on the CFG’ness of English
There are some languages that can be shown to be more complex than CFGs
For example, Dutch –
Dutch oddities
Dat Jan Marie Pieter Arabisch laat zien schrijvenTHAT JAN MARIE PIETER ARABIC LET SEE WRITE“that Jan Let Marie see Pieter write Arabic”
However, from a purely syntactic view point, this is just – dat PnVn
Other languages
Bambara (Malinese language) has non-CF features, in the form of – AnBmCnDm
Swiss-German as well However, CFGs seem to be a good
approximation for most phenomena in most languages
Previous work
Probabilistic Context Free Grammars
‘Supervised’ induction methods Little work on raw data
Mostly work on artificial CFGs Clustering
Our goal
Given a corpus of raw text separated into sentences, we want to derive a specification of the underlying grammar
This means we want to be able to Create new unseen grammatically correct
sentences Accept new unseen grammatically correct
sentences and reject ungrammatical ones
What do we need to do?
G is given by the rewrite rules – SNP VP NPthe N | a N Nman | boy | dog VPV NP Vsaw | heard | sensed | sniffed
ADIOS in outline
Composed of three main elements A representational data structure A segmentation criterion (MEX) A generalization ability
We will consider each of these in turn
Is that a dog?
(6)102(5)(4)102 (3)
(4)
101
101)1( (2) 101 (3)
103
(1)
104
(1)
(2)
104
(3)
(2)(3)
103
(6)
(5)
(7)
(6)
)6(
(5)
where
104
(4)the
dog ? END
(4)
(5)
a
andhorse
)2( that
cat
102(1)BEGIN is
Is that a cat?Where is the dog? And is that a horse?
nodeedge
The Model: Graph representation with words as vertices and sentences as paths.
ADIOS in outline
Composed of three main elements A representational data structure A segmentation criterion (MEX) A generalization ability
Toy problem – Alice in Wonderland
a l i c e w a s b e g i n n i n g t o g e t v e r y t i r e d o f s i t t i n g b y h e r s i s t e r o n t h e b a n k a n d o f h a v i n g n o t h i n g t o d o o n c e o r t w i c e s h e h a d p e e p e d i n t o t h e b o o k h e r s i s t e r w a s r e a d i n g b u t i t h a d n o p i c t u r e s o r c o n v e r s a t i o n s i n i t a n d w h a t i s t h e u s e o f a b o o k t h o u g h t a l i c e w i t h o u t p i c t u r e s o r c o n v e r s a t i o n
Detecting significant patterns
Identifying patterns becomes easier on a graph Sub-paths are automatically aligned
search path
4 5
1
2
36 7
e1 end
5 4
7
1
23
vertex
path
begin
8
e4 e5 e6
86
A
e3e2
9Initialization
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
p
e1 e2 e3 e4 e5
significantpattern
PR)e2|e1(=3/4
PR)e4|e1e2e3(=1
PR)e5|e1e2e3e4(=1/3
PL)e4(=6/41
PL)e3|e4(=5/6
PL)e2|e3e4(=1
PL)e1|e2e3e4(=3/5
PL
SLSR
PR
PR)e1(=4/41
PR)e3|e1e2(=1
begin end
Motif EXtraction
The Markov Matrix
The top right triangle defines the PL probabilities, bottom left triangle the PR probabilities
Matrix is path-dependent
1
BEGIN e1 e2 e3 e4 e5 e6 END
BEGIN8/41 2/ 4
1/ 3 1 1
e12/8 4/41
1 1
e21/2 1 1
1 1
e31 1
1/ 2 1
e41 1 1 1
1/2
e51 1/3
e61 1 1 1
1 1/8
END 8/ 411 1 1 1
1
1
1
1
1
1
B
1
1
3/4
1/5
5/6
6/41
3/6
1/3
5/41
3/5
1/3
1/5
3/5
5/6
6/41
2/6
1/2
1/2 1/2
2/41
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
p
e1 e2 e3 e4 e5
significantpattern
PR)e2|e1(=3/4
PR)e4|e1e2e3(=1
PR)e5|e1e2e3e4(=1/3
PL)e4(=6/41
PL)e3|e4(=5/6
PL)e2|e3e4(=1
PL)e1|e2e3e4(=3/5
PL
SLSR
PR
PR)e1(=4/41
PR)e3|e1e2(=1
begin end
search path1
2
36 7
e1 end
5 4
7
1
2
3
vertex
begin
8
e4 e5 e6
54
76
5 4
6 73
e2
new vertex
86
11
e3e2
e3
9
3
9
e4
8
8
C rewiring
e2 e3 e4
P1
4 5
path9
9
Rewiring the graph
Once a pattern is identified as significant, the sub-paths it subsumes are merged into a new vertex and the graph is rewired accordingly. Repeating this process, leads to the formation of complex, hierarchically structured patterns.
ALICE motifsWeight Occurrences Length
conversation 0.98 11 11whiterabbit 1.00 22 10caterpillar 1.00 28 10interrupted 0.94 7 10procession 0.93 6 9mockturtle 0.91 56 9beautiful 1.00 16 8important 0.99 11 8continued 0.98 9 8different 0.98 9 8atanyrate 0.94 7 8difficult 0.94 7 8surprise 0.99 10 7appeared 0.97 10 7mushroom 0.97 8 7thistime 0.95 19 7suddenly 0.94 13 7business 0.94 7 7nonsense 0.94 7 7morethan 0.94 6 7remember 0.92 20 7consider 0.91 10 7
curious 1.00 19 6hadbeen 1.00 17 6however 1.00 20 6perhaps 1.00 16 6hastily 1.00 16 6herself 1.00 78 6footman 1.00 14 6suppose 1.00 12 6silence 0.99 14 6witness 0.99 10 6gryphon 0.97 54 6serpent 0.97 11 6angrily 0.97 8 6croquet 0.97 8 6venture 0.95 12 6forsome 0.95 12 6timidly 0.95 9 6whisper 0.95 9 6rabbit 1.00 27 5course 1.00 25 5eplied 1.00 22 5seemed 1.00 26 5remark 1.00 28 5
Weight Occurrences Length
ADIOS in outline
Composed of three main elements A representational data structure A segmentation criterion (MEX) A generalization ability
Generalization
E1
took chair
equivalent paths
bed
table
the
5
to
E1
table
bed
chair
L
took the to
identification of candidate equivalenceclasses
newequivalence class
Bootstrapping
E1E2
blue
green
redthe
E1E2
green
blue table
bedbed
to
table
chair
bed
L
storedequivalence class
equivalent pathsbootstrapping
tochairtook red
newequivalence class
Determining L
Involves a tradeoff Larger L will demand more context
sensitivity in the inference Will hamper generalization
Smaller L will detect more patterns But many might be spurious
The ADIOS algorithm
Initialization – load all data into a pseudograph
Until no more patterns are found For each path P
Create generalized search paths from P Detect significant patterns using MEX If found, add best new pattern and
equivalence classes and rewire the graph
1205
567 321120132234 621987
321234987 1203
567 321120132234 621987 2000
321234987 1203 3211203
234987 1204
987 2001 1204
The Model: The training process
Example
uice,kid,knife,ladder,lid,matter,milk,minute,mommy,mouth,nap,nose,number,people,picnic,picture,pie,pretend,question,ride,right,salad,second,smile,snack,snow,snowman,spoon,steak,step,store,story,table,time,toaster,top,tower,truck,try,window,wood,
t,salad,second,smile,snack,snow,snowman,spoon,steak,step,store,story,table,time,toaster,top,tower,truck,try,window,wood,
to go
ba
ckh
om
eo
uts
ide
po
tty
up
tha
tth
e
1405
lad
de
r1404
inth
at
the
ba
ckb
ed
roo
mb
en
chb
ox
car
cha
irci
rcle
clo
set
cup
ga
rag
eh
ou
seo
ne
ove
nre
frig
era
tor
sno
wsq
ua
retr
uck
1458
1457)0.56(
1904
1903 )0.15(
1405
)1(
do
you
ha
ve
like
wa
nt
1679
1678
)1(
rephrase sentences by ADIOSoriginal sentences from CHILDES
(a)
(b)
I'll play with the toys and you play with your bib.there's another bar+b+que.there's a chicken!play with the dolls and the roof?
oh ; the peanut butter can go up there .you better finish it.we better hold that ; then.uh ; that's another little girl!should we put this stuff in in another chick?
I'll play with the eggs and you play with your Mom.there's another chicken.there's a square!play with the cars and the people?
should we put this chair back in the bedroom?
oh ; the peanut butter can sit right there.you better eat it.we better finish it ; then.yeah ; that's a good one!
Evaluating performance
In principle, we would like to compare ADIOS-generated parse-trees with the true parse-trees for given sentences
Alas, the ‘true parse-trees’ are subject to opinion Some approaches don’t even suppose
parse trees
Evaluating performance Define
Recall – the probability of ADIOS recognizing an unseen grammatical sentence
Precision – the proportion of grammatical ADIOS productions
Recall can be assessed by leaving out some of the training corpus
Precision is trickier Unless we’re learning a known CFG
The ATIS experiments ATIS-NL is a 13,043 sentence corpus of
natural language Transcribed phone calls to an airline
reservation service ADIOS was trained on 12,700 sentences
of ATIS-NL The remaining 343 sentences were used to
assess recall Precision was determined with the help of 8
graduate students from Cornell University
The ATIS experiments
ADIOS’ performance scores – Recall – 40% Precision – 70%
For comparison, ATIS-CFG reached – Recall – 45% Precision - <1%(!)
ADIOS/ATIS-N comparison
0.00
0.20
0.40
0.60
0.80
1.00
ADIOS ATIS-N
Pre
cis
ion
A B
Chinese
Spanish
French
English
Swedish
Danish
C
D E
An ADIOS drawback
ADIOS is inherently a heuristic and greedy algorithm Once a pattern is created it remains
forever – errors conflate Sentence ordering affects outcome
Running ADIOS with different orderings gives patterns that ‘cover’ different parts of the grammar
An ad-hoc solution Train multiple learners on the corpus
Each on a different sentence ordering Create a ‘forest’ of learners
To create a new sentence Pick one learner at random Use it to produce sentence
To check grammaticality of given sentence If any learner accepts sentence, declare as
grammatical
The effects of context window width
0
0. 1
0. 2
0. 3
0. 4
0. 5
0. 6
0. 7
0. 8
0. 9
1
0 0. 1 0. 2 0. 3 0. 4 0. 5 0. 6 0. 7 0. 8 0. 9 1
C
D
B
G
Recall
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00
Pre
cis
ion
over-generalization
low
pro
du
cti
vit
y
A
B
C
D L=6
L=5
L=4
L=3
10,000Sentences
120,000Sentences
40,000Sentences
0
0. 1
0. 2
0. 3
0. 4
0. 5
0. 6
0. 7
0. 8
0. 9
1
0 0. 1 0. 2 0. 3 0. 4 0. 5 0. 6 0. 7 0. 8 0. 9 1
F
0
0. 1
0. 2
0. 3
0. 4
0. 5
0. 6
0. 7
0. 8
0. 9
1
0 0. 1 0. 2 0. 3 0. 4 0. 5 0. 6 0. 7 0. 8 0. 9 1
E
120,000Sentences
Meta-analysis of ADIOS results
Define a pattern spectrum as the histogram of pattern types for an individual learner A pattern type is determined by its
contents E.g. TT, TET, EE, PE…
A single ADIOS learner was trained with each of 6 translations of the bible
Pattern spectraT
T
TE
TP
ET
EE
EP
PT
PE
PP
TTT
TTE
TTP
TE
T
TE
E
TE
P
TP
T
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
English
Spanish
Swedish
Chinese
Danish
French
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
0 200 400 600 8000.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
0 200 400 600 800
A B
Chinese
Spanish
French
English
Swedish
Danish
C
D E
Language dendogram
TT TE TP ET EE EP PT PE PPTT
TTT
ETT
PTE
TTE
ETE
PTP
T
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
A B
Chinese
Spanish
French
English
Swedish
Danish
C
D E