a memory-based model of syntactic analysis: data oriented parsing remko scha, rens bod, khalil sima...
Post on 19-Jan-2016
215 Views
Preview:
TRANSCRIPT
A Memory-Based Model of Syntactic Analysis: Data Oriented Parsing
Remko Scha, Rens Bod, Khalil Sima’an
Institute for Logic, Language and Computation
University of Amsterdam
Outline of the lecture
Introduction Disambiguation Data Oriented Parsing DOP1 computational aspects and
experiments Memory Based Learning framework Conclusions
Introduction
Human language cognition: Analogy-based processes on a store of past experiences
Modern linguistics Set of rules
Language processing algorithms Performance model of human language
processing Competence grammar as broad framework to
performance models. Memory / Analogy - based language processing
The Problem of Ambiguity Resolution
Every input string has unmanageable large number of analyses
Uncertain input – generate guesses and choose one
Syntactic disambiguation might be a side effect of semantic one
The Problem of Ambiguity Resolution
Frequency of occurrence of lexical item and syntactic structures: People register frequencies
People prefer analyses they already experienced than constructing a new ones
More frequent analyses are preferred to less frequent ones
From Probabilistic Competence-Grammars to Data-Oriented Parsing
Probabilistic information derived from past experience
Characterization of the possible sentence-analyses of the language
Stochastic Grammar Define : all sentences, all analyses. Assign : probability for each Achieve : preference that people display
when they choose sentence or analyses.
Stochastic Grammar
These predictions are limited
Platitudes and conventional phrases
Allow redundancy
Use Tree Substitution Grammar
Stochastic Tree Substitution Grammar
Set of elementary trees
Tree rewrite process
Redundant model
Statistically relevant phrases
Memory based processing model
Memory based processing model
Data oriented parsing approach: Corpus of utterances – past experience STSG to analyze new input
In order to describe a specific DOP model A formalism for representing utterance-
analyses An extraction function Combination operations A probability model
A Simple Data Oriented Parsing Model: DOP1
Our corpus: DOP1 - Imaginary corpus of two trees Possible sub trees:
t consists of more than one node t is connected except for the leaf nodes of t, each node in t has the
same daughter-nodes as the corresponding node in T
Stochastic Tree Substitution Grammar – set of sub trees
Generation process – composition: A B – B is substituted on the leftmost non terminal leaf
node of A
Example of sub trees
S
VPNP
NPVshe
wanted NP PP
the dress P NP
on the rack
DOP1 - Imaginary corpus of two trees
S
VPNP
VPshe PP
P NP
with the telescope
V NP
saw the dress
Derivation and parse #1
S
VPNP
VPshe PP
V NP
saw
NP
the dress
PP
P NP
with the telescope
S
VPNP
VPshe PP
P NP
with the telescope
V NP
saw the dress
She saw the dress with the telescope.
Derivation and parse #2
S
VPNP
VPshe PP
V NP
saw
NP
the dress
VP
P NP
with the telescope
S
VPNP
VPshe PP
P NP
with the telescope
V NP
saw the dress
She saw the dress with the telescope.
Probability Computations:
Probability of substituting a sub tree t on a specific node
Probability of Derivation
Probability of Parse Tree
. ( ) ( )
( )
t r t r t
tP t
t
1( ) ( ... ) ( )n ii
P D P t t P t
( ) ( )D derives T
P T P D
Computational Aspects of DOP1
Parsing Disambiguation
Most Probable DerivationMost Probable Parse
Optimizations
Parsing
Chart-like parse forest
Derivation forest Elementary tree t as a context-free rule:
root(t) —> yield(t) Label phrase with it’s syntactic category
and its full elementary tree
Elementary trees of an example STSG
0 1 2 3 4
Derivation forest for the string abcd
Derivations and parse trees for the string abcd
Derivations and parse trees for the string abcd
Disambiguation
Derivation forest define all derivation and parses
Most likely parse must be chosen MPP in DOP1 MPP vs. MPD
Most Probable Derivation
Viterbi algorithm: Eliminate low probability sub derivations
using bottom-up fashion Select the most probable sub derivation at
each chart entry, eliminate other sub derivation of that root node.
Viterbi algorithm
Two derivations for abc d1 > d2 : eliminate the right derivation
Algorithm 1 – Computing the probability of most probable derivation
Input : STSG , S , R , P Elementary trees in R are in CNF A—>t H : tree t, root A, sequence of
labels H. <A, i, j> - non terminal A in chart entry
(i,j) after parsing the input W1,...,Wn . PPMPD – probability of MPD of input
string W1,...,Wn.
Algorithm 1 – Computing the probability of most probable derivation
The Most Probable Parse
Computing MPP in STSG is NP hard Monte Carlo method
Sample derivations Observe frequent parse tree Estimate parse tree probability Random – first search
The algorithm Law of Large Numbers
Algorithm 2: Sampling a random derivation
for length := 1 to n do for start := 0 to n - length do
for each root node X chart-entry (start, start + length) do:
1. select at random a tree from the distribution of elementary trees with root node X
2. eliminate the other elementary trees with root node X from this chart-entry
Results of Algorithm 2
Random derivation for the whole sentence
First guess for MPP Compute the size of the sampling set
Probability of error Upper bound 0 index of MPP,i index of parse i, N derivation
No unique MPP – ambiguity
Reminder
2 2[ ] [ ] [ ]
0 [ ] 1
( ) [ ]
V X E X E X
P X
X V X
Conclusions – lower bound for N
Lower bound for N: Pi is probability of parse i B - Estimated probability by frequencies in N Var(B) = Pi*(1-Pi)/N 0 < Pi^2 <= 1 -> Var(B) <= 1/(4*N) s = sqrt(Var(B)) -> S <= 1/(2*sqrt(N)) 1/(4*s^2) <= N 100 <= N -> s <= 0.05
Algorithm 3: Estimating the parse probabilities
Given a derivation forest of a sentence and a threshold sm for the standard error:
N := the smallest integer larger than 1/(4 sm 2) repeat N times:
sample a random derivation from the derivation forest
store the parse generated by this derivation for each parse i:
estimate the conditional probability given the sentence by pi := #(i) / N
Complexity of Algorithm 3
Assumes value of max allowed standard error
Samples number of derivations which is guaranteed to achieve the error
Number of needed samples is quadratic in chosen error
Optimizations
Sima’an : MPD in linear time in STSG size
Bod : MPP on small random corpus of sub trees
Sekine and Grishman : use only sub trees rooted with S or NP
Goodman : different polynomial time
Experimental Properties of DOP1
Experiments on the ATIS corpus MPP vs. MPD Impact of fragment size Impact of fragment lexicalization Impact of fragment frequency
Experiments on SRI-ATIS and OVIS Impact of sub tree depth
Experiments on ATIS corpus
ATIS = Air Travel Information System
750 annotated sentence analyses
Annotated by Penn Treebank
Purpose: compare accuracy obtained in undiluted DOP1 with the one obtained in restricted STSG
Experiments on ATIS corpus
Divide into training and test sets 90% = 675 in training set 10% = 75 in test set
Convert training set into fragments and enrich with probabilities
Test set sentences parsed with sub trees from the training set
MPP was estimated from 100 sampled derivations
Parse accuracy = % of MPP that are identical to test set parses
Results
On 10 random training / test splits of ATIS:
Average parse accuracy = 84.2%
Standard deviation = 2.9 %
Impact of overlapping fragments MPP vs. MPD
Can MPD achieve parse accuracies similar to MPP
Can MPD do better than MPP Overlapping fragments
Accuracies generated by MPD on test set The result is 69% Comparing to accuracy achieved with MPP on
test set : 69% vs. 85% Conclusion: overlapping fragments play important role
in predicting the appropriate analysis of a sentence
The impact of fragment size
Large fragments capture more lexical/syntactic dependencies than small ones.
The experiment: Use DOP1 with restricted maximum depth Max depth 1 -> DOP1 = SCFG Compute the accuracies both for MPD and
MPP for each max depth
Impact of fragment size
Impact of fragment lexicalization
Lexicalized fragment More words -> more lexical
dependencies
Experiment: Different version of DOP1 Restrict max number of words per fragment Check accuracy for MPP and MPD
Impact of fragment lexicalization
Impact of fragment frequency
Frequent fragments contribute more large fragments are less frequent than
small ones but might contribute more Experiment:
Restrict frequency to min number of occurrences
Not other restrictions Check accuracy for MPP
Impact of fragment frequency
Experiments on SRI-ATIS and OVIS
Employ MPD because the corpus is bigger Tests performed on DOP1 and SDOP Use set of heuristic criteria for selecting the
fragments: Constraints of the form of sub trees
d - upper bound on depth n – number of substitution sites l – number of terminals L – number of consecutive terminals
Apply constraints on all sub trees besides those with depth 1
Experiments on SRI-ATIS and OVIS
d4 n2 l7 L3
DOP(i)
Evaluation metrics: Recognized Tree Language Coverage – TLC Exact match Labeled bracketing recall and precision
Experiments on SRI-ATIS
13335 annotated syntactically utterances Annotation scheme originated from Core
Language Engine system Fixed parameters except sub tree bound:
n2 l4 L3 Training set – 12335 trees Test set – 1000 trees Experiment:
Train and test on different depths upper bounds (takes more than 10 days for DOP(4) !!! )
Impact of sub tree depth SRI-ATIS
Experiments on OVIS corpus
10000 syntactically and semantically annotated trees
Both annotations treated as one More non terminal symbols Utterances are answers to questions in
dialog -> short utterances (avg. 3.43) Sima’an results – sentences with at least
2 words , avg. 4.57 n2 l7 L3
Experiments on OVIS corpus
Experiment:
Check different sub tree depth 1,3,4,5
Test set with 1000 trees Train set with 9000 trees
Impact of sub tree depth - OVIS
Summary of results
ATIS: Accuracy of parsing is 85% Overlapping fragments have impact on
accuracy Accuracy increases as fragment depth
increases both for MPP and MPD Optimal lexical maximum for ATIS is 8 Accuracy decreases if lower bound of
fragment frequency increases (for MPP)
Summary of results
SRI-ATIS: Availability of more data is more crucial
to accuracy of MPD. Depth has impact Accuracy is improved when using
memory based parsing(DOP(2)) and not SCFG (DOP(1))
Summary of results
OVIS: Recognition power isn’t affected by depth
No big difference between exact match in DOP1(1) and DOP1(4) mean and standard deviations
DOP: probabilistic recursive MBL
Relationship between present DOP framework and Memory Based Learning framework
DOP extends MBL to deal with disambiguation
MBL vs. DOP Flat or intermediate description vs.
hierarchical
Case Based Reasoning - CBR
Case Based learning Lazy learning, doesn’t generalize Lazy generalization
Classify by means of similarity function
Refer this paradigm as MBL CBR vs. other variants of MBL
Task concept Similarity function Learning task
The DOP framework and CBR CBR method
A formalism for representing utterance-analyses - case description language
An extraction function – retrieve units Combination operations – reuse and revision
Missing in DOP: Similarity function
Extend CBR: A probability model
DOP model defines CBR system for
natural language analysis
DOP1 and CBR methods
DOP1 as extension to CBR system <string,tree> = classified instance Retrieve sub trees and construct tree Sentence = instance Tree = class Set of sentences = instance space Set of trees – class space Frontier , SSF , <str , st > Infinite runtime case-base containing
instance-class-weight triples: <SSF,subtree,probability>
DOP1 and CBR methods
Task and similarity function: Task = disambiguation Similarity function:
Parsing -> recursive string matching procedure Ambiguity -> computing probability and selecting
the highest.
Conclusion: DOP1 is a lazy probabilistic recursive CBR
classifier
DOP vs. other MBL approached in NLP
K-NN vs. DOP Memory Based Sequence Learning
DOP – stochastic model fro computing probabilities
MBSL – ad hoc heuristics for computing scores DOP – globally based ranking strategy of
alternative analyzes MBSL – locally based one Different generalization power
Conclusions
Memory Based aspects of DOP model Disambiguation Probabilities to account frequencies DOP as probabilistic recursive Memory
Based model DOP1 - properties, computational
aspects and experiments. DOP and MBL - differences
top related