november 2003csa4050: computational morphology iv 1 csa405: advanced topics in nlp computational...
TRANSCRIPT
November 2003 CSA4050: Computational Morphology IV
1
CSA405: Advanced Topicsin NLP
Computational Morphology IV:
xfst
November 2003 CSA4050: Computational Morphology IV
2
What is xfst?• xfst is a general tool for creating and
manipulating finite state networks, both simple automota and transducers.
• xfst and other Xerox tools employ a notation very close to the notation we have been using so far.
• For full documentation on the syntax and semantics of Xerox REs, see– http://www.fsmbook.com
November 2003 CSA4050: Computational Morphology IV
3
Simple Commands
• command line (via babe)> xfst
• define: give a name to an RE
• print: print information
• read: read information
• various stack operations
• file interaction
November 2003 CSA4050: Computational Morphology IV
4
define command
• define name regexp ;
xfst[0]: define foo [d o g] | [c a t];
xfst[0]: define R1 [a | b | c | d];
xfst[0]: define R2 [d | e | f | g];
xfst[0]: define R3 [f | g | h | i | j];
xfst[0]: define baz R1 & R2;
November 2003 CSA4050: Computational Morphology IV
5
print words
print words name - see the words in the language called name
xfst[0]: print words R1
d
c
b
a
xfst[0]:
November 2003 CSA4050: Computational Morphology IV
6
print netprint net name - see detailed information about the network name.xfst[0]: define z R1 & R2;xfst[0]: define baz R1 & R2;xfst[0]: print net zSigma: a b c d e f gSize: 7Net: FC370Flags: deterministic, pruned, minimized,
epsilon_free, loop_freeArity: 1s0: d -> fs1.fs1: (no arcs)xfst[0]:
November 2003 CSA4050: Computational Morphology IV
7
Some Properties of Networks
• epsilon free: there are no arcs labeled with the epsilon symbol
• deterministic: no state has more than one outgoing arc
• minimised: there is no other network with exactly the same paths that has fewer states.
• These make sense for FSAs – not necessarily for FSTs.
November 2003 CSA4050: Computational Morphology IV
8
Equivalent?
a:0
a b
a
aa:0
b
A
B
no. states?no. paths?relation encoded?
November 2003 CSA4050: Computational Morphology IV
9
Remarks
• A and B encode the same relation{<“aa”,”a”>,<“ab”,”ab”>}
• They are both deterministic and minimal• They have different numbers of states.• Arcs labeled with a pair containing an epsilon
on one side can sometimes be redistributed or eliminated, reducing the number of states.
• This situation does not occur with FSAs
November 2003 CSA4050: Computational Morphology IV
10
FST Determinism:Sequential vs. Unambiguous
• Unambiguous: for any input there is at most one output.– Transducer A is unambiguous in either direction.
• Sequential: No state has more than one arc with the same symbol on the input side.– Transducer A is not sequential in one direction.
• A transducer is sequentiable if the relation it encodes is unambiguous and all the local ambiguities resolve themselves in a fixed number of steps
November 2003 CSA4050: Computational Morphology IV
11
Basic Stack Operations• read regex: push network onto stack:• print stack: list items on stack• print net: detailed info on top stack
item• pop stack: remove top item from
stack• define name: set name to value of top
stack item
November 2003 CSA4050: Computational Morphology IV
12
Stack Operations:intersect net; union net, etc.
• Load stack with N suitable arguments.
• Ensure that arguments are pushed onto stack in correct (reverse) order.
• intersect net command is issued.
• These are popped from the stack, the operation is performed, and the result written back onto the stack.
November 2003 CSA4050: Computational Morphology IV
13
Stack Example 1
xfst[0]: clear stack;
xfst[0]: read regex [d |c |e | b | w]
xfst[1]: read regex [b | s | h | w]
xfst[2]: read regex [s | d | c | f | w]
xfst[3]: print stack
xfst[3]: intersect net
xfst[1]: print stack
xfst[1]: print net
xfst[1]: print words
x1
November 2003 CSA4050: Computational Morphology IV
14
Stack Example 2
xfst[0]: clear stack;
xfst[0]: read regex [e d | i n g | s |[]]
xfst[1]: read regex [t a l k | k i c k]
xfst[2]: print stack
xfst[2]: print net
xfst[2]: print words
xfst[2]: concatenate net
xfst[1]: print words
x2/a
November 2003 CSA4050: Computational Morphology IV
15
Creating Relations
• A simple example of a transducer can be shown using the crossproduct operator:
xfst[0] clear stack
xfst[0] define Y [d o g | c a t];
xfst[0] define Z [c h i e n | c h a t];
xfst[0] read regex Y .x. Z
• We can now use apply up and apply down to test the transducer’s behaviour.
x3ab
November 2003 CSA4050: Computational Morphology IV
16
apply up; apply down• applyup(arg,R) = {x | <x,arg> in R}• applydown(arg,R) = {x | <arg,x> in R}
xfst[0] read regex [d o g | c a t].x.[c h i e n | c h a t];xfst[1] apply up chiendogcatxfst[1] apply down catchienchat
November 2003 CSA4050: Computational Morphology IV
17
Exercise for .x.
• What RE would perform the correct translations?
• Define it in xfst.
• Define an RE in xfst which relates the surface forms "sing", "sang" and "sung" to the lexical form "sing".
x3c
November 2003 CSA4050: Computational Morphology IV
18
Replace Rules
• Xerox RE notation, includes replace rules.
• Replace rules do not increase the descriptive power of REs; however, they do provide a powerful abbreviated rule-like notation.
• There are two main types of replace rules:unconditional and conditional
November 2003 CSA4050: Computational Morphology IV
19
Unconditional Replace Rules
• The most straightforward kind of unconditional replace rule is:a -> b
• This denotes an FS relation in which every symbol a in the upper language corresponds to a symbol b in the lower language.
• Checkpoint: how does this differ from a:b? What is the FST that computes this relation
November 2003 CSA4050: Computational Morphology IV
20
Unconditional Replace e.g.xfst[0]: read regex c -> rxfst[0]: apply down catxfst[0]: apply down dog
• Where there is no match, the string is identity mapped.
• The general pattern for simple Replace rules is A -> B, where A and B are REs denoting arbitrarily complex languages (not relations)
x4ab
November 2003 CSA4050: Computational Morphology IV
21
Definition of A → B
• A → B = [no_A [A .x. B]]* no_A
where no_A ~$[A – 0]• N.B. if upper does not contain empty str
~$[upper – 0] = ~$[upper]otherwise ~$[upper] is null whereas~$[upper – 0] contains at least the empty str.
November 2003 CSA4050: Computational Morphology IV
22
Conditional Replace Rules• More complex replace rules can also
specify left and right context, as in
A -> B || L _ R• each lexical substring A is related to a
substring B when the left context ends with L and the right context starts with R.
• A, B, L and R are REs denoting languages not relations.
x4c
November 2003 CSA4050: Computational Morphology IV
23
Special Cases
• The symbol .#. refers to the absolute beginning or end of string in left and right contexts. For example
e -> i || .#. p _ r• Checkpoint: write a replace rule that
brings lexical "go" into correspondence with surface "went".
November 2003 CSA4050: Computational Morphology IV
24
The kaNpat exercise
• Suppose we have a language in which kaNpat is a lexical string consisting of the morpheme kaN concatenated with the suffix pat.
• N just before nasal p gets realised as m.
• p occurring just after an m is realised as m.
November 2003 CSA4050: Computational Morphology IV
25
kaNpat rules
• We can write the following two rules to account for this behaviour:
Rule 1. [N -> m || _ p]• Notice that the lh context is empty,
meaning that any context will do.Rule 2. [p -> m || m _]• Note that the linguist must keep track of
the order in which rules are applied.
November 2003 CSA4050: Computational Morphology IV
26
Derivation of kammatLexical: kaNpat
apply [N -> m || _ p]
Intermediate: kampat
apply [p -> m || m _]
surface: kammat
• The first rule feeds the second
• Checkpoint: what happens if rules are applied in reverse order?
November 2003 CSA4050: Computational Morphology IV
27
Composing the Relations
• Each rule describes a certain relation: call these R1 and R2
• If R1 maps X to Y and R2 maps Y to Z, then there must exist a single relation which maps directly from X to Z without passing through Y.
• Mathematically, that relation is the composition of R1 and R2.
November 2003 CSA4050: Computational Morphology IV
28
Composing the Rules
• Each rule is compiled into an FST.
• If Rule1 compiles to F1, and Rule2 to F2, then there must be an F3 which computes the composition of F1 and F2.
• Checkpoint: write the RE corresponding to the composition of the original 2 rules.
November 2003 CSA4050: Computational Morphology IV
29
Testing the kaNpat grammar
• First get rules onto stack
xfst[0] read regex
[N->m || _p] .o. [p->m||m_];• Try the following and explain
– apply down (kaNpat; kampat; kammat)– apply up kammat– Try the above but with rules in reverse order
X5ab
November 2003 CSA4050: Computational Morphology IV
30
Practical use of xfst• Regular expression files (text)xfst[0] read regexp < regexpfile
• Binary files (compiled networks)xfst[1]: save stack binfile
xfst[0]: load stack binfile
• Scripts (xfst commands)xfst[0] source scriptfile
% xfst -f myscript
% xfst -l myscript
November 2003 CSA4050: Computational Morphology IV
31
A’ is the sequentiable
a:0
a b
a
aa:0
0:b
A
A’
no. states?no. paths?relation encoded?
b:a