november 2003csa4050: computational morphology iv 1 csa405: advanced topics in nlp computational...

31
November 2003 CSA4050: Computational Mo rphology IV 1 CSA405: Advanced Topics in NLP Computational Morphology IV: xfst

Upload: laureen-hood

Post on 29-Jan-2016

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: November 2003CSA4050: Computational Morphology IV 1 CSA405: Advanced Topics in NLP Computational Morphology IV: xfst

November 2003 CSA4050: Computational Morphology IV

1

CSA405: Advanced Topicsin NLP

Computational Morphology IV:

xfst

Page 2: November 2003CSA4050: Computational Morphology IV 1 CSA405: Advanced Topics in NLP Computational Morphology IV: xfst

November 2003 CSA4050: Computational Morphology IV

2

What is xfst?• xfst is a general tool for creating and

manipulating finite state networks, both simple automota and transducers.

• xfst and other Xerox tools employ a notation very close to the notation we have been using so far.

• For full documentation on the syntax and semantics of Xerox REs, see– http://www.fsmbook.com

Page 3: November 2003CSA4050: Computational Morphology IV 1 CSA405: Advanced Topics in NLP Computational Morphology IV: xfst

November 2003 CSA4050: Computational Morphology IV

3

Simple Commands

• command line (via babe)> xfst

• define: give a name to an RE

• print: print information

• read: read information

• various stack operations

• file interaction

Page 4: November 2003CSA4050: Computational Morphology IV 1 CSA405: Advanced Topics in NLP Computational Morphology IV: xfst

November 2003 CSA4050: Computational Morphology IV

4

define command

• define name regexp ;

xfst[0]: define foo [d o g] | [c a t];

xfst[0]: define R1 [a | b | c | d];

xfst[0]: define R2 [d | e | f | g];

xfst[0]: define R3 [f | g | h | i | j];

xfst[0]: define baz R1 & R2;

Page 5: November 2003CSA4050: Computational Morphology IV 1 CSA405: Advanced Topics in NLP Computational Morphology IV: xfst

November 2003 CSA4050: Computational Morphology IV

5

print words

print words name - see the words in the language called name

xfst[0]: print words R1

d

c

b

a

xfst[0]:

Page 6: November 2003CSA4050: Computational Morphology IV 1 CSA405: Advanced Topics in NLP Computational Morphology IV: xfst

November 2003 CSA4050: Computational Morphology IV

6

print netprint net name - see detailed information about the network name.xfst[0]: define z R1 & R2;xfst[0]: define baz R1 & R2;xfst[0]: print net zSigma: a b c d e f gSize: 7Net: FC370Flags: deterministic, pruned, minimized,

epsilon_free, loop_freeArity: 1s0: d -> fs1.fs1: (no arcs)xfst[0]:

Page 7: November 2003CSA4050: Computational Morphology IV 1 CSA405: Advanced Topics in NLP Computational Morphology IV: xfst

November 2003 CSA4050: Computational Morphology IV

7

Some Properties of Networks

• epsilon free: there are no arcs labeled with the epsilon symbol

• deterministic: no state has more than one outgoing arc

• minimised: there is no other network with exactly the same paths that has fewer states.

• These make sense for FSAs – not necessarily for FSTs.

Page 8: November 2003CSA4050: Computational Morphology IV 1 CSA405: Advanced Topics in NLP Computational Morphology IV: xfst

November 2003 CSA4050: Computational Morphology IV

8

Equivalent?

a:0

a b

a

aa:0

b

A

B

no. states?no. paths?relation encoded?

Page 9: November 2003CSA4050: Computational Morphology IV 1 CSA405: Advanced Topics in NLP Computational Morphology IV: xfst

November 2003 CSA4050: Computational Morphology IV

9

Remarks

• A and B encode the same relation{<“aa”,”a”>,<“ab”,”ab”>}

• They are both deterministic and minimal• They have different numbers of states.• Arcs labeled with a pair containing an epsilon

on one side can sometimes be redistributed or eliminated, reducing the number of states.

• This situation does not occur with FSAs

Page 10: November 2003CSA4050: Computational Morphology IV 1 CSA405: Advanced Topics in NLP Computational Morphology IV: xfst

November 2003 CSA4050: Computational Morphology IV

10

FST Determinism:Sequential vs. Unambiguous

• Unambiguous: for any input there is at most one output.– Transducer A is unambiguous in either direction.

• Sequential: No state has more than one arc with the same symbol on the input side.– Transducer A is not sequential in one direction.

• A transducer is sequentiable if the relation it encodes is unambiguous and all the local ambiguities resolve themselves in a fixed number of steps

Page 11: November 2003CSA4050: Computational Morphology IV 1 CSA405: Advanced Topics in NLP Computational Morphology IV: xfst

November 2003 CSA4050: Computational Morphology IV

11

Basic Stack Operations• read regex: push network onto stack:• print stack: list items on stack• print net: detailed info on top stack

item• pop stack: remove top item from

stack• define name: set name to value of top

stack item

Page 12: November 2003CSA4050: Computational Morphology IV 1 CSA405: Advanced Topics in NLP Computational Morphology IV: xfst

November 2003 CSA4050: Computational Morphology IV

12

Stack Operations:intersect net; union net, etc.

• Load stack with N suitable arguments.

• Ensure that arguments are pushed onto stack in correct (reverse) order.

• intersect net command is issued.

• These are popped from the stack, the operation is performed, and the result written back onto the stack.

Page 13: November 2003CSA4050: Computational Morphology IV 1 CSA405: Advanced Topics in NLP Computational Morphology IV: xfst

November 2003 CSA4050: Computational Morphology IV

13

Stack Example 1

xfst[0]: clear stack;

xfst[0]: read regex [d |c |e | b | w]

xfst[1]: read regex [b | s | h | w]

xfst[2]: read regex [s | d | c | f | w]

xfst[3]: print stack

xfst[3]: intersect net

xfst[1]: print stack

xfst[1]: print net

xfst[1]: print words

x1

Page 14: November 2003CSA4050: Computational Morphology IV 1 CSA405: Advanced Topics in NLP Computational Morphology IV: xfst

November 2003 CSA4050: Computational Morphology IV

14

Stack Example 2

xfst[0]: clear stack;

xfst[0]: read regex [e d | i n g | s |[]]

xfst[1]: read regex [t a l k | k i c k]

xfst[2]: print stack

xfst[2]: print net

xfst[2]: print words

xfst[2]: concatenate net

xfst[1]: print words

x2/a

Page 15: November 2003CSA4050: Computational Morphology IV 1 CSA405: Advanced Topics in NLP Computational Morphology IV: xfst

November 2003 CSA4050: Computational Morphology IV

15

Creating Relations

• A simple example of a transducer can be shown using the crossproduct operator:

xfst[0] clear stack

xfst[0] define Y [d o g | c a t];

xfst[0] define Z [c h i e n | c h a t];

xfst[0] read regex Y .x. Z

• We can now use apply up and apply down to test the transducer’s behaviour.

x3ab

Page 16: November 2003CSA4050: Computational Morphology IV 1 CSA405: Advanced Topics in NLP Computational Morphology IV: xfst

November 2003 CSA4050: Computational Morphology IV

16

apply up; apply down• applyup(arg,R) = {x | <x,arg> in R}• applydown(arg,R) = {x | <arg,x> in R}

xfst[0] read regex [d o g | c a t].x.[c h i e n | c h a t];xfst[1] apply up chiendogcatxfst[1] apply down catchienchat

Page 17: November 2003CSA4050: Computational Morphology IV 1 CSA405: Advanced Topics in NLP Computational Morphology IV: xfst

November 2003 CSA4050: Computational Morphology IV

17

Exercise for .x.

• What RE would perform the correct translations?

• Define it in xfst.

• Define an RE in xfst which relates the surface forms "sing", "sang" and "sung" to the lexical form "sing".

x3c

Page 18: November 2003CSA4050: Computational Morphology IV 1 CSA405: Advanced Topics in NLP Computational Morphology IV: xfst

November 2003 CSA4050: Computational Morphology IV

18

Replace Rules

• Xerox RE notation, includes replace rules.

• Replace rules do not increase the descriptive power of REs; however, they do provide a powerful abbreviated rule-like notation.

• There are two main types of replace rules:unconditional and conditional

Page 19: November 2003CSA4050: Computational Morphology IV 1 CSA405: Advanced Topics in NLP Computational Morphology IV: xfst

November 2003 CSA4050: Computational Morphology IV

19

Unconditional Replace Rules

• The most straightforward kind of unconditional replace rule is:a -> b

• This denotes an FS relation in which every symbol a in the upper language corresponds to a symbol b in the lower language.

• Checkpoint: how does this differ from a:b? What is the FST that computes this relation

Page 20: November 2003CSA4050: Computational Morphology IV 1 CSA405: Advanced Topics in NLP Computational Morphology IV: xfst

November 2003 CSA4050: Computational Morphology IV

20

Unconditional Replace e.g.xfst[0]: read regex c -> rxfst[0]: apply down catxfst[0]: apply down dog

• Where there is no match, the string is identity mapped.

• The general pattern for simple Replace rules is A -> B, where A and B are REs denoting arbitrarily complex languages (not relations)

x4ab

Page 21: November 2003CSA4050: Computational Morphology IV 1 CSA405: Advanced Topics in NLP Computational Morphology IV: xfst

November 2003 CSA4050: Computational Morphology IV

21

Definition of A → B

• A → B = [no_A [A .x. B]]* no_A

where no_A ~$[A – 0]• N.B. if upper does not contain empty str

~$[upper – 0] = ~$[upper]otherwise ~$[upper] is null whereas~$[upper – 0] contains at least the empty str.

Page 22: November 2003CSA4050: Computational Morphology IV 1 CSA405: Advanced Topics in NLP Computational Morphology IV: xfst

November 2003 CSA4050: Computational Morphology IV

22

Conditional Replace Rules• More complex replace rules can also

specify left and right context, as in

A -> B || L _ R• each lexical substring A is related to a

substring B when the left context ends with L and the right context starts with R.

• A, B, L and R are REs denoting languages not relations.

x4c

Page 23: November 2003CSA4050: Computational Morphology IV 1 CSA405: Advanced Topics in NLP Computational Morphology IV: xfst

November 2003 CSA4050: Computational Morphology IV

23

Special Cases

• The symbol .#. refers to the absolute beginning or end of string in left and right contexts. For example

e -> i || .#. p _ r• Checkpoint: write a replace rule that

brings lexical "go" into correspondence with surface "went".

Page 24: November 2003CSA4050: Computational Morphology IV 1 CSA405: Advanced Topics in NLP Computational Morphology IV: xfst

November 2003 CSA4050: Computational Morphology IV

24

The kaNpat exercise

• Suppose we have a language in which kaNpat is a lexical string consisting of the morpheme kaN concatenated with the suffix pat.

• N just before nasal p gets realised as m.

• p occurring just after an m is realised as m.

Page 25: November 2003CSA4050: Computational Morphology IV 1 CSA405: Advanced Topics in NLP Computational Morphology IV: xfst

November 2003 CSA4050: Computational Morphology IV

25

kaNpat rules

• We can write the following two rules to account for this behaviour:

Rule 1. [N -> m || _ p]• Notice that the lh context is empty,

meaning that any context will do.Rule 2. [p -> m || m _]• Note that the linguist must keep track of

the order in which rules are applied.

Page 26: November 2003CSA4050: Computational Morphology IV 1 CSA405: Advanced Topics in NLP Computational Morphology IV: xfst

November 2003 CSA4050: Computational Morphology IV

26

Derivation of kammatLexical: kaNpat

apply [N -> m || _ p]

Intermediate: kampat

apply [p -> m || m _]

surface: kammat

• The first rule feeds the second

• Checkpoint: what happens if rules are applied in reverse order?

Page 27: November 2003CSA4050: Computational Morphology IV 1 CSA405: Advanced Topics in NLP Computational Morphology IV: xfst

November 2003 CSA4050: Computational Morphology IV

27

Composing the Relations

• Each rule describes a certain relation: call these R1 and R2

• If R1 maps X to Y and R2 maps Y to Z, then there must exist a single relation which maps directly from X to Z without passing through Y.

• Mathematically, that relation is the composition of R1 and R2.

Page 28: November 2003CSA4050: Computational Morphology IV 1 CSA405: Advanced Topics in NLP Computational Morphology IV: xfst

November 2003 CSA4050: Computational Morphology IV

28

Composing the Rules

• Each rule is compiled into an FST.

• If Rule1 compiles to F1, and Rule2 to F2, then there must be an F3 which computes the composition of F1 and F2.

• Checkpoint: write the RE corresponding to the composition of the original 2 rules.

Page 29: November 2003CSA4050: Computational Morphology IV 1 CSA405: Advanced Topics in NLP Computational Morphology IV: xfst

November 2003 CSA4050: Computational Morphology IV

29

Testing the kaNpat grammar

• First get rules onto stack

xfst[0] read regex

[N->m || _p] .o. [p->m||m_];• Try the following and explain

– apply down (kaNpat; kampat; kammat)– apply up kammat– Try the above but with rules in reverse order

X5ab

Page 30: November 2003CSA4050: Computational Morphology IV 1 CSA405: Advanced Topics in NLP Computational Morphology IV: xfst

November 2003 CSA4050: Computational Morphology IV

30

Practical use of xfst• Regular expression files (text)xfst[0] read regexp < regexpfile

• Binary files (compiled networks)xfst[1]: save stack binfile

xfst[0]: load stack binfile

• Scripts (xfst commands)xfst[0] source scriptfile

% xfst -f myscript

% xfst -l myscript

Page 31: November 2003CSA4050: Computational Morphology IV 1 CSA405: Advanced Topics in NLP Computational Morphology IV: xfst

November 2003 CSA4050: Computational Morphology IV

31

A’ is the sequentiable

a:0

a b

a

aa:0

0:b

A

A’

no. states?no. paths?relation encoded?

b:a