finite-state methods in natural-language processing: 1—motivation · 2005. 1. 16. · motivation...

43
1 Motivation — Finite-State Methods in Natural-Language Processing: 1—Motivation Ronald M. Kaplan and Martin Kay

Upload: others

Post on 02-Jan-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Finite-State Methods in Natural-Language Processing: 1—Motivation · 2005. 1. 16. · Motivation — 2 Finite-State Methods in Language Processing The Application of a branch of

1Motivation —

Finite-State Methods inNatural-Language Processing:

1—Motivation

Ronald M. Kaplan

and

Martin Kay

Page 2: Finite-State Methods in Natural-Language Processing: 1—Motivation · 2005. 1. 16. · Motivation — 2 Finite-State Methods in Language Processing The Application of a branch of

2Motivation —

Finite-State Methods in LanguageProcessing

The Application of a branch of mathematics

—The regular branch of automata theory

to a branch of computational linguistics in which whatis crucial is (or can be reduced to)

—Properties of string sets and string relations with

—A notion of bounded dependency

Page 3: Finite-State Methods in Natural-Language Processing: 1—Motivation · 2005. 1. 16. · Motivation — 2 Finite-State Methods in Language Processing The Application of a branch of

3Motivation —

Applications

• Finite Languges—Dictionaries

—Compression

• Phenomena involvingbounded dependency—Morpholgy

• Spelling

• Hyphenation

• Tokenization

• Morphological Analysis

—Phonology

• Approximations tophenomena involvingmostly bounded dependency—Syntax

• Phenomena that can betranslated into the realm ofstrings with boundeddependency—Syntax

Page 4: Finite-State Methods in Natural-Language Processing: 1—Motivation · 2005. 1. 16. · Motivation — 2 Finite-State Methods in Language Processing The Application of a branch of

4Motivation —

Correspondences

Computational DeviceFinite-State Automaton

Descriptive Notation:Regular Expression

Set of Objects:Regular Language

Page 5: Finite-State Methods in Natural-Language Processing: 1—Motivation · 2005. 1. 16. · Motivation — 2 Finite-State Methods in Language Processing The Application of a branch of

5Motivation —

The Basic Idea

• At any given moment,an automaton is in one of afinite number of states

• A transition from one state to another is possiblewhen the automaton contains a correspondingtransition.

• The process can stop only when the automaton is inone of a subset of the states, called final.

• Transitions are labeled with symbols so that asequence of transitions corresponds to a sequenceof symbols.

Page 6: Finite-State Methods in Natural-Language Processing: 1—Motivation · 2005. 1. 16. · Motivation — 2 Finite-State Methods in Language Processing The Application of a branch of

6Motivation —

Bounded Dependency

The choice between γ1 and γ2 depends on a boundednumber of preceding symbols.

γ1

γ2

?

Page 7: Finite-State Methods in Natural-Language Processing: 1—Motivation · 2005. 1. 16. · Motivation — 2 Finite-State Methods in Language Processing The Application of a branch of

7Motivation —

Bounded Dependency

The choice between γ1 and γ2 depends on a boundednumber of preceding symbols.

γ1

γ2

?si si

irrelevant

Page 8: Finite-State Methods in Natural-Language Processing: 1—Motivation · 2005. 1. 16. · Motivation — 2 Finite-State Methods in Language Processing The Application of a branch of

8Motivation —

Closure Properties and Operations

• By definition—Union

—Concatenation

—Iteration

• By deduction—Intersection

—Complementation

—Substitution

—Reversal

— ...

Page 9: Finite-State Methods in Natural-Language Processing: 1—Motivation · 2005. 1. 16. · Motivation — 2 Finite-State Methods in Language Processing The Application of a branch of

9Motivation —

Operations on Languages andAutomata

For the set-theoretic operations on languages thereare corresponding operations on automata.

M(L) is a machine that characterizes the language L.

We will use the same symbols for corresponding operationsWe will use the same symbols for corresponding operations

M(L1⊗ L2) = M(L1)⊕M(L2)

Page 10: Finite-State Methods in Natural-Language Processing: 1—Motivation · 2005. 1. 16. · Motivation — 2 Finite-State Methods in Language Processing The Application of a branch of

10Motivation —

Automata-based Calculus

• Closure gives:—Complementation → Universal quantification

—Intersection → Combinations of constraints

• Machines give:—Finite representations for (potentially) infinite sets

—Practical implementation

• Combination gives:—Coherence

—Robustness

—Reasonable machine transformations

Page 11: Finite-State Methods in Natural-Language Processing: 1—Motivation · 2005. 1. 16. · Motivation — 2 Finite-State Methods in Language Processing The Application of a branch of

11Motivation —

Quantification

There is an x followed by a y in the string

There is no xy sequence in the string

There is a y preceded by something that is not an x

Σ*xyΣ*

Σ*xyΣ*

Every y is preceded by an x.

Σ*xyΣ*

Σ*xyΣ*

∃y.∃x.precedes(x, y)

Page 12: Finite-State Methods in Natural-Language Processing: 1—Motivation · 2005. 1. 16. · Motivation — 2 Finite-State Methods in Language Processing The Application of a branch of

12Motivation —

Universal Quantification—i before e except after c

e

other

cc

e

c

e, i, other

i, other

Σ*ceiΣ*

Page 13: Finite-State Methods in Natural-Language Processing: 1—Motivation · 2005. 1. 16. · Motivation — 2 Finite-State Methods in Language Processing The Application of a branch of

13Motivation —

Universal Quantification—i before e except after c

e

other

cc

e

c

e, i, other

i, other

After e: no i

After c: anything

Not after c or e:

anything but ei

Page 14: Finite-State Methods in Natural-Language Processing: 1—Motivation · 2005. 1. 16. · Motivation — 2 Finite-State Methods in Language Processing The Application of a branch of

14Motivation —

Only e i after c

Σ*ceiΣ* ∩Σ*cieΣ*

i,other

4

e

othe

r

e

c

c

ce, otherc

i,other

i

Page 15: Finite-State Methods in Natural-Language Processing: 1—Motivation · 2005. 1. 16. · Motivation — 2 Finite-State Methods in Language Processing The Application of a branch of

15Motivation —

Only e i after c

i,other

4

e

othe

r

e

c

c

ce, otherc

i,other

i

After e: no i

After c: not ie

Not after c or e:

anything but ei

After ci: no e

Page 16: Finite-State Methods in Natural-Language Processing: 1—Motivation · 2005. 1. 16. · Motivation — 2 Finite-State Methods in Language Processing The Application of a branch of

16Motivation —

Alternative Notations

Closure ⇒ Recursive Formalisms ⇒

Higher-level Constructs

Choose notation for theoretical significance andpractical convenience.

L1← L2 ≡ L1L2

Σ*ceiΣ* ≡ Σ*c← eiΣ*

Page 17: Finite-State Methods in Natural-Language Processing: 1—Motivation · 2005. 1. 16. · Motivation — 2 Finite-State Methods in Language Processing The Application of a branch of

17Motivation —

What is a Finite-State Automaton?

• An alphabet of symbols,

• A finite set of states,

• A transition function from states and symbols tostates,

• A distinguished member of the set of states calledthe start state, and

• A distinguished subset of the set of states calledfinal states.

Pace terminology, same definition as fordirected graphs with labeled edges, plus

initial and final states.

Pace terminology, same definition as fordirected graphs with labeled edges, plus

initial and final states.

Page 18: Finite-State Methods in Natural-Language Processing: 1—Motivation · 2005. 1. 16. · Motivation — 2 Finite-State Methods in Language Processing The Application of a branch of

18Motivation —

i to x

x

v i i i

i i

v, x

0 1 2 3 4

5

Unless otherwisemarked, the start state is

usually the leftmost inthe diagram

We draw final stateswith a double circle

Page 19: Finite-State Methods in Natural-Language Processing: 1—Motivation · 2005. 1. 16. · Motivation — 2 Finite-State Methods in Language Processing The Application of a branch of

19Motivation —

Regular Languages

• Languages — sets of strings

• Regular languages — a subset of languages

• Closed under concatenation, union, and iteration

• Every regular language is chracterized by (at least)one finite-state automaton

• Languages may contain infinitely many strings butautomata are finite

Page 20: Finite-State Methods in Natural-Language Processing: 1—Motivation · 2005. 1. 16. · Motivation — 2 Finite-State Methods in Language Processing The Application of a branch of

20Motivation —

Regular Expressions

• Formulae with operators that denote—union

—concatenation

—iteration

a* [b | c]a* [b | c]

Any number of a’s followed by either b or c.

Page 21: Finite-State Methods in Natural-Language Processing: 1—Motivation · 2005. 1. 16. · Motivation — 2 Finite-State Methods in Language Processing The Application of a branch of

21Motivation —

Some Motivations

• Word Recognition

• Dictionary Lookup

• Spelling Conventions

Page 22: Finite-State Methods in Natural-Language Processing: 1—Motivation · 2005. 1. 16. · Motivation — 2 Finite-State Methods in Language Processing The Application of a branch of

22Motivation —

A word recognizer takes a string of characters asinput and returns “yes” or “no” according as theword is or is not in a given set.

Solves the membership problem.

Word Recognition

e.g. Spell Checking, Scrabble

Page 23: Finite-State Methods in Natural-Language Processing: 1—Motivation · 2005. 1. 16. · Motivation — 2 Finite-State Methods in Language Processing The Application of a branch of

23Motivation —

• Has right set of letters (any order).

• Has right sounds (Soundex).

• Random (suprimposed) coding (Unix Spell)

Approximate methods

Wordhash1hash2

hashk

BitTable

Page 24: Finite-State Methods in Natural-Language Processing: 1—Motivation · 2005. 1. 16. · Motivation — 2 Finite-State Methods in Language Processing The Application of a branch of

24Motivation —

Exact Methods

• Hashing

• Search (linear, binary ...)

• Digital search (“Tries”)

aa

r dv a

rk

b a c k

sh e

d

si n g

ze

d

i n gs

b u z

a c k

Folds together common prefixes

Page 25: Finite-State Methods in Natural-Language Processing: 1—Motivation · 2005. 1. 16. · Motivation — 2 Finite-State Methods in Language Processing The Application of a branch of

25Motivation —

Exact Methods (continued)

• Finite-state automata

Folds together common prefixes and suffixes

Page 26: Finite-State Methods in Natural-Language Processing: 1—Motivation · 2005. 1. 16. · Motivation — 2 Finite-State Methods in Language Processing The Application of a branch of

26Motivation —

Enumeration vs. Description

• Enumeration—Representation includes an item for each object.

Size = f(Items)

• Description—Representation provides a characterization of the

set of all items.Size = g(Common properties, Exceptions)

—Adding item can decrease size.

Page 27: Finite-State Methods in Natural-Language Processing: 1—Motivation · 2005. 1. 16. · Motivation — 2 Finite-State Methods in Language Processing The Application of a branch of

27Motivation —

Classification

Exact Approximate

Enumeration Hash table SoundexBinary search

Description Trie Unix SpellFSM Right letter

Page 28: Finite-State Methods in Natural-Language Processing: 1—Motivation · 2005. 1. 16. · Motivation — 2 Finite-State Methods in Language Processing The Application of a branch of

28Motivation —

FSM Extends to Infinite SetsProductive compounding

Kindergartensgeselschaft

Page 29: Finite-State Methods in Natural-Language Processing: 1—Motivation · 2005. 1. 16. · Motivation — 2 Finite-State Methods in Language Processing The Application of a branch of

29Motivation —

Statistics

English PortugueseVocabulary

Words 81,142 206,786KBytes 858 2,389PKPAK 313 683PKZIP 253 602

FSMStates 29,317 17,267Transitions 67,709 45,838KBytes 203 124

From Lucchese and Kowaltowski (1993)

Page 30: Finite-State Methods in Natural-Language Processing: 1—Motivation · 2005. 1. 16. · Motivation — 2 Finite-State Methods in Language Processing The Application of a branch of

30Motivation —

Dictionary lookup takes a string of characters asinput and returns “yes” or “no” according as theword is or is not in a given set and returnsinformation about the word.

Dictionary Lookup

Page 31: Finite-State Methods in Natural-Language Processing: 1—Motivation · 2005. 1. 16. · Motivation — 2 Finite-State Methods in Language Processing The Application of a branch of

31Motivation —

Lookup MethodsApproximate — guess the information

If it ends in “ed”, it’s a past-tense verb.

Exact — store the information for finitely many wordsTable Lookup

• Hash

• Search

• Trie —store at word-endings.

FSM

• Store at final states?

No suffix collapse — reverts to Trie.

Page 32: Finite-State Methods in Natural-Language Processing: 1—Motivation · 2005. 1. 16. · Motivation — 2 Finite-State Methods in Language Processing The Application of a branch of

32Motivation —

Word Identifiers

Associate a unique, useful, identifier with each of nwords, e.g. an integer from 1 to n. This can be usedto index a vector of dictionary information.

n

word → ii Information

Page 33: Finite-State Methods in Natural-Language Processing: 1—Motivation · 2005. 1. 16. · Motivation — 2 Finite-State Methods in Language Processing The Application of a branch of

33Motivation —

Pre-order Walk

A pre-order walk of an n-word FSM, counting finalstates, assigns such integers, even if suffixes arecollapsed ⇒ Linear Search.

drip → 1drips → 2drop → 3drops → 4

Page 34: Finite-State Methods in Natural-Language Processing: 1—Motivation · 2005. 1. 16. · Motivation — 2 Finite-State Methods in Language Processing The Application of a branch of

34Motivation —

Suffix Counts

• Store with each state the size of its suffix set

• Skip irrelevant transitions, incrementing count bydestination suffix sizes.

drip → 1drips → 2drop → 3drops → 4

4 4 4 2 1 0

Page 35: Finite-State Methods in Natural-Language Processing: 1—Motivation · 2005. 1. 16. · Motivation — 2 Finite-State Methods in Language Processing The Application of a branch of

35Motivation —

• Minimal Perfect Hash (Lucchesi and Kowaltowski)

• Word-number mapping (Kaplan and Kay, 1985)

Page 36: Finite-State Methods in Natural-Language Processing: 1—Motivation · 2005. 1. 16. · Motivation — 2 Finite-State Methods in Language Processing The Application of a branch of

36Motivation —

Spelling Conventions

iN+tractable → intractable

iN+practical → impractical

iN is the common negative prefix— im before labial

— in otherwise

c.f. input → input

Page 37: Finite-State Methods in Natural-Language Processing: 1—Motivation · 2005. 1. 16. · Motivation — 2 Finite-State Methods in Language Processing The Application of a branch of

37Motivation —

An in/im TransducerNo exit from thisstate except over

a labial.

No labials from this state

Page 38: Finite-State Methods in Natural-Language Processing: 1—Motivation · 2005. 1. 16. · Motivation — 2 Finite-State Methods in Language Processing The Application of a branch of

38Motivation —

Generation — “intractable”

0 0

i

i

N

m

N

n

2

t

t

0

r

r

0

a

a

1

Page 39: Finite-State Methods in Natural-Language Processing: 1—Motivation · 2005. 1. 16. · Motivation — 2 Finite-State Methods in Language Processing The Application of a branch of

39Motivation —

Generation — “impractical”

0 0

i

i

N

m

N

n

2

1

p

p

0

r

r

0

a

a

Page 40: Finite-State Methods in Natural-Language Processing: 1—Motivation · 2005. 1. 16. · Motivation — 2 Finite-State Methods in Language Processing The Application of a branch of

40Motivation —

Recognition — “intractable”

0 0

i

i

n

n

N

n

2

t

t

0

r

r

0

a

a

0

t

t

0

r

r

0

a

a

Page 41: Finite-State Methods in Natural-Language Processing: 1—Motivation · 2005. 1. 16. · Motivation — 2 Finite-State Methods in Language Processing The Application of a branch of

41Motivation —

Generation — “input”

0 0

i

i

N

m

N

n

2

1

p

p

0

u

u

0

t

t

0

Page 42: Finite-State Methods in Natural-Language Processing: 1—Motivation · 2005. 1. 16. · Motivation — 2 Finite-State Methods in Language Processing The Application of a branch of

42Motivation —

A Word Transducer

Base Forms

Morphology

Spelling Rules

Text Forms

Finite-StateTransducers

Finite-stateMachine

Finite-state machine

Page 43: Finite-State Methods in Natural-Language Processing: 1—Motivation · 2005. 1. 16. · Motivation — 2 Finite-State Methods in Language Processing The Application of a branch of

43Motivation —

BibliographyKaplan, Ronald M. and Martin Kay. “Regular

models of phololgical rule systems.” ComputationalLinguistics, 23:3, September 1994.

Hopcroft, John E. and Jeffrey D. Ullman.Introduction to Automata Theory, Languages andComputation. Addison Wesley, 1979. Chapters 2and 3.

Partee, Barbara H., Alice ter Meulen and Robert E.Wall. Mathematical Methods in Linguistics. Kluwer,1990. Chapters 16 and 17.

Lucchesi, Cláudio L. and Tomasz Kowaltowski.“Applications of Finite Automata RepresentingLarge Vocabularies”. Software Practice andEsperience. 23:1, January 1993.