computational linguistics introduction

42
Apr 2009 CLINT-LIN: Finite State M achinery Computational Linguistics Introduction Finite State Machinery and Language Description

Upload: levana

Post on 16-Jan-2016

61 views

Category:

Documents


0 download

DESCRIPTION

Computational Linguistics Introduction. Finite State Machinery and Language Description. Acknowledgement. The material for this lecture is derived from a series of talks given by Dr. Ken Beesley (Xerox European Research Centre, Grenoble) in Malta, 2001. Today’s Topics. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

Computational Linguistics Introduction

Finite State Machinery and Language Description

Page 2: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

Acknowledgement

The material for this lecture is derived from a series of talks given by Dr. Ken Beesley (Xerox European Research Centre, Grenoble) in Malta, 2001.

Page 3: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

Today’s Topics

• Finite State Technology

• Regular Languages and Relations

• Review of Set Theory

• Understand the mathematical operations that can be performed on such Languages.

• Understand how Languages, Relations, Regular Expressions, and Networks are interrelated.

Page 4: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

What is Finite State Technology?

• Finite State Technology refers to a collection of techniques for application of Finite State Automata (FSA) to a range of linguistically motivated problems.

• Such Techniques include• Design of user languages for specifying FSA• Compilation of such languages into efficient

transition networks.• Development environments and runtime systems

Page 5: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

What is Finite-State Technology Good For?

• Finite-state techniques cannot handle central embedding• the man the dog the cat bit followed ate.

• They are well suited to “lower-level” natural language processing such as • Tokenization – what is the next word?• Spelling error detection: does the next word

belong to a list?• Morphological/phonological analysis/generation• Shallow syntactic parsing and “chunking”

Page 6: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

Tokenisation Problems

VfB Stuttgart scored twice in quick success-ion early in the second half on their way to a deserved 2-1 victory over Manchester United in the Champions League on Wednesday.(example from Mary Dalrymple, University of London)

• VfB Stuttgart, Manchester United• succession• 2-1• Wednesday

• Finite state techniques provide a means to specify the language of words, thus defining what it means to be the next token.

• There are three ways to specify such languages

Page 7: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

Languages,Notations and Machines

LANGUAGE(set of strings)

NOTATION MACHINE

Page 8: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

Languages,Notations and Machines

FINITE STATELANGUAGE

FINITE STATENOTATION

FINITE STATEAUTOMATON

Page 9: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

FINITE STATE AUTOMATA:preliminary definition

A finite state automaton includes:• A finite set of states• A finite set of labelled transitions between

states

Page 10: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

Physical Machines with Finite States

The Lightswitch Machine

OFF ON

UP

DOWN

Page 11: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

Physical Machines with Finite States

The Lightswitch Toggle Machine

OFF ON

PUSH

PUSH

Page 12: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

The Five Cent Machine

Problem:

• Assume you have one, two, and five cent pieces

• Design a finite state automaton which accepts exactly 5 cents.

Page 13: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

The Cola Machine

• Need to enter 25 cents (USA) to get a drink

• Accepts the following coins:• Nickel = 5 cents

• Dime = 10 cents

• Quarter = 25 cents

• For simplicity, our machine needs exact change

• We will model only the coin-accepting mechanism

Page 14: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

Physical Machines with Finite States

The Cola Machine

0

N

D

Q

N N NN

D D D

5 10 15 20 25

Start State Final State

Page 15: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

The Cola Machine Language

• List of all the sequences of coins accepted:• { Q, DDN, DND, NDD, DNNN, NDNN,

NNDNNNND, NNNNN }

• Think of the coins as SYMBOLS or CHARACTERS

• The set of symbols accepted is the ALPHABET of the machine

• Think of sequences of coins as WORDS or “strings”

• The set of words accepted by the machine is its LANGUAGE

Page 16: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

FINITE STATE AUTOMATA:better definition

A finite state automaton includes:• A finite set of states

• Initial State• Final State (s)

• A finite set of labelled transitions beween states

• Labels are symbols from an alphabet• Recognises a language• Generates a language as well!

Page 17: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

A Network that Accepts aOne Word Language

c a n t o

Page 18: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

A Network that Accepts aThree Word Language

ca n t

o

t i g r e

m e s a

Page 19: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

Scaling Up the Network

• Imagine the same network expanded to handle three million words, all of them corresponding to valid words of a given language.

• We supply a word and ‘apply’ it to the network. If it is accepted by the network, then it is a valid word. Otherwise it does not belong to the language

• This is the basis for a Spanish spelling error detector.

Page 20: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

Looking Up a Word

ca n t

o

t i g r e

m e s a

m e s a“Apply”

Page 21: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

Lookup Failure

Lookup succeeds when all input is consumed and final state is reached. Lookup can fail because:

• Not all input is consumed ("libro", "tigra")• Input is fully consumed but state is not final

("cant")• Final state is reached but there is still

unconsumed output ("mesas")

Page 22: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

Shared Structure

c l e a

e

v

r

e

Page 23: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

Transducers

m e s a s“Lookup”

“Lookdown”

m e s a +Noun +Fem +Pl

m e s a 0 0 s

mesa+Noun+Fem+Pl

Page 24: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

A Morphological Analyzer

Transducer

dogs

dog +n +pl

Page 25: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

A Morphological Analyzer

Transducer

Surface Language

Lexical Language

Page 26: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

A Quick Review of Set Theory

A set is a collection of objects.

A B

D E

We can enumerate the “members” or “elements” of finite sets: { A, D, B, E }.

There is no significant order in a set, so { A, D, B, E } is the same set as { E, A, D, B }, etc.

Page 27: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

Uniqueness of Elements

You cannot have two or more ‘A’ elements in the same set

A B

D E

{ A, A, D, B, E} is just a redundant specification of the set { A, D, B, E }.

Page 28: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

Cardinality of Sets

The Empty Set:

A Finite Set:

An Infinite Set: e.g. The Set of all Positive Integers

Norway Denmark Sweden

Page 29: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

Simple Operations on Sets: Union

A B

C

DE

Set 1 Set 2

B C A D E

Union of Set1 and Set 2

Page 30: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

Simple Operations on Sets (2): Union

A B

C

CD

Set 1 Set 2

B C A D

Union of Set1 and Set 2

Page 31: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

Simple Operations on Sets (3): Intersection

A B

C

CD

Set 1 Set 2

C

Intersection of Set1 and Set 2

Page 32: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

Simple Operations on Sets (4): Subtraction

A B

C

CD

Set 1 Set 2

A B

Set 1 minus Set 2

Page 33: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

Formal Languages

Very Important Concept in Formal Language Theory:

A Language is just a Set of Words.

• We use the terms “word” and “string” interchangeably.

• A Language can be empty, have finite cardinality, or be infinite in size.

• You can union, intersect and subtract languages, just like any other sets.

Page 34: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

Union of Languages (Sets)

dog cat rat elephant mouse

Language 1 Language 2

dog cat rat

elephant mouse

Union of Language 1 and Language 2

Page 35: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

Intersection of Languages (Sets)

dog cat rat elephant mouse

Language 1 Language 2

Intersection of Language 1 and Language 2

Page 36: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

Intersection of Languages (Sets)

dog cat rat rat mouse

Language 1 Language 2

Intersection of Language 1 and Language 2

rat

Page 37: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

Subtraction of Languages (Sets)

dog cat rat rat mouse

Language 1 Language 2

Language 1 minus Language 2

dog cat

Page 38: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

Languages

• A language is a set of words (=strings).

• Words (strings) are composed of symbols (letters) that are “concatenated” together.

• At another level, words are composed of “morphemes”.

• In most natural languages, we concatenate morphemes together to form whole words.

For sets consisting of words (i.e. for Languages), the operation of concatenation is very important.

Page 39: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

Concatenation of Languages

work talk walk

Root Language

0 ing ed s

Suffix Language

work working worked works talk talking talked talks walk walking walked walks

The concatenation of the Suffix language after the Root language.

Page 40: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

Languages and Networks

w a l k

o r

t

Network/Language 1

Network/Language 2

s

o r

s The concatenation of Network 1 and Network 2

w a l k

t

a

as

ed

i n g

0

s

ed

i n g

0

s

Page 41: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

Why is “Finite State” Computing so Interesting?

• Finite-state systems are mathematically elegant, easily manipulated and modifiable.

• Computationally efficient. Usually very compact.• The programming we linguists do is declarative. We describe

the facts of our natural language; i.e. we write grammars. We do not hack ad hoc code.

• The runtime code, which applies our systems to linguistic input, is already written and it is completely language-independent.

• Finite-state systems are inherently bidirectional: we can use the same system to analyze and to generate.

Page 42: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

Languages,Notations and Machines

FINITE STATELANGUAGE

FINITE STATENOTATION

FINITE STATEMACHINE