represen tation g - inriagallium.inria.fr/~huet/public/trento.pdfrepresen tation-e ciency the in...

71
Representation Structures for Computational Linguistics erard Huet ESSLLI 2002, Trento -1-

Upload: others

Post on 31-Dec-2019

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

Represe

nta

tion

Str

uctu

res

for

Com

puta

tional

Lin

guistic

s

Gerard

Huet

ESSLLI

2002,Tre

nto

-1

-

Page 2: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

What

the

course

isab

out

•A

computation

alplatform

forSan

skrit

•T

he

ZE

Ncom

putation

alm

orphology

toolk

it

•P

idgin

ML

•T

he

function

alprogram

min

gparad

igmfor

CL

•C

oncrete

program

min

gissu

esin

Objective

Cam

l+

Cam

lp4

•G

eneral

architectu

reissu

esfor

aC

Lplatform

•C

oop

erationon

freeC

Lresou

rces

Tw

osp

ecific

applicative

technologies:

•Local

pro

cessing

offo

cused

data

•Sharin

g

-2

-

Page 3: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

What

shall

not

be

discu

ssed

•M

Lvs

C+

+

•M

Lvs

Java

•M

Lvs

Prolog

-3

-

Page 4: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

What

shall

not

be

discu

ssedat

length

•O

bjective

CA

ML

vs

SM

L

•M

Lvs

Haskell

•M

Lvs

C

•P

idgin

ML

vs

Objective

CA

ML

-4

-

Page 5: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

Basics:

listsvs

stacks

value

l5

=[1;

2;

3;

4;

5];

value

s5

=[5;

4;

3;

2;

1];

value

rec

unstack

ls

=

match

lwith

[[]

->

s

|[h::t]

->

unstack

t[h::s]

];

value

rev

l=unstack

l[];

value

state3

=([3;

2;

1],[4;

5]);

-5

-

Page 6: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

Turin

gm

achin

es,E

macs,

and

Zip

pers

Zip

pers.

First

presen

tationat

FLoC

’96.P

ublish

edas:

G.H

uet.

The

Zip

per.

J.Function

alP

rogramm

ing

7,5(1997),

549-554.

Large

scaleim

plem

entation

sin

syntax

editors

with

incom

putation

al

lingu

isticsplatform

s:

•G

.H

uet.

Lex

icalm

orphism

sw

ithth

eZen

platform

.

•A

.R

anta.

Gram

matical

framew

orks.

-6

-

Page 7: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

Con

texts

aszip

pers

type

tree

=[Tree

of

forest

]

and

forest

=list

tree;

type

tree_zipper

=

[Top

|Zip

of

(forest

*tree_zipper

*forest)

];

type

focused_tree

=(tree_zipper

*tree);

Afo

cused

treeis

atree

with

afo

cus

poin

tof

interest,

i.e.a

treean

d

astacked

contex

t.

-7

-

Page 8: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

Operation

son

focu

sedtrees

value

down

(z,t)

=match

twith

[Tree(forest)

->

match

forest

with

[[]

->

raise

(Failure

"down")

|[hd::tl]

->

(Zip([],z,tl),hd)

]

];

value

up

(z,t)

=match

zwith

[Top

->

raise

(Failure

"up")

|Zip(l,u,r)

->

(u,

Tree(unstack

l[t::r]))

];

-8

-

Page 9: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

More

operation

son

focu

sedtrees

value

left

(z,t)

=match

zwith

[Top

->

raise

(Failure

"left")

|Zip(l,u,r)

->

match

lwith

[[]

->

raise

(Failure

"left")

|[elder::rest]

->

(Zip(elders,u,[t::r]),rest)

]

];

value

right

(z,t)

=match

zwith

[Top

->

raise

(Failure

"right")

|Zip(l,u,r)

->

match

rwith

[[]

->

raise

(Failure

"right")

|[young::rest]

->

(Zip([t::l],u,rest),young)

]

];

-9

-

Page 10: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

Applicative

updatin

g

value

del_l

(z,_)

=match

zwith

[Top

->

raise

(Failure

"del_l")

|Zip(l,u,r)

->

match

lwith

[[]

->

raise

(Failure

"del_l")

|[elder::elders]

->

(Zip(elders,u,r),elder)

]

];

value

replace

(z,_)

t=(z,t);

-10

-

Page 11: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

Poin

tsof

view

abou

tfo

cused

structu

res

•M

anip

ulation

offo

cused

data

islo

cal

•R

edundan

trep

resentation

-effi

ciency

•T

he

Interaction

Com

bin

atorsParad

igm

Rem

ark.

Zip

pers

arelin

earcon

texts.

They

aresu

perior

toΩ

-terms,

notab

lybecau

seth

eap

prox

imation

orderin

gis

substru

ctural.

The

Natu

ralTran

sformation

fromtree

functors

tozip

per

functors

is

Diff

erentiation

;Zip

pers

may

alsobe

seenas

the

linear

Function

sover

Trees.

-11

-

Page 12: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

Back

tolin

guistics

We

wan

tto

pro

cess(p

arsean

dgen

erate)natu

rallan

guage

senten

ces,

dialogu

es,corp

uses

ofvariou

skin

ds

(oral,w

ritten,new

s,book

s,w

eb

sites,etc).

We

assum

eth

atth

edata

isalread

ydigitalised

and

discretized

asa

streamof

letters(p

hon

emes

fororal

data,

lettersfor

written

one).

Afu

ndam

ental

entity

inth

ispro

cessing

isth

ew

ord.

One

tradition

allydistin

guish

espro

cessing

betw

eenstream

sof

lettersan

d

word

s(m

orphology,

lexical

analy

sis)an

dpro

cessing

betw

eenw

ords

and

senten

ces(sy

ntax

,parsin

g).H

owever,

the

natu

reof

the

word

is

ellusive.

-12

-

Page 13: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

What

Tesn

ierehas

tosay

The

lingu

istTesn

iere,in

his

Elem

ents

de

Syntax

eStru

cturale,

says:

“Pou

rsim

ple

qu’elle

paraisse,

lanotion

de

mot

estune

de

cellesdon

t

ladefi

nition

estla

plu

sdelicate

pou

rle

lingu

iste.C

’estpeu

t-etreque

tropsou

vent

onpart

de

lanotion

de

mot

pou

rarriver

ala

notion

de

phrase,

aulieu

de

partir

de

lanotion

de

phrase

pou

rarriver

ala

notion

de

mot.

Or

onne

saurait

defi

nir

laphrase

apartir

du

mot,

mais

seulem

ent

lem

ota

partir

de

laphrase.

Car

lanotion

de

phrase

estlogiq

uem

ent

anterieu

rea

cellede

mot .”

-13

-

Page 14: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

Ontological

Prob

lem

What

Tesn

ierereally

says

isan

evid

ence:

itis

the

ontological

priority

ofth

eC

orpus

overth

eLex

icon.

The

word

sare

found

inth

eC

orpus,

then

copied

toth

eLex

icon;th

eLan

guage

isdefi

ned

by

itsC

orpus.

The

preem

inen

ceof

the

Corp

us

overth

eLex

iconis

unden

iable.

Neverth

eless,th

ew

ords

arerecogn

izedin

the

corpus

relativelyto

the

generative

dev

icesof

morp

hology

;th

ein

versionof

these

generative

relations

exten

ds

the

strictcoverin

gof

the

corpus

by

the

generative

capab

ilitiesof

the

gramm

ar;an

dth

us

there

isa

tension

betw

eenth

e

co-inductive

structu

reof

the

lexicon

asa

repository

ofutteran

cesan

d

the

inductive

structu

reof

word

sas

generated

by

morp

hological

dev

icesof

stems

inth

elex

icon.

-14

-

Page 15: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

Philosop

hical

consid

erations

Anek

dot.

The

Tham

adas

inG

eorgia.

Puzzles.

The

‘oui’

prob

lem.

The

‘oiu’prob

lem.

Research

topic.

Defi

ne

the

functor

the

fixpoin

tof

which

is

constru

cted.

Tech

nology.

Chase

out

hap

axes.

Or

rather,

index

prop

erlyth

e

diach

ronical

dim

ension

ofth

elan

gageunder

consid

eration.

-15

-

Page 16: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

Back

toth

eLex

icon

Words.

Word

sare

represen

tedas

listof

positive

integers.

type

letter

=int

and

word

=list

letter;

We

prov

ide

coercion

sencode

:string

->

word

and

decode

:word

->

string.

Here

islex

icographic

orderin

g.

value

rec

lexico

l1

l2

=match

l1

with

[[]

->

True

|[c1

::

r1]

->

match

l2

with

[[]

->

False

|[c2

::

r2]

->

if

c2<c1

then

False

else

if

c2=c1

then

lexico

r1

r2

else

True

]];

-16

-

Page 17: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

Diff

erential

word

s

type

delta

=(int

*word);

Adiff

erential

word

isa

notation

perm

itting

toretrieve

aw

ordw

from

anoth

erw

ordw

′sh

aring

acom

mon

prefi

x.

Itden

otesth

em

inim

al

path

connectin

gth

ew

ords

ina

tree,as

aseq

uen

ceof

ups

and

dow

ns:

ifd

=(n

,u)

we

goup

ntim

esan

dth

endow

nalon

gw

ordu.

We

compute

the

diff

erence

betw

eenw

and

w′as

adiff

erential

word

dif

fw

w′=

(|w1|,w

2)w

here

w=

p.w

1an

dw

′=

p.w

2,w

ith

max

imal

comm

onprefi

xp.

The

converse

ofdiff

:word

->

word

->

delta

is

patch

:delta

->

word

->

word:

w′m

aybe

retrievedfrom

wan

d

d=

dif

fw

w′as

w′=

patch

dw

.-17

-

Page 18: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

Tries

Tries

storesp

arsesets

ofw

ords

sharin

gin

itialprefi

xes.

They

aredue

toR

ene

de

laB

riantais

(1959).W

euse

avery

simple

represen

tation

with

listsof

siblin

gs.

type

trie

=[Trie

of

(bool

*forest)

]

and

forest

=list

(Word.letter

*trie);

Tries

arem

anaged

(search,in

sertion,etc)

usin

gth

ezip

per

technology.

-18

-

Page 19: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

Importan

trem

arks

Tries

may

be

consid

eredas

determ

inistic

finite

stateau

tomata

graphs

foraccep

ting

the

(finite)

langu

ageth

eyrep

resent.

This

remark

isth

e

basis

form

any

lexicon

pro

cessing

libraries.

Such

graphs

areacy

clic(trees).

But

more

general

finite

state

autom

atagrap

hs

may

be

represen

tedas

annotated

trees.T

hese

annotation

saccou

nt

fornon

-determ

inistic

choice

poin

ts,an

dfor

virtu

alpoin

tersin

the

graph.

-19

-

Page 20: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

Lex

icon

Here

isa

simplistic

lexicon

compiler

make_lex

:list

string

->

trie:

value

make_lex

=

List.fold_left

(fun

lex

c->

Trie.enter

lex

(Word.encode

c))

Trie.empty;

For

instan

ce,w

ithenglish.lst

storing

alist

of173528

word

s,as

a

text

file

ofsize

2Mb,th

ecom

man

d

make_lex

<english.lst

>english.rem

pro

duces

atrie

represen

tationas

afile

of4.5M

b.

Tries

share

the

word

sby

there

prefi

xes,

but

comm

onsu

ffixes

account

fora

lotof

redundan

cyin

the

structu

re.W

esh

allelim

inate

this

redundan

cyby

sharin

g.

-20

-

Page 21: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

The

Share

Functor

module

Share

:functor

(Algebra:sig

type

domain

=’a;

value

size:

int;

end)

->

sig

value

share:

Algebra.domain->int->Algebra.domain;

end;

That

is,Share

takesas

argum

ent

am

odule

Algeb

raprov

idin

ga

type

dom

ainan

dan

integer

value

size,an

dit

defi

nes

avalu

esh

areof

the

statedty

pe.

We

assum

eth

atth

eelem

ents

fromth

edom

ainare

presen

tedw

ithan

integer

keybou

nded

by

Algeb

ra.size.T

hat

is,

share

xk

will

assum

eas

precon

dition

that

0≤

k<

Max

with

Max

=Algebra.size.

We

shall

constru

ctth

esh

aring

map

with

the

help

ofa

hash

table,

mad

eup

ofbuckets

(k,[e

1 ;e2 ;...e

n])

where

eachelem

ent

ei

has

keyk.

-21

-

Page 22: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

Mem

oizing

type

bucket

=list

Algebra.domain;

value

memo

=Array.create

Algebra.size

([]

:bucket);

We

shall

use

aserv

icefu

nction

search,su

chth

atsearch

elretu

rns

the

first

yin

lsu

chth

aty

=e

oror

elseraises

the

excep

tion

Not_found.

value

search

e=

List.find

(fun

x->

x=e);

-22

-

Page 23: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

The

share

function

value

share

element

key

=

let

bucket

=memo.(key)

in

try

search

element

bucket

with

[Not_found

->

do

memo.(key):=[element::bucket];

element

];

Sharin

gis

just

recalling!

-23

-

Page 24: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

Com

pressin

gtrees

asdags

We

may

forin

stance

instan

tiateShare

onth

ealgeb

raof

trees,w

itha

sizehash

max

dep

endin

gon

the

application

:

module

Dag

=Share

(struct

type

domain=tree;

value

size=hash_max;

end);

And

now

we

compress

atrie

into

am

inim

aldag

usin

gshare

by

a

simple

bottom

-up

traversal,w

here

the

keyis

computed

along

by

hash

ing.

For

this

we

defi

ne

agen

eralbottom

-up

traversalfu

nction

,

which

applies

aparam

etriclookup

function

toevery

node

and

its

associated

key.

-24

-

Page 25: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

Dynam

icprogram

min

g

Bottom

-up

traversing

with

inductive

hash

-code

computation

.

value

hash1

key

index

sum

=sum

+index*key

and

hash

forest

=forest

mod

hash_max;

value

traverse

lookup

=travel

where

rec

travel

=fun

[Tree(forest)

->

let

f(tries,index,span)

t=

let

(t0,k)

=travel

t

in

([t0::tries],index+1,hash1

kindex

span)

in

let

(forest0,_,span)

=List.fold_left

f([],1,0)

forest

in

let

key

=hash

span

in

(lookup

(Tree(rev

forest0))

key,

key)

];

-25

-

Page 26: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

Com

pressin

ga

treeas

adag

Now

,com

pressin

ga

treeop

timally

asa

min

imal

dag

issim

ply

effected

by

ash

aring

traversal:

value

compress

=traverse

Dag.share;

value

minimize

tree

=let

(dag,_)

=compress

tree

in

dag;

-26

-

Page 27: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

Advan

tagesan

dex

tension

s

Hash

ing

keys

and

sizeis

onth

eclien

tsid

e:

we

do

not

delegate

hash

ing

toShare,

which

isju

stan

associative

mem

ory.T

his

has

two

advan

tages:

•T

he

computation

isfu

llylin

ear

•It

isad

apted

toth

estatistics

ofth

edata

Exten

sion:

Auto-sh

aring

types

(controlled

hash

-consin

g).Suggests

a

mon

adof

shared

hash

edstru

ctures

accomm

odatin

gen

tropy

ofth

e

data.

-27

-

Page 28: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

Dagifi

edlex

icons

We

may

dagify

alex

icona

posteriori

inon

epass:

value

rec

dagify

()

=

let

lexicon

=(input_value

stdin

:Trie.trie)

in

let

dag

=Mini.minimize

lexicon

in

output_value

stdout

dag;

Or

we

may

main

taina

dagifi

edstru

cture

by

sharin

gdynam

ically

when

insertin

gw

ords

by

approp

riatem

odifi

cationof

the

zipper

operation

s.

And

now

ifw

eap

ply

this

techniq

ue

toou

ren

glishlex

icon,w

ith

comm

anddagify

<english.rem

>small.rem,w

enow

getan

optim

alrep

resentation

which

only

need

s1M

bof

storage,half

ofth

e

original

ASC

IIstrin

grep

resentation

.

-28

-

Page 29: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

Pub

The

recursive

algorithm

sgiven

sofar

arefairly

straightforw

ard.

They

areeasy

todeb

ug,

main

tainan

dm

odify

due

toth

estron

gty

pin

g

safeguard

ofM

L,an

deven

easyto

formally

certify.T

hey

are

non

etheless

efficien

ten

ough

forpro

duction

use,

than

ks

toth

e

optim

izing

native-co

de

compiler

ofO

bjective

Cam

l.

Inou

rSan

skrit

application

,th

etrie

of11500

entries

issh

runk

from

219Kb

to103K

bin

0.1s,w

hereas

the

trieof

120000flex

edform

sis

shru

nk

from1.63M

bto

140Kb

in0.5s

ona

864MH

zP

C.O

ur

trieof

173528E

nglish

word

sis

shru

nk

from4.5M

bto

1Mb

in2.7s.

Measu

remen

tssh

owed

that

the

time

complex

ityis

linear

with

the

size

ofth

elex

icon(w

ithin

comparab

lesets

ofw

ords).

-29

-

Page 30: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

Variation

s

Man

yvariation

son

triesex

ist.O

ptim

isations

oflex

icalan

alysers

for

program

min

glan

guages

aredescrib

edin

the

Dragon

book

.B

ut

the

dragon

book

ofcom

putation

allin

guistics

has

not

been

written

yet.

Variation

with

ternary

trees.Tern

arytrees

arein

spired

fromB

entley

and

Sed

gewick

.Tern

arytrees

arem

orecom

plex

than

tries,but

use

slightly

lessstorage.

Access

ispoten

tiallyfaster

inbalan

cedtrees

than

tries.A

good

meth

odology

seems

touse

triesfor

edition

,an

dto

translate

them

tobalan

cedtern

arytrees

forpro

duction

use

with

a

fixed

lexicon

.

The

ternary

versionof

our

english

lexicon

takes3.6M

b,a

savin

gsof

20%over

itstrie

versionusin

g4.5M

b.

After

dag

min

imization

,it

takes1M

b,a

savin

gsof

10%over

the

triedag

versionusin

g1.1M

b.

For

our

sansk

ritlex

iconin

dex

,th

etrie

takes221K

ban

dth

etertree

180Kb.

Shared

asdags

the

trietakes

103Kb

and

the

tertree96K

b.

-30

-

Page 31: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

Decos,

Lex

map

s,A

utos

We

understan

dth

eTrie

structu

reof

aset

ofW

ords

asa

special

case

ofa

finitely

based

map

pin

gD

eco=

Word

→A

nnotation

inth

ecase

ofB

oolean

annotation

ssh

aredby

prefi

xargu

men

ts(an

dby

comm

on

subex

pression

sw

hen

shared

).

We

storem

orphology

constru

ctions

asbein

gof

this

type,

and

we

investigate

the

reversem

appin

gby

generalisin

gth

emto

relations,

typically

inductively

defi

ned

throu

ghfinite

statem

achin

es.

The

more

sharin

gw

eget

the

better

we

optim

iseth

isdata

layout.

It

isth

us

ofparam

ount

importan

ceth

atth

ean

notation

sbe

local

quasi-m

orphism

sdecoration

s.

-31

-

Page 32: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

Decos

type

deco

’a

=[

Deco

of

(list

’a

*dforest

’a)

]

and

dforest

’a

=list

(Word.letter

*deco

’a);

We

thin

kof

the

decoration

ofty

pelist

’a

asan

inform

ation

associated

with

the

word

storedat

that

node.

We

caneasily

generalize

sharin

gto

decorated

tries.H

owever,

substan

tialsav

ings

will

result

only

ifth

ein

formation

ata

givennode

isa

function

ofth

esu

btrie

atth

atnode,

i.e.if

such

inform

ationis

defi

ned

asa

triem

orphism

.

Defi

nition

.A

deco

isa

treem

orphism

ifth

ein

formation

atevery

node

isa

function

ofth

ecorresp

ondin

gsu

b-tree.

Such

decos

preserve

the

sharin

gof

the

treesth

eydecorate.

-32

-

Page 33: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

Enco

din

gm

orphological

param

etersas

decoration

s

We

thus

profi

tof

the

regularity

ofm

orphological

transform

ations

to

have

terserep

resentation

sof

the

lexicon

decorated

by

gramm

atical

inform

ation.

Thus

ifall

plu

ralsare

obtain

edby

addin

g‘s’

toth

e

singu

larstem

excep

tfor

afew

excep

tions,

we

do

not

pay

any

costin

enco

din

gth

isplu

ralin

formation

asan

explicit

instru

ction

[pl:suffix

s]

decoratin

gth

estem

s,sin

ceit

will

not

createan

ynew

node

excep

tfor

the

fewex

ception

s.A

sop

posed

tolistin

gex

plicitly

the

plu

ralform

,w

hich

wou

ldundo

allsh

aring.

Inou

rsan

skrit

implem

entation

,th

evariou

sgen

ders

associated

with

a

nou

nstem

aredefi

ned

ina

deco

used

forpro

ducin

gth

eflex

edform

s.

The

flex

edform

sare

then

generated

usin

gan

ad-h

oc

intern

alsan

dhi

algorithm

,diffi

cult

toen

code

asa

finite-state

pro

cess,an

dth

us

diffi

cult

toin

verse.

-33

-

Page 34: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

(Asid

e)T

he

scopin

gstru

cture

ofth

elex

icon

How

tofind

the

stemasso

ciatedw

itha

gender

inth

elex

iconin

one

clickso

that

morp

hology

may

be

disp

layed-w

ithno

need

ofscrip

tor

applet.

Sim

ple

distrib

uted

architectu

re-

allth

ecom

putation

isdon

eon

the

serversid

e.

Main

tainin

gcom

putation

alin

variants

inth

elex

iconau

gmen

tsits

robustn

ess.

-34

-

Page 35: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

Explicit

morp

hology

vs

implicit

morp

hology

By

explicit

morp

hology

Im

eanlistin

gex

plicitly

the

forms

generated

by

morp

hology

operation

sfrom

root

stems,

prefi

xes

and

suffi

xes.

By

implicit

morp

hology

Im

eanju

sthav

ing

program

sw

hich

will

generate

these

flex

edform

son

dem

and.

Implicit

morp

hology

isnot

enou

ghto

recognize

the

segmen

tsof

senten

cesid

entical

with

aflex

edform

:th

em

orphological

function

s

must

be

invertib

le.

-35

-

Page 36: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

Com

prom

ise

On

the

other

han

d,th

edelim

itationbetw

eenim

plicit

and

explicit

is

blu

rredsin

cee.g.

afinite-state

mach

ine

stategrap

hm

aybe

both

consid

ereda

program

and

apiece

ofdata;

forin

stance,

atrie

stores

word

s,but

actually

the

word

sare

“recognized

asbein

gin

the

lexicon

”by

“runnin

gth

elex

iconover

them

asin

put

data”.

Thus

we

shall

represen

t“ex

plicitly

”flex

edform

san

dth

ein

formation

onhow

they

arederived

fromro

otstem

sas

atrie

bearin

gas

decoration

sin

struction

son

how

to“u

ndo

morp

hology

”lo

cally.For

this

purp

ose,w

esh

alluse

the

notion

ofdiff

erential

word

above.

We

may

now

storein

versem

aps

oflex

icalrelation

s(su

chas

morp

hology

derivation

s)usin

gth

eLex

map

structu

re.

This

way

we

bypass

the

(hard

)prob

lemof

intern

alsan

dhifsm

axiom

atisation.

-36

-

Page 37: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

Lex

map

s

type

inverse

’a

=(Word.delta

*’a)

and

inverse_map

’a

=list

(inverse

’a);

type

lexmap

’a

=[Map

of

(inverse_map

’a

*mforest

’a)

]

and

mforest

’a

=list

(Word.letter

*lexmap

’a);

Typically,

ifw

ordw

isstored

ata

node

Map([...;(d

,r);...],...),th

is

represen

tsth

efact

that

wis

the

image

by

relationr

of

w′=

patch

dw

.Such

ale

xm

ap

isth

us

arep

resentation

ofth

eim

age

by

rof

asou

rcelex

icon.

This

represen

tationis

invertib

le,w

hile

preserv

ing

max

imally

the

sharin

gof

prefi

xes,

and

thus

bein

g

amen

able

tosh

aring.

Exam

ple:

catsan

ddogs

sharin

gth

eir‘s’

node

while

implicitly

referring

toth

eirresp

ectivesin

gular

stem.

-37

-

Page 38: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

Lex

iconrep

ositoriesusin

gtries

and

decos

Ina

typical

computation

allin

guistics

application

,gram

matical

inform

ation(p

artof

speech

role,gen

der/n

um

ber

forsu

bstan

tives,

valency

and

other

subcategorization

inform

ationfor

verbs,

etc)m

ay

be

storedas

decoration

ofth

elex

iconof

roots/stem

s.From

such

a

decorated

triea

morp

hological

pro

cessorm

aycom

pute

the

lexm

apof

allflex

edform

s,decorated

with

their

derivation

inform

ationen

coded

asan

inverse

map

.T

his

structu

rem

ayitself

be

used

by

ataggin

g

pro

cessorto

constru

ctth

elin

earrep

resentation

ofa

senten

ce

decorated

by

feature

structu

res.Such

arep

resentation

will

support

furth

erpro

cessing,

such

ascom

putin

gsy

ntactic

and

function

al

structu

res,ty

pically

assolu

tions

ofcon

straint

satisfactionprob

lems.

-38

-

Page 39: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

Exam

ple:

San

skrit

The

main

compon

ent

inou

rto

olsis

astru

ctured

lexical

datab

ase.

From

this

datab

ase,variou

shypertex

tdocu

men

tsm

aybe

pro

duced

mech

anically.

The

index

CG

Ien

gine

searches

forw

ords

by

nav

igating

ina

persisten

ttrie

index

ofstem

entries.

The

curren

tdatab

ase

comprises

12000item

s,an

dits

index

has

asize

of103K

B.

When

computin

gth

isin

dex

,an

other

persisten

tstru

cture

iscreated

.

Itrecord

sin

adeco

allth

egen

ders

associated

with

anou

nen

try.A

t

presen

t,th

isdeco

records

genders

for5700

nou

ns,

and

ithas

asize

of

268KB

.

We

iterateon

this

genders

structu

rea

gramm

aticalen

gine,

which

generates

declin

edform

s.T

his

lexm

aprecord

sab

out

120000su

ch

flex

edform

sw

ithasso

ciatedgram

matical

inform

ation,an

dit

has

a

sizeof

341KB

.A

compan

iontrie,

with

out

the

inform

ation,keep

sth

e

index

offlex

edw

ords

asa

min

imized

structu

reof

140KB

.

-39

-

Page 40: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

Fin

iteState

Lore

Com

putation

alphon

ologyare

morp

hology

use

exten

sivelyfinite

state

technology

:ration

allan

guages

and

relations,

transd

ucers,

bim

achin

es,etc.

•Sch

utzen

berger

•K

oskenniem

i

•K

aplan

and

Kay

Fin

itestate

toolsets

have

been

develop

ed,w

here

word

transform

ations

aresy

stematically

compiled

ina

low-level

algebra

of

finite-state

mach

ines

operators.

Such

toolsets

have

been

develop

edat

Xerox

,Paris

VII,

Bell

Lab

s,M

itsubish

iLab

s,etc.

Com

pilin

g

complex

rewrite

rules

inration

altran

sducers

may

be

subtle.

We

dep

artfrom

this

fine-grain

edm

ethodology

and

prop

osem

oredirect

translation

spreserv

ing

the

structu

reof

the

lexicon

.

-40

-

Page 41: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

Fin

iteState

Mach

ines

asLex

iconM

orphism

s

We

startw

ithth

erem

arkth

ata

lexicon

represen

tedas

atrie

is

directly

the

statesp

acerep

resentation

ofth

e(d

etermin

istic)finite

statem

achin

eth

atrecogn

izesits

word

s,an

dth

atits

min

imization

consists

exactly

insh

aring

the

lexical

treeas

adag.

We

arein

acase

where

the

stategrap

hof

such

finite

langu

agesrecogn

izersis

an

acyclic

structu

re.Such

apure

data

structu

rem

aybe

easilybuilt

with

out

mutab

lereferen

ces,w

hich

has

computation

alan

drob

ustn

ess

advan

tages.

Inth

esam

esp

irit,w

edefi

ne

autom

ataw

hich

implem

ent

non

-trivial

rational

relations

(and

their

inversion

)an

dw

hose

statestru

cture

is

non

etheless

am

oreor

lessdirect

decoration

ofth

elex

icontrie.

The

crucial

notion

isth

atth

estate

structu

reis

alex

iconm

orphism

.

-41

-

Page 42: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

Unglu

eing

We

startw

itha

toyprob

lemw

hich

isth

esim

plest

caseof

junctu

re

analy

sis,nam

elyw

hen

there

areno

non

-trivial

junctu

reru

les,an

d

segmen

tationcon

sistsju

stin

retrievin

gth

ew

ords

ofa

senten

ceglu

ed

together

inon

elon

gstrin

gof

characters

(orphon

emes).

Con

sider

for

instan

cew

rittenE

nglish

.Y

ouhave

atex

tfile

consistin

gof

aseq

uen

ce

ofw

ords

separated

with

blan

ks,

and

youhave

alex

iconcom

plete

for

this

text

(forin

stance,

‘spell’

has

been

successfu

llyap

plied

).N

ow,

suppose

youm

akesom

eed

iting

mistake,

which

removes

allsp

aces,

and

the

taskis

toundo

this

operation

torestore

the

original.

The

transd

ucer

isdefi

ned

asa

functor,

takin

gth

elex

icontrie

structu

reas

param

eter.

-42

-

Page 43: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

Unglu

e

module

Unglue

(Lexicon:

sig

value

lexicon

:Trie.trie;

end)

=struct

type

input

=Word.word

(*

input

sentence

as

aword

*)

and

output

=list

Word.word;

(*

output

is

sequence

of

words

*)

type

backtrack

=(input

*output)

and

resumption

=list

backtrack;

(*

coroutine

resumptions

*)

exception

Finished;

We

defi

ne

our

unglu

eing

reactiveen

gine

asa

recursive

pro

cessw

hich

nav

igatesdirectly

onth

e(fl

exed

)lex

icontrie

(typically

the

compressed

trieresu

lting

fromth

eD

agm

odule

consid

eredab

ove).

-43

-

Page 44: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

The

reactiveen

gine

The

reactiveen

gine

takesas

argum

ents

the

(remain

ing)

input,

the

(partially

constru

cted)

listof

word

sretu

rned

asou

tput,

aback

track

stackw

hose

items

are(in

put,o

utp

ut)

pairs,

the

path

occ

inth

estate

graph

stackin

g(th

ereverse

of)th

ecu

rrent

comm

onprefi

xof

the

candid

atew

ords,

and

finally

the

curren

ttr

ienode

asits

curren

t

state.W

hen

the

stateis

acceptin

g,w

epush

iton

the

back

track

stack,becau

sew

ew

ant

tofavor

possib

lelon

gerw

ords,

and

sow

e

contin

ue

readin

gth

ein

put

until

either

we

exhau

stth

ein

put,

orth

e

nex

tin

put

character

isin

consisten

tw

ithth

elex

icondata.

-44

-

Page 45: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

The

reactiveen

gine

code

value

rec

react

input

output

back

occ

=fun

[Trie(b,forest)

->

if

bthen

let

pushout

=[occ::output]

in

if

input=[]

then

(pushout,back)

(*

solution

found

*)

else

let

pushback

=[(input,pushout)::back]

in

continue

pushback

else

continue

back

where

continue

cont

=match

input

with

[[]

->

backtrack

cont

|[letter

::

rest]

->

try

let

next_state

=List.assoc

letter

forest

in

react

rest

output

cont

[letter::occ]

next_state

with

[Not_found

->

backtrack

cont

]

]]

-45

-

Page 46: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

Back

track

and

backtrack

=fun

[[]

->

raise

Finished

|[(input,output)::back]

->

react

input

output

back

[]

Lexicon.lexicon

];

Now

,unglu

eing

asen

tence

isju

stcallin

gth

ereactive

engin

efrom

the

approp

riatein

itialback

tracksitu

ation.

value

unglue

sentence

=backtrack

[(sentence,[])];

-46

-

Page 47: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

Rem

ark

Non

-determ

inistic

program

min

gis

no

big

deal.

Why

shou

ldyou

surren

der

control

toa

PR

OLO

Gblack

box

?

The

three

golden

rules

ofnon

-determ

inistic

program

min

g:

•Id

entify

well

your

searchstate

space

•R

epresen

tstates

asnon

-mutab

ledata

•P

roveterm

ination

The

lastpoin

tis

essential

forunderstan

din

gth

egran

ularity

and

enforcin

gcom

pleten

ess.

-47

-

Page 48: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

More

onstate

space

consid

erations

This

non

-determ

inistic

pro

cess(recogn

izing

L∗)

uses

the

sam

estate

space

asth

elex

icon/trie

(recognizin

gL

).

This

correspon

ds

toth

efact

that

anau

tomaton

forL

∗m

aybe

obtain

edfrom

the

autom

atonfor

Lby

insertin

gε-m

ovesfrom

acceptin

gnodes

toth

ein

itialnode.

But

such

transition

sm

aybe

kept

completely

implicit.

All

youhave

todo

isto

man

ageth

enecessary

non

-determ

inism

(contin

uin

gin

Lw

hich

isnot

ingen

erala

prefi

x

langu

age(i.e.

ifm

ayhap

pen

that

both

wan

dw·s

arein

L)

versus

iterating)

inth

eback

trackstack

,but

youdo

not

have

tom

odify

at

allth

estate

space

data

structu

re.It

isju

sta

shift

inpoin

tof

view

concern

ing

this

data.

-48

-

Page 49: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

Still

more

onstate

space

consid

erations

Rem

ember

that

dagifi

edtries

defi

ne

the

min

imal

autom

atonof

a

finite

langu

ageL

.

But

itis

not

the

caseth

atth

isau

tomaton

,com

pleted

with

ε

transition

s,is

min

imal

forL∗.

Con

sider

forin

stance

L=a,a

a.

How

ever,note

that

we

areusin

git

asa

transd

ucer

computin

g

justifi

cations

fora

word

inL∗

tobe

acon

catenation

ofprecise

word

s

ofL

,an

dth

em

inim

alau

tomaton

does

not

keepen

ough

inform

ation

forth

at:distin

ctsegm

entation

sof

asen

tence

must

be

separated

.

-49

-

Page 50: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

Child

talk

module

Childtalk

=struct

value

lexicon

=Lexicon.make_lex

["boudin";"caca";"pipi"];

end;

module

Childish

=Unglue(Childtalk);

let

(sol,_)

=Childish.unglue

(Word.encode

"pipicacaboudin")

in

Childish.print_out

sol;

We

recoveras

expected

:pipi

caca

boudin.

-50

-

Page 51: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

Gen

erating

severalsolu

tions

We

resum

ea

resum

ption

with

resume

:(resumption

->

int

->

resumption).

value

resume

cont

n=

let

(output,resumption)

=backtrack

cont

in

do

print_string

"\n

Solution

";

print_int

n

;print_string

":\n";

print_out

output

;resumption

;

value

unglue_all

sentence

=restore

[(sentence,[])]

1

where

rec

restore

cont

n=

try

let

resumption

=resume

cont

n

in

restore

resumption

(n+1)

with

[Finished

->

if

n=1

then

print_string

"No

solution

found\n"

else

()

];

-51

-

Page 52: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

Solv

ing

ach

arade

module

Short

=struct

value

lexicon

=Lexicon.make_lex

["able";

"am";

"amiable";

"get";

"her";

"i";

"to";

"together"];

end;

module

Charade

=Unglue(Short);

Charade.unglue_all

(Word.encode

"amiabletogether");

Solution

1:amiable

together

Solution

2:amiable

to

get

her

Solution

3:am

iable

together

Solution

4:am

iable

to

get

her

-52

-

Page 53: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

Junctu

reeu

phon

yan

dits

discretization

When

successive

word

sare

uttered

,th

em

inim

izationof

the

energy

necessary

torecon

figu

rateth

evo

calorgan

sat

the

junctu

reof

the

word

sprovo

ques

aeu

phon

ytran

sformation

,discretized

atth

elevel

of

phon

emes

by

acon

textu

alrew

riteru

leof

the

form:

[x]u|v→

w

This

junctu

reeu

phon

y,or

extern

alsan

dhi,

isactu

allyrecord

edin

sansk

ritin

the

written

renderin

gof

the

senten

ce.T

he

first

lingu

istic

pro

cessing

isth

ereforesegm

entation

,w

hich

generalises

unglu

eing

into

sandhian

alysis.

-53

-

Page 54: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

uv

wx

-54

-

Page 55: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

zu

v

w

u v

x

-55

-

Page 56: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

Auto

type

lexicon

=trie

and

rule

=(word

*word

*word);

The

rule

triple(rev

u,

v,

w)

represen

tsth

estrin

grew

riteu|v→

w.

Now

forth

etran

sducer

statesp

ace:

type

auto

=[State

of

(bool

*deter

*choices)

]

and

deter

=list

(letter

*auto)

and

choices

=list

rule;

module

Auto

=Share

(struct

type

domain=auto;

value

size=hash_max;

end);

-56

-

Page 57: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

Com

pilin

gth

elex

iconto

am

inim

altran

sducer

(*

build_auto

:word

->

lexicon

->

(auto

*stack

*int)

*)

value

rec

build_auto

occ

=fun

[Trie(b,arcs)

->

let

local_stack

=if

bthen

get_sandhi

occ

else

[]

in

let

f(deter,stack,span)

(n,t)

=

let

current

=[n::occ]

(*

current

occurrence

*)

in

let

(auto,st,k)

=build_auto

current

t

in

([(n,auto)::deter],merge

st

stack,hash1

nk

span)

in

let

(deter,stack,span)

=fold_left

f([],[],hash0)

arcs

in

let

(h,l)

=match

stack

with

[[]

->

([],[])

|[h::l]

->

(h,l)]

in

let

key

=hash

bspan

h

in

let

s=

Auto.share

(State(b,deter,h))

key

in

(s,merge

local_stack

l,key)

];

-57

-

Page 58: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

Segm

entin

gTran

sducer

Data

Stru

ctures

type

transition

=

[Euphony

of

rule

(*

(rev

u,v,w)

st

u|v

->

w*)

|Id

(*

identity

or

no

sandhi

*)

]

and

output

=list

(word

*transition);

type

backtrack

=

[Next

of

(input

*output

*word

*choices)

|Init

of

(input

*output)

]

and

resumption

=list

backtrack;

(*

coroutine

resumptions

*)

exception

Finished;

-58

-

Page 59: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

Runnin

gth

eSegm

entin

gTran

sducer

value

rec

react

input

output

back

occ

=fun

[State(b,det,choices)

->

(*

we

try

the

deterministic

space

first

*)

let

deter

cont

=match

input

with

[[]

->

backtrack

cont

|[letter

::

rest]

->

try

let

next_state

=List.assoc

letter

det

in

react

rest

output

cont

[letter::occ]

next_state

with

[Not_found

->

backtrack

cont

]

]in

let

nondets

=if

choices=[]

then

back

else

[Next(input,output,occ,choices)::back]

in

if

bthen

let

out

=[(occ,Id)::output]

(*

opt

final

sandhi

*)

-59

-

Page 60: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

in

if

input=[]

then

(out,nondets)

(*

solution

*)

else

let

alterns

=[Init(input,out)

::

nondets

]

(*

we

first

try

the

longest

matching

word

*)

in

deter

alterns

else

deter

nondets

]

and

choose

input

output

back

occ

=fun

[[]

->

backtrack

back

|[((u,v,w)

as

rule)::others]

->

let

alterns

=[

Next(input,output,occ,others)

::

back

]

in

if

prefix

winput

then

let

tape

=advance

(length

w)

input

and

out

=[(u

@occ,Euphony(rule))::output]

in

if

v=[]

(*

final

sandhi

*)

then

if

tape=[]

then

(out,alterns)

else

backtrack

alterns

-60

-

Page 61: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

else

let

next_state

=access

v

in

react

tape

out

alterns

vnext_state

else

backtrack

alterns

]

and

backtrack

=fun

[[]

->

raise

Finished

|[resume::back]

->

match

resume

with

[Next(input,output,occ,choices)

->

choose

input

output

back

occ

choices

|Init(input,output)

->

react

input

output

back

[]

automaton

]

];

-61

-

Page 62: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

Exam

ple

ofSan

skrit

Segm

entation

process

"tacchrutvaa";

Chunk:

tacchrutvaa

may

be

segmented

as:

Solution

1:

[tad

with

sandhi

d|"s

->

cch]

["srutvaa

with

no

sandhi]

-62

-

Page 63: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

More

exam

ples

process

"o.mnama.h\"sivaaya";

Solution

1:

[om

with

sandhi

m|n

->

.mn]

[namas

with

sandhi

s|"s

->

.h"s]

["sivaaya

with

no

sandhi]

process

"sugandhi.mpu.s.tivardhanam";

Solution

1:

[sugandhim

with

sandhi

m|p

->

.mp]

[pu.s.ti

with

no

sandhi]

[vardhanam

with

no

sandhi]-

63

-

Page 64: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

San

skrit

Taggin

g

process

"sugandhi.mpu.s.tivardhanam";

Solution

1:

[sugandhim

<

acc.

sg.

m.

[sugandhi]

>with

sandhi

m|p

->

.mp]

[pu.s.ti

<

iic.

[pu.s.ti]

>with

no

sandhi]

[vardhanam

<

acc.

sg.

m.

|acc.

sg.

n.

|nom.

sg.

n.

|voc.

sg.

n.

[vardhana]

>with

no

sandhi]

-64

-

Page 65: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

Statistics

The

complete

autom

atoncon

struction

fromth

eflex

edform

slex

icon

takeson

ly9s

ona

864MH

zP

C.W

eget

avery

compact

autom

aton,

with

only

7337states,

1438of

which

acceptin

gstates,

fittin

gin

746KB

ofm

emory.

With

out

the

sharin

g,w

ew

ould

have

generated

abou

t200000

statesfor

asize

of6M

B!

The

totalnum

ber

ofsan

dhiru

lesis

2802,of

which

2411are

contex

tual.

While

4150states

have

no

choice

poin

ts,th

erem

ainin

g

3187have

anon

-determ

inistic

compon

ent,

with

afan

-out

reachin

g

164in

the

worst

situation

.H

owever

inpractice

there

arenever

more

than

2ch

oicesfor

agiven

input,

and

segmen

tationis

extrem

elyfast.

-65

-

Page 66: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

Overgen

erationP

roblem

s

Very

short

particles

have

tobe

treateddiff

erently,

oroth

erwise

there

wou

ldbe

intolerab

leovergen

eration.

Prob

ably

proso

dy

will

have

to

come

toth

erescu

e.T

he

caseof

vedic

“u”.

Com

pou

nds.

The

bah

uvrıh

iprob

lem.

Intrin

sicovergen

eration.

a+a=

a+a=

a+a=

a+a=

aM

osts.m

.en

d

with

a,m

any

s.f.en

dw

itha,

the

preverb

a(tow

ards)

isfreq

uen

t,th

e

prefi

xa

iscom

mon

(negation

).So

there

isoften

room

for

interp

retation!

E.g.

na

asatovid

yatebhavo

na

abhavo

vid

yatesatah.

vs

na

asatovid

yateab

havo

na

abhavo

vid

yatesatah.

Dou

ble

enten

dre

poetry.

-66

-

Page 67: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

Sou

ndness

and

Com

pleten

essof

the

Algorith

ms

Theorem

.If

the

lexical

system

(L,R

)is

strictan

dw

eakly

non

-overlappin

gs

isan

(L,R

)-senten

ceiff

the

algorithm

(segm

ent

all

s)retu

rns

asolu

tion;con

versely,th

e(fi

nite)

setof

all

such

solution

sex

hib

itsall

the

pro

ofsfor

sto

be

an(L

,R)-sen

tence.

Fact.

Inclassical

San

skrit,

extern

alsan

dhiis

strongly

non

-overlappin

g.

Cf.http://pauillac.inria.fr/~huet/FREE/tagger.ps

-67

-

Page 68: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

Where

isth

ein

formation

?

Mel’cu

ksay

s“E

veryth

ing

isin

the

lexicon

”.

The

keycon

cept

islex

icondirected

.So

most

ofth

ein

formation

is

indeed

inth

elex

icon.

But

alot

ofphon

ologicalin

formation

(sandhi

rules)

and

gramm

aticalknow

ledge

isin

the

code.

Iftim

eperm

its.A

tour

ofth

ediction

arystru

ctures.

-68

-

Page 69: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

Enjoy

!

•San

skrit

site:http://pauillac.inria.fr/~huet/SKT/

•San

dhiA

naly

sispap

er:

http://pauillac.inria.fr/~huet/FREE/tagger.ps

•C

ourse

notes:

http://pauillac.inria.fr/~huet/ZEN/esslli.ps

•C

ourse

slides:

http://pauillac.inria.fr/~huet/ZEN/Trento.ps

•ZE

Nlib

rary:http://pauillac.inria.fr/~huet/ZEN/zen.tar

•O

bjective

Cam

l:http://caml.inria.fr/ocaml/

-69

-

Page 70: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

What

nex

t(on

the

San

skrit

front)

•San

skrit

1V

erbm

orphology,

Corp

us

testing,

Lex

iconacq

uisition

mode,

Segm

entation

trainin

g,P

hilology

assistant

(Sch

arf,Sm

ith)

•San

skrit

2Sen

tinels,

Proso

dy,

Valen

cych

eckin

g,D

epen

den

cy

synth

esis

•San

skrit

3D

iscourse

analy

sis:R

eference,

Scop

e,T

hem

e,Focu

s,

Anap

hora

resolution

,E

xtra-lin

guistic

inform

ation

•San

skrit

∞D

istributed

develop

men

tof

multilin

gual

tools,

Sav

ing

the

Pune

diction

arypro

ject

-70

-

Page 71: Represen tation G - Inriagallium.inria.fr/~huet/PUBLIC/Trento.pdfrepresen tation-e ciency The In teraction Com binators P aradigm Remark. Zipp ers are linear con texts. They are sup

What

nex

t(on

the

Zen

front)

•Zen

main

tenan

ceD

istribution

,H

otline,

Users’

club,C

oord

ination

ofex

tension

s

•Zen

imm

ediate

exten

sions

Graftin

gof

regular

relations,

Rules

compiler

•Tow

ards

am

orecom

preh

ensive

generic

platform

for

computation

allin

guistics ,

accomm

odatin

gth

elevels

ofSyntax

,

Sem

antics,

and

Discou

rseIn

formation

Dynam

ics

-71

-