multi-layer filtering algorithm bilingual chunk alignment in statistical machine translation an...

Multi-Layer Filtering algorithm

Bilingual Chunk Alignment In Statistical Machine

Translation

An introduction of Multi-Layer Filtering (MLF) algorithm

Dawei Hou

LING 575 MT WIN07

2


What is the “Chunk” here ?

In this paper:

The “Chunk” doesn’t rely on the information from

tagging, parsing, syntax analyzing or segmenting

A “Chunk” is a continuous words order

3


Why do we use “Chunk” in translations?

Can leads to more fluent translations since chunk-based

translations capture local reordering phenomena.

Can successfully makes long sentences shorter, which

benefits SMT algorithm’s performance.

Obtains accurate one-to-one alignment of each pair bilingual

chunks.

Greatly decrease search space and time complexity during

translation.

4


What about other approaches?

What about word-based translations?

5


Some background

SMT systems employ word-based alignment models

based on the five word-based statistical models

proposed by IBM.

Problem:

Still suffer from poor performance when used in the

language pairs which have great differences in

structures since these models fundamentally rely on

word-level translation.

6


Some background

Alignment algorithms based on phrases, chunks or

structures and most of them based on complex

syntax information.

Problem:

Have proven to yield poor performance when dealing

with long sentences;

Heavily depend on the performance of associated

tools such as parsers, POS taggers ....

7


How do we get improvements from those

problems by using chunk-based translations?

8



To discover one-to-one pairs of bilingual chunks in the

untagged well-formed bilingual sentence pairs

Multi-Layers are used to extract bilingual chunks

according to different features of chunks in the

bilingual corpus.

9


Summarization of Procedures

Filtering the most frequent chunks

Clustering the similar words and filtering the most frequent

structures

Deal with the remnant fragment

Keeping one-to-one alignment

10


Filtering the most frequent chunks -- Step 1

Assumption:

The most co-occurrent word lists might be a potential chunk.

Apply the formula-1 list below, we filter those word lists as

initial monolingual chunks;

1 2 1 2 1 2( , ,... ) (1 ) ( , ,... ) ( , ,... )k k k kD D w w w MI w w w P w w w

1 21 2 1 2

1 2

( , ,... )( , ,... ) ( , ,... )

( ) ( ) ... ( )

kk k

k

P w w wMI w w w P w w w log

P w P w P w

formula-1

formula-2

11


The result of Filtering Step 1

What || kind || of || room || do || you || want || to || reserve

1.36 1.31 0.046 0.063 10.07 0.61 2.11 0.077

你 || 想 || 预 || 定 || 什 || 么 || 样 || 的 || 房 || 间

0.69 0.17 1.39 0.076 7.80 0.87 0.30 1.27

4.52

An example :

12


Filtering the most frequent chunks -- Step 2

Now we have :

All the cohesion degrees between any two adjacent words in

Source and Target sentences.

Applying the formula-3 list below, we will find the entire set of

initial monolingual chunks;

formula-3

_ _ _int{ }

_max _ _ _ _

length of a sentencen

the imum length of a chunk

13


The result of Filtering Step 2-1


1.36 1.31 0.046 0.063 10.07 0.61 2.11 0.077

你 || 想 || 预 || 定 || 什 || 么 || 样 || 的 || 房 || 间

0.69 0.17 1.39 0.076 7.80 0.87 0.30 1.27

4.52

In this case: n = int{ 10/4 } = 2;

14


The result of Filtering Step 2-(1)-EN

Initial ChunksInitial Chunks DDkk DDkk** Initial ChunksInitial Chunks DDkk DDkk**

What kind 1.36 1.36 You want 0.61 0.61

What kind of 2.10 5.25 You want to 0.33 0.82

Kind of 1.31 1.31 You want to reserve

0.086 0.60

Do you 10.07 10.07 Want to 2.11 2.11

Do you want 0.31 0.77 Want to reserve 0.056 0.14

Do you want to 0.13 0.90 To reserve 0.077 0.077

Now we get a table of the initial monolingual chunks;

2

( )*

( )

kk k

Max DD D

Max D

formula-4

15




What kind 1.36 1.36 You wantYou want 0.610.61 0.610.61

What kind of 2.10 5.25 You want toYou want to 0.330.33 0.820.82

Kind of 1.31 1.31 You want to You want to reservereserve

0.0860.086 0.600.60

Do you 10.07 10.07 Want to 2.11 2.11

Do you wantDo you want 0.310.31 0.770.77 Want to reserveWant to reserve 0.0560.056 0.140.14

Do you want toDo you want to 0.130.13 0.900.90 To reserveTo reserve 0.0770.077 0.0770.077

Set threshold Dk*> 1.0 , we get :

We still need more steps to do maximum matching and overlap discarding;

16




What kind 1.36 1.36 Want to 2.11 2.11

What kind of 2.10 5.25

Kind of 1.31 1.31 Do you 10.07 10.07

According to the maximum matching principle and Preventing overlapping problem, we need to apply :

formula-4: 1

k

k

D

D

1

i

i

K

K

D

D

formula-5:

17



Deal with the remnant fragment:

we simply combine such individual or sequential words as a chunk.

So we get a much shorter sentence lists below:

What & kind & of || room || do & you || want & to || reserve

18


The result of Filtering Step 2-(1)-CN


1.36 1.31 0.046 0.063 10.07 0.61 2.11 0.077

你 || 想 || 预 || 定 || 什 || 么 || 样 || 的 || 房 || 间

0.69 0.17 1.39 0.076 7.80 0.87 0.30 1.27

4.52

In this case: n = int{ 10/4 } = 2;

19




你想 0.69 0.69 么样的房 0.13 0.55

预定 2.39 2.39 样的 0.30 0.30

什么 7.80 7.80 样的房 0.13 0.30

什么样 0.44 1.00 样的房间 0.21 0.88

什么样的 0.58 2.44 的房 1.27 1.27

么样 0.87 0.87 的房间 2.45 5.88

么样的 0.37 0.84 房间 4.52 4.52

Now we get a table of the initial monolingual chunks;

2

( )*

( )

kk k

Max DD D

Max D

formula-4

20



Set threshold Dk*> 1.0 , we get :

We still need more steps to do maximum matching and overlap discarding;


你想你想 0.690.69 0.690.69 么样的房么样的房 0.130.13 0.550.55

预定 2.39 2.39 样的样的 0.300.30 0.300.30

什么 7.80 7.80 样的房样的房 0.130.13 0.300.30

什么样什么样 0.440.44 1.001.00 样的房间样的房间 0.210.21 0.880.88

什么样的 0.58 2.44 的房 1.27 1.27

么样么样 0.870.87 0.870.87 的房间 2.45 5.88

么样的么样的 0.370.37 0.840.84 房间 4.52 4.52

21




预定 2.39 2.39 的房 1.27 1.27

什么 7.80 7.80 的房间 2.45 5.88

什么样的 0.58 2.44 房间 4.52 4.52

According to the maximum matching principle :

的

By applying formula-4:

1

k

k

D

D

max( D 什么样的 /D 什么样 ,D 的房间 /D 房间 ) = max(2.44,1.30) = 2.44

?

22



Deal with the remnant fragment:

we simply combine such individual or sequential words as a chunk.

So we get a much shorter sentence lists below:

你 || 想 || 预 & 定 || 什 & 么 & 样 & 的 || 房 & 间

23


Some problems

After fisrt filtering process, suppose we found an aligned chunk pairs:

|| 在 & 五 & 点 ||

|| at & five & o’clock ||

But some potentially good chunks like:

Might have been broken into several fragments like:

Since this structure include word sequences with low frequency of occurrence (we suppose “six” is lower frequent than “five” here )

|| at & six & o’clock ||

|| at || six || o’clock ||

24


Clustering the similar words and filtering the most frequent structures

Many frequent chunks have similar structures but different in

detail.

We can cluster similar words according to the position vectors of

their behavior relative to anchor words.

For all of the words in the same class, we suppose they are good

chunks, then filter the most frequent structures according the

method introduced before.

25


Clustering the similar words and filtering the most frequent structures – Step 1

In the corpus resulting from the first filtering process, find the most

frequent words as anchor words, for example:

RankRank 11 22 33 44 55 66 77 88 99 1010

WordWord the a to this for in on of at room

Why we use most frequent words?

As the anchor words are the most common words, a great deal of information can be obtained.

Words in similar position vectors in relation to anchor words can be assumed to belong to similar word

classes.

26



Build words vectors and define the size of the window for observation.

(in this case windows size = 5)

For instance, we build a word vector which anchor word is “in” and we

observe a candidate word “the” to be clustered falls within the window:

SizeSize 55

PositionPosition w-2w-2 w-1w-1 ww w+1w+1 w+2w+2

WordWord the the in the the

ValueValue 16 1 0 415 0

Formula-7,8:

1

( , )N

ij j

k

V w w

1_____( , )

0 _____j

wj ww w

wj w

27



In order to compare vectors fairly, these vectors must be

normalized by formular-9 as follows:

1

*ij

ij m

ij

j

VV

V

Example : “in/that” and “in/this”

28



Measure the similarities of various vectors and cluster the words

which have similar distributions relative to the anchor words:

2)

1

( , ( )K

x y xj yj

j

D V V V V

Euclidian distance:

Example result:

Word classisWord classis Anchor Anchor wordswords

Single double twin standard suite different quiet (a, room)

the my your this that our (in, room)

America all fact Japan English (in, )

29



For all of the words in the same class, replace with a particular

symbol, and then consider this symbol as an ordinary word. Then

filter the most frequent structures my Multi-Layer Filtering algorithm

again.

For instance, if we have:

|| 在 & 五 & 点 ||

|| at & five & o’clock ||

parallel word classes:

& { One, two,…, five..., twelve }

We will get :

{ 一 , 二 ,…, 五 ..., 十二 }

|| 在 & 一 & 点 ||

|| at & one & o’clock ||

|| 在 & 两 & 点 ||

|| at & two & o’clock ||...

30



Next step:

31



Now we have a pair of new parallel sentences with chunks:

你 || 想 || 预 & 定 || 什 & 么 & 样 & 的 || 房 & 间

What & kind & of || room || do & you || want & to || reserve

Our purpose is to find one-to-one chunk alignment on the assumption that the chunks the chunks

to be aligned may occur almost equally in the corresponding parallel texts.to be aligned may occur almost equally in the corresponding parallel texts.

32



2 { ( _ , _ )}

( _ ) ( _ )

Num Co occurrence C CHK E CHK

Num C CHK Num E CHK

By applying the formular-11, we can get a alignment table:

formular-11:

θθ 你你想想预定预定什么样的什么样的房间房间What kind ofWhat kind of 0.025 0.021 0.053 0.8890.889 0.016

RoomRoom 0.021 0.029 0.09 0.014 0.8880.888

Do youDo you 0.4600.460 0.014 0.002 0.012 0.020

Want toWant to 0.007 0.0690.069 0.013 0.002 0.023

reservereserve 0.002 0.001 0.0830.083 0.034 0.047

33


Experiments

Training data:

55,000 pairs of Chinese-English spoken parallel sentences

Test data:

400 pairs of Chinese-English spoken parallel sentences were chosen randomly from the same corpus.

These 400 pairs sentences manually partitioned to obtain monolingual chunks and then manually aligned the corresponding bilingual chunks for computing the chunking and alignment accuracy.

34


Experiments

Evaluation:

Comparing the automatically obtained monolingual chunks and aligned bilingual chunks to chunks

discovered manually, we compute their precision, recall and F-Measure value by the followed formula:

100%r

p

Nprecision

N 100%

r

a

Nrecall

N

2

2

( 1)

( )

precision recallF

precision recall

35


Experiments

The accuracy of chunkingThe accuracy of chunking

Precision(%)Precision(%) Recall(%)Recall(%) F-MeasureF-Measure

77 65 0.70

Results:

The accuracy of alignmentThe accuracy of alignment

Precision(%)Precision(%) Recall(%)Recall(%) F-MeasureF-Measure

89 72 0.80

36


Experiments

Comparisions of chunk-based translation to word-based translation:

SystemsSystems BLEUBLEU NISTNIST

Word-basedWord-based 0.259 2.661

Chunk-basedChunk-based 0.290 2.921

ImprovementImprovement + 0.031 + 0.260

The improvement is about 10%.

37


Conclusions

This chunking and alignment algorithm doesn’t rely on the information from tagging, parsing or syntax analysis, and doesn’t even require sentence segmentation.

It obtains accurate one-to-one alignment of chunks

It greatly decreases search space and time complexity during translation.

The performance is better than baseline word alignment system. (in some tasks)

38


Problem / Weakness

Authors didn’t say anything.

Maybe we can do some improvement at:

The step of maximum matching

The step of building position vectors

multi-layer filtering algorithm bilingual chunk alignment in statistical machine translation an...

Documents