malay-english bitext mapping and alignment using simr/gsa algorithms mosleh al-adhaileh tang enya...

18
Malay-English Bitext Malay-English Bitext Mapping and Alignment Using Mapping and Alignment Using SIMR/GSA Algorithms SIMR/GSA Algorithms Mosleh Al-Adhaileh Mosleh Al-Adhaileh and Tang Enya Tang Enya Kong Kong Computer Aided Translation Unit Computer Aided Translation Unit School of Computer Sciences School of Computer Sciences U U niversity niversity S S cience cience M M alaysia alaysia I. Dan Melamed I. Dan Melamed Department of Computer Science Department of Computer Science Courant Institute Courant Institute N N ew ew Y Y ork ork U U niversity niversity

Upload: clinton-fitzgerald

Post on 03-Jan-2016

222 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Malay-English Bitext Mapping and Alignment Using SIMR/GSA Algorithms Mosleh Al-Adhaileh Tang Enya Kong Mosleh Al-Adhaileh and Tang Enya Kong Computer Aided

Malay-English Bitext Mapping and Malay-English Bitext Mapping and Alignment Using SIMR/GSA Alignment Using SIMR/GSA

AlgorithmsAlgorithms

Malay-English Bitext Mapping and Malay-English Bitext Mapping and Alignment Using SIMR/GSA Alignment Using SIMR/GSA

AlgorithmsAlgorithms

Mosleh Al-Adhaileh Mosleh Al-Adhaileh and Tang Enya Kong Tang Enya KongComputer Aided Translation UnitComputer Aided Translation Unit

School of Computer Sciences School of Computer Sciences UUniversity niversity SScience cience MMalaysia alaysia

I. Dan MelamedI. Dan MelamedDepartment of Computer ScienceDepartment of Computer Science

Courant InstituteCourant InstituteNNew ew YYork ork UUniversityniversity

Page 2: Malay-English Bitext Mapping and Alignment Using SIMR/GSA Algorithms Mosleh Al-Adhaileh Tang Enya Kong Mosleh Al-Adhaileh and Tang Enya Kong Computer Aided

Presentation OutlinePresentation Outline

IntroductionIntroduction

SIMR and GSA algorithms.Bitext Mapping and Alignment

Porting SIMR/GSA to Malay-English BitextPorting SIMR/GSA to Malay-English Bitext Data CollectionSteps to adopt SIMR into Malay-English language pair

Matching PredicateAxis GeneratorParameter Optimization

Results and Evaluation Results and Evaluation Conclusion Conclusion

Page 3: Malay-English Bitext Mapping and Alignment Using SIMR/GSA Algorithms Mosleh Al-Adhaileh Tang Enya Kong Mosleh Al-Adhaileh and Tang Enya Kong Computer Aided

Bitext Mapping and AlignmentBitext Mapping and AlignmentBitext Mapping and AlignmentBitext Mapping and Alignment

Bitext-(parallel text): A text in one language and its translation in another language. Bitext-(parallel text): A text in one language and its translation in another language.

Bitext Mapping and Alignment: to describe the correspondence between the two halves of the bitext.Bitext Mapping and Alignment: to describe the correspondence between the two halves of the bitext.

Bitext Mapping: is to find the corresponding points, i.e. words, text units, or segments boundaries, between its two halves

Bitext Mapping: is to find the corresponding points, i.e. words, text units, or segments boundaries, between its two halves

Bitext Alignment: is a segmentation of the two texts, such that the nth segment of one text corresponds to the nth segment of the other.

Bitext Alignment: is a segmentation of the two texts, such that the nth segment of one text corresponds to the nth segment of the other.

Page 4: Malay-English Bitext Mapping and Alignment Using SIMR/GSA Algorithms Mosleh Al-Adhaileh Tang Enya Kong Mosleh Al-Adhaileh and Tang Enya Kong Computer Aided

Bitext Mapping and Alignment are needed in order to compile this Data into a useful source of knowledge.

Bitext Mapping and Alignment are needed in order to compile this Data into a useful source of knowledge.

Word sense disambiguationWord sense disambiguation

Bilingual lexicographyBilingual lexicography

Machine TranslationMachine Translation

Multilingual information retrieval Multilingual information retrieval

Also as practical tool for assisting translators Also as practical tool for assisting translators

Page 5: Malay-English Bitext Mapping and Alignment Using SIMR/GSA Algorithms Mosleh Al-Adhaileh Tang Enya Kong Mosleh Al-Adhaileh and Tang Enya Kong Computer Aided

Bitext spaceBitext spaceBitext spaceBitext space

X: Characters’ position in text 1X: Characters’ position in text 1

Y:

Characters’

position in

text 2

Y:

Characters’

position in

text 2

terminusterminus

originorigin

Main diagonal

Main diagonal

A Bitext can form the axes of a rectangular bitext space.A Bitext can form the axes of a rectangular bitext space.

True Points of Correspondence (TPCs) can be plotted as points in the bitext space.True Points of Correspondence (TPCs) can be plotted as points in the bitext space.

X: Characters’ position in text 1X: Characters’ position in text 1

Y:

Characters’

position in

text 2

Y:

Characters’

position in

text 2

terminusterminus

originorigin

Main diagonal

Main diagonal

The point (XX,YY) is TPCTPC if token at position X and a token at position Y are translation to each other. The point (XX,YY) is TPCTPC if token at position X and a token at position Y are translation to each other.

Page 6: Malay-English Bitext Mapping and Alignment Using SIMR/GSA Algorithms Mosleh Al-Adhaileh Tang Enya Kong Mosleh Al-Adhaileh and Tang Enya Kong Computer Aided

Real bitexts are noisy:- Fertility = A single segment in one half may correspond to zero, one, two or more segments in the other half.

- crossed dependencies (distortion) = Where human translators change and rearrange material so the target output text will not flow well according to the order of the source text.

Page 7: Malay-English Bitext Mapping and Alignment Using SIMR/GSA Algorithms Mosleh Al-Adhaileh Tang Enya Kong Mosleh Al-Adhaileh and Tang Enya Kong Computer Aided

0102030405060708090

100110120130140150160170180190200210220230240250260270280290300310320330340350360370380390400410420430440450

0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400 420 440 460 480 500 520

source

targ

et

Page 8: Malay-English Bitext Mapping and Alignment Using SIMR/GSA Algorithms Mosleh Al-Adhaileh Tang Enya Kong Mosleh Al-Adhaileh and Tang Enya Kong Computer Aided

mapping

0

10

20

30

40

50

60

70

80

90

100

110

120

130

140

150

160

170

180

190

200

210

220

230

240

250

260

270

280

290

300

310

320

330

340

350

360

370

380

390

400

410

420

430

440

450

0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400 420 440 460 480 500source

Page 9: Malay-English Bitext Mapping and Alignment Using SIMR/GSA Algorithms Mosleh Al-Adhaileh Tang Enya Kong Mosleh Al-Adhaileh and Tang Enya Kong Computer Aided

SIMRSIMRSIMRSIMR

SIMRSIMRSIMRSIMRMalay English

Mapped Bitext

SIMR and GSA algorithmsSIMR and GSA algorithms

Bitext

Malay English

MappingMapping SIMRSIMR: stands for : stands for SSmooth mooth IInjective njective MMap ap RRecognizerecognizer

05

101520253035404550556065707580

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70

05

101520253035404550556065707580

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70

TCPs TCPs

05

101520253035404550556065707580

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70

05

101520253035404550556065707580

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70

TBM TBM

Page 10: Malay-English Bitext Mapping and Alignment Using SIMR/GSA Algorithms Mosleh Al-Adhaileh Tang Enya Kong Mosleh Al-Adhaileh and Tang Enya Kong Computer Aided

AlignmentAlignment

positions on X-axis

posi

tion

s on

Y-a

xis

SIMR Output: the correspondence pointsSIMR Output: the correspondence points

GSAGSA: : stands for stands for GGeometric eometric SSegment egment AAlignment. lignment.

sentences on X-axis

sent

ence

s on

Y-a

xis

A B C D E G JIHF K Labcde

g

jih

f

Segment boundaries form a grid over the Segment boundaries form a grid over the bitext spacebitext space

sentences on X-axis

sent

ence

s on

Y-a

xis

A B C D E G JIHF K Labcde

g

jih

f

Each cell represents the intersection of two Each cell represents the intersection of two segments, one from each half of the bitextsegments, one from each half of the bitext

GSAGSA: reduces the sets of correspondence points : reduces the sets of correspondence points in in SIMRSIMR’s output to segment alignments ’s output to segment alignments

A point inside A point inside (X,y)(X,y) cell indicates that some token in cell indicates that some token in segment segment XX corresponds with some token in segment corresponds with some token in segment yy; ; segments segments XX and and yy correspond. correspond.

Page 11: Malay-English Bitext Mapping and Alignment Using SIMR/GSA Algorithms Mosleh Al-Adhaileh Tang Enya Kong Mosleh Al-Adhaileh and Tang Enya Kong Computer Aided

Data CollectionData CollectionData CollectionData Collection

“The 7 Habits of Highly Effective PeopleThe 7 Habits of Highly Effective People”“The 7 Habits of Highly Effective PeopleThe 7 Habits of Highly Effective People”

Malay-English Bitexts from UUnit TTerjemahan MMelalui KKomputer (UTMKUTMK) - USMMalay-English Bitexts from UUnit TTerjemahan MMelalui KKomputer (UTMKUTMK) - USM

English Version: 101,790101,790 wordsEnglish Version: 101,790101,790 words13 chapters13 chapters

Malay Version: 107,161107,161 wordsMalay Version: 107,161107,161 words

“SemanticsSemantics”“SemanticsSemantics”

English Version: 50,17050,170 wordsEnglish Version: 50,17050,170 words8 chapters8 chapters Malay Version: 51,80251,802 wordsMalay Version: 51,80251,802 words

“User’s Guide: Microsoft Word for WindowsUser’s Guide: Microsoft Word for Windows” “User’s Guide: Microsoft Word for WindowsUser’s Guide: Microsoft Word for Windows”

English Version: 6,9746,974 wordsEnglish Version: 6,9746,974 wordsFirst 20 pagesFirst 20 pages

Malay Version: 8,2818,281 wordsMalay Version: 8,2818,281 words

Page 12: Malay-English Bitext Mapping and Alignment Using SIMR/GSA Algorithms Mosleh Al-Adhaileh Tang Enya Kong Mosleh Al-Adhaileh and Tang Enya Kong Computer Aided

Steps to adopt SIMRSIMR into Malay-English language pairSteps to adopt SIMRSIMR into Malay-English language pair

Malay English

Segment AlignmentSegment

AlignmentMalay English

Malay English Test Data

Bitext Mapping

Bitext Mapping

Malay English

GSA

Manual Alignment

ParameterParameter re-optimizationre-optimization

ParameterParameter re-optimizationre-optimization

Validate Manual

Alignment

ADOMIT

Training Data

SIMRAxis Axis generatorgenerator

Axis Axis generatorgenerator

KIMDBilingual

dictionary

KIMDBilingual

dictionaryLexiconLexiconLexiconLexicon

Page 13: Malay-English Bitext Mapping and Alignment Using SIMR/GSA Algorithms Mosleh Al-Adhaileh Tang Enya Kong Mosleh Al-Adhaileh and Tang Enya Kong Computer Aided

Matching PredicateMatching Predicate

Find the TPCs between the two halves of the bitext

It is a heuristic used to decide whether two given tokens It is a heuristic used to decide whether two given tokens might be mutual translation.might be mutual translation.

Cognate wordsComputer Komputer

Sistem System

Punctuation marks

The matching Predicates were fine-tuned with The matching Predicates were fine-tuned with stop-liststop-list words words for both Malay and English languagesfor both Malay and English languages

Lexicon Bury: mengebumikan, menanam, kematian, keretaBury: mengebumikan, menanam, kematian, kereta

Page 14: Malay-English Bitext Mapping and Alignment Using SIMR/GSA Algorithms Mosleh Al-Adhaileh Tang Enya Kong Mosleh Al-Adhaileh and Tang Enya Kong Computer Aided

For each language, an axis generator performs the mapping from tokens (the smallest semantic units) to axis position.For each language, an axis generator performs the mapping from tokens (the smallest semantic units) to axis position.

Data LemmatizationData LemmatizationData LemmatizationData Lemmatization

Axis GeneratorAxis GeneratorAxis GeneratorAxis Generator

The position of a token (in character) is the position of its median character.The position of a token (in character) is the position of its median character.

“tujuh tabiat gambaran seluruh.”“tujuh tabiat gambaran seluruh.”

0 <EOS>0 <EOS> 3 tujuh3 tujuh 9.5 tabiat9.5 tabiat 17.5 gambaran17.5 gambaran

26 seluruh26 seluruh 31 .31 . 31.5 <EOS>31.5 <EOS>

EnglishEnglish: word stemming POS tag (Brill’s) and XTAG lexicon (contains 90000 roots, 317000 inflected forms).EnglishEnglish: word stemming POS tag (Brill’s) and XTAG lexicon (contains 90000 roots, 317000 inflected forms).MalayMalay: root construction rules, and lexicon (contains 10000 popular words).MalayMalay: root construction rules, and lexicon (contains 10000 popular words).

Page 15: Malay-English Bitext Mapping and Alignment Using SIMR/GSA Algorithms Mosleh Al-Adhaileh Tang Enya Kong Mosleh Al-Adhaileh and Tang Enya Kong Computer Aided

ADOMITADOMIT (Automatic Detection of OMIssions in Translation) Alignment ValidationADOMITADOMIT (Automatic Detection of OMIssions in Translation) Alignment Validation

Parameter OptimizationParameter OptimizationParameter OptimizationParameter OptimizationWe use Chapter 3, 7 and 11 from the 7habits book. All together 1245 segments. It is manually aligned at the sentence level.

We use Chapter 3, 7 and 11 from the 7habits book. All together 1245 segments. It is manually aligned at the sentence level.

Simply say: Any segment whose slope is unusually low is a likely omission.Simply say: Any segment whose slope is unusually low is a likely omission.

x= character position in text1

y= c

hara

cter

pos

itio

n in

te

xt2

A O B

a

b Parameter Value

Chain size 7

Max. point ambiguity 6

0.14

Max. linear regression error

Min. Cognate length ratio 0.80

Max. angle deviation

5

Parameters valueParameters value

Page 16: Malay-English Bitext Mapping and Alignment Using SIMR/GSA Algorithms Mosleh Al-Adhaileh Tang Enya Kong Mosleh Al-Adhaileh and Tang Enya Kong Computer Aided

Results and EvaluationResults and EvaluationResults and EvaluationResults and Evaluation

Page 17: Malay-English Bitext Mapping and Alignment Using SIMR/GSA Algorithms Mosleh Al-Adhaileh Tang Enya Kong Mosleh Al-Adhaileh and Tang Enya Kong Computer Aided

ConclusionConclusionConclusionConclusionThis experiment shows that SIMR/GSA algorithms can map/align Malay-English bitexts with high accuracy as they performed on the other variety of language pairs and text genres.

This experiment shows that SIMR/GSA algorithms can map/align Malay-English bitexts with high accuracy as they performed on the other variety of language pairs and text genres. These results encourage us, as a future work, to think of extending the text alignment to word alignment aiming at the identification of correspondence between linguistic units below the sentence level within a bitext.

These results encourage us, as a future work, to think of extending the text alignment to word alignment aiming at the identification of correspondence between linguistic units below the sentence level within a bitext.Bitexts are becoming plentiful and available, both in private data warehouses and on publicly accessible sites on the WWW. They form a very useful source of knowledge if they were treated efficiently.

Bitexts are becoming plentiful and available, both in private data warehouses and on publicly accessible sites on the WWW. They form a very useful source of knowledge if they were treated efficiently.

Visit Visit http://utmk-ultra.cs.usm.my/,http://utmk-ultra.cs.usm.my/, the URL for the URL for UUnit nit TTerjemahan erjemahan MMelalui elalui KKomputer (omputer (UTMKUTMK) – ) – USMUSM. . Visit Visit http://utmk-ultra.cs.usm.my/,http://utmk-ultra.cs.usm.my/, the URL for the URL for UUnit nit TTerjemahan erjemahan MMelalui elalui KKomputer (omputer (UTMKUTMK) – ) – USMUSM. .

Visit Visit http://www.cs.nyu.edu/~melamed/, http://www.cs.nyu.edu/~melamed/, the URL for the URL for important referencesimportant referencesVisit Visit http://www.cs.nyu.edu/~melamed/, http://www.cs.nyu.edu/~melamed/, the URL for the URL for important referencesimportant references

Page 18: Malay-English Bitext Mapping and Alignment Using SIMR/GSA Algorithms Mosleh Al-Adhaileh Tang Enya Kong Mosleh Al-Adhaileh and Tang Enya Kong Computer Aided

Thank you…..Thank you…..

Mosleh Al-Adhaileh Mosleh Al-Adhaileh and Tang Enya Kong Tang Enya KongComputer Aided Translation UnitComputer Aided Translation Unit

School of Computer Sciences School of Computer Sciences UUniversity niversity SScience cience MMalaysia alaysia

I. Dan MelamedI. Dan MelamedDepartment of Computer ScienceDepartment of Computer Science

Courant InstituteCourant InstituteNNew ew YYork ork UUniversityniversity