malay-english bitext mapping and alignment using simr/gsa algorithms mosleh al-adhaileh tang enya...
TRANSCRIPT
Malay-English Bitext Mapping and Malay-English Bitext Mapping and Alignment Using SIMR/GSA Alignment Using SIMR/GSA
AlgorithmsAlgorithms
Malay-English Bitext Mapping and Malay-English Bitext Mapping and Alignment Using SIMR/GSA Alignment Using SIMR/GSA
AlgorithmsAlgorithms
Mosleh Al-Adhaileh Mosleh Al-Adhaileh and Tang Enya Kong Tang Enya KongComputer Aided Translation UnitComputer Aided Translation Unit
School of Computer Sciences School of Computer Sciences UUniversity niversity SScience cience MMalaysia alaysia
I. Dan MelamedI. Dan MelamedDepartment of Computer ScienceDepartment of Computer Science
Courant InstituteCourant InstituteNNew ew YYork ork UUniversityniversity
Presentation OutlinePresentation Outline
IntroductionIntroduction
SIMR and GSA algorithms.Bitext Mapping and Alignment
Porting SIMR/GSA to Malay-English BitextPorting SIMR/GSA to Malay-English Bitext Data CollectionSteps to adopt SIMR into Malay-English language pair
Matching PredicateAxis GeneratorParameter Optimization
Results and Evaluation Results and Evaluation Conclusion Conclusion
Bitext Mapping and AlignmentBitext Mapping and AlignmentBitext Mapping and AlignmentBitext Mapping and Alignment
Bitext-(parallel text): A text in one language and its translation in another language. Bitext-(parallel text): A text in one language and its translation in another language.
Bitext Mapping and Alignment: to describe the correspondence between the two halves of the bitext.Bitext Mapping and Alignment: to describe the correspondence between the two halves of the bitext.
Bitext Mapping: is to find the corresponding points, i.e. words, text units, or segments boundaries, between its two halves
Bitext Mapping: is to find the corresponding points, i.e. words, text units, or segments boundaries, between its two halves
Bitext Alignment: is a segmentation of the two texts, such that the nth segment of one text corresponds to the nth segment of the other.
Bitext Alignment: is a segmentation of the two texts, such that the nth segment of one text corresponds to the nth segment of the other.
Bitext Mapping and Alignment are needed in order to compile this Data into a useful source of knowledge.
Bitext Mapping and Alignment are needed in order to compile this Data into a useful source of knowledge.
Word sense disambiguationWord sense disambiguation
Bilingual lexicographyBilingual lexicography
Machine TranslationMachine Translation
Multilingual information retrieval Multilingual information retrieval
Also as practical tool for assisting translators Also as practical tool for assisting translators
Bitext spaceBitext spaceBitext spaceBitext space
X: Characters’ position in text 1X: Characters’ position in text 1
Y:
Characters’
position in
text 2
Y:
Characters’
position in
text 2
terminusterminus
originorigin
Main diagonal
Main diagonal
A Bitext can form the axes of a rectangular bitext space.A Bitext can form the axes of a rectangular bitext space.
True Points of Correspondence (TPCs) can be plotted as points in the bitext space.True Points of Correspondence (TPCs) can be plotted as points in the bitext space.
X: Characters’ position in text 1X: Characters’ position in text 1
Y:
Characters’
position in
text 2
Y:
Characters’
position in
text 2
terminusterminus
originorigin
Main diagonal
Main diagonal
The point (XX,YY) is TPCTPC if token at position X and a token at position Y are translation to each other. The point (XX,YY) is TPCTPC if token at position X and a token at position Y are translation to each other.
Real bitexts are noisy:- Fertility = A single segment in one half may correspond to zero, one, two or more segments in the other half.
- crossed dependencies (distortion) = Where human translators change and rearrange material so the target output text will not flow well according to the order of the source text.
0102030405060708090
100110120130140150160170180190200210220230240250260270280290300310320330340350360370380390400410420430440450
0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400 420 440 460 480 500 520
source
targ
et
mapping
0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
180
190
200
210
220
230
240
250
260
270
280
290
300
310
320
330
340
350
360
370
380
390
400
410
420
430
440
450
0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400 420 440 460 480 500source
SIMRSIMRSIMRSIMR
SIMRSIMRSIMRSIMRMalay English
Mapped Bitext
SIMR and GSA algorithmsSIMR and GSA algorithms
Bitext
Malay English
MappingMapping SIMRSIMR: stands for : stands for SSmooth mooth IInjective njective MMap ap RRecognizerecognizer
05
101520253035404550556065707580
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70
05
101520253035404550556065707580
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70
TCPs TCPs
05
101520253035404550556065707580
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70
05
101520253035404550556065707580
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70
TBM TBM
AlignmentAlignment
positions on X-axis
posi
tion
s on
Y-a
xis
SIMR Output: the correspondence pointsSIMR Output: the correspondence points
GSAGSA: : stands for stands for GGeometric eometric SSegment egment AAlignment. lignment.
sentences on X-axis
sent
ence
s on
Y-a
xis
A B C D E G JIHF K Labcde
g
jih
f
Segment boundaries form a grid over the Segment boundaries form a grid over the bitext spacebitext space
sentences on X-axis
sent
ence
s on
Y-a
xis
A B C D E G JIHF K Labcde
g
jih
f
Each cell represents the intersection of two Each cell represents the intersection of two segments, one from each half of the bitextsegments, one from each half of the bitext
GSAGSA: reduces the sets of correspondence points : reduces the sets of correspondence points in in SIMRSIMR’s output to segment alignments ’s output to segment alignments
A point inside A point inside (X,y)(X,y) cell indicates that some token in cell indicates that some token in segment segment XX corresponds with some token in segment corresponds with some token in segment yy; ; segments segments XX and and yy correspond. correspond.
Data CollectionData CollectionData CollectionData Collection
“The 7 Habits of Highly Effective PeopleThe 7 Habits of Highly Effective People”“The 7 Habits of Highly Effective PeopleThe 7 Habits of Highly Effective People”
Malay-English Bitexts from UUnit TTerjemahan MMelalui KKomputer (UTMKUTMK) - USMMalay-English Bitexts from UUnit TTerjemahan MMelalui KKomputer (UTMKUTMK) - USM
English Version: 101,790101,790 wordsEnglish Version: 101,790101,790 words13 chapters13 chapters
Malay Version: 107,161107,161 wordsMalay Version: 107,161107,161 words
“SemanticsSemantics”“SemanticsSemantics”
English Version: 50,17050,170 wordsEnglish Version: 50,17050,170 words8 chapters8 chapters Malay Version: 51,80251,802 wordsMalay Version: 51,80251,802 words
“User’s Guide: Microsoft Word for WindowsUser’s Guide: Microsoft Word for Windows” “User’s Guide: Microsoft Word for WindowsUser’s Guide: Microsoft Word for Windows”
English Version: 6,9746,974 wordsEnglish Version: 6,9746,974 wordsFirst 20 pagesFirst 20 pages
Malay Version: 8,2818,281 wordsMalay Version: 8,2818,281 words
Steps to adopt SIMRSIMR into Malay-English language pairSteps to adopt SIMRSIMR into Malay-English language pair
Malay English
Segment AlignmentSegment
AlignmentMalay English
Malay English Test Data
Bitext Mapping
Bitext Mapping
Malay English
GSA
Manual Alignment
ParameterParameter re-optimizationre-optimization
ParameterParameter re-optimizationre-optimization
Validate Manual
Alignment
ADOMIT
Training Data
SIMRAxis Axis generatorgenerator
Axis Axis generatorgenerator
KIMDBilingual
dictionary
KIMDBilingual
dictionaryLexiconLexiconLexiconLexicon
Matching PredicateMatching Predicate
Find the TPCs between the two halves of the bitext
It is a heuristic used to decide whether two given tokens It is a heuristic used to decide whether two given tokens might be mutual translation.might be mutual translation.
Cognate wordsComputer Komputer
Sistem System
Punctuation marks
The matching Predicates were fine-tuned with The matching Predicates were fine-tuned with stop-liststop-list words words for both Malay and English languagesfor both Malay and English languages
Lexicon Bury: mengebumikan, menanam, kematian, keretaBury: mengebumikan, menanam, kematian, kereta
For each language, an axis generator performs the mapping from tokens (the smallest semantic units) to axis position.For each language, an axis generator performs the mapping from tokens (the smallest semantic units) to axis position.
Data LemmatizationData LemmatizationData LemmatizationData Lemmatization
Axis GeneratorAxis GeneratorAxis GeneratorAxis Generator
The position of a token (in character) is the position of its median character.The position of a token (in character) is the position of its median character.
“tujuh tabiat gambaran seluruh.”“tujuh tabiat gambaran seluruh.”
0 <EOS>0 <EOS> 3 tujuh3 tujuh 9.5 tabiat9.5 tabiat 17.5 gambaran17.5 gambaran
26 seluruh26 seluruh 31 .31 . 31.5 <EOS>31.5 <EOS>
EnglishEnglish: word stemming POS tag (Brill’s) and XTAG lexicon (contains 90000 roots, 317000 inflected forms).EnglishEnglish: word stemming POS tag (Brill’s) and XTAG lexicon (contains 90000 roots, 317000 inflected forms).MalayMalay: root construction rules, and lexicon (contains 10000 popular words).MalayMalay: root construction rules, and lexicon (contains 10000 popular words).
ADOMITADOMIT (Automatic Detection of OMIssions in Translation) Alignment ValidationADOMITADOMIT (Automatic Detection of OMIssions in Translation) Alignment Validation
Parameter OptimizationParameter OptimizationParameter OptimizationParameter OptimizationWe use Chapter 3, 7 and 11 from the 7habits book. All together 1245 segments. It is manually aligned at the sentence level.
We use Chapter 3, 7 and 11 from the 7habits book. All together 1245 segments. It is manually aligned at the sentence level.
Simply say: Any segment whose slope is unusually low is a likely omission.Simply say: Any segment whose slope is unusually low is a likely omission.
x= character position in text1
y= c
hara
cter
pos
itio
n in
te
xt2
A O B
a
b Parameter Value
Chain size 7
Max. point ambiguity 6
0.14
Max. linear regression error
Min. Cognate length ratio 0.80
Max. angle deviation
5
Parameters valueParameters value
Results and EvaluationResults and EvaluationResults and EvaluationResults and Evaluation
ConclusionConclusionConclusionConclusionThis experiment shows that SIMR/GSA algorithms can map/align Malay-English bitexts with high accuracy as they performed on the other variety of language pairs and text genres.
This experiment shows that SIMR/GSA algorithms can map/align Malay-English bitexts with high accuracy as they performed on the other variety of language pairs and text genres. These results encourage us, as a future work, to think of extending the text alignment to word alignment aiming at the identification of correspondence between linguistic units below the sentence level within a bitext.
These results encourage us, as a future work, to think of extending the text alignment to word alignment aiming at the identification of correspondence between linguistic units below the sentence level within a bitext.Bitexts are becoming plentiful and available, both in private data warehouses and on publicly accessible sites on the WWW. They form a very useful source of knowledge if they were treated efficiently.
Bitexts are becoming plentiful and available, both in private data warehouses and on publicly accessible sites on the WWW. They form a very useful source of knowledge if they were treated efficiently.
Visit Visit http://utmk-ultra.cs.usm.my/,http://utmk-ultra.cs.usm.my/, the URL for the URL for UUnit nit TTerjemahan erjemahan MMelalui elalui KKomputer (omputer (UTMKUTMK) – ) – USMUSM. . Visit Visit http://utmk-ultra.cs.usm.my/,http://utmk-ultra.cs.usm.my/, the URL for the URL for UUnit nit TTerjemahan erjemahan MMelalui elalui KKomputer (omputer (UTMKUTMK) – ) – USMUSM. .
Visit Visit http://www.cs.nyu.edu/~melamed/, http://www.cs.nyu.edu/~melamed/, the URL for the URL for important referencesimportant referencesVisit Visit http://www.cs.nyu.edu/~melamed/, http://www.cs.nyu.edu/~melamed/, the URL for the URL for important referencesimportant references
Thank you…..Thank you…..
Mosleh Al-Adhaileh Mosleh Al-Adhaileh and Tang Enya Kong Tang Enya KongComputer Aided Translation UnitComputer Aided Translation Unit
School of Computer Sciences School of Computer Sciences UUniversity niversity SScience cience MMalaysia alaysia
I. Dan MelamedI. Dan MelamedDepartment of Computer ScienceDepartment of Computer Science
Courant InstituteCourant InstituteNNew ew YYork ork UUniversityniversity