morphological pre-processing for turkish to english ... · morphological pre-processing for turkish...
TRANSCRIPT
![Page 1: Morphological Pre-processing for Turkish to English ... · Morphological Pre-processing for Turkish to English Statistical Machine Translation Arianna Bisazza, Marcello Federico FBK](https://reader030.vdocument.in/reader030/viewer/2022032611/5c65b10b09d3f2916e8d2db4/html5/thumbnails/1.jpg)
Morphological Pre-processing for Turkish toEnglish Statistical Machine Translation
Arianna Bisazza, Marcello Federico
FBK - Ricerca Scientifica e Tecnologica, Trento, Italy
Tokyo, Dec 1-2, 2009
Bisazza, Federico Morphological Pre-processing for Turkish-English SMT Tokyo, Dec 1-2, 2009
![Page 2: Morphological Pre-processing for Turkish to English ... · Morphological Pre-processing for Turkish to English Statistical Machine Translation Arianna Bisazza, Marcello Federico FBK](https://reader030.vdocument.in/reader030/viewer/2022032611/5c65b10b09d3f2916e8d2db4/html5/thumbnails/2.jpg)
1
Outline
• Turkish & SMT
• Morphological Segmentation
– Preprocessing chain
– Segmentation rules
• Lexical Approximation
• Experiments
• Future Work & Conclusions
Bisazza, Federico Morphological Pre-processing for Turkish-English SMT Tokyo, Dec 1-2, 2009
![Page 3: Morphological Pre-processing for Turkish to English ... · Morphological Pre-processing for Turkish to English Statistical Machine Translation Arianna Bisazza, Marcello Federico FBK](https://reader030.vdocument.in/reader030/viewer/2022032611/5c65b10b09d3f2916e8d2db4/html5/thumbnails/3.jpg)
2
Outline
• Turkish & SMT
• Morphological Segmentation
– Preprocessing chain
– Segmentation rules
• Lexical Approximation
• Experiments
• Future Work & Conclusions
Bisazza, Federico Morphological Pre-processing for Turkish-English SMT Tokyo, Dec 1-2, 2009
![Page 4: Morphological Pre-processing for Turkish to English ... · Morphological Pre-processing for Turkish to English Statistical Machine Translation Arianna Bisazza, Marcello Federico FBK](https://reader030.vdocument.in/reader030/viewer/2022032611/5c65b10b09d3f2916e8d2db4/html5/thumbnails/4.jpg)
3
Turkish & SMT
Several linguistic features of Turkish can negatively affect an SMT system:
Bisazza, Federico Morphological Pre-processing for Turkish-English SMT Tokyo, Dec 1-2, 2009
![Page 5: Morphological Pre-processing for Turkish to English ... · Morphological Pre-processing for Turkish to English Statistical Machine Translation Arianna Bisazza, Marcello Federico FBK](https://reader030.vdocument.in/reader030/viewer/2022032611/5c65b10b09d3f2916e8d2db4/html5/thumbnails/5.jpg)
4
Turkish & SMT
Several linguistic features of Turkish can negatively affect an SMT system:
• Agglutination→ large vocabulary, built by a wide range of suffix combinations
oda [room] = ‘room’odam [room-my] = ‘my room’odamda [room-my-in] = ‘in my room’odamdaydı [room-my-in-was] = ‘was in my room’odamdaydım [room-my-in-was-I] = ‘I was in my room’
Bisazza, Federico Morphological Pre-processing for Turkish-English SMT Tokyo, Dec 1-2, 2009
![Page 6: Morphological Pre-processing for Turkish to English ... · Morphological Pre-processing for Turkish to English Statistical Machine Translation Arianna Bisazza, Marcello Federico FBK](https://reader030.vdocument.in/reader030/viewer/2022032611/5c65b10b09d3f2916e8d2db4/html5/thumbnails/6.jpg)
5
Turkish & SMT
Several linguistic features of Turkish can negatively affect an SMT system:
• Agglutination→ large vocabulary, built by a wide range of suffix combinations
oda [room] = ‘room’odam [room-my] = ‘my room’odamda [room-my-in] = ‘in my room’odamdaydı [room-my-in-was] = ‘was in my room’odamdaydım [room-my-in-was-I] = ‘I was in my room’
Some statistics on IWSLT09 training corpus :
Tokens Dict.sizeTR 139,514 17,619EN 182,627 8,345
OOV (devset2): 6.16%
Bisazza, Federico Morphological Pre-processing for Turkish-English SMT Tokyo, Dec 1-2, 2009
![Page 7: Morphological Pre-processing for Turkish to English ... · Morphological Pre-processing for Turkish to English Statistical Machine Translation Arianna Bisazza, Marcello Federico FBK](https://reader030.vdocument.in/reader030/viewer/2022032611/5c65b10b09d3f2916e8d2db4/html5/thumbnails/7.jpg)
6
Turkish & SMT
Several linguistic features of Turkish can negatively affect an SMT system:
• Agglutination→ large vocabulary, built by a wide range of suffix combinations
• Vowel harmony and other phoneme alternation phenomena→ systematic stem and suffix allomorphy
Ex. the suffix -(I)m = ‘my’:
sac+(I)m → sacım ‘my hair’el+(I)m → elim ‘my hand’
kol+(I)m → kolum ‘my arm’goz+(I)m → gozum ‘my eye’
kafa+(I)m → kafam ‘my head’
Bisazza, Federico Morphological Pre-processing for Turkish-English SMT Tokyo, Dec 1-2, 2009
![Page 8: Morphological Pre-processing for Turkish to English ... · Morphological Pre-processing for Turkish to English Statistical Machine Translation Arianna Bisazza, Marcello Federico FBK](https://reader030.vdocument.in/reader030/viewer/2022032611/5c65b10b09d3f2916e8d2db4/html5/thumbnails/8.jpg)
7
Turkish & SMT
Several linguistic features of Turkish can negatively affect an SMT system:
• Agglutination→ large vocabulary, built by a wide range of suffix combinations
• Vowel harmony and other phoneme alternation phenomena→ systematic stem and suffix allomorphy
Ex. the suffix -(I)m = ‘my’:
sac+(I)m → sacım ‘my hair’el+(I)m → elim ‘my hand’
kol+(I)m → kolum ‘my arm’goz+(I)m → gozum ‘my eye’
kafa+(I)m → kafam ‘my head’
If splitted from words, suffixes undergo data sparseness→ need to use a notation that factorizes
Bisazza, Federico Morphological Pre-processing for Turkish-English SMT Tokyo, Dec 1-2, 2009
![Page 9: Morphological Pre-processing for Turkish to English ... · Morphological Pre-processing for Turkish to English Statistical Machine Translation Arianna Bisazza, Marcello Federico FBK](https://reader030.vdocument.in/reader030/viewer/2022032611/5c65b10b09d3f2916e8d2db4/html5/thumbnails/9.jpg)
8
Turkish & SMT
Several linguistic features of Turkish can negatively affect an SMT system:
• Agglutination→ large vocabulary, built by a wide range of suffix combinations
• Vowel harmony and other phoneme alternation phenomena→ systematic stem and suffix allomorphy
• Word order→ complex, long-span reorderings between TR and EN
Banyolu iki kisilik bir oda istiyorum.[bath-with] [two] [people-for] [a] [room] [want-I]
‘I’d like a twin room with a bath please.’
Bisazza, Federico Morphological Pre-processing for Turkish-English SMT Tokyo, Dec 1-2, 2009
![Page 10: Morphological Pre-processing for Turkish to English ... · Morphological Pre-processing for Turkish to English Statistical Machine Translation Arianna Bisazza, Marcello Federico FBK](https://reader030.vdocument.in/reader030/viewer/2022032611/5c65b10b09d3f2916e8d2db4/html5/thumbnails/10.jpg)
9
Turkish & SMT
Several linguistic features of Turkish can negatively affect an SMT system:
• Agglutination→ large vocabulary, built by a wide range of suffix combinations
• Vowel harmony and other phoneme alternation phenomena→ systematic stem and suffix allomorphy
• Word order→ complex, long-span reorderings between TR and EN
Importance of specific linguistic preprocessing:
Bisazza, Federico Morphological Pre-processing for Turkish-English SMT Tokyo, Dec 1-2, 2009
![Page 11: Morphological Pre-processing for Turkish to English ... · Morphological Pre-processing for Turkish to English Statistical Machine Translation Arianna Bisazza, Marcello Federico FBK](https://reader030.vdocument.in/reader030/viewer/2022032611/5c65b10b09d3f2916e8d2db4/html5/thumbnails/11.jpg)
10
Turkish & SMT
Several linguistic features of Turkish can negatively affect an SMT system:
• Agglutination→ large vocabulary, built by a wide range of suffix combinations
• Vowel harmony and other phoneme alternation phenomena→ systematic stem and suffix allomorphy
• Word order→ complex, long-span reorderings between TR and EN
Importance of specific linguistic preprocessing:
→ reduction of data sparseness (dict. size from 17.6K to 10.4K)
→ decrease of OOV rate by more than half
→ improvement of 5 points BLEU
Bisazza, Federico Morphological Pre-processing for Turkish-English SMT Tokyo, Dec 1-2, 2009
![Page 12: Morphological Pre-processing for Turkish to English ... · Morphological Pre-processing for Turkish to English Statistical Machine Translation Arianna Bisazza, Marcello Federico FBK](https://reader030.vdocument.in/reader030/viewer/2022032611/5c65b10b09d3f2916e8d2db4/html5/thumbnails/12.jpg)
11
Outline
• Turkish & SMT
• Morphological Segmentation
– Preprocessing chain
– Segmentation rules
• Lexical Approximation
• Experiments
• Future Work & Conclusions
Bisazza, Federico Morphological Pre-processing for Turkish-English SMT Tokyo, Dec 1-2, 2009
![Page 13: Morphological Pre-processing for Turkish to English ... · Morphological Pre-processing for Turkish to English Statistical Machine Translation Arianna Bisazza, Marcello Federico FBK](https://reader030.vdocument.in/reader030/viewer/2022032611/5c65b10b09d3f2916e8d2db4/html5/thumbnails/13.jpg)
12
Morphological Segmentation
Idea: selectively splitting or removing suffixes from the words
Bisazza, Federico Morphological Pre-processing for Turkish-English SMT Tokyo, Dec 1-2, 2009
![Page 14: Morphological Pre-processing for Turkish to English ... · Morphological Pre-processing for Turkish to English Statistical Machine Translation Arianna Bisazza, Marcello Federico FBK](https://reader030.vdocument.in/reader030/viewer/2022032611/5c65b10b09d3f2916e8d2db4/html5/thumbnails/14.jpg)
13
Morphological Segmentation
Idea: selectively splitting or removing suffixes from the words
Already explored by:
• Habash & Sadat, 2006 [1] on an Arabic-English task
– similar method: comparison of segmentation schemes– different language: Arabic affixation less rich
• Oflazer & Durgar El-Kahlout, 2007 [2] on an English-Turkish task
– similar preprocessing chain– translating into a morphologically rich language
Bisazza, Federico Morphological Pre-processing for Turkish-English SMT Tokyo, Dec 1-2, 2009
![Page 15: Morphological Pre-processing for Turkish to English ... · Morphological Pre-processing for Turkish to English Statistical Machine Translation Arianna Bisazza, Marcello Federico FBK](https://reader030.vdocument.in/reader030/viewer/2022032611/5c65b10b09d3f2916e8d2db4/html5/thumbnails/15.jpg)
14
Preprocessing chain
wwwwwwwwwwwwwwwwwwwwww�
Turkish source
Morph. analysis
(Oflazer, 1994)
Morph. disambiguation
(Sak & Saraclar, 2007)
Suffix tags split/removal
(tested 11 schemes)
Lexical approximation
Phrase-based SMT (Moses)
wwwwwwwwwwwwwwwwwwwwww�
Bisazza, Federico Morphological Pre-processing for Turkish-English SMT Tokyo, Dec 1-2, 2009
![Page 16: Morphological Pre-processing for Turkish to English ... · Morphological Pre-processing for Turkish to English Statistical Machine Translation Arianna Bisazza, Marcello Federico FBK](https://reader030.vdocument.in/reader030/viewer/2022032611/5c65b10b09d3f2916e8d2db4/html5/thumbnails/16.jpg)
15
Preprocessing chain
1. Morphological analysis (Oflazer, 1994 [3])
2. Morphological disambiguation in context (Sak & Saraclar, 2007 [4])
‘Are there any tours of famous stars’ homes?’
Unlu yıldızların evine turlar var mı ?ev+Noun+A3sg+P2sg+Dat [to your house]ev+Noun+A3sg+P3sg+Dat [to his/her/its house]evin+Noun+A3sg+Pnon+Dat [to the kernel]
Bisazza, Federico Morphological Pre-processing for Turkish-English SMT Tokyo, Dec 1-2, 2009
![Page 17: Morphological Pre-processing for Turkish to English ... · Morphological Pre-processing for Turkish to English Statistical Machine Translation Arianna Bisazza, Marcello Federico FBK](https://reader030.vdocument.in/reader030/viewer/2022032611/5c65b10b09d3f2916e8d2db4/html5/thumbnails/17.jpg)
16
Preprocessing chain
1. Morphological analysis
2. Morphological disambiguation in context (Sak & Saraclar, 2007 [4])
‘Are there any tours of famous stars’ homes?’
Unlu yıldızların evine turlar var mı ?ev+Noun+A3sg+P2sg+Dat [to your house]
-> ev+Noun+A3sg+P3sg+Dat [to his/her/its house]evin+Noun+A3sg+Pnon+Dat [to the kernel]
Bisazza, Federico Morphological Pre-processing for Turkish-English SMT Tokyo, Dec 1-2, 2009
![Page 18: Morphological Pre-processing for Turkish to English ... · Morphological Pre-processing for Turkish to English Statistical Machine Translation Arianna Bisazza, Marcello Federico FBK](https://reader030.vdocument.in/reader030/viewer/2022032611/5c65b10b09d3f2916e8d2db4/html5/thumbnails/18.jpg)
17
Preprocessing chain
1. Morphological analysis
2. Morphological disambiguation in context (Sak & Saraclar, 2007 [4])
‘Are there any tours of famous stars’ homes?’
Unlu yıldızların evine turlar var mı ?ev+Noun+A3sg+P2sg+Dat [to your house]
-> ev+Noun+A3sg+P3sg+Dat [to his/her/its house]evin+Noun+A3sg+Pnon+Dat [to the kernel]
Note: some tags encode implicit features (i.e. with no surface form)
We use feature tags to:→ abstract from suffix allomorphy→ deal with non-ambiguous symbols→ make more readable rules
Bisazza, Federico Morphological Pre-processing for Turkish-English SMT Tokyo, Dec 1-2, 2009
![Page 19: Morphological Pre-processing for Turkish to English ... · Morphological Pre-processing for Turkish to English Statistical Machine Translation Arianna Bisazza, Marcello Federico FBK](https://reader030.vdocument.in/reader030/viewer/2022032611/5c65b10b09d3f2916e8d2db4/html5/thumbnails/19.jpg)
18
Preprocessing chain
1. Morphological analysis
2. Morphological disambiguation in context
3. Rules for splitting/removal of suffix tags
• rules based on feature tags → simple regular expressions• 11 segmentation schemes developed and tested• mainly focus on nominal, but also some verbal inflection
Bisazza, Federico Morphological Pre-processing for Turkish-English SMT Tokyo, Dec 1-2, 2009
![Page 20: Morphological Pre-processing for Turkish to English ... · Morphological Pre-processing for Turkish to English Statistical Machine Translation Arianna Bisazza, Marcello Federico FBK](https://reader030.vdocument.in/reader030/viewer/2022032611/5c65b10b09d3f2916e8d2db4/html5/thumbnails/20.jpg)
19
Preprocessing chain
1. Morphological analysis
2. Morphological disambiguation in context
3. Rules for splitting/removal of suffix tags
Bisazza, Federico Morphological Pre-processing for Turkish-English SMT Tokyo, Dec 1-2, 2009
![Page 21: Morphological Pre-processing for Turkish to English ... · Morphological Pre-processing for Turkish to English Statistical Machine Translation Arianna Bisazza, Marcello Federico FBK](https://reader030.vdocument.in/reader030/viewer/2022032611/5c65b10b09d3f2916e8d2db4/html5/thumbnails/21.jpg)
20
Segmentation rules
Idea: Split off tags expected to have English counterpart, remove others.When decision is not straightforward → experiment
Bisazza, Federico Morphological Pre-processing for Turkish-English SMT Tokyo, Dec 1-2, 2009
![Page 22: Morphological Pre-processing for Turkish to English ... · Morphological Pre-processing for Turkish to English Statistical Machine Translation Arianna Bisazza, Marcello Federico FBK](https://reader030.vdocument.in/reader030/viewer/2022032611/5c65b10b09d3f2916e8d2db4/html5/thumbnails/22.jpg)
21
Segmentation rules
Idea: Split off tags expected to have English counterpart, remove others.When decision is not straightforward → experiment
• Nominal case
– split off if expected to have an English counterpart:
- dative (oda/ya) ≈ ‘to’- ablative (oda/dan) ≈ ‘from’- locative (oda/da) ≈ ‘in’- instrumental(oda/yla) ≈ ‘with/by’
– removed otherwise:
- nominative (oda-)
– doubtful cases:
- accusative (oda/yı) (≈ ‘the’) ⇒ removed- genitive (oda/nın) (≈ ‘of/’s’) ⇒ removed
Bisazza, Federico Morphological Pre-processing for Turkish-English SMT Tokyo, Dec 1-2, 2009
![Page 23: Morphological Pre-processing for Turkish to English ... · Morphological Pre-processing for Turkish to English Statistical Machine Translation Arianna Bisazza, Marcello Federico FBK](https://reader030.vdocument.in/reader030/viewer/2022032611/5c65b10b09d3f2916e8d2db4/html5/thumbnails/23.jpg)
22
Segmentation rules
Idea: Split off tags expected to have English counterpart, remove others.When decision is not straightforward → experiment
• Nominal case
• Possessive
– split off if expected to have an English counterpart:
- 1st and 2nd sing. (oda/m, oda/n) ≈ ‘my’, ‘your’- 1st, 2nd and 3rd plur.(oda/mız, oda/nız, oda/ları) ≈ ‘our’, ‘your’, ‘their’
– removed otherwise:
- no possessive (oda-)
– doubtful cases:
- 3rd sing. (oda/sı) (≈ ‘his/her’) ⇒ removed
Bisazza, Federico Morphological Pre-processing for Turkish-English SMT Tokyo, Dec 1-2, 2009
![Page 24: Morphological Pre-processing for Turkish to English ... · Morphological Pre-processing for Turkish to English Statistical Machine Translation Arianna Bisazza, Marcello Federico FBK](https://reader030.vdocument.in/reader030/viewer/2022032611/5c65b10b09d3f2916e8d2db4/html5/thumbnails/24.jpg)
23
Segmentation rules
Idea: Split off tags expected to have English counterpart, remove others.When decision is not straightforward → experiment
• Nominal case
• Possessive
• Copula ‘to be’
– always split off. Example:
- oda temiz/dir litt. [room clean-is] ‘the room is clean’- oda temiz/di litt. [room clean-was] ‘the room was clean’
Bisazza, Federico Morphological Pre-processing for Turkish-English SMT Tokyo, Dec 1-2, 2009
![Page 25: Morphological Pre-processing for Turkish to English ... · Morphological Pre-processing for Turkish to English Statistical Machine Translation Arianna Bisazza, Marcello Federico FBK](https://reader030.vdocument.in/reader030/viewer/2022032611/5c65b10b09d3f2916e8d2db4/html5/thumbnails/25.jpg)
24
Segmentation rules
Idea: Split off tags expected to have English counterpart, remove others.When decision is not straightforward → experiment
• Nominal case
• Possessive
• Copula ‘to be’
• Verb person
– split off person suffixes from finite verb forms and copula. Example:
- gidiyor/um litt. [go-I] ‘I go’- gidiyor/sun litt. [go-you] ‘you go’
Bisazza, Federico Morphological Pre-processing for Turkish-English SMT Tokyo, Dec 1-2, 2009
![Page 26: Morphological Pre-processing for Turkish to English ... · Morphological Pre-processing for Turkish to English Statistical Machine Translation Arianna Bisazza, Marcello Federico FBK](https://reader030.vdocument.in/reader030/viewer/2022032611/5c65b10b09d3f2916e8d2db4/html5/thumbnails/26.jpg)
25
Segmentation rules
Idea: Split off tags expected to have English counterpart, remove others.When decision is not straightforward → experiment
• Nominal case
• Possessive
• Copula ‘to be’
• Verb person
Example: ‘I was in my room’
odamdaydım → oda /m/da/ydı/m[room-my-in-was-I] [room] [my] [in] [was] [I]
Bisazza, Federico Morphological Pre-processing for Turkish-English SMT Tokyo, Dec 1-2, 2009
![Page 27: Morphological Pre-processing for Turkish to English ... · Morphological Pre-processing for Turkish to English Statistical Machine Translation Arianna Bisazza, Marcello Federico FBK](https://reader030.vdocument.in/reader030/viewer/2022032611/5c65b10b09d3f2916e8d2db4/html5/thumbnails/27.jpg)
26
Segmentation rules
Idea: Split off tags expected to have English counterpart, remove others.When decision is not straightforward → experiment
• Nominal case
• Possessive
• Copula ‘to be’
• Verb person
Example: ‘I was in my room’
odamdaydım → oda /m/da/ydı/m[room-my-in-was-I] [room] [my] [in] [was] [I]
oda+Noun+A3sg/+P1sg/+Loc/^DB+Verb+Zero+Past/+A1sg
↑ ↑ ↑ ↑ ↑lemma poss. case copula v.pers
Bisazza, Federico Morphological Pre-processing for Turkish-English SMT Tokyo, Dec 1-2, 2009
![Page 28: Morphological Pre-processing for Turkish to English ... · Morphological Pre-processing for Turkish to English Statistical Machine Translation Arianna Bisazza, Marcello Federico FBK](https://reader030.vdocument.in/reader030/viewer/2022032611/5c65b10b09d3f2916e8d2db4/html5/thumbnails/28.jpg)
27
Looking into the alignments
Before segmentation:
!"#$$%&'%(%)"*+%,("*$!"#$%&'''!($#)*+,-.,/#01,/23,4&'
''4''/23''/#01''-.''"#$%($#)*+''
'4''/23''/#01'-.''"#$%($#)*+'
!"#$$%&'%(%)-"*-$+%-$,("-$*$!"#$%&''''!($#)*+&''''!-.&'!/#01&!/23&''!4&'
!"#$$%&'%(%)-"*-$+%-$,("-$*$!"#$%&''''!($#)*+&''''!-.&'!/#01&!/23&''!4&'
Bisazza, Federico Morphological Pre-processing for Turkish-English SMT Tokyo, Dec 1-2, 2009
![Page 29: Morphological Pre-processing for Turkish to English ... · Morphological Pre-processing for Turkish to English Statistical Machine Translation Arianna Bisazza, Marcello Federico FBK](https://reader030.vdocument.in/reader030/viewer/2022032611/5c65b10b09d3f2916e8d2db4/html5/thumbnails/29.jpg)
28
Looking into the alignments
Before segmentation:
!"#$$%&'%(%)"*+%,("*$!"#$%&'''!($#)*+,-.,/#01,/23,4&'
''4''/23''/#01''-.''"#$%($#)*+''
'4''/23''/#01'-.''"#$%($#)*+'
!"#$$%&'%(%)-"*-$+%-$,("-$*$!"#$%&''''!($#)*+&''''!-.&'!/#01&!/23&''!4&'
!"#$$%&'%(%)-"*-$+%-$,("-$*$!"#$%&''''!($#)*+&''''!-.&'!/#01&!/23&''!4&'
After segmentation:
!"#$$%&'%(%)"*+%,("*$!"#$%&'''!($#)*+,-.,/#01,/23,4&'
''4''/23''/#01''-.''"#$%($#)*+''
'4''/23''/#01'-.''"#$%($#)*+'
!"#$$%&'%(%)-"*-$+%-$,("-$*$!"#$%&''''!($#)*+&''''!-.&'!/#01&!/23&''!4&'
!"#$$%&'%(%)-"*-$+%-$,("-$*$!"#$%&''''!($#)*+&''''!-.&'!/#01&!/23&''!4&'
Bisazza, Federico Morphological Pre-processing for Turkish-English SMT Tokyo, Dec 1-2, 2009
![Page 30: Morphological Pre-processing for Turkish to English ... · Morphological Pre-processing for Turkish to English Statistical Machine Translation Arianna Bisazza, Marcello Federico FBK](https://reader030.vdocument.in/reader030/viewer/2022032611/5c65b10b09d3f2916e8d2db4/html5/thumbnails/30.jpg)
29
Looking into the alignments
Before segmentation:
!"#$$%&'%(%)"*+%,("*$!"#$%&'''!($#)*+,-.,/#01,/23,4&'
''4''/23''/#01''-.''"#$%($#)*+''
'4''/23''/#01'-.''"#$%($#)*+'
!"#$$%&'%(%)-"*-$+%-$,("-$*$!"#$%&''''!($#)*+&''''!-.&'!/#01&!/23&''!4&'
!"#$$%&'%(%)-"*-$+%-$,("-$*$!"#$%&''''!($#)*+&''''!-.&'!/#01&!/23&''!4&'
After segmentation:
!"#$$%&'%(%)"*+%,("*$!"#$%&'''!($#)*+,-.,/#01,/23,4&'
''4''/23''/#01''-.''"#$%($#)*+''
'4''/23''/#01'-.''"#$%($#)*+'
!"#$$%&'%(%)-"*-$+%-$,("-$*$!"#$%&''''!($#)*+&''''!-.&'!/#01&!/23&''!4&'
!"#$$%&'%(%)-"*-$+%-$,("-$*$!"#$%&''''!($#)*+&''''!-.&'!/#01&!/23&''!4&'
!"#$$%&'%(%)"*+%,("*$!"#$%&'''!($#)*+,-.,/#01,/23,4&'
''4''/23''/#01''-.''"#$%($#)*+''
'4''/23''/#01'-.''"#$%($#)*+'
!"#$$%&'%(%)-"*-$+%-$,("-$*$!"#$%&''''!($#)*+&''''!-.&'!/#01&!/23&''!4&'
!"#$$%&'%(%)-"*-$+%-$,("-$*$!"#$%&''''!($#)*+&''''!-.&'!/#01&!/23&''!4&'
Bisazza, Federico Morphological Pre-processing for Turkish-English SMT Tokyo, Dec 1-2, 2009
![Page 31: Morphological Pre-processing for Turkish to English ... · Morphological Pre-processing for Turkish to English Statistical Machine Translation Arianna Bisazza, Marcello Federico FBK](https://reader030.vdocument.in/reader030/viewer/2022032611/5c65b10b09d3f2916e8d2db4/html5/thumbnails/31.jpg)
30
Outline
• Turkish & SMT
• Morphological Segmentation
– Preprocessing chain
– Segmentation rules
• Lexical Approximation
• Experiments
• Future Work & Conclusions
Bisazza, Federico Morphological Pre-processing for Turkish-English SMT Tokyo, Dec 1-2, 2009
![Page 32: Morphological Pre-processing for Turkish to English ... · Morphological Pre-processing for Turkish to English Statistical Machine Translation Arianna Bisazza, Marcello Federico FBK](https://reader030.vdocument.in/reader030/viewer/2022032611/5c65b10b09d3f2916e8d2db4/html5/thumbnails/32.jpg)
31
Lexical Approximation
Bisazza, Federico Morphological Pre-processing for Turkish-English SMT Tokyo, Dec 1-2, 2009
![Page 33: Morphological Pre-processing for Turkish to English ... · Morphological Pre-processing for Turkish to English Statistical Machine Translation Arianna Bisazza, Marcello Federico FBK](https://reader030.vdocument.in/reader030/viewer/2022032611/5c65b10b09d3f2916e8d2db4/html5/thumbnails/33.jpg)
32
Lexical Approximation
Idea: replace OOV words in the test with morphologically similar words of training
Cf. previous IWSLT’s works on Arabic:
• Mermer & al., 2007 [5]
• Shen & al., 2008 [6]
Bisazza, Federico Morphological Pre-processing for Turkish-English SMT Tokyo, Dec 1-2, 2009
![Page 34: Morphological Pre-processing for Turkish to English ... · Morphological Pre-processing for Turkish to English Statistical Machine Translation Arianna Bisazza, Marcello Federico FBK](https://reader030.vdocument.in/reader030/viewer/2022032611/5c65b10b09d3f2916e8d2db4/html5/thumbnails/34.jpg)
33
Lexical Approximation
Idea: replace OOV words in the test with morphologically similar words of training
• Possible replacers → known words sharing the same lemma
• Similarity function → priority to words sharing more contiguous tags
• Deterministic choice of 1-best candidate
Bisazza, Federico Morphological Pre-processing for Turkish-English SMT Tokyo, Dec 1-2, 2009
![Page 35: Morphological Pre-processing for Turkish to English ... · Morphological Pre-processing for Turkish to English Statistical Machine Translation Arianna Bisazza, Marcello Federico FBK](https://reader030.vdocument.in/reader030/viewer/2022032611/5c65b10b09d3f2916e8d2db4/html5/thumbnails/35.jpg)
34
Lexical Approximation
Idea: replace OOV words in the test with morphologically similar words of training
• Possible replacers → known words sharing the same lemma
• Similarity function → priority to words sharing more contiguous tags
• Deterministic choice of 1-best candidate
Word Gloss Preprocessed (MS11) Score
cıkıslar exits cık+Verb+PosˆDB+Noun+Inf3+A3pl
cıkıs exit cık+Verb+PosˆDB+Noun+Inf3+A3sg 93cıkma going out cık+Verb+PosˆDB+Noun+Inf2+A3sg 66
cıkacak will go out cık+Verb+PosˆDB+Noun+FutPart+A3sg 66
cıkan who goes out cık+Verb+PosˆDB+Adj+PresPart 44
cıkıyor is going out cık+Verb+Pos+Prog1 27
cıkmıyor isn’t going out cık+Verb+Neg+Prog1 0
cıkarır takes out cık+VerbˆDB+Verb+Caus+Pos+Aor -15
Bisazza, Federico Morphological Pre-processing for Turkish-English SMT Tokyo, Dec 1-2, 2009
![Page 36: Morphological Pre-processing for Turkish to English ... · Morphological Pre-processing for Turkish to English Statistical Machine Translation Arianna Bisazza, Marcello Federico FBK](https://reader030.vdocument.in/reader030/viewer/2022032611/5c65b10b09d3f2916e8d2db4/html5/thumbnails/36.jpg)
35
Outline
• Turkish & SMT
• Morphological Segmentation
– Preprocessing chain
– Segmentation rules
• Lexical Approximation
• Experiments
• Future Work & Conclusions
Bisazza, Federico Morphological Pre-processing for Turkish-English SMT Tokyo, Dec 1-2, 2009
![Page 37: Morphological Pre-processing for Turkish to English ... · Morphological Pre-processing for Turkish to English Statistical Machine Translation Arianna Bisazza, Marcello Federico FBK](https://reader030.vdocument.in/reader030/viewer/2022032611/5c65b10b09d3f2916e8d2db4/html5/thumbnails/37.jpg)
36
Experiments
Bisazza, Federico Morphological Pre-processing for Turkish-English SMT Tokyo, Dec 1-2, 2009
![Page 38: Morphological Pre-processing for Turkish to English ... · Morphological Pre-processing for Turkish to English Statistical Machine Translation Arianna Bisazza, Marcello Federico FBK](https://reader030.vdocument.in/reader030/viewer/2022032611/5c65b10b09d3f2916e8d2db4/html5/thumbnails/38.jpg)
37
Experiments
Training set SMT (on devset2)
Preprocessing Tokens Dict. %OOV %BLEU %WER %PER
baseline 139,514 17,619 6.16 52.26 37.75 29.95
MS2 (case) 151,410 14,343 4.35 53.89 37.21 28.51
MS6 (case,poss) 156,390 12,009 3.49 54.10 37.29 28.19
MS7 (case,poss,cop) 157,927 11,519 3.18 55.05 37.73 27.67
MS11 (case,poss,cop,v.pers) 168,135 10,450 2.54 56.23 36.59 26.37
Bisazza, Federico Morphological Pre-processing for Turkish-English SMT Tokyo, Dec 1-2, 2009
![Page 39: Morphological Pre-processing for Turkish to English ... · Morphological Pre-processing for Turkish to English Statistical Machine Translation Arianna Bisazza, Marcello Federico FBK](https://reader030.vdocument.in/reader030/viewer/2022032611/5c65b10b09d3f2916e8d2db4/html5/thumbnails/39.jpg)
38
Experiments
Training set SMT (on devset2)
Preprocessing Tokens Dict. %OOV %BLEU %WER %PER
baseline 139,514 17,619 6.16 52.26 37.75 29.95
MS2 (case) 151,410 14,343 4.35 53.89 37.21 28.51
MS6 (case,poss) 156,390 12,009 3.49 54.10 37.29 28.19
MS7 (case,poss,cop) 157,927 11,519 3.18 55.05 37.73 27.67
MS11 (case,poss,cop,v.pers) 168,135 10,450 2.54 56.23 36.59 26.37
• segmentation minimizes differences in word granularity between TR and EN
• reduces dictionary size and data sparseness
Bisazza, Federico Morphological Pre-processing for Turkish-English SMT Tokyo, Dec 1-2, 2009
![Page 40: Morphological Pre-processing for Turkish to English ... · Morphological Pre-processing for Turkish to English Statistical Machine Translation Arianna Bisazza, Marcello Federico FBK](https://reader030.vdocument.in/reader030/viewer/2022032611/5c65b10b09d3f2916e8d2db4/html5/thumbnails/40.jpg)
39
Experiments
Training set SMT (on devset2)
Preprocessing Tokens Dict. %OOV %BLEU %WER %PER
baseline 139,514 17,619 6.16 52.26 37.75 29.95
MS2 (case) 151,410 14,343 4.35 53.89 37.21 28.51
MS6 (case,poss) 156,390 12,009 3.49 54.10 37.29 28.19
MS7 (case,poss,cop) 157,927 11,519 3.18 55.05 37.73 27.67
MS11 (case,poss,cop,v.pers) 168,135 10,450 2.54 56.23 36.59 26.37
• segmentation minimizes differences in word granularity between TR and EN
• reduces dictionary size and data sparseness
• important OOV decrease and consequent BLEU improvement
Bisazza, Federico Morphological Pre-processing for Turkish-English SMT Tokyo, Dec 1-2, 2009
![Page 41: Morphological Pre-processing for Turkish to English ... · Morphological Pre-processing for Turkish to English Statistical Machine Translation Arianna Bisazza, Marcello Federico FBK](https://reader030.vdocument.in/reader030/viewer/2022032611/5c65b10b09d3f2916e8d2db4/html5/thumbnails/41.jpg)
40
Experiments
Training set SMT (on devset2)
Preprocessing Tokens Dict. %OOV %BLEU %WER %PER
baseline 139,514 17,619 6.16 52.26 37.75 29.95
MS2 (case) 151,410 14,343 4.35 53.89 37.21 28.51
MS6 (case,poss) 156,390 12,009 3.49 54.10 37.29 28.19
MS7 (case,poss,cop) 157,927 11,519 3.18 55.05 37.73 27.67
MS11 (case,poss,cop,v.pers) 168,135 10,450 2.54 56.23 36.59 26.37
• segmentation minimizes differences in word granularity between TR and EN
• reduces dictionary size and data sparseness
• important OOV decrease and consequent BLEU improvement
• WER figures not very significant, but PER constantly lowers→ positive effect on lexical choice rather than on reordering
Bisazza, Federico Morphological Pre-processing for Turkish-English SMT Tokyo, Dec 1-2, 2009
![Page 42: Morphological Pre-processing for Turkish to English ... · Morphological Pre-processing for Turkish to English Statistical Machine Translation Arianna Bisazza, Marcello Federico FBK](https://reader030.vdocument.in/reader030/viewer/2022032611/5c65b10b09d3f2916e8d2db4/html5/thumbnails/42.jpg)
41
Experiments
Varying the distortion limit (DL):
Preprocess. DL %BLEU ∆ %WER %PER
baseline6 52.26
1.3%37.75 29.95
∞ 52.96 37.18 29.71
MS66 54.10
1.4%37.29 28.19
∞ 54.87 36.69 28.35
MS116 56.23
3.0%36.59 26.37
∞ 57.91 33.70 25.69
• because task is simple, unlimited distortion has reasonable decoding time
• the more segmented the text, the more improvement possible
Bisazza, Federico Morphological Pre-processing for Turkish-English SMT Tokyo, Dec 1-2, 2009
![Page 43: Morphological Pre-processing for Turkish to English ... · Morphological Pre-processing for Turkish to English Statistical Machine Translation Arianna Bisazza, Marcello Federico FBK](https://reader030.vdocument.in/reader030/viewer/2022032611/5c65b10b09d3f2916e8d2db4/html5/thumbnails/43.jpg)
42
Experiments
Lexical approximation:
Preprocess. DL %BLEUMS11 ∞ 57.91MS11 & ∞ 58.12lex.approx.
• work in progress
• promising results in particular setting → room for improvement
• in final submission dropping OOV words gave better BLEU scores
Bisazza, Federico Morphological Pre-processing for Turkish-English SMT Tokyo, Dec 1-2, 2009
![Page 44: Morphological Pre-processing for Turkish to English ... · Morphological Pre-processing for Turkish to English Statistical Machine Translation Arianna Bisazza, Marcello Federico FBK](https://reader030.vdocument.in/reader030/viewer/2022032611/5c65b10b09d3f2916e8d2db4/html5/thumbnails/44.jpg)
43
Outline
• Turkish & SMT
• Morphological Segmentation
– Preprocessing chain
– Segmentation rules
• Lexical Approximation
• Experiments
• Future Work & Conclusions
Bisazza, Federico Morphological Pre-processing for Turkish-English SMT Tokyo, Dec 1-2, 2009
![Page 45: Morphological Pre-processing for Turkish to English ... · Morphological Pre-processing for Turkish to English Statistical Machine Translation Arianna Bisazza, Marcello Federico FBK](https://reader030.vdocument.in/reader030/viewer/2022032611/5c65b10b09d3f2916e8d2db4/html5/thumbnails/45.jpg)
44
Future Work & Conclusions
• refine segmentation schemes by better addressing verbal suffixation
• improve lexical approximation technique:
– test different similarity functions– feed the decoder with multiple options of replacement
• repeat experiments on a more complex task
Bisazza, Federico Morphological Pre-processing for Turkish-English SMT Tokyo, Dec 1-2, 2009
![Page 46: Morphological Pre-processing for Turkish to English ... · Morphological Pre-processing for Turkish to English Statistical Machine Translation Arianna Bisazza, Marcello Federico FBK](https://reader030.vdocument.in/reader030/viewer/2022032611/5c65b10b09d3f2916e8d2db4/html5/thumbnails/46.jpg)
45
Future Work & Conclusions
• refine segmentation schemes by better addressing verbal suffixation
• improve lexical approximation technique:
– test different similarity functions– feed the decoder with multiple options of replacement
• repeat experiments on a more complex task
• linguistic preprocessing crucial for morphologically rich language like Turkish
• split/removing suffixes from morph.analyzed text yields large improvements
• linguistic knowledge guides hypothesis formulation before empirical validation
Bisazza, Federico Morphological Pre-processing for Turkish-English SMT Tokyo, Dec 1-2, 2009
![Page 47: Morphological Pre-processing for Turkish to English ... · Morphological Pre-processing for Turkish to English Statistical Machine Translation Arianna Bisazza, Marcello Federico FBK](https://reader030.vdocument.in/reader030/viewer/2022032611/5c65b10b09d3f2916e8d2db4/html5/thumbnails/47.jpg)
46
Thanks for your attention!
Preprocessing scripts available at : http://hlt.fbk.eu/people/bisazza
Bisazza, Federico Morphological Pre-processing for Turkish-English SMT Tokyo, Dec 1-2, 2009
![Page 48: Morphological Pre-processing for Turkish to English ... · Morphological Pre-processing for Turkish to English Statistical Machine Translation Arianna Bisazza, Marcello Federico FBK](https://reader030.vdocument.in/reader030/viewer/2022032611/5c65b10b09d3f2916e8d2db4/html5/thumbnails/48.jpg)
47
IWSLT09 TR-EN Outputs Compared
Japon Buyukelciligi ile irtibata gecmek istiyorum .
Ref: I’d like to contact the Japanese Embassy .baseline: I’d like to contact with Japanese buyukelciligi .
MS11: I’d like to contact with Japanese embassy .
Bu film rulolarını banyo ettirip basabilir miydiniz ?
Ref: Could you develop and print these rolls of film ?baseline: Could you reissue ettirip rulolarını this film developed ?
MS11: Could you reissue roll of film developed ?
Onu bulmaktan umidi hemen hemen kestim .
Ref: I’ve just about given up finding it .baseline: bulmaktan umidi cut it right away .
MS11: I cut almost hope from find it .
Simdi kirazların cicek acma mevsimi .
Ref: It’s cherry blossom season .baseline: kirazların buds mail seasons now .
MS11: cherry blossoms bloom season now .
Bisazza, Federico Morphological Pre-processing for Turkish-English SMT Tokyo, Dec 1-2, 2009
![Page 49: Morphological Pre-processing for Turkish to English ... · Morphological Pre-processing for Turkish to English Statistical Machine Translation Arianna Bisazza, Marcello Federico FBK](https://reader030.vdocument.in/reader030/viewer/2022032611/5c65b10b09d3f2916e8d2db4/html5/thumbnails/49.jpg)
48
References
[1] N. Habash and F. Sadat, “Arabic Preprocessing Schemes for Statistical Machine Translation,”
in Proc. of NAACL HLT. New York City, USA: Association for Computational Linguistics,
June 2006, pp. 49–52.
[2] K. Oflazer and I. D. El-Kahlout, “Exploring Different Representational Units in English-to-
Turkish Statistical Machine Translation,” in Proc. of Workshop on SMT. Prague, Czech
Republic: Association for Computational Linguistics, June 2007, pp. 25–32.
[3] K. Oflazer, “Two-level Description of Turkish Morphology,” Literary and Linguistic Computing,
vol. 9, no. 2, pp. 137–148, 1994.
[4] T. G. H. Sak and M. Saraclar, “Morphological Disambiguation of Turkish Text with Perceptron
Algorithm,” in Proc.of CICLing, 2007, pp. 107–118.
[5] H. K. C. Mermer and M. U. Dogan, “The TUBITAK-UEKAE Statistical Machine Translation
System for IWSLT 2007,” in Proc. of IWSLT, Trento, Italy, 2007, pp. 176–179.
[6] T. A. W. Shen, B. Delaney and R. Slyh, “The MIT-LL/AFRL IWSLT-2008 MT System,” in
Proc. of IWSLT, Hawaii, USA, 2008, pp. 69–76.
Bisazza, Federico Morphological Pre-processing for Turkish-English SMT Tokyo, Dec 1-2, 2009