the xmu phrase -based statistical machine translation ... · institute of artificial intelligence...

13:46 1

The XMU PhraseThe XMU Phrase--Based Based

Statistical Machine Statistical Machine

Translation SystemTranslation System

for IWSLT 2006for IWSLT 2006

YidongYidong Chen, Chen, XiaodongXiaodong ShiShi

Institute of Artificial IntelligenceInstitute of Artificial Intelligence

Xiamen UniversityXiamen University

P. R. ChinaP. R. China

November 28, 2006 November 28, 2006 -- KyotoKyoto

13:46 2

OutlineOutline

�� OverviewOverview

�� TrainingTraining

�� SystemSystem�� Translation ModelTranslation Model

�� ParametersParameters

�� DecoderDecoder

�� Dealing with the Unknown WordsDealing with the Unknown Words

�� Recovering the Missing PunctuationsRecovering the Missing Punctuations

�� Translating the ASR LatticeTranslating the ASR Lattice

�� ExperimentsExperiments

�� ConclusionsConclusions

13:46 3

OverviewOverview

�� Who we are?Who we are?

�� NLP group at Institute of Artificial Intelligence, Xiamen NLP group at Institute of Artificial Intelligence, Xiamen UniversityUniversity

�� Begin research on SMT since 2004Begin research on SMT since 2004

�� Have worked on ruleHave worked on rule--based MT for more than 15 based MT for more than 15 yearsyears

�� First web MT in China (1999)First web MT in China (1999)

�� First mobile phone MT in China (2006)First mobile phone MT in China (2006)

�� Website: Website: http://ai.xmu.edu.cn/http://ai.xmu.edu.cn/

http://mt.xmu.edu.cnhttp://mt.xmu.edu.cn

http://nlp.xmu.edu.cnhttp://nlp.xmu.edu.cn

13:46 4

13:46 5

Overview (Cont.)Overview (Cont.)

�� IWSLT 2006IWSLT 2006

�� First participationFirst participation

�� We implemented a simple We implemented a simple phrasephrase--basedbased

statistical machine translation system.statistical machine translation system.

�� We participated in the We participated in the open data trackopen data track for for

ASR latticeASR lattice and and Cleaned TranscriptsCleaned Transcripts for the for the

ChineseChinese--English translation directionEnglish translation direction..

13:46 6

OutlineOutline











13:46 7

TrainingTraining

�� Preprocessing (Chinese part)Preprocessing (Chinese part)

�� SegmentationSegmentation

�� Mixed (DBC/SBC) case to SBC caseMixed (DBC/SBC) case to SBC case

�� Preprocessing (English part)Preprocessing (English part)

�� TokenizationTokenization

�� TruecasingTruecasing of the first word of an English of the first word of an English

sentencesentence

13:46 8

Training (Cont.)Training (Cont.)

�� Word AlignmentWord Alignment

�� Firstly, we ran Firstly, we ran GIZA++GIZA++ up to IBM model 4 in up to IBM model 4 in

both translation directionsboth translation directions to get an initial to get an initial

word alignment.word alignment.

�� Then, We applied Then, We applied ““growgrow--diagdiag--finalfinal”” method method

(Koehn, 2003) to refine it and achieve (Koehn, 2003) to refine it and achieve nn--toto--nn

word alignment.word alignment.

13:46 9


�� Phrase ExtractionPhrase Extraction

�� similar to (similar to (OchOch, 2002)., 2002).

�� We limited the length of phrases from 1 word We limited the length of phrases from 1 word

to 6 words.to 6 words.

�� For a Chinese phrase, only 20For a Chinese phrase, only 20--best best

corresponding bilingual phrases were kept. corresponding bilingual phrases were kept.

is used to evaluate and rank the is used to evaluate and rank the

bilingual phrases with the same Chinese bilingual phrases with the same Chinese

phrase.phrase.

1

( , )N

i i

i

h e cλ=

⋅∑ % %

13:46 10


�� Phrase ProbabilitiesPhrase Probabilities

�� Phrase translation probabilityPhrase translation probability

�� Inversed phrase translation probabilityInversed phrase translation probability

�� Phrase lexical weightPhrase lexical weight

�� Inversed phrase lexical weightInversed phrase lexical weight

��

��

( | )p e c% %

( | )p c e% %

( | )lex e c% %

( | )lex c e% %

( , )( | )

( , )e

N e cp e c

N e c′

=′∑

%

% %% %

% %

1 1

( , )1

1( | ) ( | , ) ( | )

|{ | ( , ) } |

II J

i j

i j ai

lex e c lex e c a p c ej i j a ∀ ∈=

= =∈ ∑∏% %

13:46 11

OutlineOutline











13:46 12

Translation ModelTranslation Model

�� We use a logWe use a log--linear modeling (linear modeling (OchOch, 2002):, 2002):

1

1 1

11 1

1 1

1

exp[ ( , )]

Pr( | )

exp[ ( , )]I

MI J

m mI J m

MI J

m m

me

h e c

e c

h e c

λ

λ

=

=

⋅=

⋅

∑

∑ ∑

1

1 1 1

1

ˆ argmax ( , )I

MI I J

m me m

e h e cλ=

= ⋅

∑

13:46 13

Translation Model (Cont.)Translation Model (Cont.)

�� Seven featuresSeven features

�� Phrase translation probabilityPhrase translation probability

�� Inversed phrase translation probabilityInversed phrase translation probability

�� Phrase lexical weightPhrase lexical weight

�� Inversed phrase lexical weightInversed phrase lexical weight

�� English language modelEnglish language model

�� English sentence length penaltyEnglish sentence length penalty

�� Chinese phrase count penaltyChinese phrase count penalty

�� We didnWe didn’’t use features on reordering.t use features on reordering.

( | )p e c% %

( | )p c e% %

( | )lex e c% %

( | )lex c e% %

1( )Ilm e

I

J ′−

13:46 14

OutlineOutline











13:46 15

ParametersParameters

�� We didnWe didn’’t used discriminative training method to t used discriminative training method to

train the parameters. We adjust the parameters train the parameters. We adjust the parameters

by hand.by hand.

�� We didnWe didn’’t readjust the parameters according to t readjust the parameters according to

the develop sets provided in this evaluation. we the develop sets provided in this evaluation. we

simply used an empirical setting, with which our simply used an empirical setting, with which our

decoder achieved a good performance in decoder achieved a good performance in

translating the test set from the translating the test set from the 2005 China2005 China’’s s

National 863 MT EvaluationNational 863 MT Evaluation..

13:46 16

Parameters (Cont.)Parameters (Cont.)

�� The parameter settings for our systemThe parameter settings for our system

13:46 17

OutlineOutline











13:46 18

DecoderDecoder

Splitting

the source

Source

Sentence list

Merging

the translations

Translation list

Target

Loop

Sentence

Preprocess

Postprocess

Translation

English

Language

model

Bilingual

phrases

neon

A monotone

DP-based

decoder

Chinese

Language

model

13:46 19

Decoder (Cont.)Decoder (Cont.)

�� We used the monotone search in the We used the monotone search in the

decoding, similar to (decoding, similar to (ZensZens, 2002)., 2002).

�� Dynamic programming recursion:Dynamic programming recursion:

(0,$) 1Q =

10

1,

( , ) max ( , ) ( , )M

j

m m jj j

me e

Q j e Q j e h e cλ ′+′≤ <=′

′ ′= + ⋅

∑%

%

{ }( 1,$) max ( , ) ($ | )e

Q J Q J e p e′

′ ′+ = +

13:46 20

OutlineOutline











13:46 21

Dealing with the Unknown WordsDealing with the Unknown Words

�� No special translation models for named entities No special translation models for named entities are used. Named entities are translated in the are used. Named entities are translated in the same way as other unknown words.same way as other unknown words.

�� Unknown words were translated in two steps:Unknown words were translated in two steps:�� Firstly, we will look up a dictionary containing more Firstly, we will look up a dictionary containing more than 100,000 Chinese words for the word.than 100,000 Chinese words for the word.

�� If no translations are found in the first step, the word If no translations are found in the first step, the word will then be translated using a rulewill then be translated using a rule--based Chinesebased Chinese--English translation system.English translation system.

�� All the 63 unknown words in the test data for the All the 63 unknown words in the test data for the Cleaned Transcripts task in this evaluation are Cleaned Transcripts task in this evaluation are translated into English.translated into English.

13:46 22

OutlineOutline











13:46 23

Recovering the Missing Recovering the Missing

PunctuationsPunctuations

�� There are no punctuations in the Chinese There are no punctuations in the Chinese

sentences.sentences.

�� The missing of punctuations can have an The missing of punctuations can have an

adverse effect on the translation quality, adverse effect on the translation quality,

so we developed a preprocessing model so we developed a preprocessing model

to recover the missing punctuations.to recover the missing punctuations.

13:46 24


Punctuations (Cont.)Punctuations (Cont.)

�� Two ways to do with the missing punctuationsTwo ways to do with the missing punctuations

�� Method 1: To remove punctuations from the Chinese Method 1: To remove punctuations from the Chinese part of the training set, then train the model using the part of the training set, then train the model using the training set, and then translate the sentences without training set, and then translate the sentences without punctuations directly.punctuations directly.

�� Method 2: To recover the punctuations from the input Method 2: To recover the punctuations from the input sentences, then translate the result sentences using a sentences, then translate the result sentences using a model trained from normal training set.model trained from normal training set.

�� Experiments and results on develop set 4Experiments and results on develop set 4

0.19360.1936Method 1Method 1

0.21390.2139

bleubleu--44

Method 2Method 2

13:46 25


Punctuations (Cont.)Punctuations (Cont.)�� Given a Chinese sentence with Given a Chinese sentence with NN words, words, ww11ww22 …… wwNN, we may construct a directed graph , we may construct a directed graph with 2with 2NN+1 levels+1 levels

13:46 26


Punctuations (Cont.)Punctuations (Cont.)

�� Given such a graph, the problem of punctuation Given such a graph, the problem of punctuation

recovering could be looked on as a problem of recovering could be looked on as a problem of

searching the optimal path from the node in searching the optimal path from the node in

levellevel00 to the node in to the node in levellevel22NN..

�� In this problem, a path is said to be better than In this problem, a path is said to be better than

the other one if the the other one if the language model scorelanguage model score for it for it

is larger than that for the latter. is larger than that for the latter.

�� We then used the We then used the ViterbiViterbi algorithm to solve the algorithm to solve the

search problem.search problem.

13:46 27

OutlineOutline











13:46 28

Translating the ASR LatticeTranslating the ASR Lattice

�� In the task of translating the ASR lattice, three In the task of translating the ASR lattice, three

types of test data were given:types of test data were given:

�� word latticeword lattice

�� the 20the 20--best results generated from ASR latticebest results generated from ASR lattice

�� the 1the 1--best result generated form ASR latticebest result generated form ASR lattice

�� A possibly better way is to regenerate the 1A possibly better way is to regenerate the 1--best best

result based on Chinese language model from result based on Chinese language model from

word lattice and then to translate it. word lattice and then to translate it.

13:46 29

Translating the ASR Lattice Translating the ASR Lattice

(Cont.)(Cont.)

�� We used a simpler approximate wayWe used a simpler approximate way

�� We first used our system to translate all the We first used our system to translate all the

2020--best results and got 20 translations for best results and got 20 translations for

each corresponding sentence.each corresponding sentence.

�� Then we used the English language model to Then we used the English language model to

choose the best translation for each sentence.choose the best translation for each sentence.

13:46 30

OutlineOutline











13:46 31

ExperimentsExperiments

�� The data we usedThe data we used

7.9M Chinese wordsChinese Reader (Duzhe) Corpus

350K Chinese wordsChinese part of the training set

from IWSLT 2006Chinese

Language

Model

7.4M words

English part of the training set

from the 2005 China’s National

863 MT Evaluation

English

Language

Model

152,049 sentence

pairs

Training set from the 2005

China’s National

863 MT Evaluation

39,952 sentence pairsTraining set from IWSLT 2006

Bilingual

Phrase

statisticsgenre

CorpusPurposes

13:46 32

Experiments (Cont.)Experiments (Cont.)

�� The use of additional data did help The use of additional data did help

improving the performance of our system improving the performance of our system

on the develop sets. on the develop sets.

�� Influence of the additional Influence of the additional bitextsbitexts (bleu(bleu--4)4)

0.21390.21390.18690.1869develop set 4develop set 4



0.39220.3922

Training with Training with

additional additional bitextsbitexts

0.33050.3305

Training without Training without

additional additional bitextsbitexts

develop set 1develop set 1

13:46 33


�� Scores of our system in IWSLT 2006Scores of our system in IWSLT 2006

0.21620.1976Correct Recognition

Result

0.17180.1579CE read speech ASR

output

0.16230.1505CE spontaneous

speech ASR output

additional

(without case +

punctuation)

official

(with case +

punctuation)

13:46 34


�� Some lessonsSome lessons

�� The scores on Correct Recognition Result are The scores on Correct Recognition Result are

significantly significantly higherhigher than those on ASR output. than those on ASR output.

This may result from the influence of the ASR This may result from the influence of the ASR

errors. And the other reason may be the errors. And the other reason may be the

simple method we used to translate ASR simple method we used to translate ASR

lattice.lattice.

13:46 35


�� Some lessons (Cont.)Some lessons (Cont.)

�� The scores on CE read speech ASR output The scores on CE read speech ASR output

are slightly are slightly higherhigher than those on CE than those on CE

spontaneous speech ASR output. This spontaneous speech ASR output. This

indicates that the ASR system used to give indicates that the ASR system used to give

the ASR output may be cleverer at the read the ASR output may be cleverer at the read

speech data than at the spontaneous speech speech data than at the spontaneous speech

data.data.

13:46 36


�� Some lessons (Cont.)Some lessons (Cont.)

�� The additional scores are The additional scores are higherhigher than the than the

official scores. This indicates that postofficial scores. This indicates that post--editing editing

models such as models such as truecasingtruecasing or punctuation or punctuation

correction may help improving the translation correction may help improving the translation

quality. We will integrate such models in the quality. We will integrate such models in the

future.future.

13:46 37

OutlineOutline











13:46 38

ConclusionsConclusions

�� We describe the system which participated We describe the system which participated

in the 2006 IWSLT Speech Translation in the 2006 IWSLT Speech Translation

Evaluation of Institute of Artificial Evaluation of Institute of Artificial

Intelligence, Xiamen University.Intelligence, Xiamen University.

�� It is a rather crude phraseIt is a rather crude phrase--based SMT based SMT

baseline, for example, without even baseline, for example, without even

considering phrase reordering.considering phrase reordering.

�� More improvements are underway.More improvements are underway.

13:46 39

ReferencesReferences

�� Koehn, Philipp, Koehn, Philipp, OchOch, , FrazFraz Josef and Josef and MarcuMarcu DanieDanie, "Statistical phrase, "Statistical phrase--based based translation", translation", Proceeding of the Human Language Technology Conference of Proceeding of the Human Language Technology Conference of the North American Chapter of the Association for Computational the North American Chapter of the Association for Computational Linguistics (HLTLinguistics (HLT--NAACL)NAACL), Edmonton, Canada, 2003, pp. 127, Edmonton, Canada, 2003, pp. 127--133.133.

�� OchOch, Franz Josef, "Statistical Machine Translation: From Single Wor, Franz Josef, "Statistical Machine Translation: From Single Word d Models to Alignment Templates", Models to Alignment Templates", Ph.D. thesisPh.D. thesis, RWTH , RWTH AdchenAdchen, Germany, , Germany, 2002.2002.

�� OchOch, , FrazFraz Josef and Ney, Hermann, "Discriminative training and maximum Josef and Ney, Hermann, "Discriminative training and maximum entropy models for statistical machine translation", entropy models for statistical machine translation", Proceeding of the 40Proceeding of the 40thth

Annual Meeting of the Association for Computational Linguistics Annual Meeting of the Association for Computational Linguistics (ACL)(ACL), , Philadelphia, PA, 2002, pp. 295Philadelphia, PA, 2002, pp. 295--302.302.

�� OchOch, Franz Josef, "Minimum error rate training in statistical machi, Franz Josef, "Minimum error rate training in statistical machine ne translation", translation", Proceeding of the 41Proceeding of the 41stst Annual Meeting of the Association for Annual Meeting of the Association for Computational Linguistics (ACL)Computational Linguistics (ACL), Sapporo, Japan, 2003, pp. 160, Sapporo, Japan, 2003, pp. 160--167.167.

�� ZensZens, Richard, , Richard, OchOch, Franz Josef and Ney, Hermann, "Phrase, Franz Josef and Ney, Hermann, "Phrase--Based Based Statistical Machine Translation", Statistical Machine Translation", Proceeding of the 25Proceeding of the 25thth German Conference German Conference on Artificial Intelligence (KI2002)on Artificial Intelligence (KI2002), ser. , ser. Lecture Notes in Artificial Intelligence Lecture Notes in Artificial Intelligence (LNAI)(LNAI), M. , M. JarkeJarke, J. Koehler, and G. , J. Koehler, and G. LakemeyerLakemeyer, Eds., Vol. 2479. Aachen, , Eds., Vol. 2479. Aachen, Germany: Springer Germany: Springer VerlagVerlag, September 2002, pp. 18, September 2002, pp. 18––32.32.

13:46 40

References (Cont.)References (Cont.)

�� Koehn, Philipp, Axelrod, Koehn, Philipp, Axelrod, AmittaiAmittai, , MayneMayne, Alexandra Birch, , Alexandra Birch, CallisonCallison--Burch, Burch, Chris, Osborne, Miles and Talbot, David, "Edinburgh system descrChris, Osborne, Miles and Talbot, David, "Edinburgh system description for iption for the 2005 the 2005 iwsltiwslt speech translation evaluation", speech translation evaluation", Proceeding of International Proceeding of International Workshop on Spoken Language TranslationWorkshop on Spoken Language Translation, Pittsburgh, PA, 2005, Pittsburgh, PA, 2005

�� He, He, ZhongjunZhongjun, Liu, Yang, , Liu, Yang, XiongXiong, , DeyiDeyi, , HouHou, , HongxuHongxu and Liu, and Liu, QunQun, "ICT , "ICT System Description for the 2006 TCSTAR Run #2 SLT Evaluation", System Description for the 2006 TCSTAR Run #2 SLT Evaluation", Proceeding of the TCSTAR Workshop on SpeechProceeding of the TCSTAR Workshop on Speech--toto--Speech TranslationSpeech Translation, , Barcelona, Spain, 2006, pp. 63Barcelona, Spain, 2006, pp. 63--68.68.

�� Forney, G. D., "The Forney, G. D., "The ViterbiViterbi algorithm", algorithm", Proceeding of IEEEProceeding of IEEE, 61(2): 268, 61(2): 268--278, 278, 19731973

�� StolckeStolcke, Andreas, ", Andreas, "SrilmSrilm –– an extensible language modeling toolkit", an extensible language modeling toolkit", Proceedings of the International Conference on Spoken language Proceedings of the International Conference on Spoken language ProcessingProcessing, 2002, volume 2, pp. 901, 2002, volume 2, pp. 901––904.904.

�� Chen, Stanley F. and Goodman, Joshua, "An empirical study of smoChen, Stanley F. and Goodman, Joshua, "An empirical study of smoothing othing techniques for language modeling", techniques for language modeling", Technical Report TRTechnical Report TR--1010--9898, Harvard , Harvard University Center for Research in Computing Technology, 1998.University Center for Research in Computing Technology, 1998.

13:46 41

��

��

��

ThanksThanks��

��

This work was supported by the National Natural Science FoundatiThis work was supported by the National Natural Science Foundation on of China (Grant No. 60573189 and Grant No, 60373080), National 8of China (Grant No. 60573189 and Grant No, 60373080), National 863 63 HighHigh--tech Program (Grant No. 2004AA117010tech Program (Grant No. 2004AA117010--06) and Key Research 06) and Key Research Project of Fujian Province (Grant No. 2006H0038) Project of Fujian Province (Grant No. 2006H0038)

��

the xmu phrase -based statistical machine translation ... · institute of artificial intelligence...

Documents