the ict statistical machine translation systems for iwslt 2007 zhongjun he, haitao mi, yang liu,...

40
The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Y un Huang, Zhixiang Ren, Yajuan Lu, Qun Liu Institute of Computing Technology Chinese Academy of Sciences 2007.09.15– 2007.08.16

Upload: hubert-norman

Post on 13-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan

The ICT Statistical Machine Translation Systems for IWSLT 2007Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan Lu, Qun Liu

Institute of Computing TechnologyChinese Academy of Sciences

2007.09.15– 2007.08.16

Page 2: The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan

Outline

OverviewMT Systems

BruinConfuciusLynx

Official EvaluationDiscussionSummary

Page 3: The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan

Introduction of Our Group

Multilingual Interaction Technology Laboratory, Institute of Computing Technology, Chinese Academy SciencesLong history for working on MT

Rule-basedExample-based

Focus on SMT from 2004Website: http://mtgroup.ict.ac.cn/

Page 4: The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan

People Working on SMT at ICT

StaffsQun Liu (Researcher)Yajuan Lu (Associate Researcher) Yang Liu (Associate Researcher) Weihua Luo (Assistant Researcher)

PhD StudentsZhongjun HeHaitao MiJinsong SuYang Feng

Master StudentsYun HuangWenbin JiangZhixiang Ren…

Page 5: The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan

IWSLT 2007 Evaluation

Chinese-English transcript translation task

Page 6: The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan

Systems for IWSLT 2007 Evaluation

MT Systems:Bruin (formally syntax-based)Confucius (extended phrase-based) Lynx (linguistically syntax-based)

Page 7: The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan

Outline

OverviewMT Systems

BruinLynxConfucius

Official EvaluationDiscussionSummary

Page 8: The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan

BruinBruin is a formally syntax-based systemMaxEnt Reordering Model build on BTG rules

Regard reordering as a binary classificationBuilding a MaxEnt-based classifierUsing boundary words instead of whole phrases as features for the classifier

target

source

straight inverted

Page 9: The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan

FeaturesSource and target boundary words (lexical feature)Combinations of boundary words (collocation feature)

C1

E1

C2

E2

Target boundary head words

Source boundary head words

b1 b2

Page 10: The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan

Training and Decoding

Training the modelLearning reordering examples from bilingual word-aligned corpusGenerating features from reordering examplesTraining a MaxEnt model on the features

DecodingCKY algorithm

For details, see Xiong et al., ACL2006

Page 11: The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan

Confucius

An extended phrase-based systemLog-linear modelMonotone decodingWe try a phrase-based similarity model, in which a translation for a certain source phrase can be applied for other similar phrases

Page 12: The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan

Phrase-based Similarity Model

全省 出口 总值 的 25.5%

Find the most similar phrase pair

全市 出口 总值 的 半数

half of the entire city 's export

volume

Page 13: The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan

Phrase-based Similarity Model

全省 出口 总值 的 25.5%

Compare

全市 出口 总值 的 半数

half of the entire city 's export

volume

Page 14: The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan

Phrase-based Similarity Model

全省 出口 总值 的 25.5%

Replace

全省 出口 总值 的 25.5%

25.5% of the entire province

's export

volume

Page 15: The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan

Lynx

A linguistically syntax-based systemBased on tree-to-string alignment template (TAT), which map the source language tree to target language stringLog-linear Model

Page 16: The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan

Translation Process: Parsing中国 的 经济 发展

NP

DNP NP

NP DEG NN NN

NR

中国

的 经济 发展

parsing

Page 17: The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan

Translation Process: Detachment

NP

DNP NP

NP DEG NN NN

NR

中国

的 经济 发展

NP

DNP NP

DEG

NP

NR

中国

NP

NN NN

NN NN

经济 发展

detachment

NP

Page 18: The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan

Translation Process : Production

NP

DNP NP

NP DEG

NP

NR

中国

NP

NN NN

NN NN

经济 发展

of

China

economic development

Page 19: The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan

Translation Process : Combination

NP

DNP NP

DEG

NP

NR

中国

NP

NN NN

NN NN

经济 发展

of

China

economic development

economic development of China

combination

NP

Page 20: The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan

Training and DecodingTraining

Extract TATs from word-aligned, source side parsed bilingual corpus using bottom-up strategyImpose several restrictions to decrease the magnitude

Decodingbottom-up beam search

For details, see Liu et al., ACL2006

Page 21: The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan

OutlineOverviewMT Systems

BruinLynxConfucius

Official EvaluationDiscussionSummary

Page 22: The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan

Toolkits UsedWord alignment

GIZA++ plus “grow-diag-final” refinement methodLanguage model

SRILMChinese parser

Deyi Xiong’s A lexicalized PCFG model trained on PennTree bank

Chinese word segmentation ICTCLAS

Page 23: The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan

Preprocessing and Postprocessing

PreprocessingChinese word segmentationRule-based translations of numbers, dates and Chinese namesChinese sentences Parsing (for Lynx only)

PostprocessingRemove unknown wordsCapitalize the first word of each sentence

Page 24: The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan

Training data

Training Data List

Page 25: The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan

Development and test set

Corpus statistics of the IWSLT 2006 and 2007 development and test set

Page 26: The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan

Results on IWSLT 2006 developmentset and test set

Small data: The training data released by the IWSLT 2007Large data: All the training data

Page 27: The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan

Results on IWSLT 2007 test set

Page 28: The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan

OutlineOverviewMT Systems

BruinLynxConfucius

Official EvaluationDiscussionSummary

Page 29: The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan

Discussion

Lynx(0.1777)Training Corpus:• Training data:

– About 39k sentence pairs dialogs data– Provided by IWSLT 2007

– About 5M sentence pairs newswire data – Released by LDC

• Domain is quite different – Newswire vs. Dialogs

• Newswire data is too large

Page 30: The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan

Discussion

Lynx (0.1777)Parser :• Trained on Penn Chinese Treebank• Domain is quite different too

– Newswire vs. Dialogs

• Parsing error (low performance of parser)• Lynx decoder

– Only depends on the 1-best parsing tree

Page 31: The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan

Discussion Models:

Bruin (0.3750)• MaxEnt based reordering model• Long distance word reordering

Confucius (0.2802)• Monotone search

2007 test set (2006 test set)• 6.7words/sent (12.7words/sent)

– Bruin will do better• Punctuation marks (no )

– More positive reordering information– Bruin will do better

Page 32: The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan

Discussion Models:

Bruin (0.3750)• MaxEnt based reordering model• Long distance word reordering

Confucius (0.2802)• Monotone search

2007 test set (2006 test set)• 6.7words/sent (12.7words/sent)

– Bruin will do better• Punctuation marks (no )

– More positive reordering information– Bruin will do better

Page 33: The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan

Discussion Models:

Bruin (0.3750)• MaxEnt based reordering model• Long distance word reordering

Confucius (0.2802)• Monotone search

2007 test set (2006 test set)• 6.7words/sent (12.7words/sent)

– Bruin will do better• Punctuation marks (no )

– More positive reordering information– Bruin will do better

2006 tst 2007 tstBruin 0.2283 0.3750

Confucius

0.2042 0.2802

Page 34: The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan

Discussion Models:

Bruin (0.3750)• MaxEnt based reordering model• Long distance word reordering

Confucius (0.2802)• Monotone search

2007 test set (2006 test set)• 6.7words/sent (12.7words/sent)

– Bruin will do better• Punctuation marks (no )

– More positive reordering information– Bruin will do better

2006 tst 2007 tstBruin 0.2283 0.3750

Confucius

0.2042 0.2802

Page 35: The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan

Discussion Models:

Bruin (0.3750)• MaxEnt based reordering model• Long distance word reordering

Confucius (0.2802)• Monotone search

2007 test set (2006 test set)• 6.7words/sent (12.7words/sent)

– Bruin will do better• Punctuation marks (no )

– More positive reordering information– Bruin will do better

2006 tst 2007 tstBruin 0.2283 0.3750

Confucius

0.2042 0.2802

Page 36: The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan

Results on IWSLT 2007 test set

Page 37: The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan

Outline

OverviewSystems

BruinLynxConfucius

Official EvaluationDiscussionSummary

Page 38: The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan

SummaryMT

3 systems based on different translation models:• MaxEnt BTG Model• TAT model• Phrase-based Similarity Model

Future WorkMore new modelSystem combination

Page 39: The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan

ReferencesYang Liu, Qun Liu, and Shouxun Lin. 2006. Tree-to-String Alignment Template for Statistical Machine Translation. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pages 609-616, Sydney, Australia, July. Deyi Xiong, Qun Liu, and Shouxun Lin. 2006. Maximum Entropy Based Phrase Reordering Model for Statistical Machine Translation . In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pages 521-528, Sydney, Australia, July.

Page 40: The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan

Thanks!