the trgtk’s system description for patentmt at...

42
nlp.ict.ac.cn www.torangetek.com The TRGTK’s System Description for PatentMT at NTCIR-10 Hao Xiong Torangetek Inc. Institute of Computing Technology Chinese Academy of Sciences 2013-6-21 1

Upload: others

Post on 30-Jan-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The TRGTK’s System Description for PatentMT at NTCIR-10research.nii.ac.jp/ntcir/workshop/Online... · nlp.ict.ac.cn The TRGTK’s System Description for PatentMT at NTCIR-10 Hao

nlp.ict.ac.cnwww.torangetek.com

The TRGTK’s System Description for PatentMT at NTCIR-10

Hao XiongTorangetek Inc.

Institute of Computing TechnologyChinese Academy of Sciences

2013-6-21

1

Page 2: The TRGTK’s System Description for PatentMT at NTCIR-10research.nii.ac.jp/ntcir/workshop/Online... · nlp.ict.ac.cn The TRGTK’s System Description for PatentMT at NTCIR-10 Hao

nlp.ict.ac.cnwww.torangetek.com

Outline

• Something about TRGTK• Techniques used in Patent MT• Summary

2

Page 3: The TRGTK’s System Description for PatentMT at NTCIR-10research.nii.ac.jp/ntcir/workshop/Online... · nlp.ict.ac.cn The TRGTK’s System Description for PatentMT at NTCIR-10 Hao

nlp.ict.ac.cnwww.torangetek.com

Outline

• Something about TRGTK• Techniques used in Patent MT• Summary

3

Page 4: The TRGTK’s System Description for PatentMT at NTCIR-10research.nii.ac.jp/ntcir/workshop/Online... · nlp.ict.ac.cn The TRGTK’s System Description for PatentMT at NTCIR-10 Hao

nlp.ict.ac.cnwww.torangetek.com

Torangetek Inc.

• Torangetek– Translation + Orange + Technology

4

Page 5: The TRGTK’s System Description for PatentMT at NTCIR-10research.nii.ac.jp/ntcir/workshop/Online... · nlp.ict.ac.cn The TRGTK’s System Description for PatentMT at NTCIR-10 Hao

nlp.ict.ac.cnwww.torangetek.com

Torangetek Inc.

• Torangetek– Translation + Orange + Technology

5

Page 6: The TRGTK’s System Description for PatentMT at NTCIR-10research.nii.ac.jp/ntcir/workshop/Online... · nlp.ict.ac.cn The TRGTK’s System Description for PatentMT at NTCIR-10 Hao

nlp.ict.ac.cnwww.torangetek.com

Customers

• Spoken Translation Service

• Patent Translation Service

• E-business Translation Service

6

Page 7: The TRGTK’s System Description for PatentMT at NTCIR-10research.nii.ac.jp/ntcir/workshop/Online... · nlp.ict.ac.cn The TRGTK’s System Description for PatentMT at NTCIR-10 Hao

nlp.ict.ac.cnwww.torangetek.com

Outline

• Something about TRGTK• Techniques used in Patent MT• Summary

7

Page 8: The TRGTK’s System Description for PatentMT at NTCIR-10research.nii.ac.jp/ntcir/workshop/Online... · nlp.ict.ac.cn The TRGTK’s System Description for PatentMT at NTCIR-10 Hao

nlp.ict.ac.cnwww.torangetek.com

Goal

• Translation Speed Versus Translation Quality• NTCIR-9

– ICT: Academy– Quality

• NTCIR-10– Torangetek: Company– Fast and Stable Translation Service

8

Page 9: The TRGTK’s System Description for PatentMT at NTCIR-10research.nii.ac.jp/ntcir/workshop/Online... · nlp.ict.ac.cn The TRGTK’s System Description for PatentMT at NTCIR-10 Hao

nlp.ict.ac.cnwww.torangetek.com

Data Usage

System Bilingual Monolingual

C-E 1 Million 40 MillionJ-E 3 Million 40 MillionE-J 3 Million 73 Million

9

Page 10: The TRGTK’s System Description for PatentMT at NTCIR-10research.nii.ac.jp/ntcir/workshop/Online... · nlp.ict.ac.cn The TRGTK’s System Description for PatentMT at NTCIR-10 Hao

nlp.ict.ac.cnwww.torangetek.com

Final Results34.63

26.99

32.2133.46

26.34

31.4

21.52

26.34

31.97

27.28

32.91

0

5

10

15

20

25

30

35

40

C-E J-E E-J

Intrinsic Evaluation Chronological EvaluationMultilingual EvaluationNTCIR9

10

Page 11: The TRGTK’s System Description for PatentMT at NTCIR-10research.nii.ac.jp/ntcir/workshop/Online... · nlp.ict.ac.cn The TRGTK’s System Description for PatentMT at NTCIR-10 Hao

nlp.ict.ac.cnwww.torangetek.com

Overall ArchitectureBilingual Training Corpus Monolingual

Training CorpusPatent Grant

Data

Parallel Chinese Segmenter

Word Alignment

Rule Extraction

Language Model Training

Tokenization

Testing Document

Document-level Decoder

Translation History Memory

Translation Result

11

Page 12: The TRGTK’s System Description for PatentMT at NTCIR-10research.nii.ac.jp/ntcir/workshop/Online... · nlp.ict.ac.cn The TRGTK’s System Description for PatentMT at NTCIR-10 Hao

nlp.ict.ac.cnwww.torangetek.com

Traditional Chinese Segmenter

• Training is Off-line– Time Insensitive

• Testing is On-line– O(n2) for Effective Searching– Patent Sentences: more than 10 words– Time Sensitive

12

Page 13: The TRGTK’s System Description for PatentMT at NTCIR-10research.nii.ac.jp/ntcir/workshop/Online... · nlp.ict.ac.cn The TRGTK’s System Description for PatentMT at NTCIR-10 Hao

nlp.ict.ac.cnwww.torangetek.com

Parallel Chinese Segmenter

• Character-based Max-Entropy Model– Hwee Tou Ng and Jin Kiat Low, 2004– Meng et.,al, 2012– BMES– 13 Features– 10k Entities from Baidu-pedia

• Segmentation History Storage• Searching on GPUs

13

Page 14: The TRGTK’s System Description for PatentMT at NTCIR-10research.nii.ac.jp/ntcir/workshop/Online... · nlp.ict.ac.cn The TRGTK’s System Description for PatentMT at NTCIR-10 Hao

nlp.ict.ac.cnwww.torangetek.com

Segmentation History Storage

• Practical Application– Using 3% Words in Training Set

• Character-based Model– Context Length:2 characters

14

Page 15: The TRGTK’s System Description for PatentMT at NTCIR-10research.nii.ac.jp/ntcir/workshop/Online... · nlp.ict.ac.cn The TRGTK’s System Description for PatentMT at NTCIR-10 Hao

nlp.ict.ac.cnwww.torangetek.com

Running Example

奥巴马 与 沙龙 举行 了 会谈Obama and Sharon hold a talk

Training Instance

15

Page 16: The TRGTK’s System Description for PatentMT at NTCIR-10research.nii.ac.jp/ntcir/workshop/Online... · nlp.ict.ac.cn The TRGTK’s System Description for PatentMT at NTCIR-10 Hao

nlp.ict.ac.cnwww.torangetek.com

Running ExampleTraining Instance

布 什 与 沙 龙 举 行 了 会 谈Bush and Sharon hold a talk

Testing Instance

奥巴马 与 沙龙 举行 了 会谈Obama and Sharon hold a talk

16

Page 17: The TRGTK’s System Description for PatentMT at NTCIR-10research.nii.ac.jp/ntcir/workshop/Online... · nlp.ict.ac.cn The TRGTK’s System Description for PatentMT at NTCIR-10 Hao

nlp.ict.ac.cnwww.torangetek.com

Running ExampleTraining Instance

布 什 与 沙 龙 举 行 了 会 谈Bush and Sharon hold a talk

Testing Instance

奥巴马 与 沙龙 举行 了 会谈Obama and Sharon hold a talk

String Matching

17

Page 18: The TRGTK’s System Description for PatentMT at NTCIR-10research.nii.ac.jp/ntcir/workshop/Online... · nlp.ict.ac.cn The TRGTK’s System Description for PatentMT at NTCIR-10 Hao

nlp.ict.ac.cnwww.torangetek.com

Running ExampleTraining Instance

Testing Instance

布 什 与 沙 龙 举 行 了 会 谈Bush and Sharon hold a talk

奥巴马 与 沙龙 举行 了 会谈Obama and Sharon hold a talk

与 沙龙 举行 了 会谈

Wrong Matched Segmentation

18

Page 19: The TRGTK’s System Description for PatentMT at NTCIR-10research.nii.ac.jp/ntcir/workshop/Online... · nlp.ict.ac.cn The TRGTK’s System Description for PatentMT at NTCIR-10 Hao

nlp.ict.ac.cnwww.torangetek.com

Running ExampleTraining Instance

Testing Instance

布 什 与 沙 龙 举 行 了 会 谈Bush and Sharon hold a talk

奥巴马 与 沙龙 举行 了 会谈Obama and Sharon hold a talk

与 沙龙 举行 了 会谈

布什与沙龙 什与沙龙举Wrong Matched Segmentation

巴马与沙龙 马与沙龙举

19

Page 20: The TRGTK’s System Description for PatentMT at NTCIR-10research.nii.ac.jp/ntcir/workshop/Online... · nlp.ict.ac.cn The TRGTK’s System Description for PatentMT at NTCIR-10 Hao

nlp.ict.ac.cnwww.torangetek.com

Running ExampleTraining Instance

Testing Instance

布 什 与 沙 龙 举 行 了 会 谈Bush and Sharon hold a talk

奥巴马 与 沙龙 举行 了 会谈Obama and Sharon hold a talk

举行 了 会谈

Correct Matched Segmentation

20

Page 21: The TRGTK’s System Description for PatentMT at NTCIR-10research.nii.ac.jp/ntcir/workshop/Online... · nlp.ict.ac.cn The TRGTK’s System Description for PatentMT at NTCIR-10 Hao

nlp.ict.ac.cnwww.torangetek.com

Running ExampleTraining Instance

Testing Instance

_ _ 布什与 _ 布什与沙 布什与沙龙

布 什 与 沙 龙 举 行 了 会 谈Bush and Sharon hold a talk

奥巴马 与 沙龙 举行 了 会谈Obama and Sharon hold a talk

什与沙龙举 与沙龙举行

举行 了 会谈

Context for left characters

21

Page 22: The TRGTK’s System Description for PatentMT at NTCIR-10research.nii.ac.jp/ntcir/workshop/Online... · nlp.ict.ac.cn The TRGTK’s System Description for PatentMT at NTCIR-10 Hao

nlp.ict.ac.cnwww.torangetek.com

Running ExampleTraining Instance

Testing Instance

_ _ 布什与 _ 布什与沙 布什与沙龙

布 什 与 沙 龙 举 行 了 会 谈Bush and Sharon hold a talk

奥巴马 与 沙龙 举行 了 会谈Obama and Sharon hold a talk

什与沙龙举 与沙龙举行

举行 了 会谈

布什 与 沙龙

22

Page 23: The TRGTK’s System Description for PatentMT at NTCIR-10research.nii.ac.jp/ntcir/workshop/Online... · nlp.ict.ac.cn The TRGTK’s System Description for PatentMT at NTCIR-10 Hao

nlp.ict.ac.cnwww.torangetek.com

Running ExampleTraining Instance

Testing Instance

_ _ 布什与 _ 布什与沙 布什与沙龙

布 什 与 沙 龙 举 行 了 会 谈Bush and Sharon hold a talk

奥巴马 与 沙龙 举行 了 会谈Obama and Sharon hold a talk

什与沙龙举 与沙龙举行

举行 了 会谈

布什 与 沙龙

Reduce 50% Searching Time23

Page 24: The TRGTK’s System Description for PatentMT at NTCIR-10research.nii.ac.jp/ntcir/workshop/Online... · nlp.ict.ac.cn The TRGTK’s System Description for PatentMT at NTCIR-10 Hao

nlp.ict.ac.cnwww.torangetek.com

Computing on GPUs

• Inspired by Youngmin Yi et., 2011– CKY-Parsing on GPUs

• GPUs– NVIDIA GTX480 GPU: 480 processing cores

• Compute Unified Device Architecture

24

Page 25: The TRGTK’s System Description for PatentMT at NTCIR-10research.nii.ac.jp/ntcir/workshop/Online... · nlp.ict.ac.cn The TRGTK’s System Description for PatentMT at NTCIR-10 Hao

nlp.ict.ac.cnwww.torangetek.com

Parallel Computing

• Hierarchical Computing– Layer 1: Segmentation with 1 Character– Layer 2: Segmentation with 2 Characters– …– Layer n: Segmentation with n Characters

• Computing in One Layer is Parallel

25

Page 26: The TRGTK’s System Description for PatentMT at NTCIR-10research.nii.ac.jp/ntcir/workshop/Online... · nlp.ict.ac.cn The TRGTK’s System Description for PatentMT at NTCIR-10 Hao

nlp.ict.ac.cnwww.torangetek.com

Parallel Computing

布=>B 什=>E 与=>S 沙=>B 龙=>E

布什 沙龙

布什 与 与 沙龙

布什 与 沙龙

什与 与沙

什=>B …

布 什与 …

… 布什 与沙龙

布什 与 沙 … 什 与沙龙

26

Page 27: The TRGTK’s System Description for PatentMT at NTCIR-10research.nii.ac.jp/ntcir/workshop/Online... · nlp.ict.ac.cn The TRGTK’s System Description for PatentMT at NTCIR-10 Hao

nlp.ict.ac.cnwww.torangetek.com

Experiments of Segmentation

32

40

0.3 0.505

1015202530354045

Using SHS No SHS

1GPU500GPUs

SHS: Segmentation Historical Storage(10 Million Extrinsic Patent Chinese Sentences)

27

minutes

Page 28: The TRGTK’s System Description for PatentMT at NTCIR-10research.nii.ac.jp/ntcir/workshop/Online... · nlp.ict.ac.cn The TRGTK’s System Description for PatentMT at NTCIR-10 Hao

nlp.ict.ac.cnwww.torangetek.com

Patent Translation

• Document Translation– Description of Patents

• Translation Consistency– Terminology

• Large Requirements– More than 10k docs per day

28

Page 29: The TRGTK’s System Description for PatentMT at NTCIR-10research.nii.ac.jp/ntcir/workshop/Online... · nlp.ict.ac.cn The TRGTK’s System Description for PatentMT at NTCIR-10 Hao

nlp.ict.ac.cnwww.torangetek.com

Document-level Decoder

• Document Generation– KNN Algorithm + Distributional Similarity

• Term Recognition– C-Value– Term Candidates Templates

• Documental CKY Algorithm

29

Page 30: The TRGTK’s System Description for PatentMT at NTCIR-10research.nii.ac.jp/ntcir/workshop/Online... · nlp.ict.ac.cn The TRGTK’s System Description for PatentMT at NTCIR-10 Hao

nlp.ict.ac.cnwww.torangetek.com

Documental CKY Algorithm

• Hierarchical Phrase-based Model• Linking Translation Consistent Cubes• Shared Score

– Local Score• Translation Prob, Lexical Trans Prob,…, Word

Count– Shared Language Model Score

• LMs from different contexts in each sentence

30

Page 31: The TRGTK’s System Description for PatentMT at NTCIR-10research.nii.ac.jp/ntcir/workshop/Online... · nlp.ict.ac.cn The TRGTK’s System Description for PatentMT at NTCIR-10 Hao

nlp.ict.ac.cnwww.torangetek.com

Running Example

W1,W2,..,Wi,..,Wn

Sentence 1

W1,W2,..,Wj,..,Wn

Sentence 2

W1,W2,..,Wk,..,Wn

Sentence 3

31

Page 32: The TRGTK’s System Description for PatentMT at NTCIR-10research.nii.ac.jp/ntcir/workshop/Online... · nlp.ict.ac.cn The TRGTK’s System Description for PatentMT at NTCIR-10 Hao

nlp.ict.ac.cnwww.torangetek.com

Running Example

W1,W2,..,Wi,..,Wn

Sentence 1

W1,W2,..,Wj,..,Wn

Sentence 2

W1,W2,..,Wk,..,Wn

Sentence 3

Term Recognition

32

Page 33: The TRGTK’s System Description for PatentMT at NTCIR-10research.nii.ac.jp/ntcir/workshop/Online... · nlp.ict.ac.cn The TRGTK’s System Description for PatentMT at NTCIR-10 Hao

nlp.ict.ac.cnwww.torangetek.com

Running Example

W1,..,Wi-1,Wi,Wi+1..,Wn

Sentence 1 Sentence 2 Sentence 3

Transi-1,0

Transi-1,1

W1,..,Wi-1,Wi,Wi+1..,Wn W1,..,Wi-1,Wi,Wi+1..,Wn

Transi+1,0

Transi+1,1

Transi-1,0

Transi-1,1

Transi,0

Transi,1

Transi+1,0

Transi+1,1

Transi-1,0

Transi-1,1

Transi+1,0

Transi+1,1

Generating Translation for each CubeLinking Translation Consistent Cubes

33

Page 34: The TRGTK’s System Description for PatentMT at NTCIR-10research.nii.ac.jp/ntcir/workshop/Online... · nlp.ict.ac.cn The TRGTK’s System Description for PatentMT at NTCIR-10 Hao

nlp.ict.ac.cnwww.torangetek.com

Running Example

W1,..,Wi-1,Wi,Wi+1..,Wn

Sentence 1 Sentence 2 Sentence 3

Transi-1,0

Transi-1,1

W1,..,Wi-1,Wi,Wi+1..,Wn W1,..,Wi-1,Wi,Wi+1..,Wn

Transi+1,0

Transi+1,1

Transi-1,0

Transi-1,1

Transi,0

Transi,1

Transi+1,0

Transi+1,1

Transi-1,0

Transi-1,1

Transi+1,0

Transi+1,1

Wi-1,Wi,Wi+1 :Trans0

Traditional CKY

Wi-1,Wi,Wi+1 :Trans0 Wi-1,Wi,Wi+1 :Trans0

34

Page 35: The TRGTK’s System Description for PatentMT at NTCIR-10research.nii.ac.jp/ntcir/workshop/Online... · nlp.ict.ac.cn The TRGTK’s System Description for PatentMT at NTCIR-10 Hao

nlp.ict.ac.cnwww.torangetek.com

Running Example

W1,..,Wi-1,Wi,Wi+1..,Wn

Sentence 1 Sentence 2 Sentence 3

Transi-1,0

Transi-1,1

W1,..,Wi-1,Wi,Wi+1..,Wn W1,..,Wi-1,Wi,Wi+1..,Wn

Transi+1,0

Transi+1,1

Transi-1,0

Transi-1,1

Transi,0

Transi,1

Transi+1,0

Transi+1,1

Transi-1,0

Transi-1,1

Transi+1,0

Transi+1,1

S1:Wi-1,Wi,Wi+1 :Trans0 S2:Wi-1,Wi,Wi+1 :Trans0 S3:Wi-1,Wi,Wi+1 :Trans0

Our CKY: Push into one Cube

35

Page 36: The TRGTK’s System Description for PatentMT at NTCIR-10research.nii.ac.jp/ntcir/workshop/Online... · nlp.ict.ac.cn The TRGTK’s System Description for PatentMT at NTCIR-10 Hao

nlp.ict.ac.cnwww.torangetek.com

Parallel Training

45

8

05

101520253035404550

Training Time(hours)

1CPU100CPUs

36

Page 37: The TRGTK’s System Description for PatentMT at NTCIR-10research.nii.ac.jp/ntcir/workshop/Online... · nlp.ict.ac.cn The TRGTK’s System Description for PatentMT at NTCIR-10 Hao

nlp.ict.ac.cnwww.torangetek.com

Parallel Decoding

50

800

1000

0

200

400

600

800

1000

1200

Decoding Speed(words/sec)

1CPU100CPUs100CPUs+THS

THS: Translation Historical Storage(1 Million Training Data)Testing Sentences: Extrinsic 10k Sentences

37

Page 38: The TRGTK’s System Description for PatentMT at NTCIR-10research.nii.ac.jp/ntcir/workshop/Online... · nlp.ict.ac.cn The TRGTK’s System Description for PatentMT at NTCIR-10 Hao

nlp.ict.ac.cnwww.torangetek.com

Outline

• Something about TRGTK• Techniques used in Patent MT• Summary

38

Page 39: The TRGTK’s System Description for PatentMT at NTCIR-10research.nii.ac.jp/ntcir/workshop/Online... · nlp.ict.ac.cn The TRGTK’s System Description for PatentMT at NTCIR-10 Hao

nlp.ict.ac.cnwww.torangetek.com

Summary

• Fast and Stable Translation Service• Parallel Chinese Segmenter• Computing on GPUs• Using Seg/Trans History Storage• Document-level Decoder

39

Page 40: The TRGTK’s System Description for PatentMT at NTCIR-10research.nii.ac.jp/ntcir/workshop/Online... · nlp.ict.ac.cn The TRGTK’s System Description for PatentMT at NTCIR-10 Hao

nlp.ict.ac.cnwww.torangetek.com

Discussion

• Patent MT helps Translators– Google API: reduce 50% time– Our API (20M Training Sentences, 1M

Terminologies, Refined Segmenter): reduce 90% time

40

Page 41: The TRGTK’s System Description for PatentMT at NTCIR-10research.nii.ac.jp/ntcir/workshop/Online... · nlp.ict.ac.cn The TRGTK’s System Description for PatentMT at NTCIR-10 Hao

nlp.ict.ac.cnwww.torangetek.com 41

一种治疗肝纤维化及肝硬化的中成药及其制剂的制备方法,属中成药及其制备方法技术领域。其中成药原料组分:郁金、猪苓、赤芍、青皮、木香、大腹皮、泽泻、厚朴(制)、鸡内金、鳖甲。

A Chinese medicinal composition for treating hepatic fibrosis and liver cirrhosis and its preparation process, and belongs to the field of chinese medicine and its prepn technology. Wherein the medicinal material component: Curcmae Rhizoma, pig 苓, Radix Paeoniae Rubra, Pericarpium Citri Reticulatae Viride, Radix Aucklandiae, Pericarpium Arecae, Rhizoma Alismatis, Cortex Magnoliae Officinalis preparata, Endothelium Corneum GigeriaeGalli, Carapax Trionycis.

Page 42: The TRGTK’s System Description for PatentMT at NTCIR-10research.nii.ac.jp/ntcir/workshop/Online... · nlp.ict.ac.cn The TRGTK’s System Description for PatentMT at NTCIR-10 Hao

nlp.ict.ac.cn

谢谢!Thank you!ありがとうDankeJe vous remercie!감사합니다!ขอบคุณ!

42