the trgtk’s system description for patentmt at...
TRANSCRIPT
nlp.ict.ac.cnwww.torangetek.com
The TRGTK’s System Description for PatentMT at NTCIR-10
Hao XiongTorangetek Inc.
Institute of Computing TechnologyChinese Academy of Sciences
2013-6-21
1
nlp.ict.ac.cnwww.torangetek.com
Outline
• Something about TRGTK• Techniques used in Patent MT• Summary
2
nlp.ict.ac.cnwww.torangetek.com
Outline
• Something about TRGTK• Techniques used in Patent MT• Summary
3
nlp.ict.ac.cnwww.torangetek.com
Torangetek Inc.
• Torangetek– Translation + Orange + Technology
4
nlp.ict.ac.cnwww.torangetek.com
Torangetek Inc.
• Torangetek– Translation + Orange + Technology
5
nlp.ict.ac.cnwww.torangetek.com
Customers
• Spoken Translation Service
• Patent Translation Service
• E-business Translation Service
6
nlp.ict.ac.cnwww.torangetek.com
Outline
• Something about TRGTK• Techniques used in Patent MT• Summary
7
nlp.ict.ac.cnwww.torangetek.com
Goal
• Translation Speed Versus Translation Quality• NTCIR-9
– ICT: Academy– Quality
• NTCIR-10– Torangetek: Company– Fast and Stable Translation Service
8
nlp.ict.ac.cnwww.torangetek.com
Data Usage
System Bilingual Monolingual
C-E 1 Million 40 MillionJ-E 3 Million 40 MillionE-J 3 Million 73 Million
9
nlp.ict.ac.cnwww.torangetek.com
Final Results34.63
26.99
32.2133.46
26.34
31.4
21.52
26.34
31.97
27.28
32.91
0
5
10
15
20
25
30
35
40
C-E J-E E-J
Intrinsic Evaluation Chronological EvaluationMultilingual EvaluationNTCIR9
10
nlp.ict.ac.cnwww.torangetek.com
Overall ArchitectureBilingual Training Corpus Monolingual
Training CorpusPatent Grant
Data
Parallel Chinese Segmenter
Word Alignment
Rule Extraction
Language Model Training
Tokenization
Testing Document
Document-level Decoder
Translation History Memory
Translation Result
11
nlp.ict.ac.cnwww.torangetek.com
Traditional Chinese Segmenter
• Training is Off-line– Time Insensitive
• Testing is On-line– O(n2) for Effective Searching– Patent Sentences: more than 10 words– Time Sensitive
12
nlp.ict.ac.cnwww.torangetek.com
Parallel Chinese Segmenter
• Character-based Max-Entropy Model– Hwee Tou Ng and Jin Kiat Low, 2004– Meng et.,al, 2012– BMES– 13 Features– 10k Entities from Baidu-pedia
• Segmentation History Storage• Searching on GPUs
13
nlp.ict.ac.cnwww.torangetek.com
Segmentation History Storage
• Practical Application– Using 3% Words in Training Set
• Character-based Model– Context Length:2 characters
14
nlp.ict.ac.cnwww.torangetek.com
Running Example
奥巴马 与 沙龙 举行 了 会谈Obama and Sharon hold a talk
Training Instance
15
nlp.ict.ac.cnwww.torangetek.com
Running ExampleTraining Instance
布 什 与 沙 龙 举 行 了 会 谈Bush and Sharon hold a talk
Testing Instance
奥巴马 与 沙龙 举行 了 会谈Obama and Sharon hold a talk
16
nlp.ict.ac.cnwww.torangetek.com
Running ExampleTraining Instance
布 什 与 沙 龙 举 行 了 会 谈Bush and Sharon hold a talk
Testing Instance
奥巴马 与 沙龙 举行 了 会谈Obama and Sharon hold a talk
String Matching
17
nlp.ict.ac.cnwww.torangetek.com
Running ExampleTraining Instance
Testing Instance
布 什 与 沙 龙 举 行 了 会 谈Bush and Sharon hold a talk
奥巴马 与 沙龙 举行 了 会谈Obama and Sharon hold a talk
与 沙龙 举行 了 会谈
Wrong Matched Segmentation
18
nlp.ict.ac.cnwww.torangetek.com
Running ExampleTraining Instance
Testing Instance
布 什 与 沙 龙 举 行 了 会 谈Bush and Sharon hold a talk
奥巴马 与 沙龙 举行 了 会谈Obama and Sharon hold a talk
与 沙龙 举行 了 会谈
布什与沙龙 什与沙龙举Wrong Matched Segmentation
巴马与沙龙 马与沙龙举
19
nlp.ict.ac.cnwww.torangetek.com
Running ExampleTraining Instance
Testing Instance
布 什 与 沙 龙 举 行 了 会 谈Bush and Sharon hold a talk
奥巴马 与 沙龙 举行 了 会谈Obama and Sharon hold a talk
举行 了 会谈
Correct Matched Segmentation
20
nlp.ict.ac.cnwww.torangetek.com
Running ExampleTraining Instance
Testing Instance
_ _ 布什与 _ 布什与沙 布什与沙龙
布 什 与 沙 龙 举 行 了 会 谈Bush and Sharon hold a talk
奥巴马 与 沙龙 举行 了 会谈Obama and Sharon hold a talk
什与沙龙举 与沙龙举行
举行 了 会谈
Context for left characters
21
nlp.ict.ac.cnwww.torangetek.com
Running ExampleTraining Instance
Testing Instance
_ _ 布什与 _ 布什与沙 布什与沙龙
布 什 与 沙 龙 举 行 了 会 谈Bush and Sharon hold a talk
奥巴马 与 沙龙 举行 了 会谈Obama and Sharon hold a talk
什与沙龙举 与沙龙举行
举行 了 会谈
布什 与 沙龙
22
nlp.ict.ac.cnwww.torangetek.com
Running ExampleTraining Instance
Testing Instance
_ _ 布什与 _ 布什与沙 布什与沙龙
布 什 与 沙 龙 举 行 了 会 谈Bush and Sharon hold a talk
奥巴马 与 沙龙 举行 了 会谈Obama and Sharon hold a talk
什与沙龙举 与沙龙举行
举行 了 会谈
布什 与 沙龙
Reduce 50% Searching Time23
nlp.ict.ac.cnwww.torangetek.com
Computing on GPUs
• Inspired by Youngmin Yi et., 2011– CKY-Parsing on GPUs
• GPUs– NVIDIA GTX480 GPU: 480 processing cores
• Compute Unified Device Architecture
24
nlp.ict.ac.cnwww.torangetek.com
Parallel Computing
• Hierarchical Computing– Layer 1: Segmentation with 1 Character– Layer 2: Segmentation with 2 Characters– …– Layer n: Segmentation with n Characters
• Computing in One Layer is Parallel
25
nlp.ict.ac.cnwww.torangetek.com
Parallel Computing
布=>B 什=>E 与=>S 沙=>B 龙=>E
布什 沙龙
布什 与 与 沙龙
布什 与 沙龙
什与 与沙
什=>B …
布 什与 …
… 布什 与沙龙
布什 与 沙 … 什 与沙龙
26
nlp.ict.ac.cnwww.torangetek.com
Experiments of Segmentation
32
40
0.3 0.505
1015202530354045
Using SHS No SHS
1GPU500GPUs
SHS: Segmentation Historical Storage(10 Million Extrinsic Patent Chinese Sentences)
27
minutes
nlp.ict.ac.cnwww.torangetek.com
Patent Translation
• Document Translation– Description of Patents
• Translation Consistency– Terminology
• Large Requirements– More than 10k docs per day
28
nlp.ict.ac.cnwww.torangetek.com
Document-level Decoder
• Document Generation– KNN Algorithm + Distributional Similarity
• Term Recognition– C-Value– Term Candidates Templates
• Documental CKY Algorithm
29
nlp.ict.ac.cnwww.torangetek.com
Documental CKY Algorithm
• Hierarchical Phrase-based Model• Linking Translation Consistent Cubes• Shared Score
– Local Score• Translation Prob, Lexical Trans Prob,…, Word
Count– Shared Language Model Score
• LMs from different contexts in each sentence
30
nlp.ict.ac.cnwww.torangetek.com
Running Example
W1,W2,..,Wi,..,Wn
Sentence 1
W1,W2,..,Wj,..,Wn
Sentence 2
W1,W2,..,Wk,..,Wn
Sentence 3
31
nlp.ict.ac.cnwww.torangetek.com
Running Example
W1,W2,..,Wi,..,Wn
Sentence 1
W1,W2,..,Wj,..,Wn
Sentence 2
W1,W2,..,Wk,..,Wn
Sentence 3
Term Recognition
32
nlp.ict.ac.cnwww.torangetek.com
Running Example
W1,..,Wi-1,Wi,Wi+1..,Wn
Sentence 1 Sentence 2 Sentence 3
Transi-1,0
Transi-1,1
…
W1,..,Wi-1,Wi,Wi+1..,Wn W1,..,Wi-1,Wi,Wi+1..,Wn
Transi+1,0
Transi+1,1
…
Transi-1,0
Transi-1,1
…
Transi,0
Transi,1
…
Transi+1,0
Transi+1,1
…
Transi-1,0
Transi-1,1
…
Transi+1,0
Transi+1,1
…
Generating Translation for each CubeLinking Translation Consistent Cubes
33
nlp.ict.ac.cnwww.torangetek.com
Running Example
W1,..,Wi-1,Wi,Wi+1..,Wn
Sentence 1 Sentence 2 Sentence 3
Transi-1,0
Transi-1,1
…
W1,..,Wi-1,Wi,Wi+1..,Wn W1,..,Wi-1,Wi,Wi+1..,Wn
Transi+1,0
Transi+1,1
…
Transi-1,0
Transi-1,1
…
Transi,0
Transi,1
…
Transi+1,0
Transi+1,1
…
Transi-1,0
Transi-1,1
…
Transi+1,0
Transi+1,1
…
Wi-1,Wi,Wi+1 :Trans0
Traditional CKY
Wi-1,Wi,Wi+1 :Trans0 Wi-1,Wi,Wi+1 :Trans0
34
nlp.ict.ac.cnwww.torangetek.com
Running Example
W1,..,Wi-1,Wi,Wi+1..,Wn
Sentence 1 Sentence 2 Sentence 3
Transi-1,0
Transi-1,1
…
W1,..,Wi-1,Wi,Wi+1..,Wn W1,..,Wi-1,Wi,Wi+1..,Wn
Transi+1,0
Transi+1,1
…
Transi-1,0
Transi-1,1
…
Transi,0
Transi,1
…
Transi+1,0
Transi+1,1
…
Transi-1,0
Transi-1,1
…
Transi+1,0
Transi+1,1
…
S1:Wi-1,Wi,Wi+1 :Trans0 S2:Wi-1,Wi,Wi+1 :Trans0 S3:Wi-1,Wi,Wi+1 :Trans0
Our CKY: Push into one Cube
35
nlp.ict.ac.cnwww.torangetek.com
Parallel Training
45
8
05
101520253035404550
Training Time(hours)
1CPU100CPUs
36
nlp.ict.ac.cnwww.torangetek.com
Parallel Decoding
50
800
1000
0
200
400
600
800
1000
1200
Decoding Speed(words/sec)
1CPU100CPUs100CPUs+THS
THS: Translation Historical Storage(1 Million Training Data)Testing Sentences: Extrinsic 10k Sentences
37
nlp.ict.ac.cnwww.torangetek.com
Outline
• Something about TRGTK• Techniques used in Patent MT• Summary
38
nlp.ict.ac.cnwww.torangetek.com
Summary
• Fast and Stable Translation Service• Parallel Chinese Segmenter• Computing on GPUs• Using Seg/Trans History Storage• Document-level Decoder
39
nlp.ict.ac.cnwww.torangetek.com
Discussion
• Patent MT helps Translators– Google API: reduce 50% time– Our API (20M Training Sentences, 1M
Terminologies, Refined Segmenter): reduce 90% time
40
nlp.ict.ac.cnwww.torangetek.com 41
一种治疗肝纤维化及肝硬化的中成药及其制剂的制备方法,属中成药及其制备方法技术领域。其中成药原料组分:郁金、猪苓、赤芍、青皮、木香、大腹皮、泽泻、厚朴(制)、鸡内金、鳖甲。
A Chinese medicinal composition for treating hepatic fibrosis and liver cirrhosis and its preparation process, and belongs to the field of chinese medicine and its prepn technology. Wherein the medicinal material component: Curcmae Rhizoma, pig 苓, Radix Paeoniae Rubra, Pericarpium Citri Reticulatae Viride, Radix Aucklandiae, Pericarpium Arecae, Rhizoma Alismatis, Cortex Magnoliae Officinalis preparata, Endothelium Corneum GigeriaeGalli, Carapax Trionycis.
nlp.ict.ac.cn
谢谢!Thank you!ありがとうDankeJe vous remercie!감사합니다!ขอบคุณ!
42