can controlled language rules increase the value of mt? fred hollowood & johann rotourier...
TRANSCRIPT
Can Controlled Language Rules increase the value of MT?
Fred Hollowood & Johann Rotourier
Symantec Dublin
Localisation Challenge
Databases filled with English content• Large volumes• Perishable• Technical
Fast delivery
Cost effective
Goals
Reduce cost of Translation to 30%• Implement CL within the authoring community• Foster the use of editor software to police the CL rule set• Identify the most efficient MT system for each target language• Develop Post-Editing guidelines• Refine Symantec glossaries to assist in dictionary preparation
Controlled Language and MT
ControlledLanguage
MT system
Rule SetsTerminologyStyleEditors
Language PairsJp, De, Fr, It, Es
Post Editing Assessment
Sequence of Events
Identify a corpus
Develop a test suite
Develop terminology
Work with MT engines
Assess results
Two Questions
How effective are CL rules in terms of post-editing effort?
Which CL rules provide the best results?
Corpus Selection
Origin • stream of XML messages
Volume• 30,000 words
Process• Use TM technology to pre-process raw XML to provide strings for MT• Use Macros to tidy up untranslatable text
Terminology Extraction
Extraction• Tools: Wordsmith Tools 4
Removal of duplicates• Spelling variants• Hyphenation variants• Capitalisation variants• Symbol/Plain• Abbreviation/Plain
Removal of synonyms
Custom Dictionaries
Current MT systems• Systran Premium 4.0• Logomedia Translate Pro
— Differing capabilities
— Differing function
Per target language• Grammars• Styles
Test Suite
59 rules examined
17 of which already encapsulated in Symantec’s writing guidelines
Classification• 8 lexical• 40 syntactic• 11 textual
Controlled Language Sources
Breakdown of CL sources (59)
17
18
411
5
13Attempto
Bernth & Gdaniec
Personal
PACE
AECMA
Easy English
O'Brien
Testing the Rules
Process• Find an example sentence that does not conform to the rule• Edit it to conform to all other rules under study• Minimize the linguistic complexity (single test)• Apply the CL rule• Repeat the procedure to obtain 3 test examples
Test Suite• 59 rules expressed as 177 sentences
Post Editing Guidelines
Ensure information transfer
Modify what is grammatically deviant from commercial quality
Modify what is lexically essential for understanding in target.
Avoid the use of synonyms for the sake of originality
Don’t forget that all the words are probably present in the output ( possibly wrong order)
Remember style does not matter but information accuracy does.
Don’t dally, if an improvement is not obvious, move along
Metrics Generation
Quality levels• Excellent (4), Good (3), Medium (2), Poor (1)• Uncontrolled source generates output A• Controlled source generates output B
Focus is on Usability
Evaluation by native speakers
Further study is being done to link into other systems of quality evaluation
• Blackjack• SAE J 2450
Overall evaluation (French)
O v e r a l l e v a l u a t i o n o f 1 7 7 e x a m p l e s ( S y s t r a n F r e n c h )
3 84 6 4 4 4 9
2 2
5 8
1 1 5
0
2 0
4 0
6 0
8 0
1 0 0
1 2 0
1 4 0
P M G E
S c o r e s
Numb
er of
exam
ples M T o u t p u t A
M T o u t p u t B
Overall evaluation (Japanese)
Overall evaluation of 177 examples (Logomedia Japanese)
32
72
42
30
13
50 52
62
0
10
20
30
40
50
60
70
80
P M G E
Scores
Num
ber
of
exam
ple
s
MT output A
MT output B
Overall evaluation (German)
Overall evaluation of 177 examples (Systran German)
25
53 57
42
0
20
71
86
0
20
40
60
80
100
P M G E
Scores
num
ber
of
exam
ple
s
MT output A
MT output B
Preliminary Results
CL significant impact
Benefit varies by language• Lots of scope for further study
Some rules are more effective than others (score range: 0- 17)
Symantec’s implied rules have mixed effectiveness
Recommend 7 additional rules
Additional rules
Rules with an impact in all languages• Do not omit words within lexical items, even when the term has already been
used in the sentence (12). Repeat the head noun with conjoined articles or prepositions. (15)
• Do not use slashes to list lexical items (except for product names). (14)• Always write a verb next to its particle. (17)• Only use the modal ‘could’ when the sentence contains ‘if’, otherwise use ‘can’.
(10)• Be very careful with the –ing words: If it is a gerund, use an article in front of it.
(7). If it is introducing a new clause, use ‘by’ in front it (8). If it is modifying a noun in a non-finite clause, replace it with a relative clause. (5)
• Make sure that every segment can stand syntactically alone. (11)• Avoid footnotes in the middle of a segment. Turn footnotes into independent
segments. (11)
Next Steps
Apply subsets of rules to a larger corpus.• Language checker Acrolinx
Increase the number of MT engines studied• Comprendium/Prompt (European languages)• Fujitsu/Nova’s PC Transer (Japanese)
Further refine Post Editing guidelines
Keep abreast of upgrades in current systems• Bugs fixed• New versions of software
Move to a production pilot project