advanced decision architectures collaborative technology alliance a task-based evaluation method for...

Advanced Decision Architectures Collaborative Technology Alliance

A Task-Based Evaluation Method for Embedded Machine Translation in Instant

Messaging Systems

William OgdenNew Mexico State University


MT good enough?

Since 1981, Bill Ogg with human and computer interaction research is that cognitive psychologists. He is an expert ergonomic design, evaluation, sample, including the execution of the development of software interfaces are participating in the first half.

Work with the information retrieval research for the interaction between multiple languages and language users in applications is about a communication problem. In particular, the current instant messaging applications to evaluate machine translation technology used to apply the task-centric approach has been developed. He is currently in New Mexico State University, undergraduate and graduate students how to design a communication is to continue to teach.


MT Evaluation

• Typically, machine translations are compared to human translations with computed distance metrics

• Good for system development and cross-system comparisons.

• Translations can also be rated for adequacy and fluency.


Machine Translation (MT)

• What is it good for?– Our goal is to evaluate the usefulness of

automatically translated language for Army applications.

• Proposed applications.– Document translation – Speech to speech dialog– Text instant messaging– … and others.


CCL+ (TrIM) in field tests

• CCL enabled precise, rapid reporting and consultation on illness and injury, as well as location and availability of medical resources, simutantaneously in thirteen languages (Campbell and Hillenbrand, 2005).

• The TRiM (Translingual Instant Messaging) language application tool is an example of a marvelous application that crosses the language barrier. As the Fire Support Coordinator (FSC) for JWID 03, responsible for all ground artillery, naval gun and close air support, I was required to work with the Spanish Army. TRiM effectively enabled me to write my message on a whiteboard … and send it straight to Spain. They receive it in Spanish and can then respond to … me and I receive it in English." (Joyce, 2003)

James R. Campbell and Chris Hillenbrand (2005). CCL for Operational Medical Support, Military Medical Technology Online Archives, Jan 26, 2005 in Volume: 9 Issue: 1 Retrieved from

http://www.military-medical-technology.com/article.cfm?DocID=784John R. Joyce (2003) Coalition Interoperability Tested at Dahlgren During JWID 2003. CHIPS - The

Department of the Navy Information Technology Magazine. Fall 2003, Retrieved from http://www.chips.navy.mil/archives/03_fall/PDF/JWID_2003.pdf


Evaluation Goals

• Evaluate MT in the context of Instant Messaging (TrIM - MITRE) for Army coalition coordination

• Discover applicable contexts • Develop tasks that can be used to encourage

realistic task-oriented conversation• Characterize these task domains

• Improve the MT application technology• Discover user expectations concerning the

capabilities of MT in this environment


Task-based evaluations

• MT is embedded in the application

• Test users are given realistic tasks

• Task measures are used to evaluate MT usefulness and effectiveness

• Linguistic measures are used to evaluate MT technology


Task-based advantages

• Answers the “good enough?” question

• Provides formative evaluations of the user interface

• Provides insights for MT development.


Task-based disadvantages

• Are the tasks realistic enough?

• Are the test users representative and available?


General method

• Participants.• Pairs of native English, Japanese, Korean, Spanish,

Chinese speakers.• Environment.

• Pairs were seated in separate rooms with a task window and an TRIM window

• Task window covered the English dialog for non-English pairs.


General method

• Procedure• Collaborative Task instructions were presented.• 10 – 20 minute practice task• Tasks were presented usually in two parts.• Participants worked until task was completed

(mutual agreement) or Time Limit reached (rarely)• Non-English participants rated the translation

quality (separate session).


Evolving tasks

Conducted a series of studies with an evolving set of information sharing tasks– Table fill-in– Logistic Map

• Situation assessment• Open-ended logistic requests• Discrete trial logistic requests

– Shared whiteboard planning – Picture identification


Analysis of Tasks

• Task characteristics determine structure of the communication

• Fixed or stylistic messaging strategies will be best serve by application interfaces providing custom, human-translated versions


Typical Logistic Task Finding

Language Pair Task Time (hrs)

Map Errors Message Count

Chinese-English 1.4 6 162

Japanese-English

1.35 6 199

Korean-English 1.2 8 191

Spanish-English 1 2 160

English-English .8 1 96

• Task performance is slowed– But not prevented (completion rates > .90)

– Translations judged “adequate” > 80%


Implications

• MT works for IM but could work better – It works because people are engaged and

can negotiate meanings– It could work better if the technology

supported negotiation and repair.


Meta comment analysis

Semantic Category

Occurrences(total/unique)

Percent Poorly Translated

Chinese Korean Chinese Korean

Yes 54/8 107/26 15 18

No 16/6 29/9 19 4

OK 92/13 85/11 19 9

Ready 27/24 22/20 11 23

What? 40/32 94/71 32 32

Understand? 11/9 8/7 9 0

Wait 8/5 4/3 37 50


Meta-buttons


Meta-Button Results

Without Meta-Buttons

With Meta-Buttons

Meta messages 31 47

Task messages 65 54

Task time 69 min 52 min


Discrete trial logistic task


Comparing Korean Translators

Average Solution Time (Seconds)

Correctsolutions

Matching solutions

English Control

109 16.5 16.0

Korean 1 155 16.2 17.3

Korean 2 179 14.5 15.0


Picture Identification


Multiple Web Translations


Multiple Web Translations

• Translations obtained from three web sites– Google– Bizlingo (from Excite in Japan)– Amikai

• Source of initial translation balanced across trials.

• 16 pairs of Japanese-English teams

• 12 pairs of English-English teams

• 32 picture identification tasks


Multiple translations helped

Available Translations

Time (sec) Per cent correct

Message count

Single 177 83 8.80

Multiple 2 148 83 7.78

Multiple 3 130 86 7.74

English Control 109 94 7.74


Translation server comparison

Initial Translation

View count

Time (sec)

Per cent correct

Message count

BizLingo 1.83 132 79 7.41

Google 3.87 150 89 7.98


Multiple helps even the best

Available Translations

TranslationService

Time (sec)

Message count

Multiple BizLingo 135 7.43

Single BizLingo 172 8.46

Multiple Google 150 8

Single Google 194 9.49


Evaluation method sensitivity

• Task-based evaluation method is sensitive to MT engine differences

• But differences may actually be a good thing when multiple translation are made available


Conclusions

• IM is a good application for MT

• Task-based evaluation is effective

• Current application technologies do not support negotiated meaning

• Improvements are possible– e.g. meta-buttons, multiple translation


Available Technology

• Web-based task presentation and data collection

• Multiple translation chat client (Flash app)


Acknowledgments

• ARL – John Warner, Melissa Holland

• MITRE – Rod Holland, Galen Williamson

• NMSU– Sieun An, Emily Chaffin, Yuki Ishikawa,

Wanying Jin, Jong Hwan Kim, Yosip Kim, Roberto Montalvo, Jeff Sorge and Ron Zacharski.

advanced decision architectures collaborative technology alliance a task-based evaluation method for...

Documents