serge astm-presentation-chicago-2014-final
TRANSCRIPT
WK46396 and WK46397
What’s behind two new ASTM work items,
Serge Gladkoff,
(GALA, Logrus International)
Chicago, November 5, 2014
The science of
Language Quality Assurance
LISA QA Model
SAE J2450
SDL TMS
Acrocheck
ApSIC XBench
CheckMate
QA Distiller
XLIFF:Doc
EN15038
{ …Proprietary metrics and
scorecards… }
…What is translation quality? All of them disagree
on what quality is. Would you dare giving a
universal definition, considering all these
authors had their own idea of what it is?
Quality
Definition?
SOLUTION: THE BOLZANO-WEIERSTRASS
METHOD
Divide the desert by a line running from north to
south. The lion is then either in the eastern or in the
western part. Lets assume it is in the eastern part.
Divide this part by a line running from east to west.
The lion is either in the northern or in the southern
part. Lets assume it is in the northern part. We can
continue this process arbitrarily and thereby
constructing with each step an increasingly narrow
fence around the selected area. The diameter of the
chosen partitions converges to zero so that the lion
is caged into a fence of arbitrarily small diameter.
PROBLEM
To Catch a Lion in the Sahara Desert.
A THEORY OF BIG GAME HUNTING
Means
A. Concentrating on factors
making strongest impression
B. Separating global (holistic)
and local issues, with the
former being typically more
important and playing bigger
role
GENERAL CONSIDERATIONS
Reflecting the perception
and priorities of the target
audience
Means
A. Covering the whole spectrum of
potential uses, subject areas, and
materials;
B. From slightly post-edited MT to ultra-
polished manual translations
C. Common approach
D. Same approach to technical materials
and marketing content
E. Only adjust acceptance criteria /
thresholds based on expectations
GENERAL CONSIDERATIONS
Universal applicability
We are all humans and, irrespective of what exactly we are looking at, be it a restaurant menu or drug usage guidelines, we are making our first judgment about text quality using exactly the same criteria. We do not need a different approach or a completely new metric for each subject area or type of content. In reality, the only thing that requires adjustment is tolerance level. We are ready to accept a barely comprehensible menu translation, but expect perfect clarity and lack of ambiguity in the medical area. In technical terms, this means that we are still measuring the same thing, i.e. readability/clarity, but with different expectations, and this approach applies to all other criteria.
Means
A. Should be clear, not overly
complicated
B. Should be process-friendly,
i.e. reasonably economical
and applicable to the real
world
GENERAL CONSIDERATIONS
Viability of methodology
MeansA. Concentrating on methodology rather than
particular cases/uses.
B. Issue typology is not an inalienable part of
the methodology, but rather an add-on
component. It can be based for instance on
MQM or other source, or legacy criteria,
including those used/provided by the client.
C. Weights assigned to particular issues are
expected to vary within a wide range
depending on the goals set, subject matter,
type of material, etc. Particular issues might
simply prove irrelevant for the job or area of
focus, which results in zero weights being
assigned to these issues.
GENERAL CONSIDERATIONS
Flexibility of approach
The client knows what types of are important to his content.
ENTIRETY OF IMPRESSION
Reader/consumer is primarily interested in overall readability and adequacy of the whole piece, and only then in readability of parts (sentences).
THRESHOLD OF ACCEPTANCE
…is determined by usability expectations
Expectation of how readable and adequate the translated content should be, determines the acceptable quality level for these key cornerstone factors.
GRADINGIf piece has serious defect, it
has to be discarded without
wasting time on further analysis.
If text is inadequate or
unreadable, it does not make
sense to count typos or see
whether the terminology is right.
Good stuff
Substandard
Acceptance threshold
Neither Readability nor Adequacy are 100% objective
The solution lies in evaluating each of the two major holistic criteria (readability and adequacy) separately, on a PASS/FAIL basis.The logical thing to do is establish an acceptance threshold that would correspond to the lower end of the statistical range.
How can we deal with this lack of complete objectivity in a real-world scenario, when no reference translations are available, there is a single reviewer who can only look at a certain percentage of the overall content, and we still need to evaluate and grade translated texts?
The scale from 0 to 10
The smaller scale will not fit the Bell curve
Important and direct consequence is that the scale used for holistic translation ratings should be at least between 0 and 10, and by no means smaller.
Atomistic Quality
Fluency (mechanical)
Spelling
Style Guide
Typography
Grammar
Locale convension
….
Fluency (content)
Inconsistency
Idiomatic
Duplication
Ambiguity
Accuracy
MistranslationOmissionAddition
Untranslated
Printing
CopyingColor and black
and white digital printing
Internationalization
Compatibility
(other)
Design
Global font choice
Headers and footers
Margins
Page break
Kerning
….
Building the concrete LQA metrics
1
2
3
45
The methodology fully covers all types of translated content, including those produced using MT and/or MT + post-editing.
Applying LQA metrics
Applying it correctly
Three-dimensional
vector
Holistic
readability
Holistic
adequacy
Atomic
compound
detailed
metrics
Readability threshold
Pass/Fail
Adequacy threshold
Pass/Fail
Atomistic rating
Detailed score
Implementation keys
Holistic parameters cannot
be mixed
Only those materials that
pass HP are analyzed
further
Experts required to produce
precise and reliable
atomistic score
Select content to apply the
metrics
In the vast majority of real-life cases, nobody can afford the luxury of
employing an expert panel to evaluate the translation quality of any particular document or web portal. LSPs typically have to use a single reviewer who only
looks at a certain percentage of the content. To produce meaningful, reliable
results despite this limitation proper sampling must be done.
100
1,000
10,000
100,000
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
100 1,000 10,000 100,000 1,000,000
Sample Size
(Words)
% to Be Checked
Total Volume, Words
95% Confidence Level % of Total - 0.25% CI % of Total - 0.5% CI % of Total - 0.125% CI
Sample Size - 0.25% CI Sample Size - 0.5% CI Sample Size - 0.125% CI
HOW FASTIDIOUS ARE YOU?
THE LUXURY OF FULL METRICS
Significant research
- Know your area
- Sustain R&D
- Develop metrics
- Develop processes
- How much and
what to QA?
Develop metrics
Professional LSP
Professional linguists
Provide training
Provide reference
materials
Build supply chain
Terminology
maintenance &
support
Translation Memory
maintenance and
management
Localization Quality
Assurance costs
Pay to apply
CONTRAINTS
A. Professional LQA would require global
federal program to develop applicable
LQA metrics, allocate funding, book
professional LQA with specially trained
LSPs.
B. Yet, there’s still acute need for that,
which has been demonstrated by
GUIDADO DE SALUD web site of
Affordable Care act.
C. The methodology of public LQA is very
much needed.
D. There IS a feedback on the site from
the public, how it is to be handled?
THE NEED AND THE CONSTRAINTS
Executive Order 13166
http://www.justice.gov/crt/about/c
or/13166.php
www.lep.gov
“requires Federal agencies to examine the services they provide, identify any need for services to those with limited English proficiency (LEP), and develop and implement a system to provide those services so LEP persons can have meaningful access to them”
Cannot be trained
Is not ready to spend
a lot of time
Opinionated by
definition
The Crowd
Is limited by volume
Is random by nature
Arbitrary issue
classification
Can be large in
number of reviewers
The Feedback
Using the statistical
approach to turn the
tables and gain in
another area what
we have lost
The Approach
CONSTRAINTSof public feedback
THE METRICS
1. Quality square approachThere MAY be showstopper errors.
2. The parameters are simplified (no detailed issue definitions)No detailed Atomistic quality issue definitions can be applied.
3. Each reviewer produces four ratings on 0 – 10 scale
“0 – 10” scale is the smallest one to accommodate the Bell Curve.
(Each reviewer is asked to provide examples.)
4. The calibration:
(a) Showstopper: 0 = two or more major errors, 10 = no major errors
(b) Holistic readability (fluency): 0 = incomprehensible, 10 = a poem
(c) Holistic adequacy (accuracy): 0 = inadequate, 10 = perfectly conveying meaning
(d) Atomistic (small specific errors): 0 = full of small errors, 10 = completely error-free
For crowd sourced LQA the atomistic quality category is not formalized in any way whatsoever.
THE PROCESS
1. LQA review scope is defined and briefly and clearly explained
To prevent reviewers straying to other areas..
2. The content needs to be final
Updates and scope changes are outside of the scope of crowdsourced review.
3. Communication is done via simple online portal
No bandwidth to manage the crowd manually.
4. Better if volunteers are language professionals
It would compensate fore the lack of special training.
5. Proper sampling
No less than 10 reviewers for each area; the more – the better.
6. Proper processing
The results are manually vetted to remove outliers:
- discard outliers w/o explanation and obvious reviewers errors
- are major errors statistically significant? 30% threshold instead of 5% is recommended.
- apply statistics to analyze results
..an average Readability Rating as 6.2 out of 10 with a standard deviation of 2.2, and Adequacy 6.5 out of 10 with a standard deviation of 1.9.
The conclusion would be: The text is readable (rating above 5), but barely so, and leaves much to be desired in view of its importance and high level of public exposure.Again, it is up to the expert who is doing the analysis to define the threshold, that, for example, for this type of content a proper target for average readability is at least 8 out of 10.
…the adjusted value for fechnical errors of 4.7 out of 10 for the average atomistic quality rating, with a 2.4 standard deviation
Despite the fact that review was by design a less than ideal
community feedback-based LQA, resulting in rating inconsistency
among reviewers, most reviewers found too many noticeable and annoying
technical/minor mistakes in the text, as reflected in the low
average rating, which is unsatisfactory. Substantial
remedial work is clearly called for in this area.
THE ONLY PUBLIC LQA METRICS AVAILABLE
POSITIVES
• Both holistic measures can be relied upon with reasonable confidence
• Good overall assessment
• Allows to identify showstoppers
• Good general idea of the level of technical errors
• Affordable and available for US federal agencies
Is it appropriate? CONTRA
• Only rough judgment
• Not a good quantitative assessment
• Not complete roster of errors even in the selected sample
• No concrete process recommendations
YOU DECIDE!
MORE INFORMATION ABOUT MQM
http://www.qt21.eu/mqm-definition/definition-2014-08-19.html