serge astm-presentation-chicago-2014-final

WK46396 and WK46397

What’s behind two new ASTM work items,

Serge Gladkoff,

(GALA, Logrus International)

Chicago, November 5, 2014

The science of

Language Quality Assurance

LISA QA Model

SAE J2450

SDL TMS

Acrocheck

ApSIC XBench

CheckMate

QA Distiller

XLIFF:Doc

EN15038

{ …Proprietary metrics and

scorecards… }

…What is translation quality? All of them disagree

on what quality is. Would you dare giving a

universal definition, considering all these

authors had their own idea of what it is?

Quality

Definition?

SOLUTION: THE BOLZANO-WEIERSTRASS

METHOD

Divide the desert by a line running from north to

south. The lion is then either in the eastern or in the

western part. Lets assume it is in the eastern part.

Divide this part by a line running from east to west.

The lion is either in the northern or in the southern

part. Lets assume it is in the northern part. We can

continue this process arbitrarily and thereby

constructing with each step an increasingly narrow

fence around the selected area. The diameter of the

chosen partitions converges to zero so that the lion

is caged into a fence of arbitrarily small diameter.

PROBLEM

To Catch a Lion in the Sahara Desert.

A THEORY OF BIG GAME HUNTING

Means

A. Concentrating on factors

making strongest impression

B. Separating global (holistic)

and local issues, with the

former being typically more

important and playing bigger

role

GENERAL CONSIDERATIONS

Reflecting the perception

and priorities of the target

audience

Means

A. Covering the whole spectrum of

potential uses, subject areas, and

materials;

B. From slightly post-edited MT to ultra-

polished manual translations

C. Common approach

D. Same approach to technical materials

and marketing content

E. Only adjust acceptance criteria /

thresholds based on expectations


Universal applicability

We are all humans and, irrespective of what exactly we are looking at, be it a restaurant menu or drug usage guidelines, we are making our first judgment about text quality using exactly the same criteria. We do not need a different approach or a completely new metric for each subject area or type of content. In reality, the only thing that requires adjustment is tolerance level. We are ready to accept a barely comprehensible menu translation, but expect perfect clarity and lack of ambiguity in the medical area. In technical terms, this means that we are still measuring the same thing, i.e. readability/clarity, but with different expectations, and this approach applies to all other criteria.

Means

A. Should be clear, not overly

complicated

B. Should be process-friendly,

i.e. reasonably economical

and applicable to the real

world


Viability of methodology

MeansA. Concentrating on methodology rather than

particular cases/uses.

B. Issue typology is not an inalienable part of

the methodology, but rather an add-on

component. It can be based for instance on

MQM or other source, or legacy criteria,

including those used/provided by the client.

C. Weights assigned to particular issues are

expected to vary within a wide range

depending on the goals set, subject matter,

type of material, etc. Particular issues might

simply prove irrelevant for the job or area of

focus, which results in zero weights being

assigned to these issues.


Flexibility of approach

The client knows what types of are important to his content.

ENTIRETY OF IMPRESSION

Reader/consumer is primarily interested in overall readability and adequacy of the whole piece, and only then in readability of parts (sentences).

TWO KEY FACTORS

ADEQUACY

READABILITY

THRESHOLD OF ACCEPTANCE

…is determined by usability expectations

Expectation of how readable and adequate the translated content should be, determines the acceptable quality level for these key cornerstone factors.

GRADINGIf piece has serious defect, it

has to be discarded without

wasting time on further analysis.

If text is inadequate or

unreadable, it does not make

sense to count typos or see

whether the terminology is right.

Good stuff

Substandard

Acceptance threshold

Neither Readability nor Adequacy are 100% objective

The solution lies in evaluating each of the two major holistic criteria (readability and adequacy) separately, on a PASS/FAIL basis.The logical thing to do is establish an acceptance threshold that would correspond to the lower end of the statistical range.

How can we deal with this lack of complete objectivity in a real-world scenario, when no reference translations are available, there is a single reviewer who can only look at a certain percentage of the overall content, and we still need to evaluate and grade translated texts?

The scale from 0 to 10

The smaller scale will not fit the Bell curve

Important and direct consequence is that the scale used for holistic translation ratings should be at least between 0 and 10, and by no means smaller.

Atomistic Quality

Fluency (mechanical)

Spelling

Style Guide

Typography

Grammar

Locale convension

….

Fluency (content)

Inconsistency

Idiomatic

Duplication

Ambiguity

Accuracy

MistranslationOmissionAddition

Untranslated

Printing

CopyingColor and black

and white digital printing

Internationalization

Compatibility

(other)

Design

Global font choice

Headers and footers

Margins

Page break

Kerning

….

MQM Tree

Atomistic Quality

𝑄𝐴 = 𝑖=0𝑛 𝑁𝑖 ∗ 𝑊𝑖

𝑉

Quality

Triangle?

SHOWSTOPPER PROBLEM

..or quality

square!

There are things that you will

know when you see them…

Showstoppers…

Building the concrete LQA metrics

1

2

3

45

The methodology fully covers all types of translated content, including those produced using MT and/or MT + post-editing.

Applying LQA metrics

Applying it correctly

Three-dimensional

vector

Holistic

readability

Holistic

adequacy

Atomic

compound

detailed

metrics

Readability threshold

Pass/Fail

Adequacy threshold

Pass/Fail

Atomistic rating

Detailed score

Implementation keys

Holistic parameters cannot

be mixed

Only those materials that

pass HP are analyzed

further

Experts required to produce

precise and reliable

atomistic score

Select content to apply the

metrics

In the vast majority of real-life cases, nobody can afford the luxury of

employing an expert panel to evaluate the translation quality of any particular document or web portal. LSPs typically have to use a single reviewer who only

looks at a certain percentage of the content. To produce meaningful, reliable

results despite this limitation proper sampling must be done.

100

1,000

10,000

100,000

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

100 1,000 10,000 100,000 1,000,000

Sample Size

(Words)

% to Be Checked

Total Volume, Words

95% Confidence Level % of Total - 0.25% CI % of Total - 0.5% CI % of Total - 0.125% CI

Sample Size - 0.25% CI Sample Size - 0.5% CI Sample Size - 0.125% CI

HOW FASTIDIOUS ARE YOU?

THE LUXURY OF FULL METRICS

Significant research

- Know your area

- Sustain R&D

- Develop metrics

- Develop processes

- How much and

what to QA?

Develop metrics

Professional LSP

Professional linguists

Provide training

Provide reference

materials

Build supply chain

Terminology

maintenance &

support

Translation Memory

maintenance and

management

Localization Quality

Assurance costs

Pay to apply

The scene of

PUBLIC SITE

CONTRAINTS

A. Professional LQA would require global

federal program to develop applicable

LQA metrics, allocate funding, book

professional LQA with specially trained

LSPs.

B. Yet, there’s still acute need for that,

which has been demonstrated by

GUIDADO DE SALUD web site of

Affordable Care act.

C. The methodology of public LQA is very

much needed.

D. There IS a feedback on the site from

the public, how it is to be handled?

THE NEED AND THE CONSTRAINTS

Executive Order 13166

http://www.justice.gov/crt/about/c

or/13166.php

www.lep.gov

“requires Federal agencies to examine the services they provide, identify any need for services to those with limited English proficiency (LEP), and develop and implement a system to provide those services so LEP persons can have meaningful access to them”

http://www.justice.gov/crt/about/cor/13166.php

http://www.lep.gov/

Cannot be trained

Is not ready to spend

a lot of time

Opinionated by

definition

The Crowd

Is limited by volume

Is random by nature

Arbitrary issue

classification

Can be large in

number of reviewers

The Feedback

Using the statistical

approach to turn the

tables and gain in

another area what

we have lost

The Approach

CONSTRAINTSof public feedback

THE METRICS

1. Quality square approachThere MAY be showstopper errors.

2. The parameters are simplified (no detailed issue definitions)No detailed Atomistic quality issue definitions can be applied.

3. Each reviewer produces four ratings on 0 – 10 scale

“0 – 10” scale is the smallest one to accommodate the Bell Curve.

(Each reviewer is asked to provide examples.)

4. The calibration:

(a) Showstopper: 0 = two or more major errors, 10 = no major errors

(b) Holistic readability (fluency): 0 = incomprehensible, 10 = a poem

(c) Holistic adequacy (accuracy): 0 = inadequate, 10 = perfectly conveying meaning

(d) Atomistic (small specific errors): 0 = full of small errors, 10 = completely error-free

For crowd sourced LQA the atomistic quality category is not formalized in any way whatsoever.

THE PROCESS

1. LQA review scope is defined and briefly and clearly explained

To prevent reviewers straying to other areas..

2. The content needs to be final

Updates and scope changes are outside of the scope of crowdsourced review.

3. Communication is done via simple online portal

No bandwidth to manage the crowd manually.

4. Better if volunteers are language professionals

It would compensate fore the lack of special training.

5. Proper sampling

No less than 10 reviewers for each area; the more – the better.

6. Proper processing

The results are manually vetted to remove outliers:

- discard outliers w/o explanation and obvious reviewers errors

- are major errors statistically significant? 30% threshold instead of 5% is recommended.

- apply statistics to analyze results

..an average Readability Rating as 6.2 out of 10 with a standard deviation of 2.2, and Adequacy 6.5 out of 10 with a standard deviation of 1.9.

The conclusion would be: The text is readable (rating above 5), but barely so, and leaves much to be desired in view of its importance and high level of public exposure.Again, it is up to the expert who is doing the analysis to define the threshold, that, for example, for this type of content a proper target for average readability is at least 8 out of 10.

…the adjusted value for fechnical errors of 4.7 out of 10 for the average atomistic quality rating, with a 2.4 standard deviation

Despite the fact that review was by design a less than ideal

community feedback-based LQA, resulting in rating inconsistency

among reviewers, most reviewers found too many noticeable and annoying

technical/minor mistakes in the text, as reflected in the low

average rating, which is unsatisfactory. Substantial

remedial work is clearly called for in this area.

THE ONLY PUBLIC LQA METRICS AVAILABLE

POSITIVES

• Both holistic measures can be relied upon with reasonable confidence

• Good overall assessment

• Allows to identify showstoppers

• Good general idea of the level of technical errors

• Affordable and available for US federal agencies

Is it appropriate? CONTRA

• Only rough judgment

• Not a good quantitative assessment

• Not complete roster of errors even in the selected sample

• No concrete process recommendations

YOU DECIDE!

MORE INFORMATION ABOUT MQM

http://www.qt21.eu/mqm-definition/definition-2014-08-19.html

http://www.qt21.eu/mqm-definition/definition-2014-08-19.html

MORE INFORMATION ABOUT METHODOLOGY

WK46397=The guide for LQA Methodology

WK46396=MQM

The Proposal

THANK [email protected]

[email protected]

mailto:[email protected]

mailto:[email protected]