machine learning quality management guideline · this is a summary of the machine learning quality...

Machine Learning Quality Management Guideline extended abstract

Based on Japanese ver. 1.0.1 (rev. 37) English abstract 2020-09-29 (rev. 2a)

© National Institute of Advanced Industrial Science and Technology (AIST)

This is a summary of the Machine Learning Quality Management Guideline ver. 1.0.1, published in

Japanese on June 30, 2020 as AIST CPSEC technical report CPSEC-TR-2020001.

The original document is available from «https://www.cpsec.aist.go.jp/achievements/aiqm/».

The final version of the translated guideline is to be published in late 2020.

Translations of technical terms might change in the final version of the translated guideline.

1. Overview

This document establishes a basis for quality goals for machine learning-based

products/services, and provides procedural guidance for realizing quality through development

process management and system evaluations.

This document aims to enable providers of products and services to evaluate and improve the

quality of their systems so as to reduce accidents and/or losses caused by AI malfunctions in the

society. Furthermore, it enables stakeholders to express their product quality using provided

norms, which can be used for both commercial purposes (e.g. quoting prices of their AI-based

products) and social purposes (e.g. to express their responsibility to the society).

2. Scope

This document establishes required levels of quality goals and quality assurance aspects of a

software component using machine learning technology (hereafter “a machine learning

component”). It also mentions the quality-in-use of the whole product system which makes

use of machine learning components.

This document assumes a system quality of the following layered model:

“Quality-in-use” is required by end-users (or the society) and provided by system providers, and it is to be realized by the “externally-visible quality properties” of the

system, such as safety, dependability, and functional completeness.

The externally-visible quality properties required to a system or a component are supported by internally-holding quality properties that the component intrinsically

https://www.cpsec.aist.go.jp/achievements/aiqm/

holds. The component, in turn, may depend on an externally-visible quality of its sub-

components.

Internal, intrinsic quality properties of a system component are to be evaluated by means of a variety of ways such as testing, verification, or process management, etc.

In this document, we call the externally-visible quality properties required to a machine learning

component as the “quality goals”. We call the internal, intrinsic quality properties of a machine

learning component as the “quality management targets”.

3. Quality goals of the machine learning component

In conventional systems, quality requirements on most software components are “correctness

in terms of given design specifications”, whatever the quality-in-use required to the whole

system is. It is because most of the quality-in-use requirements are considered at the system

design stage, not at the implementation stage.

On the contrary, in many cases, implementors of machine learning components are required to

consider some quality aspects closely related to the quality-in-use.

This document identifies the following three properties as the quality goals specific to machine

learning components. It also defines “levels” of quality requirements for each property.

3-1 Safety

The safety, on the machine learning component level, is a property to reduce possibilities of

generating undesirable, probably harmful outputs from a machine learning component. In

Safety Effectiveness Fairness othersProduct

quality-in-use(example)

Safety/Risk avoidance AI performance Fairness

Generalquality assoftware

Quality goals

Refer otherstandards/guidelines

e.g. security,reliability,

maintainability

Qualitymanagement

aspects

Accuracy

External quality of the ML component

e.g. IEC 61508, ISO/IEC 15408 etc.

Quality ofother

componentsrequires

requires

realizes

realizes

Robust-ness Stability

Depend-ability of

underlyingsoftware

Complete-ness ofdomainanalysis

Coverage fordistinguished

problemcases

Diversityof test data

Well distributionof training

data

contrast with the “AI performance” requirement below, it does not care about any quality of

output within an acceptable range of output.

For the safety requirement, the document defines 7 levels of requirements (named AISL: AI

safety level): the top four levels (named AISL 4–AISL 1) correspond to the levels 4–1 of the safety

integrity level (SIL) defined in IEC 61508 series; the bottom three levels (named AISL 0.2, AISL

0.1 and AISL 0) correspond to the requirement strengths weaker than the defined SILs.

3-2 AI performance

The AI performance is a property to raise the value of a machine learning component in the

performance indicators defined by an application specification; it does not directly consider

cases of worst-case outputs. In typical applications, both safety and AI-performance

requirements may be required at the same time.

For the AI-performance requirement, the document defines three levels (named AIPL: AI

performance level). AIPL 2 means that the requirement for AI-performance is to be strongly

guaranteed, such as by contracts or by financial means. AIPL 1 means that the requirement is

on the best-effort basis: small non-satisfaction will not cause large damage on the system

quality-in-use. AIPL 0 does not specify any such requirements.

3-3 Fairness

A fairness requirement to machine learning components is very special: as it cannot be evaluated

logically, but only statistically, fairness cannot be guaranteed solely by the design phase.

Most fairness requirements to the components come from ethical requirements to the whole

system. This document does not state that “what kind of equality requirement” is ethically

favorable to any system using machine learning technology: it is out of the scope of this

document and to be determined separately. Instead, we focus on how to realize and assure the

satisfaction of any determined requirements.

For the fairness, the quality goal levels defined by the document (named AIFL: AI fairness level)

are analogous to those of AIPL.

3-4 Other possible quality goals

As a machine learning component is, indeed, a software component, various other aspects of

qualities may be required as well. Most of such aspects are to be treated as the same as

conventional software components.

The document has some remarks on aspects such as security, privacy, or resistance to attack, etc.

4. Quality management targets

Given a quality level for each of the three quality goals, we are supposed to achieve the required

quality through the development process.

For the quality management/assessment during the development process, the document

establishes eight quality management targets to be checked.

For each quality management target, the document specifies a set of abstract-level requirements

for each quality goal level, and summarizes a guidance referring to technology methods to

achieve the goal.

DataQuality Design

Data Quality

ModelQuality

Recognition Target

Prop. 1:DomainAnalysis

D EA B CF

GH

I

To determine operational input conditions of the ML component• Identify expected range of inputs• Provide a concrete notion of

conditions e.g. using data-labels• Distinguish between unsupported

and rare conditions

Prop. 2:Case

Coverage

To identify combinations of conditions used for data quality management• To Exaustively covers high-risk

combinations of conditions• To limit total numbers of

combinations to tractive one

A B CF

GH

I

D E

Prop 3:Data

Coverage

Prop 4:Data

Distribution

A B CF

GH

I

D E

To ensure that good data is availablefor each identified condition combinations• Enough amount of data• Unbiased data within each combinations→ Ensuring good effort of training for important conditions (such as risk conditions)

A B CF

GH

I

D E

To ensure good distribution of data forthe whole domain of input data→ To achieve a model with good performance

Compromiseor Balancing needed

Prop. 5:Accuracy

Prop. 6:Robustness

Input

Training Data

Input

ModelOutput

Good, stable outputs on pointsnot available in training/test datasets• Directly: by numerical analysis• Indirectly: through test methods mgmt.

Good outputs on data pointsavailable in training/test dataset• Directly evaluated by testing result

Prop. 7:Software

Dependability

Prop 7:Qualitystability

Software quality except the trained modelis ensured

Quality at the start timeis to be maintained duringlong operations

Note: the labels for eight properties are shortened: refer to the main text for the full titles.

4-1 Completeness of problem domain analysis

The first target, “completeness of problem domain analysis” is related to the issues of how the

surrounding environment and situation, which input data originate from, is analyzed and

understood conceptually. The analysis results should cover all expected situations in run time

and should be well-formalized in a clean, comprehensible way, e.g. using a feature tree. The

analysis should be detailed enough to distinguish situations that lead to different final outputs,

and are also distinct in the risk critically against probable unsatisfaction of required quality goals.

(example) Car driving applications should distinguish night-time from daylight, rainy conditions

from dry conditions. To that purpose, weather, time, and other application-specific attributes

should be taken into account.

4-2 Coverage for distinguished problem cases

Combinations of the analyzed situations in the above domain analysis might often cause a

combinatorial explosion. In that case, the numbers of distinguished cases should be reduced

to some tractable numbers, yet covers critical conditions well.

This kind of technique is often used in the test design of conventional software.

(example) in car-driving applications, it might be hard to collect all data corresponding to all

application combinations of attributes such as “daytime, rainy, winter, highway, sun-facing, light

in weight, etc.”

However, we should use those data corresponding to any high-risk combinations of situations such

as “snowy winter”, “nighttime rainy”, “summer sun-facing”, “rainy and heavy highway”, etc.

4-3 Diversity of test data

Given the problem domain design and considerations in the above two properties, the test data

(and often also training data) set should contain enough data which cover each of all

distinguished cases, and which diverse well within the area of the distinguished problem domain

in the real world.

This is critical to ensure that the machine learning component is well trained against all

identified risk situations.

4-4 Well-distribution of training data

In addition to the above, training data should be well distributed, unbiased against the real-

world input. This is often said to be one of the basic principles of good machine learning, but

sometimes it may cause a problem, especially when a rarely-happened, critically important case

exists in the possible input. If we prepare enough data for such a rare critical case, the input

dataset will have a positive bias toward that particular case. If not, the machine learning

component might not be well trained against such rare cases; these are controversial and some

compromise is needed.

4-5 Correctness of trained model

Given input data of well-designed distribution in regard to the above two properties, the

machine learning model should produce a satisfactory output answer to good proportions of

prepared input test data.

It also assumes (implicitly) that the training and test datasets do not contain ill-valued data (in

a critical amount).

4-6 Robustness of trained model

Furthermore, because the number of data in the datasets is fairly limited compared to the

hypothetically possible real-world input, we should separately ensure that the model should

behave well even to the input data-point not contained in any datasets. Those problems related

to this property include a kind of overfitted model, but not limited to that.

4-7 Dependability of underlying software system

In addition to the prepared datasets and trained model, both training software program and

inference (decision-making) software program should be well dependable. Machine learning

is a bit tricky that a training process sometimes hides bugs of the underlying software

infrastructure, by self-adjusting the trained model to the misbehavior. Therefore, a separate

assurance for the software quality is quite important.

4-8 Stability/maintainability of quality

Machine learning-based systems will loose their expected quality levels during the continuous

operations, either by changes in the nature of the input environment, or unsatisfactory analysis

coverage of the input situations at the first design time. In most machine-learning applications,

continuous monitoring and continuous relearning are crucial for long-run operations. At the

same time, runtime updates often cause some troubles (degraded performance), so the quality

of updates also needs systematic monitoring and handling.

The document provides considerations for both the on-line updates and the off-line updates.

5. Process model

The document assumes a hypothetical abstract AI development process model as a reference.

It abstracts the development process as a hybrid of waterfall development and iterative agile

developments.

The model construction phase, corresponding to the implementation design and coding phase

in the conventional software development, is modeled as an iterative process. Quality designs

and quality checkpoints are inserted before and after the model construction phase, which is

similar to those in the waterfall development process model.

The proof of concept (PoC) phase is also explicitly modeled to the “source-side” of the waterfall,

while continuous learning and management are to the “sink-side”.

System definition Risk analysis

QualityCheck

Machine learningTraining

Data preparation

Operation Monitoring

Requirements def.

Continuous training

System updatingAdditional

Data aquisition

AbandonmentPlanning

Training experiments

Resultevaluation

Iterative development

DevOps andcont. training

Proof of Concept

The whole lifecycle in the waterfall styleModifications incl. updates to system requirements

Product test

Quality check

Data planning,preparation

6. Relations to other documents

EU, OECD, Japanese government, etc. provide several principle documents about the ethics of

artificial intelligence. This document aims to provide a basis for achieving social level

requirements specified in these documents with technical and engineering means.

QA4AI consortium in Japan has published “AI product quality assurance guideline” in 2019.

From our point of view, their guideline is complementary to beneficial mainly for AI engineers

for understanding their problems and finding appropriate technologies to solve. It seems to be

roughly corresponding to Section 7 (Guidance for the adoption of quality management

technology) of the full version of our guideline. Our guideline is mainly targeted to the entities

(companies) who plan to produce AI-based products and supply to the society, by supporting

their quality management activities through systematic process.

(attached: the table of contents, translated to English, of the original Japanese guideline)

Machine Learning Quality Management Guideline, version 1.0.1

Table of Contents (translated)

Page numbers are those for the Japanese version of Machine Learning Quality Management

Guideline, revision 1.0.1.37, published on June 30, 2020.

1 Summary of guideline .......................................................................................................................................... 1

1.1 Purpose and background of guideline ................................................................................................ 1

1.2 Use of this guideline .................................................................................................................................... 2

1.3 Issues on the quality management of machine learning systems ........................................... 4

Importance of environment analysis ......................................................................................... 4

Continuous assessment of risks ................................................................................................... 5

Quality management depending on data ................................................................................. 6

1.4 Basic concept of quality management ................................................................................................. 6

1.5 External quality properties as quality goals .................................................................................. 10

Risk avoidance .................................................................................................................................. 11

AI performance ................................................................................................................................. 12

Fairness................................................................................................................................................ 12

1.6 Treatments of other “AI quality” aspects ........................................................................................ 13

Security and privacy ....................................................................................................................... 14

Resistance to attacks ...................................................................................................................... 14

1.6.3 Ethics and other social aspects .................................................................................................. 15

1.6.4 Complexity of external environments .................................................................................... 15

1.7 Internal quality properties as quality management target ..................................................... 16

1.7.1 Completeness of problem domain analysis .......................................................................... 18

1.7.2 Coverage for distinguished problem cases ........................................................................... 19

1.7.3 Diversity of test data ...................................................................................................................... 20

1.7.4 Distribution of training data ....................................................................................................... 21

1.7.5 Accuracy of trained model ........................................................................................................... 23

1.7.6 Robustness of trained model ...................................................................................................... 23

1.7.7 Dependability of underlying software .................................................................................... 23

1.7.8 Stability/maintainability of quality ......................................................................................... 24

1.8 Thoughts on development processes ............................................................................................... 24

1.8.1 Iterative training and quality management lifecycle ........................................................ 24

1.8.2 Work divisions and development process managements .............................................. 27

1.9 Relations to other guidelines and principles ................................................................................. 28

1.9.1 “Social principle on human-centric AI” (Japan) ................................................................. 28

1.9.2 Oversea/international principles and guidelines about AI technology ................... 29

1.10 Structure of this guideline ..................................................................................................................... 29

2 Introduction .......................................................................................................................................................... 31

2.1 Scope of guideline ..................................................................................................................................... 31

2.1.1 Target products and systems ...................................................................................................... 31

2.1.2 Target of quality management ................................................................................................... 31

2.1.3 Extent of quality management ................................................................................................... 32

2.2 Relations to existing standards about system quality ............................................................... 33

2.2.1 Security standard: ISO/IEC 15408 ........................................................................................... 33

2.2.2 Software quality models: ISO/IEC 25000 series ................................................................ 33

2.3 Definitions of terms.................................................................................................................................. 34

2.3.1 Terms on machine learning system structure ..................................................................... 34

2.3.2 Terms on development stakeholders and their roles ...................................................... 36

2.3.3 Terms on quality concepts ........................................................................................................... 37

2.3.4 Terms on development process ................................................................................................ 39

2.3.5 Terms on usage environments ................................................................................................... 40

2.3.6 Terms related to data for machine learning ......................................................................... 41

2.3.7 Other terms ........................................................................................................................................ 43

3 Level assignments for external quality of machine learning systems .......................................... 45

3.1 Risk avoidance ............................................................................................................................................ 45

3.2 AI performance .......................................................................................................................................... 46

3.3 Fairness ......................................................................................................................................................... 47

4 Reference model for the development process of systems using machine learning ............. 49

4.1 Proof of Concept phase ........................................................................................................................... 49

4.1.1 Handling of PoC with trial operations .................................................................................... 49

4.2 Main development phase ....................................................................................................................... 50

4.2.1 Model construction phase ........................................................................................................... 51

4.2.2 System construction and testing phase ................................................................................. 56

4.3 Deployment, operation and monitoring phase ............................................................................ 57

5 Procedure for applying the guideline ......................................................................................................... 58

5.1 Basic process ............................................................................................................................................... 58

5.1.1 Identifying functions of ML components in the whole system .................................... 58

5.1.2 Assigning external quality levels for ML components ..................................................... 59

5.1.3 Determining requirement levels for internal quality aspects ...................................... 60

5.1.4 Realizing internal quality ............................................................................................................. 60

5.2 (informative) Outsourcing AI developments ................................................................................ 60

5.2.1 Handling of exploratory approaches ....................................................................................... 61

5.2.2 Clarifications of actions in each development step ........................................................... 62

5.2.3 Considerations on the division of works ............................................................................... 64

5.3 (informative) Considerations on multi-step (differential) development .......................... 65

6 Requirements for quality assessment ........................................................................................................ 68

6.1 Completeness of problem domain analysis ................................................................................... 68

6.1.1 Basic concept ..................................................................................................................................... 68

6.1.2 Guidance .............................................................................................................................................. 69

6.1.3 Requirements for assigned quality levels ............................................................................. 71

6.2 Coverage for distinguished problem cases ..................................................................................... 73

6.2.1 Basic concept ..................................................................................................................................... 73

6.2.2 Guidance .............................................................................................................................................. 74


6.3 Diversity of test data ................................................................................................................................ 76

6.3.1 Basic concept ..................................................................................................................................... 76

6.3.2 Guidance .............................................................................................................................................. 76


6.4 Distribution of training data ................................................................................................................. 78

6.4.1 Basic concept ..................................................................................................................................... 78

6.4.2 Guidance .............................................................................................................................................. 79


6.5 Accuracy of trained model .................................................................................................................... 80

6.5.1 Basic concept ..................................................................................................................................... 80

6.5.2 Guidance .............................................................................................................................................. 81


6.6 Robustness of trained model ............................................................................................................... 83

6.6.1 Basic concept ..................................................................................................................................... 83

6.6.2 Guidance .............................................................................................................................................. 83


6.7 Dependability of underlying software ............................................................................................. 85

6.7.1 Basic concept ..................................................................................................................................... 85

6.7.2 Guidance .............................................................................................................................................. 86


6.8 Stability/maintainability of quality ................................................................................................... 87

6.8.1 Basic concept ..................................................................................................................................... 87

6.8.2 Guidance .............................................................................................................................................. 89


7 Guidance for the adoption of quality management technology ...................................................... 93

7.1 Completeness of problem domain analysis ................................................................................... 93

7.1.1 General provisions .......................................................................................................................... 93

7.1.2 presumption of risk factors on input side ............................................................................. 94

7.1.3 estimation of the output-side structure of data ................................................................. 95

7.2 Coverage for distinguished problem cases ..................................................................................... 96

7.2.1 General provisions .......................................................................................................................... 96

7.3 Diversity of test data ................................................................................................................................ 97

7.3.1 Considerations on data acquisition ......................................................................................... 97

7.3.2 Addition of test data at data preprocess phase .................................................................. 98

7.3.3 Addition of test data at testing phase ..................................................................................... 98

7.4 Distribution of training data ................................................................................................................. 98

7.5 Accuracy and Robustness of models ................................................................................................. 98

7.5.1 Software testing for machine learning components ........................................................ 99

7.5.2 Technology for stability evaluation and improvements .............................................. 101

7.6 (skipped).................................................................................................................................................... 106

7.7 Dependability of underlying software .......................................................................................... 106

7.7.1 General provisions ....................................................................................................................... 106

7.7.2 Quality management of open source software ................................................................ 107

7.7.3 Tracking of software configuration and tracing of bug information ...................... 107

7.7.4 Possible detection of software malfunction using testing techniques ................... 107

7.7.5 Software updates and their risks ........................................................................................... 108

7.7.6 References ....................................................................................................................................... 108

7.8 Stability/maintainability of quality ................................................................................................ 109

7.8.1 Monitoring ....................................................................................................................................... 109

7.8.2 Detection of concept drifts ....................................................................................................... 110

7.8.3 Retraining ........................................................................................................................................ 111

7.8.4 Preparation of additional training data .............................................................................. 112

8 (informative) Related documents .............................................................................................................. 114

8.1 Relations to/from other guidelines ................................................................................................ 114

8.1.1 AI contractors’ guideline from METI, Japan ...................................................................... 114

8.1.2 Guideline published by QA4AI consortium ....................................................................... 115

8.2 International standardization activities related to AI quality ............................................. 118

8.2.1 Quality, safety ................................................................................................................................. 118

8.2.2 Transparency ................................................................................................................................. 119

8.2.3 Fairness, bias .................................................................................................................................. 119

8.2.4 Other ethical aspects ................................................................................................................... 120

9 (informative) Analyses ................................................................................................................................... 121

9.1 Analysis of internal quality aspects for risk avoidance and AI performance ............... 121

9.2 Quality level assignments for AI performance .......................................................................... 122

9.3 Considerations on fairness properties .......................................................................................... 123

10 Figures and tables ............................................................................................................................................ 124

10.1 Corresponding chart between external/internal quality properties ............................... 124

machine learning quality management guideline · this is a summary of the machine learning quality...

Documents