machine learning quality management guideline · this is a summary of the machine learning quality...
TRANSCRIPT
Machine Learning Quality Management Guideline extended abstract
Based on Japanese ver. 1.0.1 (rev. 37) English abstract 2020-09-29 (rev. 2a)
© National Institute of Advanced Industrial Science and Technology (AIST)
This is a summary of the Machine Learning Quality Management Guideline ver. 1.0.1, published in
Japanese on June 30, 2020 as AIST CPSEC technical report CPSEC-TR-2020001.
The original document is available from «https://www.cpsec.aist.go.jp/achievements/aiqm/».
The final version of the translated guideline is to be published in late 2020.
Translations of technical terms might change in the final version of the translated guideline.
1. Overview
This document establishes a basis for quality goals for machine learning-based
products/services, and provides procedural guidance for realizing quality through development
process management and system evaluations.
This document aims to enable providers of products and services to evaluate and improve the
quality of their systems so as to reduce accidents and/or losses caused by AI malfunctions in the
society. Furthermore, it enables stakeholders to express their product quality using provided
norms, which can be used for both commercial purposes (e.g. quoting prices of their AI-based
products) and social purposes (e.g. to express their responsibility to the society).
2. Scope
This document establishes required levels of quality goals and quality assurance aspects of a
software component using machine learning technology (hereafter “a machine learning
component”). It also mentions the quality-in-use of the whole product system which makes
use of machine learning components.
This document assumes a system quality of the following layered model:
“Quality-in-use” is required by end-users (or the society) and provided by system providers, and it is to be realized by the “externally-visible quality properties” of the
system, such as safety, dependability, and functional completeness.
The externally-visible quality properties required to a system or a component are supported by internally-holding quality properties that the component intrinsically
holds. The component, in turn, may depend on an externally-visible quality of its sub-
components.
Internal, intrinsic quality properties of a system component are to be evaluated by means of a variety of ways such as testing, verification, or process management, etc.
In this document, we call the externally-visible quality properties required to a machine learning
component as the “quality goals”. We call the internal, intrinsic quality properties of a machine
learning component as the “quality management targets”.
3. Quality goals of the machine learning component
In conventional systems, quality requirements on most software components are “correctness
in terms of given design specifications”, whatever the quality-in-use required to the whole
system is. It is because most of the quality-in-use requirements are considered at the system
design stage, not at the implementation stage.
On the contrary, in many cases, implementors of machine learning components are required to
consider some quality aspects closely related to the quality-in-use.
This document identifies the following three properties as the quality goals specific to machine
learning components. It also defines “levels” of quality requirements for each property.
3-1 Safety
The safety, on the machine learning component level, is a property to reduce possibilities of
generating undesirable, probably harmful outputs from a machine learning component. In
Safety Effectiveness Fairness othersProduct
quality-in-use(example)
Safety/Risk avoidance AI performance Fairness
Generalquality assoftware
Quality goals
Refer otherstandards/guidelines
e.g. security,reliability,
maintainability
Qualitymanagement
aspects
Accuracy
External quality of the ML component
e.g. IEC 61508, ISO/IEC 15408 etc.
Quality ofother
componentsrequires
requires
realizes
realizes
Robust-ness Stability
Depend-ability of
underlyingsoftware
Complete-ness ofdomainanalysis
Coverage fordistinguished
problemcases
Diversityof test data
Well distributionof training
data
contrast with the “AI performance” requirement below, it does not care about any quality of
output within an acceptable range of output.
For the safety requirement, the document defines 7 levels of requirements (named AISL: AI
safety level): the top four levels (named AISL 4–AISL 1) correspond to the levels 4–1 of the safety
integrity level (SIL) defined in IEC 61508 series; the bottom three levels (named AISL 0.2, AISL
0.1 and AISL 0) correspond to the requirement strengths weaker than the defined SILs.
3-2 AI performance
The AI performance is a property to raise the value of a machine learning component in the
performance indicators defined by an application specification; it does not directly consider
cases of worst-case outputs. In typical applications, both safety and AI-performance
requirements may be required at the same time.
For the AI-performance requirement, the document defines three levels (named AIPL: AI
performance level). AIPL 2 means that the requirement for AI-performance is to be strongly
guaranteed, such as by contracts or by financial means. AIPL 1 means that the requirement is
on the best-effort basis: small non-satisfaction will not cause large damage on the system
quality-in-use. AIPL 0 does not specify any such requirements.
3-3 Fairness
A fairness requirement to machine learning components is very special: as it cannot be evaluated
logically, but only statistically, fairness cannot be guaranteed solely by the design phase.
Most fairness requirements to the components come from ethical requirements to the whole
system. This document does not state that “what kind of equality requirement” is ethically
favorable to any system using machine learning technology: it is out of the scope of this
document and to be determined separately. Instead, we focus on how to realize and assure the
satisfaction of any determined requirements.
For the fairness, the quality goal levels defined by the document (named AIFL: AI fairness level)
are analogous to those of AIPL.
3-4 Other possible quality goals
As a machine learning component is, indeed, a software component, various other aspects of
qualities may be required as well. Most of such aspects are to be treated as the same as
conventional software components.
The document has some remarks on aspects such as security, privacy, or resistance to attack, etc.
4. Quality management targets
Given a quality level for each of the three quality goals, we are supposed to achieve the required
quality through the development process.
For the quality management/assessment during the development process, the document
establishes eight quality management targets to be checked.
For each quality management target, the document specifies a set of abstract-level requirements
for each quality goal level, and summarizes a guidance referring to technology methods to
achieve the goal.
DataQuality Design
Data Quality
ModelQuality
Recognition Target
Prop. 1:DomainAnalysis
D EA B CF
GH
I
To determine operational input conditions of the ML component• Identify expected range of inputs• Provide a concrete notion of
conditions e.g. using data-labels• Distinguish between unsupported
and rare conditions
Prop. 2:Case
Coverage
To identify combinations of conditions used for data quality management• To Exaustively covers high-risk
combinations of conditions• To limit total numbers of
combinations to tractive one
A B CF
GH
I
D E
Prop 3:Data
Coverage
Prop 4:Data
Distribution
A B CF
GH
I
D E
To ensure that good data is availablefor each identified condition combinations• Enough amount of data• Unbiased data within each combinations→ Ensuring good effort of training for important conditions (such as risk conditions)
A B CF
GH
I
D E
To ensure good distribution of data forthe whole domain of input data→ To achieve a model with good performance
Compromiseor Balancing needed
Prop. 5:Accuracy
Prop. 6:Robustness
Input
Training Data
Input
ModelOutput
Good, stable outputs on pointsnot available in training/test datasets• Directly: by numerical analysis• Indirectly: through test methods mgmt.
Good outputs on data pointsavailable in training/test dataset• Directly evaluated by testing result
Prop. 7:Software
Dependability
Prop 7:Qualitystability
Software quality except the trained modelis ensured
Quality at the start timeis to be maintained duringlong operations
Note: the labels for eight properties are shortened: refer to the main text for the full titles.
4-1 Completeness of problem domain analysis
The first target, “completeness of problem domain analysis” is related to the issues of how the
surrounding environment and situation, which input data originate from, is analyzed and
understood conceptually. The analysis results should cover all expected situations in run time
and should be well-formalized in a clean, comprehensible way, e.g. using a feature tree. The
analysis should be detailed enough to distinguish situations that lead to different final outputs,
and are also distinct in the risk critically against probable unsatisfaction of required quality goals.
(example) Car driving applications should distinguish night-time from daylight, rainy conditions
from dry conditions. To that purpose, weather, time, and other application-specific attributes
should be taken into account.
4-2 Coverage for distinguished problem cases
Combinations of the analyzed situations in the above domain analysis might often cause a
combinatorial explosion. In that case, the numbers of distinguished cases should be reduced
to some tractable numbers, yet covers critical conditions well.
This kind of technique is often used in the test design of conventional software.
(example) in car-driving applications, it might be hard to collect all data corresponding to all
application combinations of attributes such as “daytime, rainy, winter, highway, sun-facing, light
in weight, etc.”
However, we should use those data corresponding to any high-risk combinations of situations such
as “snowy winter”, “nighttime rainy”, “summer sun-facing”, “rainy and heavy highway”, etc.
4-3 Diversity of test data
Given the problem domain design and considerations in the above two properties, the test data
(and often also training data) set should contain enough data which cover each of all
distinguished cases, and which diverse well within the area of the distinguished problem domain
in the real world.
This is critical to ensure that the machine learning component is well trained against all
identified risk situations.
4-4 Well-distribution of training data
In addition to the above, training data should be well distributed, unbiased against the real-
world input. This is often said to be one of the basic principles of good machine learning, but
sometimes it may cause a problem, especially when a rarely-happened, critically important case
exists in the possible input. If we prepare enough data for such a rare critical case, the input
dataset will have a positive bias toward that particular case. If not, the machine learning
component might not be well trained against such rare cases; these are controversial and some
compromise is needed.
4-5 Correctness of trained model
Given input data of well-designed distribution in regard to the above two properties, the
machine learning model should produce a satisfactory output answer to good proportions of
prepared input test data.
It also assumes (implicitly) that the training and test datasets do not contain ill-valued data (in
a critical amount).
4-6 Robustness of trained model
Furthermore, because the number of data in the datasets is fairly limited compared to the
hypothetically possible real-world input, we should separately ensure that the model should
behave well even to the input data-point not contained in any datasets. Those problems related
to this property include a kind of overfitted model, but not limited to that.
4-7 Dependability of underlying software system
In addition to the prepared datasets and trained model, both training software program and
inference (decision-making) software program should be well dependable. Machine learning
is a bit tricky that a training process sometimes hides bugs of the underlying software
infrastructure, by self-adjusting the trained model to the misbehavior. Therefore, a separate
assurance for the software quality is quite important.
4-8 Stability/maintainability of quality
Machine learning-based systems will loose their expected quality levels during the continuous
operations, either by changes in the nature of the input environment, or unsatisfactory analysis
coverage of the input situations at the first design time. In most machine-learning applications,
continuous monitoring and continuous relearning are crucial for long-run operations. At the
same time, runtime updates often cause some troubles (degraded performance), so the quality
of updates also needs systematic monitoring and handling.
The document provides considerations for both the on-line updates and the off-line updates.
5. Process model
The document assumes a hypothetical abstract AI development process model as a reference.
It abstracts the development process as a hybrid of waterfall development and iterative agile
developments.
The model construction phase, corresponding to the implementation design and coding phase
in the conventional software development, is modeled as an iterative process. Quality designs
and quality checkpoints are inserted before and after the model construction phase, which is
similar to those in the waterfall development process model.
The proof of concept (PoC) phase is also explicitly modeled to the “source-side” of the waterfall,
while continuous learning and management are to the “sink-side”.
System definition Risk analysis
QualityCheck
Machine learningTraining
Data preparation
Operation Monitoring
Requirements def.
Continuous training
System updatingAdditional
Data aquisition
AbandonmentPlanning
Training experiments
Resultevaluation
Iterative development
DevOps andcont. training
Proof of Concept
The whole lifecycle in the waterfall styleModifications incl. updates to system requirements
Product test
Quality check
Data planning,preparation
6. Relations to other documents
EU, OECD, Japanese government, etc. provide several principle documents about the ethics of
artificial intelligence. This document aims to provide a basis for achieving social level
requirements specified in these documents with technical and engineering means.
QA4AI consortium in Japan has published “AI product quality assurance guideline” in 2019.
From our point of view, their guideline is complementary to beneficial mainly for AI engineers
for understanding their problems and finding appropriate technologies to solve. It seems to be
roughly corresponding to Section 7 (Guidance for the adoption of quality management
technology) of the full version of our guideline. Our guideline is mainly targeted to the entities
(companies) who plan to produce AI-based products and supply to the society, by supporting
their quality management activities through systematic process.
(attached: the table of contents, translated to English, of the original Japanese guideline)
Machine Learning Quality Management Guideline, version 1.0.1
Table of Contents (translated)
Page numbers are those for the Japanese version of Machine Learning Quality Management
Guideline, revision 1.0.1.37, published on June 30, 2020.
1 Summary of guideline .......................................................................................................................................... 1
1.1 Purpose and background of guideline ................................................................................................ 1
1.2 Use of this guideline .................................................................................................................................... 2
1.3 Issues on the quality management of machine learning systems ........................................... 4
Importance of environment analysis ......................................................................................... 4
Continuous assessment of risks ................................................................................................... 5
Quality management depending on data ................................................................................. 6
1.4 Basic concept of quality management ................................................................................................. 6
1.5 External quality properties as quality goals .................................................................................. 10
Risk avoidance .................................................................................................................................. 11
AI performance ................................................................................................................................. 12
Fairness................................................................................................................................................ 12
1.6 Treatments of other “AI quality” aspects ........................................................................................ 13
Security and privacy ....................................................................................................................... 14
Resistance to attacks ...................................................................................................................... 14
1.6.3 Ethics and other social aspects .................................................................................................. 15
1.6.4 Complexity of external environments .................................................................................... 15
1.7 Internal quality properties as quality management target ..................................................... 16
1.7.1 Completeness of problem domain analysis .......................................................................... 18
1.7.2 Coverage for distinguished problem cases ........................................................................... 19
1.7.3 Diversity of test data ...................................................................................................................... 20
1.7.4 Distribution of training data ....................................................................................................... 21
1.7.5 Accuracy of trained model ........................................................................................................... 23
1.7.6 Robustness of trained model ...................................................................................................... 23
1.7.7 Dependability of underlying software .................................................................................... 23
1.7.8 Stability/maintainability of quality ......................................................................................... 24
1.8 Thoughts on development processes ............................................................................................... 24
1.8.1 Iterative training and quality management lifecycle ........................................................ 24
1.8.2 Work divisions and development process managements .............................................. 27
1.9 Relations to other guidelines and principles ................................................................................. 28
1.9.1 “Social principle on human-centric AI” (Japan) ................................................................. 28
1.9.2 Oversea/international principles and guidelines about AI technology ................... 29
1.10 Structure of this guideline ..................................................................................................................... 29
2 Introduction .......................................................................................................................................................... 31
2.1 Scope of guideline ..................................................................................................................................... 31
2.1.1 Target products and systems ...................................................................................................... 31
2.1.2 Target of quality management ................................................................................................... 31
2.1.3 Extent of quality management ................................................................................................... 32
2.2 Relations to existing standards about system quality ............................................................... 33
2.2.1 Security standard: ISO/IEC 15408 ........................................................................................... 33
2.2.2 Software quality models: ISO/IEC 25000 series ................................................................ 33
2.3 Definitions of terms.................................................................................................................................. 34
2.3.1 Terms on machine learning system structure ..................................................................... 34
2.3.2 Terms on development stakeholders and their roles ...................................................... 36
2.3.3 Terms on quality concepts ........................................................................................................... 37
2.3.4 Terms on development process ................................................................................................ 39
2.3.5 Terms on usage environments ................................................................................................... 40
2.3.6 Terms related to data for machine learning ......................................................................... 41
2.3.7 Other terms ........................................................................................................................................ 43
3 Level assignments for external quality of machine learning systems .......................................... 45
3.1 Risk avoidance ............................................................................................................................................ 45
3.2 AI performance .......................................................................................................................................... 46
3.3 Fairness ......................................................................................................................................................... 47
4 Reference model for the development process of systems using machine learning ............. 49
4.1 Proof of Concept phase ........................................................................................................................... 49
4.1.1 Handling of PoC with trial operations .................................................................................... 49
4.2 Main development phase ....................................................................................................................... 50
4.2.1 Model construction phase ........................................................................................................... 51
4.2.2 System construction and testing phase ................................................................................. 56
4.3 Deployment, operation and monitoring phase ............................................................................ 57
5 Procedure for applying the guideline ......................................................................................................... 58
5.1 Basic process ............................................................................................................................................... 58
5.1.1 Identifying functions of ML components in the whole system .................................... 58
5.1.2 Assigning external quality levels for ML components ..................................................... 59
5.1.3 Determining requirement levels for internal quality aspects ...................................... 60
5.1.4 Realizing internal quality ............................................................................................................. 60
5.2 (informative) Outsourcing AI developments ................................................................................ 60
5.2.1 Handling of exploratory approaches ....................................................................................... 61
5.2.2 Clarifications of actions in each development step ........................................................... 62
5.2.3 Considerations on the division of works ............................................................................... 64
5.3 (informative) Considerations on multi-step (differential) development .......................... 65
6 Requirements for quality assessment ........................................................................................................ 68
6.1 Completeness of problem domain analysis ................................................................................... 68
6.1.1 Basic concept ..................................................................................................................................... 68
6.1.2 Guidance .............................................................................................................................................. 69
6.1.3 Requirements for assigned quality levels ............................................................................. 71
6.2 Coverage for distinguished problem cases ..................................................................................... 73
6.2.1 Basic concept ..................................................................................................................................... 73
6.2.2 Guidance .............................................................................................................................................. 74
6.2.3 Requirements for assigned quality levels ............................................................................. 74
6.3 Diversity of test data ................................................................................................................................ 76
6.3.1 Basic concept ..................................................................................................................................... 76
6.3.2 Guidance .............................................................................................................................................. 76
6.3.3 Requirements for assigned quality levels ............................................................................. 77
6.4 Distribution of training data ................................................................................................................. 78
6.4.1 Basic concept ..................................................................................................................................... 78
6.4.2 Guidance .............................................................................................................................................. 79
6.4.3 Requirements for assigned quality levels ............................................................................. 79
6.5 Accuracy of trained model .................................................................................................................... 80
6.5.1 Basic concept ..................................................................................................................................... 80
6.5.2 Guidance .............................................................................................................................................. 81
6.5.3 Requirements for assigned quality levels ............................................................................. 81
6.6 Robustness of trained model ............................................................................................................... 83
6.6.1 Basic concept ..................................................................................................................................... 83
6.6.2 Guidance .............................................................................................................................................. 83
6.6.3 Requirements for assigned quality levels ............................................................................. 84
6.7 Dependability of underlying software ............................................................................................. 85
6.7.1 Basic concept ..................................................................................................................................... 85
6.7.2 Guidance .............................................................................................................................................. 86
6.7.3 Requirements for assigned quality levels ............................................................................. 86
6.8 Stability/maintainability of quality ................................................................................................... 87
6.8.1 Basic concept ..................................................................................................................................... 87
6.8.2 Guidance .............................................................................................................................................. 89
6.8.3 Requirements for assigned quality levels ............................................................................. 91
7 Guidance for the adoption of quality management technology ...................................................... 93
7.1 Completeness of problem domain analysis ................................................................................... 93
7.1.1 General provisions .......................................................................................................................... 93
7.1.2 presumption of risk factors on input side ............................................................................. 94
7.1.3 estimation of the output-side structure of data ................................................................. 95
7.2 Coverage for distinguished problem cases ..................................................................................... 96
7.2.1 General provisions .......................................................................................................................... 96
7.3 Diversity of test data ................................................................................................................................ 97
7.3.1 Considerations on data acquisition ......................................................................................... 97
7.3.2 Addition of test data at data preprocess phase .................................................................. 98
7.3.3 Addition of test data at testing phase ..................................................................................... 98
7.4 Distribution of training data ................................................................................................................. 98
7.5 Accuracy and Robustness of models ................................................................................................. 98
7.5.1 Software testing for machine learning components ........................................................ 99
7.5.2 Technology for stability evaluation and improvements .............................................. 101
7.6 (skipped).................................................................................................................................................... 106
7.7 Dependability of underlying software .......................................................................................... 106
7.7.1 General provisions ....................................................................................................................... 106
7.7.2 Quality management of open source software ................................................................ 107
7.7.3 Tracking of software configuration and tracing of bug information ...................... 107
7.7.4 Possible detection of software malfunction using testing techniques ................... 107
7.7.5 Software updates and their risks ........................................................................................... 108
7.7.6 References ....................................................................................................................................... 108
7.8 Stability/maintainability of quality ................................................................................................ 109
7.8.1 Monitoring ....................................................................................................................................... 109
7.8.2 Detection of concept drifts ....................................................................................................... 110
7.8.3 Retraining ........................................................................................................................................ 111
7.8.4 Preparation of additional training data .............................................................................. 112
8 (informative) Related documents .............................................................................................................. 114
8.1 Relations to/from other guidelines ................................................................................................ 114
8.1.1 AI contractors’ guideline from METI, Japan ...................................................................... 114
8.1.2 Guideline published by QA4AI consortium ....................................................................... 115
8.2 International standardization activities related to AI quality ............................................. 118
8.2.1 Quality, safety ................................................................................................................................. 118
8.2.2 Transparency ................................................................................................................................. 119
8.2.3 Fairness, bias .................................................................................................................................. 119
8.2.4 Other ethical aspects ................................................................................................................... 120
9 (informative) Analyses ................................................................................................................................... 121
9.1 Analysis of internal quality aspects for risk avoidance and AI performance ............... 121
9.2 Quality level assignments for AI performance .......................................................................... 122
9.3 Considerations on fairness properties .......................................................................................... 123
10 Figures and tables ............................................................................................................................................ 124
10.1 Corresponding chart between external/internal quality properties ............................... 124