data validation for machine learning - mlsys · 2019-06-15 · machine learning is a powerful tool...

14
DATA V ALIDATION FOR MACHINE L EARNING Eric Breck 1 Neoklis Polyzotis 1 Sudip Roy 1 Steven Euijong Whang 2 Martin Zinkevich 1 ABSTRACT Machine learning is a powerful tool for gleaning knowledge from massive amounts of data. While a great deal of machine learning research has focused on improving the accuracy and efficiency of training and inference algorithms, there is less attention in the equally important problem of monitoring the quality of data fed to machine learning. The importance of this problem is hard to dispute: errors in the input data can nullify any benefits on speed and accuracy for training and inference. This argument points to a data-centric approach to machine learning that treats training and serving data as an important production asset, on par with the algorithm and infrastructure used for learning. In this paper, we tackle this problem and present a data validation system that is designed to detect anomalies specifically in data fed into machine learning pipelines. This system is deployed in production as an integral part of TFX(Baylor et al., 2017) – an end-to-end machine learning platform at Google. It is used by hundreds of product teams use it to continuously monitor and validate several petabytes of production data per day. We faced several challenges in developing our system, most notably around the ability of ML pipelines to soldier on in the face of unexpected patterns, schema-free data, or training/serving skew. We discuss these challenges, the techniques we used to address them, and the various design choices that we made in implementing the system. Finally, we present evidence from the system’s deployment in production that illustrate the tangible benefits of data validation in the context of ML: early detection of errors, model-quality wins from using better data, savings in engineering hours to debug problems, and a shift towards data-centric workflows in model development. 1 I NTRODUCTION Machine Learning (ML) is widely used to glean knowl- edge from massive amounts of data. The applications are ever-increasing and range from machine perception and text understanding to health care, genomics, and self-driving cars. Given the critical nature of some of these applications, and the role that ML plays in their operation, we are also ob- serving the emergence of ML platforms (Baylor et al., 2017; Chandra, 2014) that enable engineering teams to reliably deploy ML pipelines in production. In this paper we focus on the problem of validating the input data fed to ML pipelines. The importance of this prob- lem is hard to overstate, especially for production pipelines. Irrespective of the ML algorithms used, data errors can adversely affect the quality of the generated model. Fur- thermore, it is often the case that the predictions from the generated models are logged and used to generate more data for training. Such feedback loops have the potential to am- 1 Google Research 2 KAIST. Work done while at Google Research. Correspondence to: Neoklis Polyzotis <npolyzo- [email protected]>. Proceedings of the 2 nd SysML Conference, Palo Alto, CA, USA, 2019. Copyright 2019 by the author(s). plify even “small” data errors and lead to gradual regression of model performance over a period of time. Therefore, it is imperative to catch data errors early, before they propagate through these complex loops and taint more of the pipeline’s state. The importance of error-free data also applies to the task of model understanding, since any attempt to debug and understand the output of the model must be grounded on the assumption that the data is adequately clean. All these observations point to the fact that we need to elevate data to a first-class citizen in ML pipelines, on par with algorithms and infrastructure, with corresponding tooling to continuously monitor and validate data throughout the various stages of the pipeline. Data validation is neither a new problem nor unique to ML, and so we borrow solutions from related fields (e.g., database systems). However, we argue that the problem acquires unique challenges in the context of ML and hence we need to rethink existing solutions. We discuss these challenges through an example that reflects an actual data- related production outage in Google. Example 1.1 Consider an ML pipeline that trains on new training data arriving in batches every day, and pushes a fresh model trained on this data to the serving infrastruc- ture. The queries to the model servers (the serving data)

Upload: others

Post on 20-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Validation for Machine Learning - MLSys · 2019-06-15 · Machine learning is a powerful tool for gleaning knowledge from massive amounts of data. While a great deal of machine

DATA VALIDATION FOR MACHINE LEARNING

Eric Breck 1 Neoklis Polyzotis 1 Sudip Roy 1 Steven Euijong Whang 2 Martin Zinkevich 1

ABSTRACTMachine learning is a powerful tool for gleaning knowledge from massive amounts of data. While a great dealof machine learning research has focused on improving the accuracy and efficiency of training and inferencealgorithms, there is less attention in the equally important problem of monitoring the quality of data fed to machinelearning. The importance of this problem is hard to dispute: errors in the input data can nullify any benefitson speed and accuracy for training and inference. This argument points to a data-centric approach to machinelearning that treats training and serving data as an important production asset, on par with the algorithm andinfrastructure used for learning.

In this paper, we tackle this problem and present a data validation system that is designed to detect anomaliesspecifically in data fed into machine learning pipelines. This system is deployed in production as an integralpart of TFX(Baylor et al., 2017) – an end-to-end machine learning platform at Google. It is used by hundredsof product teams use it to continuously monitor and validate several petabytes of production data per day. Wefaced several challenges in developing our system, most notably around the ability of ML pipelines to soldier onin the face of unexpected patterns, schema-free data, or training/serving skew. We discuss these challenges, thetechniques we used to address them, and the various design choices that we made in implementing the system.Finally, we present evidence from the system’s deployment in production that illustrate the tangible benefits ofdata validation in the context of ML: early detection of errors, model-quality wins from using better data, savingsin engineering hours to debug problems, and a shift towards data-centric workflows in model development.

1 INTRODUCTION

Machine Learning (ML) is widely used to glean knowl-edge from massive amounts of data. The applications areever-increasing and range from machine perception and textunderstanding to health care, genomics, and self-drivingcars. Given the critical nature of some of these applications,and the role that ML plays in their operation, we are also ob-serving the emergence of ML platforms (Baylor et al., 2017;Chandra, 2014) that enable engineering teams to reliablydeploy ML pipelines in production.

In this paper we focus on the problem of validating theinput data fed to ML pipelines. The importance of this prob-lem is hard to overstate, especially for production pipelines.Irrespective of the ML algorithms used, data errors canadversely affect the quality of the generated model. Fur-thermore, it is often the case that the predictions from thegenerated models are logged and used to generate more datafor training. Such feedback loops have the potential to am-

1Google Research 2KAIST. Work done while at GoogleResearch. Correspondence to: Neoklis Polyzotis <[email protected]>.

Proceedings of the 2nd SysML Conference, Palo Alto, CA, USA,2019. Copyright 2019 by the author(s).

plify even “small” data errors and lead to gradual regressionof model performance over a period of time. Therefore, it isimperative to catch data errors early, before they propagatethrough these complex loops and taint more of the pipeline’sstate. The importance of error-free data also applies to thetask of model understanding, since any attempt to debugand understand the output of the model must be groundedon the assumption that the data is adequately clean. Allthese observations point to the fact that we need to elevatedata to a first-class citizen in ML pipelines, on par withalgorithms and infrastructure, with corresponding toolingto continuously monitor and validate data throughout thevarious stages of the pipeline.

Data validation is neither a new problem nor unique toML, and so we borrow solutions from related fields (e.g.,database systems). However, we argue that the problemacquires unique challenges in the context of ML and hencewe need to rethink existing solutions. We discuss thesechallenges through an example that reflects an actual data-related production outage in Google.

Example 1.1 Consider an ML pipeline that trains on newtraining data arriving in batches every day, and pushes afresh model trained on this data to the serving infrastruc-ture. The queries to the model servers (the serving data)

Page 2: Data Validation for Machine Learning - MLSys · 2019-06-15 · Machine learning is a powerful tool for gleaning knowledge from massive amounts of data. While a great deal of machine

Data Validation for Machine Learning

are logged and joined with labels to create the next day’straining data. This setup ensures that the model is con-tinuously updated and adapts to any changes in the datacharacteristics on a daily basis.

Now, let us assume that an engineer performs a (seemingly)innocuous code refactoring in the serving stack, which, how-ever, introduces a bug that pins the value of a specific intfeature to -1 for some slice of the serving data (e.g., imaginethat the feature’s value is generated by doing a RPC intoa backend system and the bug causes the RPC to fail, thusreturning an error value). The ML model, being robust todata changes, continues to generate predictions, albeit at alower level of accuracy for this slice of data.

Since the serving data eventually becomes training data,this means that the next version of the model gets trainedwith the problematic data. Note that the data looks perfectlyfine for the training code, since -1 is an acceptable valuefor the int feature. If the feature is important for accuracythen the model will continue to under-perform for the sameslice of data. Moreover, the error will persist in the servingdata (and thus in the next batch of training data) until it isdiscovered and fixed.

The example illustrates a common setup where the gener-ation (and ownership!) of the data is decoupled from theML pipeline. This decoupling is often necessary as it allowsproduct teams to experiment and innovate by joining datasources maintained and curated by other teams. However,this multiplicity of data sources (and corresponding codepaths populating the sources) can lead to multiple failuresmodes for different slices of the data. A lack of visibilityby the ML pipeline into this data generation logic exceptthrough side effects (e.g., the fact that -1 became morecommon on a slice of the data) makes detecting such slice-specific problems significantly harder. Furthermore, thepipeline often receives the data in a raw-value format (e.g.,tensorflow.Example or CSV) that strips out any se-mantic information that can help identify errors. Going backto our example, -1 is a valid value for the int feature anddoes not carry with it any semantics related to the backenderrors. Overall, there is little a-priori information that thepipeline can leverage to reason about data errors.

The example also illustrates a very common source of dataerrors in production: bugs in code. This has several impor-tant implications for data validation. First, data errors arelikely to exhibit some “structure” that reflects the executionof the faulty code (e.g., all training examples in the slice getthe value of -1). Second, these errors tend to be differentthan the type of errors commonly considered in the data-cleaning literature (e.g., entity resolution). Finally, fixing anerror requires changes to code which is immensely hard toautomate reliably. Even if it were possible to automatically“patch” the data to correct an error (e.g., using a technique

such as Holocleans (Rekatsinas et al., 2017)), this wouldhave to happen consistently for both training and servingdata, with the additional challenge that the latter is a streamwith stringent latency requirements. Instead, the commonapproach in practice is for the on-call engineer to investigatethe problem, isolate the bug in the code, and submit a fixfor both the training and serving side. In turn, this meansthat the data-validation system must generate reliable alertswith high precision, and provide enough context so that theon-call engineer can quickly identify the root cause of theproblem. Our experience shows that on-calls tend to ignorealerts that are either spammy or not actionable, which cancause important errors to go unnoticed.

A final observation on Example 1.1 is that errors can happenand propagate at different stages of the pipeline. Catchingdata errors early is thus important, as it helps debug theroot cause and also rollback the pipeline to a working state.Moreover, it is important to rely on mechanisms specificto data validation rather than on detection of second-ordereffects. Concretely, suppose that we relied on model-qualityvalidation as a failsafe for data errors. The resilience of MLalgorithms to noisy data means that errors may result in asmall drop in model quality, one that can be easily missedif the data errors affect the model only on specific slices ofthe data but the aggregate model metrics still look okay.

Our Contributions. In this paper we present a data-validation system whose design is driven by the aforemen-tioned challenges. Our system is deployed in production asan integral part of TFX. Hundreds of product teams use oursystem to validate trillions of training and serving examplesper day, amounting to several petabytes of data per day.

The data-validation mechanisms that we develop are basedon “battle-tested” principles from data management sys-tems, but tailored to the context of ML. For instance, afundamental piece of our solution is the familiar concept ofa data schema, which codifies the expectations for correctdata. However, as mentioned above, existing solutions donot work out of the box and need to adapt to the context ofML. For instance, the schema encodes data properties thatare unique to ML and are thus absent from typical databaseschemas. Moreover, the need to surface high-precision, ac-tionable alerts to a human operator has influenced the typeof properties that we can check. As an example, we foundthat statistical tests for detecting changes in the data distribu-tion, such as the chi-squared test, are too sensitive and alsouninformative for the typical scale of data in ML pipelines,which led us to seek alternative methods to quantify changesbetween data distributions. Another difference is that indatabase systems the schema comes first and provides amechanism to verify both updates and queries against thedata. This assumption breaks in the context of ML pipelineswhere data generation is decoupled from the pipeline. In

Page 3: Data Validation for Machine Learning - MLSys · 2019-06-15 · Machine learning is a powerful tool for gleaning knowledge from massive amounts of data. While a great deal of machine

Data Validation for Machine Learning

Data Validation System

TFX

Schema

Model Unit Testing

ModelTraining Code Tensorflow Serving

Training Data

Test failures, if any

Fixes to training code

Data Analyzer

Data Validator

Data Statistics

Serving Data

Anomaly Alerts

Suggested Schema Edits /

User approved

schema updates

Training and Serving Data

To Data Analyzer

To Data Analyzer

Training data generation code

User’s code fixes

Serving data generation code

To serving data generation code

Use

r’s c

ode

fixes

Figure 1: An overview of our data validation system and its integration with TFX.

essence, in our setup the data comes first. This necessitatesnew workflows where the schema can be inferred from andco-evolve with the data. Moreover, this co-evolution needsto be user friendly so that we can allow human operators toseamlessly encode their domain knowledge about the dataas part of the schema.

Another novel aspect of our work is that we use the schemato perform unit tests for the training algorithm. These testscheck that there are no obvious errors in the code of the train-ing algorithm, but most importantly they help our systemuncover any discrepancies between the codified expectationsover the data and the assumptions made by the algorithm.Any discrepancies mean that either the schema needs tochange (to codify the new expectations) or the training codeneeds to change to be compliant with the actual shape ofthe data. Overall, this implies another type of co-evolutionbetween the schema and the training algorithm.

As mentioned earlier, our system has been deployed in pro-duction at Google and so some parts of its implementationhave been influenced by the specific infrastructure that weuse. Still, we believe that our design choices, techniques,and lessons learned generalize to other setups and are thusof interest to both researchers and practitioners. We havealso open-sourced the libraries1 that implement the coretechniques that we describe in the paper so that they can beintegrated in other platforms.

1GitHub repository for TensorFlow Data Validation: https://github.com/tensorflow/data-validation

2 SYSTEM OVERVIEW

Figure 1 shows a schematic overview of our data-validationsystem and how it integrates with an end-to-end machinelearning platform. At a high level, the platform instantiatesa pipeline that ingests training data, passes it to data valida-tion (our system), then pipes it to a training algorithm thatgenerates a model, and finally pushes the latter to a servinginfrastructure for inference. These pipelines typically workin a continuous fashion: a new batch of data arrives peri-odically, which triggers a new run of the pipeline. In otherwords, each batch of input data will trigger a new run ofthe data-validation logic and hence potentially a new set ofdata anomalies. Note that our notion of a batch is differentthan the mini-batches that constitute chunks of data usedto compute and update model parameters during training.The batches of data ingested into the pipeline correspondto larger intervals of time (say a day). While our systemsupports validation of data on a sample of data ingestedinto the pipeline, this option is disabled by default sinceour current response times for single batch is acceptable formost users and within the expectations of end-to-end ML.Furthermore, by validating over the entire batch we ensurethat anomalies that are infrequent or manifest in small butimportant slices of data are not silently ignored.

The data validation system consists of three main compo-nents – a Data Analyzer that computes predefined set ofdata statistics sufficient for data validation, a Data Valida-tor that checks for properties of data as specified through a

Page 4: Data Validation for Machine Learning - MLSys · 2019-06-15 · Machine learning is a powerful tool for gleaning knowledge from massive amounts of data. While a great deal of machine

Data Validation for Machine Learning

Schema (defined in Section 3), and a Model Unit Tester thatchecks for errors in the training code using synthetic datagenerated through the schema. Overall, our system supportsthe following types of data validation:

• Single-batch validation answers the question: arethere any anomalies in a single batch of data? Thegoal here is to alert the on-call about the error andkick-start the investigation.

• Inter-batch validation answers the question: are thereany significant changes between the training and serv-ing data, or between successive batches of the trainingdata? This tries to capture the class of errors that occurbetween software stacks (e.g., training data generationfollows different logic than serving data generation), orto capture bugs in the rollout of new code (e.g., a newversion of the training-data generation code results indifferent semantics for a feature, which requires oldbatches to be backfilled).

• Model testing answers the question: are there anyassumptions in the training code that are not reflectedin the data (e.g. are we taking the logarithm of a featurethat turns out to be a negative number or a string?).

The following sections discuss the details of how our systemperforms these data validation checks and how our designchoices address the challenges discussed in Section 1. Now,we acknowledge that these checks are not exhaustive, butour experience shows that they cover the vast majority oferrors we have observed in production and so they providea meaningful scope for the problem.

3 SINGLE-BATCH VALIDATION

The first question we answer is: are there data errors withineach new batch that is ingested by the pipeline?

We expect the data characteristics to remain stable withineach batch, as the latter corresponds to a single run of thedata-generation code. We also expect some characteristicsto remain stable across several batches that are close in time,since it is uncommon to have frequent drastic changes to thedata-generation code. For these reasons, we consider any de-viation within a batch from the expected data characteristics,given expert domain knowledge, as an anomaly.

In order to codify these expected data characteristics, oursystem generalizes the traditional notion of a schema fromdatabase systems. The schema acts as a logical model of thedata and attempts to capture some of the semantics that arelost when the data are transformed to the format acceptedby training algorithms, which is typically key-value lists. Toillustrate this, consider a training example with a key-value(“age”, 150). If this feature corresponds to the age of a

person in years then clearly there is an error. If, however,the feature corresponds to the age of a document in daysthen the value can be perfectly valid. The schema can codifyour expectations for “age” and provide a reliable mechanismto check for errors.

message Schema {// Features described in this schema.repeated Feature feature;

// String domains referenced in the features.repeated StringDomain string_domain;

}

message Feature {// The name of the feature.string name;

// Type of the feature’s valuesFeatureType type;

// Constraints on the number of examples that have this// feature populated.FeaturePresence presence;

// Minimum and maximum number of values.ValueCount value_count;

// Domain for the values of the feature.oneof domain_info {

// Reference to a domain defined at the schema// level.string domain;

// Inline definitions of domains.IntDomain int_domain;FloatDomain float_domain;StringDomain string_domain;BoolDomain bool_domain;

}

LifecycleStage lifecycle_stage;}

Figure 2: Simplified schema as a protocol buffer. Note thattag numbers are omitted to simplify exposition. For furtherexplanation of constraints, see Appendix A.

Figure 2 shows the schema formalism used by our data val-idation system, defined as a protocol buffer message(pro,2017). (We use the protocol-buffer representation as it cor-responds to our implementation and is also easy to follow.)The schema follows a logical data model where each trainingor serving example is a collection of features, with each fea-ture having several constraints attached to it. This flat modelhas an obvious mapping to the flat data formats used bypopular ML frameworks, e.g., tensorflow.Exampleor CSV. We have also extended the schema to cover struc-tured examples (e.g., JSON objects or protocol buffers) butwe do not discuss this capability in this paper.

The constraints associated with each feature cover somebasic properties (e.g., type, domain, valency) but also con-straints that are relevant to ML and that also reflect code-centric error patterns that we want to capture through datavalidation (see also our discussion in Section 1). For in-stance, the presence constraint covers bugs that cause thedata-generation code to silently drop features from someexamples (e.g., failed RPCs that cause the code to skip afeature). As another case, the domain constraints can coverbugs that change the representation of values for the same

Page 5: Data Validation for Machine Learning - MLSys · 2019-06-15 · Machine learning is a powerful tool for gleaning knowledge from massive amounts of data. While a great deal of machine

Data Validation for Machine Learning

feature (e.g., a code-change that lower-cases the string rep-resentation of country codes for some examples). We defera more detailed discussion of our schema formalism in Ap-pendix A. However, we note that we do not make any claimson completeness – there are reasonable data properties thatour schema cannot encode. Still, our experience so far hasshown that the current schema is powerful enough to capturethe vast majority of use cases in production pipelines.

Schema Validation The Data Validator component of oursystem validates each batch of data by comparing it againstthe schema. Any disagreement is flagged as an anomaly andtriggers an alert to the on-call for further investigation. Fornow, we assume that the pipeline includes a user-curatedschema that codifies all the constraints of interest. We dis-cuss how to arrive at this state later.

In view of the design goals of data validation discussed inSection 1, the Data Validator component:

• attempts to detect issues as early in the pipeline as possibleto avoid training on bad data. In order to ensure it can doso scalably and efficiently, we rely on the per-batch datastatistics computed by a preceding Data Analyzer module.Our strategy of validating using these pre-computed statis-tics allows the validation itself to be fairly lightweight.This enables revalidation of the data using an updatedschema near instantaneously.

• be easily interpretable and narrowly focused on the exactanomaly. Table 1 shows different categories of anomaliesthat our Data Validator reports to the user. Each anomalycorresponds to a violation of some property specified inthe schema and has an associated description templatethat is populated with concrete values from the detectedanomaly before being presented to the user.

• include suggested updates to the schema to “eliminate”anomalies that correspond to the natural evolution of thedata. For instance, as shown in Figure 4, the domain of

0: { features: { feature: { name: ‘event’ string_list: { ‘CLICK’ } } } ...}1: { features: { feature: { name: ‘event’ string_list: {‘CONVERSION’} } } ...}...

SchemaTraining/Serving data

feature { name: ‘event’ presence: { min_fraction: 1 } value_count: {

min: 1max: 1

} type: BYTES string_domain { value: ‘CLICK’ value: ‘CONVERSION’ }}

Figure 3: An example schema and corresponding data inthe tf.train.Example (tfe) format.

feature { key: ‘event’ value { bytes_list { ‘IMPRESSIONS’ } }}

Fig. 3 Schema !Examples contain missing values.Fix: Add value to domain

string_domain { name: ‘event’ value: ‘CLICKS’ value: ‘CONVERSIONS’+ value: ‘IMPRESSIONS’}

Dat

a V

alid

atio

n

Figure 4: Schema-driven validation

“event” seems to have acquired a new IMPRESSIONSvalue, and so our validator generates a suggestion toextend the feature’s domain with the same value (this isshown with the red text and “+” sign in Figure 4).

• avoids false positives by focusing on the narrow set ofconstraints that can be expressed in our schema and by re-quiring that the schema is manually curated, thus ensuringthat the user actually cares about the encoded constraints.

Schema Life Cycle As mentioned earlier, our assumptionis that the pipeline owners are also responsible to curate theschema. However, many machine learning pipelines usethousands of features, and so constructing a schema man-ually for such pipelines can be quite tedious. Furthermore,the domain knowledge of the features may be distributedacross a number of engineers within the product team oreven outside of the team. In such cases, the upfront effort inschema construction can discourage engineers from settingup data validation until they run into a serious data error.

To overcome this adoption hurdle, the Data Validatorcomponent synthesizes a basic version of the schemabased on all available batches of data in the pipeline. Thisauto-generated schema attempts to capture salient propertiesof the data without overfitting to a particular batch of data.Avoiding overfitting is important: an overfitted schema ismore likely to cause spurious alerts when validating a newbatch of data, which in turn increases the cognitive overheadfor the on-call engineers, reduces their trust in the system,and may even lead them to switch off data validation alto-gether. We currently rely on a set of reasonable heuristicsto perform this initial schema inference. A more formalsolution, perhaps with guarantees or controls on the amountof overfitting, is an interesting direction for future work.

Once the initial schema is in place, the Data Validator willrecommend updates to the schema as new data is ingestedand analyzed. To help users manage the schema easily, oursystem also includes a user interface and tools that aid usersby directing their attention to important suggestions andproviding a click-button interface to apply the suggestedchanges. The user can then accept these suggested updatesor manually edit the schema using her judgement. We expectowners of pipelines to treat the schema as a productionasset at par with source code and adopt best practices forreviewing, versioning, and maintaining the schema. Forinstance, in our pipelines the schema is stored in the version-

Page 6: Data Validation for Machine Learning - MLSys · 2019-06-15 · Machine learning is a powerful tool for gleaning knowledge from massive amounts of data. While a great deal of machine

Data Validation for Machine Learning

control system for source code.

4 INTER-BATCH VALIDATION

There are certain anomalies that only manifest when twoor more batches of data are considered together, e.g., driftin the distribution of feature values across multiple batchesof data. In this section, we first cover the different types ofsuch anomalies, discuss the reasons why they occur, andfinally present some common techniques that can be usedto detect them.

Training-Serving Skew One of the issues that frequentlyoccurs in production pipelines is skew between training andserving data. Several factors contribute to this issue but themost common is different code paths used for generationof training and serving data. These different code pathsare required due to widely different latency and throughputcharacteristics of offline training data generation versusonline serving data generation.

Based on our experience, we can broadly categorize training-serving skew into three main categories.

Feature skew occurs when a particular feature for an exam-ple assumes different values in training versus at servingtime. This can happen, for instance, when a developer addsor removes a feature from the training-data code path butinadvertently forgets to do the same to the serving path. Amore interesting mechanism through which feature skewoccurs is termed time travel. This happens when the fea-ture value is determined by querying a non-static source ofdata. For instance, consider a scenario where each examplein our data corresponds to an online ad impression. Oneof the features for the impression is the number of clicks,obtained by querying a database. Now, if the training datais generated by querying the same database then it is likelythat the click count for each impression will appear highercompared to the serving data, since it includes all the clicksthat happened between when the data was served and whenthe training data was generated. This skew would bias theresulting model against a different distribution of click ratescompared to what is observed at serving time, which islikely to affect model quality.

Distribution skew occurs when the distribution of featurevalues over a batch of training data is different from that seenat serving time. To understand why this happens considerthe following scenario. Let us assume that we have a setupwhere a sample of today’s serving data is used as the trainingdata for next day’s model. The sampling is needed since thevolume of data seen at serving time is prohibitively large totrain over. Any flaw in the sampling scheme can result intraining data distributions that look significantly differentfrom serving data distributions.

Scoring/Serving Skew occurs when only a subset of thescored examples are actually served. To illustrate this sce-nario, consider a list of ten recommended videos shown tothe user out of hundred that are scored by the model. Subse-quently if the user clicks on one of the ten videos, then wecan treat that as a positive example and the other nine as neg-ative examples for next day’s training. However, the ninetyvideos that were never served may not have associated labelsand therefore may never appear in the training data. This es-tablishes an implicit feedback loop which further increasesthe chances of misprediction for lower ranked items, andconsequently less frequent appearance in training data. De-tecting such types of skew is harder than the other two types.

In addition to validating individual batches of training data,the Data Validator also monitors for skew between trainingand serving data, continuously. Specifically, our servinginfrastructure is configured to log a sample of the servingdata which is imported back into the training pipeline. TheData Validator component continuously compares batchesof incoming training and serving data to detect differenttypes of skew. To detect feature skew, the Data Validatorcomponent does a key-join between corresponding batchesof training and serving data followed by a feature wisecomparison. Any detected skew is summarized andpresented to the user using the ML platform’s standardalerting infrastructure. To detect distribution skew, werely on measures that quantify the distance betweendistributions, which we discuss below. In addition to skewdetection, each serving data batch is validated against theschema to detect the anomalies listed in Section 3. Wenote that in all cases, the parameters for skew detection areencoded as additional constraints in the schema.

Quantifying Distribution Distance As mentioned above,our distribution-skew detection relies on metrics that canquantify the distance between the training and serving dis-tributions. It is important to note that, in practice, we expectthese distributions to be different (and hence their distanceto be positive), since the distribution of examples on eachday is in fact different than the last. So the key question iswhether the distance is high enough to warrant an alert.

A first-cut solution here might be to use typical distancemetrics such as KL divergence or cosine similarity andfire an alert only if the metric crosses a threshold. Theproblem with this solution is that product teams have a hardtime understanding the natural meaning of the metric andthus tuning the threshold to avoid false positives. Anotherapproach might be to rely on statistical goodness-of-fit tests,where the alert can be tuned based on common confidencelevels. For instance, one might apply a chi-square test andalert if the null hypothesis is rejected at the 1% confidencelevel. The problem with this general approach, however, isthat the variance of test statistics typically has an inverse

Page 7: Data Validation for Machine Learning - MLSys · 2019-06-15 · Machine learning is a powerful tool for gleaning knowledge from massive amounts of data. While a great deal of machine

Data Validation for Machine Learning

square relationship to the total number of examples, whichmakes these tests sensitive to even minute differences in thetwo distributions.

We illustrate the last point with a simple experiment. Webegin with a control sample of 100 million points from aN(0,1) distribution and then randomly replace 10 thousandpoints by sampling from a N(0,2) distribution. This replace-ment introduces a minute amount of noise (0.01%) in theoriginal data, for which ML is expected to be resilient. Inother words, a human operator would consider the two dis-tributions to be the “same”. We then perform a chi-squaretest between the two distributions to test their fit. Figure 5shows the p-value of this test over 10 trials of this exper-iment, along with the threshold of 1% for the confidencelevel. Note that any p-value below the threshold implies thatthe test rejects the null hypothesis (that the two distributionsare the same) and would therefore result in a distribution-skew alert. As shown, this method would needlessly fire analert 7 out of 10 times and would most likely be considereda flaky detection method. We have repeated the same exper-iment with actual error-free data from one of our productteams and obtained qualitatively the same results.

To overcome these limitations we focus on distribution-distance metrics which are resilient to differences we ob-serve in practice and whose sensitivity can be easily tunedby product teams. In particular, we use d∞(p, q) =maxi∈S |pi − qi| as the metric to compare two distribu-tions p and q with probabilities pi and qi respectively foreach value i (from some universe). Notice that this metrichas several desirable properties. First, it has a simple naturalinterpretation of the largest change in probability for a valuein the two distributions, which makes it easier for teamsto set a threshold (e.g., “allow changes of only up to 1%for each value”). Moreover, an alert comes with a “culprit”value that can be the starting point of an investigation. Forinstance, if the highest change in frequency is observed invalue ’-1’ then this might be an issue with some backendfailing and producing a default value. (This is an actualsituation observed in one of our production pipelines.)

2 4 6 8 10trial

10 13

10 11

10 9

10 7

10 5

10 3

10 1

p-value, 10K examples altered (out of 100M)alpha=0.01

Figure 5: Sensitivity of chi-square test.

The next question is whether we can estimate the metric ina statistically sound manner based on observed frequencies.We develop a method to do that based on Dirichlet priors.For more details, see Appendix B.

5 MODEL UNIT TESTING

Up to this point, we focused on detecting mismatches be-tween the expected and the actual state of the data. Ourintent has been to uncover software errors in generating ei-ther training or serving data. Here, we shift gears and focuson a different class of software errors: mismatches betweenthe expected data and the assumptions made in the trainingcode. Specifically, recall that our platform (and similar plat-forms) allow the user to plug in their own training code togenerate the model (see also Figure 1). This code is mostlya black box for the remaining parts of the platform, includ-ing the data-validation system, and can perform arbitrarycomputations over the data. As we explain below, thesecomputations may make assumptions that do not agree withthe data and cause serious errors that propagate through theML infrastructure.

To illustrate this scenario, suppose that the training codeapplies a logarithm transform on a numeric feature, thusmaking the implicit assumption that the feature’s value isalways positive. Now, let us assume that the schema does notencode this assumption (e.g., the feature’s domain includesnon-positive values) and also that the specific instance oftraining data happens to carry positive values for this feature.As a result, two things happen: (a) data validation does notdetect any errors, and (b) the generated model includes thesame transform and hence the same assumption. Considernow what happens when the model is served and the servingdata happens to have a non-positive value for the feature: theerror is triggered, resulting in a (possibly crashing) failureinside the serving infrastructure.

The previous example illustrates the dangers of these hiddenassumptions and the importance of catching them beforethey propagate through the served models. Note that thetraining code can make several other types of assumptionsthat are not reflected in the schema, e.g., that a featureis always present even though it is marked as optional,or that a feature has a dense shape even though it maytake a variable number of values, to name a few that wehave observed in production. Again, the danger of theseassumptions is that they are not part of the schema (so theycannot be caught through schema validation) and they maybe satisfied in the specific instance of training data (and sowill not trigger during training).

We base our approach on fuzz testing (Miller et al., 1990),using the schema to guide the generation of synthetic in-puts. Specifically, we generate synthetic training examples

Page 8: Data Validation for Machine Learning - MLSys · 2019-06-15 · Machine learning is a powerful tool for gleaning knowledge from massive amounts of data. While a great deal of machine

Data Validation for Machine Learning

that adhere to the schema constraints. For instance, if theschema in Figure 3 is used to generate data, each examplewill have an event feature, each of which would have onevalue, uniformly chosen at random to be either ‘CLICK’or ‘CONVERSION’. Similarly, integral features would berandom integers from the range specified in the schema, toname another case. The generation can be seeded so that itprovides a deterministic output.

The generated data is then used to drive a few iterationsof the training code. The goal is to trigger the hidden as-sumptions in the code that do not agree with the schemaconstraints. Returning to our earlier example with the loga-rithm transform, a synthetic input with non-positive featurevalues will cause an error and hence uncover the mismatchbetween the schema and the training code. At that point, theuser can fix the training code, e.g., apply the transform onmax(value, 1), or extend the schema to mark the featureas positive (so that data validation catches this error). Doingboth provides for additional robustness, as it enables alertsif the data violates the stated assumptions and protects thegenerated model from crashes if the data is wrong.

This fuzz-testing strategy is obviously not fool-proof, asthe random data can still pass through the training codewithout triggering errors. In practice, however, we foundthat fuzz-testing can trigger common errors in the trainingcode even with a modest number of randomly-generatedexamples (e.g., in the 100s). In fact, it has worked so wellthat we have packaged this type of testing as a unit test overtraining algorithms, and included the test in the standardtemplates of our ML platform. Our users routinely executethese unit tests to validate changes to the training logic oftheir pipelines. To our knowledge, this application of unittesting in the context of ML and for the purpose of datavalidation is a novel aspect of our work.

6 EMPIRICAL VALIDATION

As mentioned earlier, our data-validation system has beendeployed in production as part of the standard ML platformused in a large organization. The system currently analyzesseveral petabytes of training and serving data per day andhas resulted in significant savings in engineer hours in twoways: by catching important data anomalies early (beforethey result in a bad model being pushed to production), andby helping teams diagnose model-quality problems causedby data errors. In what follows we provide empirical evi-dence to back this claim. We first report aggregate resultsfrom our production deployment and then discuss in somedetail individual use cases.

Number of schema edits

Perc

enta

ge o

f pip

elin

es

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

1 [2-5] [5-10] [10-20] [20-50]

Figure 6: Number of manual schema changes by users overthe analysis pipelines.

6.1 Results from Production Deployment

We present an analysis of our system in production, basedon a sample of more than 700 ML pipelines that employ datavalidation and train models that are pushed to our productionserving infrastructure.

We first consider the co-evolution of data and schema inthe lifecycle of these pipelines, which is a core tenet of ourapproach. Recall that our users go through the followingworkflow: (i) when the pipeline is set up, they obtain aninitial schema through our automatic inference, (ii) theyreview and manually edit the schema to compensate for anymissing domain knowledge, (iii) they commit the updatedschema to a version control system and start the training partof the pipeline. Subsequently, when a data alert fires thatindicates a disagreement between the data and the schema,the user has the option to either fix the data generationprocess or update the schema, if the alert corresponds toa natural evolution of the data. Here we are interested inunderstanding the latter changes that reflect the co-evolutionwe mentioned.

Figure 6 shows a histogram of this number of schemachanges across the >700 pipelines that we considered. Thegraph illustrates that the schema evolves in most of the ex-amined pipelines, in line with our hypothesis of data-schemaco-evolution.

The majority of cases have up to five schema revisions, withthe distribution tapering off after that point. This evidencesuggests that the input data has evolving properties but is notcompletely volatile, which is reasonable in practice. On theside, we also conclude that the engineers treat the schema asan important production asset and hence are willing to putin the effort to keep it up to date. In fact, anecdotal evidencefrom some teams suggest a mental shift towards a data-centric view of ML, where the schema is not solely used fordata validation but also provides a way to document newfeatures that are used in the pipeline and thus disseminate

Page 9: Data Validation for Machine Learning - MLSys · 2019-06-15 · Machine learning is a powerful tool for gleaning knowledge from massive amounts of data. While a great deal of machine

Data Validation for Machine Learning

Anomaly Category Used Fired Fixedgiven Fired

New feature column (in databut not in schema)

100% 10% 65%

Out of domain values for cat-egorical features

45% 6% 66%

Missing feature column (inschema but not in data)

97% 6% 53%

The fraction of examplescontaining a feature is toosmall

97% 3% 82%

Too small feature value vec-tor for example

98% 2% 56%

Too large feature value vec-tor for example

98% <1% 28%

Data completely missing 100% 3% 65%Incorrect data type for fea-ture values

98% <1% 100%

Non-boolean value forboolean feature type

14% <1% 100%

Out of domain values for nu-meric features

67% 1% 77%

Table 1: Analysis of data anomalies over the most recent30-day period for evaluation pipelines. First, we checkedthe schemas, to determine what fraction of pipelines couldpossibly fire a particular kind of alert (Used). Then, welooked at each day, and saw what kinds of anomalies Fired,and calculated what fraction of pipelines had an anomalyfire on any day. If there were two days with none of thistype of anomaly firing on a pipeline afterward, then weconsidered the problem Fixed. This methodology can misssome fixes, if an anomaly is fixed but a new anomaly of thesame type arrives the next day. It is also possible that ananomaly appears fixed but wasn’t if a pipeline stopped orexample validation was turned off, but this is less likely.

information across the members of the team.

In our next analysis we examine in more detail how usersinteract with the schema. Specifically, we instrumented oursystem to monitor the types of data anomalies raised in pro-duction and the subsequent reactions of the users in terms ofschema updates. Table 1 summarizes the results. As shown,the most common anomalies are new feature columns,unexpected string values, and missing feature columns. Thefirst two are unsurprising: even in a healthy pipeline, therewill be new values for fields such as “postal code”, and newfeature columns are constantly being created by feature engi-neers. Missing features and missing data are more cause forconcern. We can see from the chart that it is very rare thatthe physical type of a feature column is wrong (for example,someone manually added a feature column with the wrongtype, causing this anomaly); nonetheless, by checking thiswe check agreement between the prescriptive nature of theschema and the actual data on disk. Even in cases wherethe anomalies almost never fire, this check is useful.

Table 1 also shows that product teams fix the majority of thedetected anomalies. Now, some anomalies remain unfixedand we postulate that this happens for two reasons. First, insome cases a reasonable course of action is to let the anoma-lous batch slide out of the pipeline’s rolling window, e.g., inthe case of a data-missing anomaly where regenerating thedata is impossible. Second, as we mentioned repeatedly, weare dealing with a human operator in the loop who mightmiss or ignore the alert depending on their cognitive load.This re-emphasizes the observation that we need to be mind-ful of how and when we deliver the anomaly alerts to theon-call operators.

Finally, we turn our attention to the data-validation work-flow through model unit testing. Again, this unit testing ispart of the standard ML platform in our organization andso we were able to gather fleet-wide statistics on this func-tionality. Specifically, more than 70% of pipelines had atleast one model unit test defined. Based on analysis of testlogs over a period of one month, we determined that thesetests were executed more than 80K times (including runsexecuted as part of continuous test framework). Of all ofthese executions 6% had failures indicating that either thetraining code had incorrect assumptions about the data orthe schema itself was under specified.

6.2 Case Studies

In addition to the previous results, we present three casestudies that illustrate the benefits of our system in produc-tion.

Missing features in Google Play recommender pipeline.The goal of the Google Play recommender system is torecommend relevant Android apps to the Play app userswhen they visit the homepage of the store, with an aim ofdriving discovery of apps that will be useful to the user.Using the Data Validation system of TFX, Google Playdiscovered a few features that were always missing from thelogs, but always present in training. The results of an onlineA/B experiment showed that removing this skew improvedthe app install rate on the main landing page of the app storeby 2%.

Data debugging leads to model wins. One of the productteams in our organization employs ML to generate videorecommendations. The product team needed to migrate theirexisting training and serving infrastructure onto the new MLplatform that also includes data validation. A prerequisitefor the migration was of course to achieve parity in termsof model quality. After setting up the new infrastructure,our data-validation system started generating alerts aboutmissing features over the serving data. During the ensuinginvestigation, the team discovered that a backend systemwas storing some features in a different format than expectedwhich caused these features to be silently dropped in their

Page 10: Data Validation for Machine Learning - MLSys · 2019-06-15 · Machine learning is a powerful tool for gleaning knowledge from massive amounts of data. While a great deal of machine

Data Validation for Machine Learning

data-parsing code. In this case, the hard part was identify-ing the features which exhibited the problem (and this isprecisely the information provided by our data-validationservice)– the fix was easy and resulted in achieving perfor-mance parity. The total time to diagnose and fix the problemwas two days, whereas similar issues have taken months toresolve.

Stand-alone data validation. Some product teams are set-ting up ML pipelines solely for the purpose of running ourdata-validation system, without doing any training or serv-ing of models. These teams have existing infrastructure(predating the ML platform) to train and serve models, butthey are lacking a systematic mechanism for data validationand, as a result, they are suffering from the data-relatedissues that we described earlier in the paper. To address thisshortcoming in their infrastructure, these teams have set uppipelines that solely monitor the training and serving dataand alert the on-call when an error is detected (while trainingand serving still happen using the existing infrastructure).

Feature-store migration. Two product teams have usedour system to validate the migration of their features tonew stores. In this setup, our system was used to applyschema validation and drift detection on datasets from theold and new storage systems, respectively. The detecteddata errors helped the teams debug the new storage systemand consequently complete the migration successfully.

7 RELATED WORK

As compared to a similar system from Amazon(Schelteret al., 2018), our design choices and techniques are in-formed based on wide deployment of our system at Google.While Amazon’s system allows users to express arbitraryconstraints, we opted to have a restrictive schema definitionlanguage that captures the data constraints for most of ourusers while permitting us to develop effective tools to helpusers manage their schema. Another differentiating factorof our work is the emphasis on co-evolution of the schemawith data and the model with the user in the loop.

While model training (Abadi et al., 2016; ker; mxn) is a cen-tral topic, it is only one component of a machine learningplatform and represents a small fraction of the code (Sculleyet al., 2015). As a result, there is an increasing effort to buildend-to-end machine learning systems (Chandra, 2014; Bay-lor et al., 2017; Sparks et al., 2017; Bose et al., 2017) thatcover all aspects of machine learning including data manage-ment (Polyzotis et al., 2017), model management (Vartak,2017; Fernandez et al., 2016; Schelter et al., 2017), andmodel serving (Olston et al., 2017; Crankshaw et al., 2015),to name a few. In addition, testing and monitoring are nec-essary to reduce technical debt and ensure product readinessof machine learning systems (Breck et al., 2017). In com-

parison, this paper specifically focuses on data validationfor production machine learning. In addition we go intodetail on a particular solution to the problem, compared toprevious work that covered in broad strokes the issues re-lated to data management and machine learning (Polyzotiset al., 2017).

A topic closely related to data validation is data clean-ing (Rekatsinas et al., 2017; Stonebraker et al., 2013;Khayyat et al., 2015; Volkovs et al., 2014) where the state-of-art is to incorporate multiple signals including integrityconstraints, external data, and quantitative statistics to repairerrors. More recently, several cleaning tools target machinelearning. BoostClean (Krishnan et al., 2017) selectivelycleans out-of-range errors that negatively affect model accu-racy. Data linter (Hynes et al., 2017) uses best practices toidentify potential issues and efficiencies of machine learn-ing data. In comparison, our system uses a data-drivenschema utilizing previous data to generate actionable alertsto users. An interesting direction is to perform root causeanalysis (Wang et al., 2015) that is actionable as well.

Schema generation is commonly done in Database sys-tems (DiScala & Abadi, 2016) where the main focus ison query processing. As discussed in Section 1, our contexis unique in that the schema is reverse-engineered and hasto co-evolve with the data. Moreover, we introduce schemaconstraints unique to ML.

Skew detection is relevant to various statistical approachesincluding homogeneity tests (Pearson, 1992), analysis ofvariance (Fisher, 1921; 1992), and time series analysis (Bas-seville & Nikiforov, 1993; Ding et al., 2008; Brodersenet al., 2015). As discussed, our approach is to avoid statisti-cal tests that lead to false positive alerts and instead rely onmore interpretable metrics of distribution distance, coupledwith sound approximation methods.

The traditional approach of model testing is to select a ran-dom set of examples from manually labeled datasets (Wittenet al., 2011). More recently, adversarial deep learning tech-niques (Goodfellow et al., 2014) have been proposed togenerate examples that fool deep neural networks. Deep-Xplore (Pei et al., 2017) is a whitebox tool for generatingdata where models make different predictions. There arealso many tools for traditional software testing. In com-parison, our model testing uses a schema to generate dataand can thus work for any type of models. The process ofiteratively adjusting the schema is similar in spirit to versionspaces (Russell & Norvig, 2009) and has connections withteaching learners (Frazier et al., 1996).

Page 11: Data Validation for Machine Learning - MLSys · 2019-06-15 · Machine learning is a powerful tool for gleaning knowledge from massive amounts of data. While a great deal of machine

Data Validation for Machine Learning

REFERENCES

Keras. https://keras.io/.

Mxnet. https://mxnet.incubator.apache.org/.

Tensorflow examples. https://www.tensorflow.org/programmers guide/datasets.

Protocol buffers. https://developers.google.com/protocol-buffers/, 2017.

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean,J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur,M., Levenberg, J., Monga, R., Moore, S., Murray, D. G.,Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke,M., Yu, Y., and Zheng, X. Tensorflow: A system forlarge-scale machine learning. In OSDI, pp. 265–283,2016. ISBN 978-1-931971-33-1.

Basseville, M. and Nikiforov, I. V. Detection of AbruptChanges: Theory and Application. Prentice-Hall, Inc.,1993. ISBN 0-13-126780-9.

Baylor, D., Breck, E., Cheng, H.-T., Fiedel, N., Foo, C. Y.,Haque, Z., Haykal, S., Ispir, M., Jain, V., Koc, L., Koo,C. Y., Lew, L., Mewald, C., Modi, A. N., Polyzotis,N., Ramesh, S., Roy, S., Whang, S. E., Wicke, M.,Wilkiewicz, J., Zhang, X., and Zinkevich, M. TFX: Atensorflow-based production-scale machine learning plat-form. In SIGKDD, pp. 1387–1395, 2017. ISBN 978-1-4503-4887-4. doi: 10.1145/3097983.3098021. URLhttp://doi.acm.org/10.1145/3097983.3098021.

Bose, J.-H., Flunkert, V., Gasthaus, J., Januschowski, T.,Lange, D., Salinas, D., Schelter, S., Seeger, M., and Wang,Y. Probabilistic demand forecasting at scale. PVLDB, 10(12):1694–1705, August 2017. ISSN 2150-8097.

Breck, E., Cai, S., Nielsen, E., Salib, M., and Sculley, D.The ml test score: A rubric for ml production readinessand technical debt reduction. In IEEE Big Data, 2017.

Brodersen, K. H., Gallusser, F., Koehler, J., Remy, N., andScott, S. L. Inferring causal impact using bayesian struc-tural time-series models. Annals of Applied Statistics, 9:247–274, 2015.

Chandra, T. Sibyl: a system for large scale machine learn-ing at Google. In Dependable Systems and Networks(Keynote), Atlanta, GA, 2014. URL http://www.youtube.com/watch?v=3SaZ5UAQrQM.

Crankshaw, D., Bailis, P., Gonzalez, J. E., Li, H., Zhang,Z., Franklin, M. J., Ghodsi, A., and Jordan, M. I. Themissing piece in complex analytics: Low latency, scalablemodel management and serving with velox. In CIDR,2015. URL http://cidrdb.org/cidr2015/Papers/CIDR15Paper19u.pdf.

Ding, H., Trajcevski, G., Scheuermann, P., Wang, X., andKeogh, E. Querying and mining of time series data:Experimental comparison of representations and distancemeasures. PVLDB, 1(2):1542–1552, August 2008. ISSN2150-8097. doi: 10.14778/1454159.1454226. URL http://dx.doi.org/10.14778/1454159.1454226.

DiScala, M. and Abadi, D. J. Automatic generation of nor-malized relational schemas from nested key-value data. InSIGMOD, pp. 295–310, 2016. ISBN 978-1-4503-3531-7.

Fernandez, R. C., Abedjan, Z., Madden, S., and Stonebraker,M. Towards large-scale data discovery: Position paper.In ExploreDB, pp. 3–5, 2016. ISBN 978-1-4503-4312-1.

Fisher, R. A. On the probable error of a coefficient ofcorrelation deduced from a small sample. Metron, 1:3–32, 1921.

Fisher, R. A. Statistical Methods for Research Workers, pp.66–70. Springer-Verlag New York, 1992.

Frazier, M., Goldman, S. A., Mishra, N., and Pitt, L. Learn-ing from a consistently ignorant teacher. J. Comput. Syst.Sci., 52(3):471–492, 1996. doi: 10.1006/jcss.1996.0035.URL https://doi.org/10.1006/jcss.1996.0035.

Frigyik, B. A., Kapila, A., and Gupta, M. R. Introductionto the dirichlet distribution and related processes. Tech-nical report, University of Washington Department ofElectrical Engineering, 2010. UWEETR-2010-0006.

Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining andharnessing adversarial examples. CoRR, abs/1412.6572,2014. URL http://arxiv.org/abs/1412.6572.

Hynes, N., Scully, D., and Terry, M. The data linter:Lightweight, automated sanity checking for ml data sets.In Workshop on ML Systems at NIPS 2017, 2017.

Khayyat, Z., Ilyas, I. F., Jindal, A., Madden, S., Ouzzani,M., Papotti, P., Quiane-Ruiz, J.-A., Tang, N., and Yin,S. Bigdansing: A system for big data cleansing. InSIGMOD, pp. 1215–1230, 2015.

Krishnan, S., Franklin, M. J., Goldberg, K., and Wu, E.Boostclean: Automated error detection and repair formachine learning. CoRR, abs/1711.01299, 2017. URLhttp://arxiv.org/abs/1711.01299.

Miller, B. P., Fredriksen, L., and So, B. An empirical studyof the reliability of unix utilities. Commun. ACM, 33(12):32–44, December 1990. ISSN 0001-0782. doi:10.1145/96267.96279. URL http://doi.acm.org/10.1145/96267.96279.

Olston, C., Li, F., Harmsen, J., Soyke, J., Gorovoy, K.,Lao, L., Fiedel, N., Ramesh, S., and Rajashekhar, V.

Page 12: Data Validation for Machine Learning - MLSys · 2019-06-15 · Machine learning is a powerful tool for gleaning knowledge from massive amounts of data. While a great deal of machine

Data Validation for Machine Learning

Tensorflow-serving: Flexible, high-performance ml serv-ing. In Workshop on ML Systems at NIPS 2017, 2017.

Pearson, K. On the Criterion that a Given System of De-viations from the Probable in the Case of a CorrelatedSystem of Variables is Such that it Can be ReasonablySupposed to have Arisen from Random Sampling, pp.11–28. Springer-Verlag New York, 1992.

Pei, K., Cao, Y., Yang, J., and Jana, S. Deepxplore: Au-tomated whitebox testing of deep learning systems. InSOSP, pp. 1–18, 2017. ISBN 978-1-4503-5085-3.

Polyzotis, N., Roy, S., Whang, S. E., and Zinkevich, M. Datamanagement challenges in production machine learning.In SIGMOD, pp. 1723–1726, 2017.

Rekatsinas, T., Chu, X., Ilyas, I. F., and Re, C. Holoclean:Holistic data repairs with probabilistic inference. PVLDB,10(11):1190–1201, August 2017. ISSN 2150-8097.

Russell, S. and Norvig, P. Artificial Intelligence: AModern Approach. Prentice Hall Press, Upper SaddleRiver, NJ, USA, 3rd edition, 2009. ISBN 0136042597,9780136042594.

Schelter, S., Boese, J.-H., Kirschnick, J., Klein, T., andSeufert, S. Automatically tracking metadata and prove-nance of machine learning experiments. In Workshop onML Systems at NIPS 2017, 2017.

Schelter, S., Lange, D., Schmidt, P., Celikel, M., Biess-mann, F., and Grafberger, A. Automating large-scaledata quality verification. Proc. VLDB Endow., 11(12):1781–1794, August 2018. ISSN 2150-8097. doi: 10.14778/3229863.3229867. URL https://doi.org/10.14778/3229863.3229867.

Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips,T., Ebner, D., Chaudhary, V., Young, M., Crespo, J.-F.,and Dennison, D. Hidden technical debt in machinelearning systems. In NIPS, pp. 2503–2511, 2015. URLhttp://dl.acm.org/citation.cfm?id=2969442.2969519.

Sparks, E. R., Venkataraman, S., Kaftan, T., Franklin, M. J.,and Recht, B. Keystoneml: Optimizing pipelines forlarge-scale advanced analytics. In ICDE, pp. 535–546,2017.

Stonebraker, M., Bruckner, D., Ilyas, I. F., Beskales, G.,Cherniack, M., Zdonik, S. B., Pagan, A., and Xu, S. Datacuration at scale: The data tamer system. In CIDR, 2013.

Vartak, M. MODELDB: A system for machine learningmodel management. In CIDR, 2017.

Volkovs, M., Chiang, F., Szlichta, J., and Miller, R. J. Con-tinuous data cleaning. In ICDE, pp. 244–255, 2014. doi:10.1109/ICDE.2014.6816655.

Wang, X., Dong, X. L., and Meliou, A. Data x-ray:A diagnostic tool for data errors. In SIGMOD, pp.1231–1245, 2015. ISBN 978-1-4503-2758-9. doi:10.1145/2723372.2750549. URL http://doi.acm.org/10.1145/2723372.2750549.

Witten, I. H., Frank, E., and Hall, M. A. Data Mining: Prac-tical Machine Learning Tools and Techniques. MorganKaufmann Publishers Inc., San Francisco, CA, USA, 3rdedition, 2011. ISBN 0123748569, 9780123748560.

Page 13: Data Validation for Machine Learning - MLSys · 2019-06-15 · Machine learning is a powerful tool for gleaning knowledge from massive amounts of data. While a great deal of machine

Data Validation for Machine Learning

A CONSTRAINTS IN THE DATA SCHEMA

message Feature {...

// Limits the distribution drift between training// and serving data.FeatureComparator skew_comparator;

// Limits the distribution drift between two// consecutive batches of data.FeatureComparator drift_comparator;

}

Figure 7: Extensions to the Feature message of Schemato check for distribution drifts.

We explain some of these feature level characteristics belowusing an instance of the schema shown in Figure 3:

Feature type: One of the key invariants of a featureis its data type. For example, a change in the datatype from integer to string can easily cause the trainerto fail and is therefore considered a serious anomaly.Our schema allows specification of feature types as INT,FLOAT, and BYTES, which are the allowed types in thetf.train.Example (tfe) format. In Figure 3, the fea-ture “event” is marked as of type BYTES. Note that fea-tures may have richer semantic types, which we capture ina different part of the schema (explained later).

Feature presence: While some features are expected tobe present in all examples, others may only be expectedin a fraction of the examples. The FeaturePresencefield can be used to specify this expectation of presence.It allows specification of a lower limit on the fractionof examples that the feature must be present. For in-stance, the property presence: {min fraction:1} for the “event” feature in Figure 3 indicates that thisfeature is expected to be present in all examples.

Feature value count: Features can be single valued or lists.Furthermore, for features that are lists, they may or maynot all be of the same length. These value counts areimportant to determine how the values can be encoded intothe low-level tensor representation. The ValueCountfield in the schema can be used to express such properties.In the example in Figure 3, the feature “event” is indicatedto be a scalar as expressed using the min and max valuesset to 1.

Feature domains: While some features may not have a re-stricted domain (for example, a feature for “user queries”),many features assume values only from a limited domain.Furthermore, there may be related features that assumevalues from the same domain. For instance, it makes sensefor two features like “apps installed” and “apps used” tobe drawn from the same set of values. Our schema allowsspecification of domains both at the level of individualfeatures as well as at the level of the schema. The named

domains at the level of schema can be shared by all rele-vant features. Currently, our schema only supports shareddomains for features with string domains.

A domain can also encode the semantic type of the data,which can be different than the raw type captured by theTYPE field. For instance, a bytes features may use the val-ues “TRUE” or “FALSE” to essentially encode a booleanfeature. Another example is an integer feature encodingcategorical ids (e.g., enum values). Yet another example isa bytes feature that encodes numbers (e.g., values of thesort “123”). These patterns are fairly common in produc-tion and reflect common practices in translating structureddata into the flat tf.train.Example format. Thesesemantic properties are important for both data validationand understanding, and so we allow them to be markedexplicitly in the domain construct of each feature.

Feature life cycle: The feature set used in a machinelearning pipeline keeps evolving. For instance, ini-tially a feature may be introduced only for experimen-tation. After sufficient trials, it may be promoted to abeta stage before finally getting upgraded to be a pro-duction feature. The gravity of anomalies in the fea-tures at different stages is different. Our schema al-lows tagging of features with the stage of life cycle thatthey are currently in. The current set of stages sup-ported are UNKNOWN_STAGE, PLANNED, ALPHA, BETA,PRODUCTION, DEPRECATED, and DEBUG_ONLY

Figure 2 only shows only a fragment of the constraints thatcan be expressed by our schema. For instance, our schemacan encode how groups of features can encode logical se-quences (e.g., the sequence of queries issued by a user whereeach query can be described with a set of features), or canexpress constraints on the distribution of values over thefeature’s domain. We will cover some of these extensionsin Section 4, but we omit a full presentation of the schemain the interest of space.

B STATISTICAL SIGNIFICANCE OFMEASUREMENTS OF DRIFT

Suppose that we have two days of data, and we have somemeasure of their distance? As we discussed in Section 4,this measure will never be zero, as some real drift is alwaysexpected. However, how do we know if we can trust such ameasurement?

Specifically, suppose that there is a set S where |S| = nof distinct observations we can make, and ∆(S) is theset of all distributions over S. Without loss of general-ity, we assume S = {1 . . . n}. In one dataset, there aremp observations P1 . . . Pmp ∈ S with an empirical dis-tribution p = p1 . . . pn. In a second dataset, there are

Page 14: Data Validation for Machine Learning - MLSys · 2019-06-15 · Machine learning is a powerful tool for gleaning knowledge from massive amounts of data. While a great deal of machine

Data Validation for Machine Learning

mq observations Q1 . . . Qmqwith an empirical distribu-

tion q = q1 . . . qn. We can assume that the elementsin P1 . . . Pmp were drawn independently from some dis-tribution p = p1 . . . pn, that was in turn drawn from afixed Dirichlet prior(Frigyik et al., 2010) Dir(α) whereα = (1 . . . 1) (a uniform density over the simplex), and sim-ilarly, Q1 . . . Qmq

were drawn independently from somedistribution q = q1 . . . qn, that was in turn drawn from thesame fixed Dirichlet prior.

If we have a particular measure d : ∆(S) × ∆(S) → R,we have two options. First, we could measure the empiricaldrift d(p, q). Or we could attempt to estimate the theoreticaldrift d(p, q). Although p and q are not directly observed,the latter can be achieved more easily than one might ex-pect. First of all, the posterior distribution over p given theobservations P1 . . . Pmp

is simply Dir(α+mpp), and theposterior distribution of q is Dir(α+mq q) (see Section 1.2in (Frigyik et al., 2010)). Thus, one can get a sample fromthe posterior distribution of d(p, q) by sampling p′ fromDir(α + mpp) (see Section 2.2 in (Frigyik et al., 2010))and q′ from Dir(α+mq q) and calculating d(p′, q′).

One metric we use is d1(p, q) =∑n

i=1 |pi − qi|. The advan-tage of d1 is that it corresponds to how visually distinguish-able two distributions are (see Figure 8). Consider a fieldthat has 100 values that are all equally likely. If all values areuppercase instead of lowercase, then the d∞(p, q) = 0.01,but d1(p, q) = 1 (the highest possible value).

What we find, if estimate the theoretical drift as describedabove, is that when mp and mq are at least a billion andn < 100, then d(p, q) is very close to d(p, q). In otherwords, for the datasets that we care about, the empiricaldrift d∞(p, q) is a sound approximation of the theoreticaldrift.

Moreover, the empirical drift of d1 can be visualized:

Figure 8: Two distributions, white and black are compared.When overlaying them, the difference can be “seen”. Thesum of the magnitude of these visible differences is the d1distance.

The connection to visualization is important: for instance,if we observe a high KL divergence, Kolmogorov-Smirnovstatistic, or a low cosine similiarity, it is likely that the firstthing a human will do is look at the two distributions andvisually inspect them. The scenarios where the L1 distanceis low but the KL divergence is high correspond to when

very small probabilities (10−6) become less small (10−3). Itis unclear whether such small fluctuations on the probabilityof rare features are cause for concern in general, whereasthe magnitude of d1 directly corresponds to the number ofexamples impacted, which in turn can affect performance.