arxiv:2008.12138v1 [cs.cy] 26 aug 2020

17
H OW TO “I MPROVE "P REDICTION U SING B EHAVIOR MODIFICATION Galit Shmueli Institute of Service Science National Tsing Hua University Hsinchu 30013, Taiwan [email protected] Ali Tafti College of Business Administration University of Illinois at Chicago Chicago, Illinois 60607 [email protected] Abstract Many internet platforms that collect behavioral big data use it to predict user behavior for internal purposes and for their business customers (e.g., advertisers, insurers, security forces, governments, political consulting firms) who utilize the predictions for personalization, targeting, and other decision-making. Improving predictive accuracy is therefore extremely valuable. Data science researchers design algorithms, models, and approaches to improve prediction. Prediction is also improved with larger and richer data. Beyond improving algorithms and data, platforms can stealthily achieve better prediction accuracy by “pushing” users’ behaviors towards their predicted values, using behavior modification techniques, thereby demonstrating more certain predictions. Such apparent “improved” prediction can unintentionally result from employing reinforcement learning algorithms that combine prediction and behavior modification. This strategy is absent from the machine learning and statistics literature. Investigating its properties requires integrating causal with predictive notation. To this end, we incorporate Pearl’s causal do(.) operator into the predictive vocabulary. We then decompose the expected prediction error given behavior modification, and identify the components impacting predictive power. Our derivation elucidates implications of such behavior modification to data scientists, platforms, their customers, and the humans whose behavior is manipulated. Behavior modification can make users’ behavior more predictable and even more homogeneous; yet this apparent predictability might not generalize when customers use predictions in practice. Outcomes pushed towards their predictions can be at odds with customers’ intentions, and harmful to manipulated users. Keywords behavior modification · behavioral big data · machine learning · prediction error · causal intervention · internet platforms 1 Introduction: Prediction, Prediction Products, and Behavior Modification Recent years have seen an incredible growth in predictive modeling of user behavior using behavioral big data in both industry and in academia. Behavioral big data are large and highly detailed datasets on human and social actions and interactions. (Shmueli, 2017). Predictions based on such data now shape almost every aspect of modern life (Agrawal et al., 2018). Internet platforms, such as Google and Facebook, predict user behavior for internal purposes and for their customers who utilize the predictions for personalization, targeting, and other decision-making. Predicted behaviors of interest to platforms and their customers include probability of purchase, churn, engagement, and even behaviors such as voting intentions and life events such as pregnancy 1 . 1.1 Prediction Products In The Age of Surveillance Capitalism 2 , Zuboff (2019) describes the processes used by several large internet platforms that collect behavioral big data to package the raw material of users’ actively shared data and passively generated data (e.g. location data, video usage, friendship ties) into “prediction products", which are then sold in “behavioral futures markets" to business customers such as insurance companies, marketers, advertisers, security forces, governments, and political consulting firms. The predictions are used by platform customers to modify and shape users’ behavior toward desired commercial outcomes. 3 1 www.nytimes.com/2012/02/19/magazine/shopping-habits.html 2 Zuboff’s (2019) book sounds an alarm about the increasingly intrusive and exploitative practices of digital platform firms. It has been received with both acclaim and criticism. The work takes a condemnatory stance towards the major digital platforms, arguing that their practices are trending not just towards influence but pervasive control of human behavior, and thus pose a great danger to the autonomous and lived experience of humanity. We take a sober and analytical approach, by considering the potential scope and effects of behavioral modification should digital platforms choose to engage in such strategies. Regardless of one’s disposition towards the arguments made in the book, there is a need for better understanding and careful analysis of how the scenarios depicted in that work might play out in practice. 3 www.theguardian.com/technology/2019/jan/20/shoshana-zuboff-age-of-surveillance-capitalism-google-facebook arXiv:2008.12138v2 [cs.CY] 7 Aug 2021

Upload: others

Post on 31-Dec-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: arXiv:2008.12138v1 [cs.CY] 26 Aug 2020

HOW TO “IMPROVE" PREDICTION USING BEHAVIORMODIFICATION

Galit ShmueliInstitute of Service Science

National Tsing Hua UniversityHsinchu 30013, Taiwan

[email protected]

Ali TaftiCollege of Business Administration

University of Illinois at ChicagoChicago, Illinois [email protected]

AbstractMany internet platforms that collect behavioral big data use it to predict user behavior for internal

purposes and for their business customers (e.g., advertisers, insurers, security forces, governments, politicalconsulting firms) who utilize the predictions for personalization, targeting, and other decision-making.Improving predictive accuracy is therefore extremely valuable. Data science researchers design algorithms,models, and approaches to improve prediction. Prediction is also improved with larger and richer data.Beyond improving algorithms and data, platforms can stealthily achieve better prediction accuracy by“pushing” users’ behaviors towards their predicted values, using behavior modification techniques, therebydemonstrating more certain predictions. Such apparent “improved” prediction can unintentionally resultfrom employing reinforcement learning algorithms that combine prediction and behavior modification. Thisstrategy is absent from the machine learning and statistics literature. Investigating its properties requiresintegrating causal with predictive notation. To this end, we incorporate Pearl’s causal do(.) operator into thepredictive vocabulary. We then decompose the expected prediction error given behavior modification, andidentify the components impacting predictive power. Our derivation elucidates implications of such behaviormodification to data scientists, platforms, their customers, and the humans whose behavior is manipulated.Behavior modification can make users’ behavior more predictable and even more homogeneous; yet thisapparent predictability might not generalize when customers use predictions in practice. Outcomes pushedtowards their predictions can be at odds with customers’ intentions, and harmful to manipulated users.

Keywords behavior modification · behavioral big data · machine learning · prediction error · causal intervention ·internet platforms

1 Introduction: Prediction, Prediction Products, and Behavior Modification

Recent years have seen an incredible growth in predictive modeling of user behavior using behavioral big data inboth industry and in academia. Behavioral big data are large and highly detailed datasets on human and socialactions and interactions. (Shmueli, 2017). Predictions based on such data now shape almost every aspect of modernlife (Agrawal et al., 2018). Internet platforms, such as Google and Facebook, predict user behavior for internalpurposes and for their customers who utilize the predictions for personalization, targeting, and other decision-making.Predicted behaviors of interest to platforms and their customers include probability of purchase, churn, engagement,and even behaviors such as voting intentions and life events such as pregnancy1.

1.1 Prediction Products

In The Age of Surveillance Capitalism2, Zuboff (2019) describes the processes used by several large internetplatforms that collect behavioral big data to package the raw material of users’ actively shared data and passivelygenerated data (e.g. location data, video usage, friendship ties) into “prediction products", which are then sold in“behavioral futures markets" to business customers such as insurance companies, marketers, advertisers, securityforces, governments, and political consulting firms. The predictions are used by platform customers to modify andshape users’ behavior toward desired commercial outcomes.3

1www.nytimes.com/2012/02/19/magazine/shopping-habits.html2Zuboff’s (2019) book sounds an alarm about the increasingly intrusive and exploitative practices of digital platform firms. It

has been received with both acclaim and criticism. The work takes a condemnatory stance towards the major digital platforms,arguing that their practices are trending not just towards influence but pervasive control of human behavior, and thus pose agreat danger to the autonomous and lived experience of humanity. We take a sober and analytical approach, by considering thepotential scope and effects of behavioral modification should digital platforms choose to engage in such strategies. Regardless ofone’s disposition towards the arguments made in the book, there is a need for better understanding and careful analysis of howthe scenarios depicted in that work might play out in practice.

3www.theguardian.com/technology/2019/jan/20/shoshana-zuboff-age-of-surveillance-capitalism-google-facebook

arX

iv:2

008.

1213

8v2

[cs

.CY

] 7

Aug

202

1

Page 2: arXiv:2008.12138v1 [cs.CY] 26 Aug 2020

The more accurate these prediction products, the higher value they provide to the platforms’ customers, and in turnthe higher the revenues for the platforms. Hence, platforms have a strong incentive to improve prediction accuracy.

One example of a prediction product is the recently launched Google Analytics “predictive metrics" servicethat “automatically enriches your data by bringing Google machine-learning expertise to bear on your dataset topredict the future behavior of your users".4 Another example is Facebook’s “loyalty prediction" service that offersadvertisers the ability to target users based on how they will behave, what they will buy, and what they will think.

1.2 Modifying User Behavior

Digital platforms now routinely use behavior modification (BMOD) techniques to change their users’ behaviorsboth online and offline. Behaviors include clicking an ad, purchasing an item, posting sensitive information, visitinga doctor, voting, and more. BMOD techniques are classically defined as “an observable, replicable and irreduciblecomponent of an intervention designed to alter or redirect causal processes that regulate behavior" (Michie et al.,2013). BMOD techniques derive from principles of behaviorist psychology and include nudging, herding, andoperant conditioning, among others. The most popular technique is the nudge, defined as “any aspect of thechoice architecture that alters people’s behavior in a predictable way without forbidding any options or significantlychanging their economic incentives" (Thaler and Sunstein, 2009, p. 6). Zuboff (2019) identifies two more typesof behavior modification: herding, which is controlling key elements in a person’s immediate context in order toguide their behavior towards a predictable one; and operant conditioning, a term coined by the famous behavioralpsychologist BF Skinner, which uses positive and negative reinforcement to encourage certain behaviors andextinguish others. BMOD techniques integrated into digital platforms are commonly termed persuasive technology(Fogg, 2002), but are often no longer “observable, replicable and irreducible." Behavioral interventions range intheir transparency: some are visible to users, such as chatbots, suggestions by recommender systems, and appnotifications, while others are less so, such as A/B testing, feed filtering, comment moderation on social networks,and deceptive interface design choices (Mathur et al., 2019). Platforms employ BMOD to provide personalizedservices, increase user engagement, “hook" users by habit formation (Eyal, 2014), and generate further behavioralbig data, among other purposes. Stanford university’s Behavior Design Lab director BJ Fogg lists seven types of“persuasive" technology tools (Fogg, 2002). While the field of marketing has used behavior modification even priorto the advent of the internet (Nord and Peter, 1980), today’s technologies and big data enable more covert, pervasive,and powerful manipulation due to their networked, continuously updated, dynamic and pervasive nature (Yeung,2017). Zuboff (2019) explains,

“These interventions are designed to enhance certainty by doing things: they nudge, tune, herd,manipulate, and modify behavior in specific directions by executing actions as subtle as insertinga specific phrase into your Facebook news feed, timing the appearance of a BUY button on yourphone, or shutting down your car engine when an insurance payment is late." (p. 200)

While these examples do not necessarily involve prediction, prediction-based behavior modification is common inrecommendation systems, targeted advertising, precision marketing, and other “personalized" interventions intendedto cause human users to change their behavior in a specific direction that is beneficial to the intervention initiator:towards longer online engagement, higher purchase propensity, or increased information sharing.

1.3 Combining Prediction and Behavior Modification

Our focus is on the new capability that results from combining prediction and BMOD strategies that are nowavailable to platforms. Zuboff (2019) explains how a platform that uses prediction and behavior modification for itsown commercial gain can be in conflict with users’ well-being and agency. Engineering decisions made to maximizeprofitability can lead to changes in social systems that are drastic, opaque, effectively unregulated, and massive inscale (Bak-Coleman et al., 2021). Historian Yuval Noah Harari suggests that such a combination of powers providesthe capability to “hack" human beings5:

“The ability to hack human-beings means the ability to understand humans better than theyunderstand themselves. Which means being able to predict their choices. . . to manipulate theiremotions, to make decisions for them.”

Scholars from computer science, behavioral science, media studies, and law have pointed out the strong incentivefor platforms to “make predictions true". We are interested in studying the possibilities of “improving" predictionby using BMOD, in the sense of “making predictions come true".

4“Purchase Probability, which predicts the likelihood that users who have visited your app or site will pur-chase in the next seven days. . . Churn Probability, predicts how likely it is that recently active users will notvisit your app or site in the next seven days." https://blog.google/products/marketingplatform/analytics/new-predictive-capabilities-google-analytics/ archived at https://archive.ph/Zms2O

5The TED Interview: “Yuval Noah Harari reveals the real dangers ahead" https://www.ted.com/talks/the_ted_interview_yuval_noah_harari_reveals_the_real_dangers_ahead, minute 34:15

2

Page 3: arXiv:2008.12138v1 [cs.CY] 26 Aug 2020

Commenting on Facebook’s loyalty prediction service,6 Law Professor Frank Pasquale said he

“worried how the company could turn algorithmic predictions into “self-fulfilling prophecies,”since “once they’ve made this prediction, they have a financial interest in making it true.” That is,once Facebook tells an advertising partner you’re going to do some thing or other next month, theonus is on Facebook to either make that event come to pass, or show that they were able to helpeffectively prevent it (how Facebook can verify to a marketer that it was indeed able to change thefuture is unclear)."

Computer scientist and AI expert Stuart Russell describes how platforms improve prediction by making humanusers more predictable (Russell, 2019):

“Content selection algorithms on social media. . . are designed to maximize click-through. Thesolution is simply to present items that the user likes to click on, right? Wrong. The solution is tochange the user’s preferences so that they become more predictable."

Media theorist Douglas Rushkoff explains that the better the platform is able to make users conform to theiralgorithmically determined destiny, the more it can “boast both its predictive accuracy and its ability to inducebehavior change" (Rushkoff, 2019, p. 69).

Examining the combined prediction and BMOD capabilities from a data science point of view, we study how such acombined strategy optimized to minimize prediction error could affect the different stakeholders: the platform, itscustomers purchasing prediction products, and the platform users. By describing the scenario in statistical terms,we show that aiming to minimize prediction error can result in misleading platform customers and manipulatinghumans in possibly dangerous directions. This is especially the case when predicting unwanted or risky behaviors,in the form of risk scores. An extreme example is predicting mental health risk for a healthcare stress reduction app.While the app maker aims to lower stress of high risk users, the platform can achieve low prediction error by turninghigh risk predictions into high risk realities.

The goal of this work is not to offer new evidence nor to characterize all the ways in which digital platformsengage in behavioral modification strategies. Given that some firms have already demonstrated both the capabilitiesand incentives to implement various forms of behavior modification, our goal rather is to introduce a technicalvocabulary and notation that enables investigation of such strategies. Technical terminology and notation are neededin order to identify the properties and implications of the behavior modification approach to resulting predictivepower. The thought process gives rise to various questions that we shall consider here and that future research caninvestigate in even more detail, including: Can behavior modification mask poor predictive algorithms? Can oneinfer from the manipulated predictive power the counterfactual of non-manipulated predictive power? Can customersrunning routine A/B tests on the platform detect this scheme? What are the roles of personalized predictions and ofpersonalized behavior modifications within the error minimization strategy?

Our goal is to make transparent the effects of behavior modification on predictive power, thereby enabling the studyof its impact on business, social, and humanistic aspects, as well as potential implications. In order to enable theanalysis and evaluation of the effect of behavior modification on predictive power, we use the do(.) operator byPearl (2009). This operator is useful for our approach because it expresses manipulation of a variable within asystem of causal relations. While do calculus is well developed for causal effects identification (Pearl, 2009), thechallenge here lies in combining the causal do(.) operator into the existing correlation-based predictive framework.

The paper proceeds as follows. In Section 2 we describe the traditional statistical and machine learning approachto reducing prediction error. Section 3 introduces the new approach to reducing prediction error that relies onbehavior modification. In Section 4 we formalize the approach using statistical language and notation and analyzeits implications. Section 5 discusses implications, from technical and business implications to humanistic andsocietal. Section 6 provides conclusions.

2 Reducing Prediction Error by Improving Prediction

The fields of machine learning and statistics have been introducing new and improved models, algorithms, ap-proaches, and even data, aimed at improving predictive power. Approaches such as regularization, boosting, andensembles have proven highly useful in generating more precise predictions. From transparent regression models andtree-based algorithms, to more blackbox support vector machines, k-nearest neighbors, neural nets and especiallydeep learning algorithms, their justification and adoption lies in their ability to capture intricate signals linkinginputs and a to-be-predicted output.

Predictive performance is typically measured by out-of-sample prediction errors, which compare predicted valueswith actual values for new observations. More formally, the prediction error ei for record i is defined as thedifference between the actual outcome value yi and its prediction yi, that is ei = yi − yi. For a sample of n records,

6https://theintercept.com/2018/04/13/facebook-advertising-data-artificial-intelligence-ai/

3

Page 4: arXiv:2008.12138v1 [cs.CY] 26 Aug 2020

Figure 1: Prediction error with no behavior modification (top) vs. with behavior modification (bottom). Manipulatingplatform behavior do(B) pushes the observed user behavior towards its predicted value. Note that only orangearrows denote causal effects; squiggly black arrows denote a correlation-based predictive relationship.

we have a set of actual outcome values ~y = [y1, y2, . . . , yn], a set of predicted outcome values ~y = [y1, y2, . . . , yn],and a set of prediction errors ~e = [e1, e2, . . . , en]. For each record i we also have predictor information in the formof p measurements ~xi = [xi,1, . . . , xi,p]. The predictor information for n records is contained in the matrix X .Predicted values are obtained from f , the algorithm trained on (or model estimated from) data on inputs X andactual outcomes ~y, so that ~y = f(X).

2.1 Targets of statistical and machine learning efforts

Predictive algorithms and methods are designed and tuned to minimize some aggregation of the error values (~e)by operating on the predicted values (~y). Improving predicted values is typically achieved by improving threecomponents:

1. structure of the algorithm/model f that relates the predictor information X to the outcome (e.g., newalgorithms and methods),

2. estimation/computation of f ,3. quality and quantity of predictor information X . Larger, richer behavioral datasets have been shown to

improve predictive accuracy (Martens et al., 2016).

Towards this end, companies such as Google, Facebook, Uber, Netflix, and Amazon have been investing in improvingprediction algorithms through collecting, buying, storing and processing unprecedented amounts and types of data.They have also hired top data science talent, acquired AI companies, and developed in-house predictive algorithms,platforms, and computational infrastructure.

In all these approaches, the actual outcome values ~y are considered fixed. The top panel of Figure 1 illustrates thestatistical and machine learning approach of improving the above three components in order to minimize predictionerror.

2.2 Components affecting predictive power: dissecting the expected predicted error

When predicting an outcome y that is not expected to be manipulated between training and deployment, we anticipateprediction error due to the inability of the model f to (1) correctly capture the underlying f even with unlimitedtraining data (bias), (2) correctly estimate f due to insufficient data (variance), (3) capture the errors for individualobservations ~ε (noise). For predicting a numerical outcome or probability for a new observation, these three sourcesare formalized through a bias-variance decomposition of the expected prediction error (EPE), using squared-errorloss7 (Geman et al., 1992):

EPE(~x) = E((y|~x)− f(~x)

)2= E(ε2) +

(f(~x)− E(f(~x)

)2+ E

(f(~x)− E(f(~x)

)2= σ2 +Bias2(f(~x)) + V ar(f(~x)).

(1)

7Assuming underlying model E(y|~x) = f(~x) + ε, where ε has zero mean and variance σ2.

4

Page 5: arXiv:2008.12138v1 [cs.CY] 26 Aug 2020

In statistics and machine learning, prediction is based on an assumption of continuity, where the predicted obser-vations come from the same underlying processes and environment as the data used for training the predictivealgorithm and testing its predictive performance. The deterministic underlying function f and the random noisedistribution are both assumed to remain unchanged between the time of model training and evaluation and the timeof deployment. This assumption underlies the practice of randomly partitioning the data into separate training andtest sets (or into multiple “folds" in cross validation), where the model is trained on the training data and evaluatedon the separate test data. Of course, the continuity assumption is often violated to some degree depending onthe distance (temporal, geographical, etc.) between the training/test data and the to-be-predicted data and howfast or abruptly the environment changes between these two contexts. These “dataset shift" challenges (see e.g.Moreno-Torres et al., 2012; Quionero-Candela et al., 2009) can increase prediction errors beyond the disparityobserved between training and test prediction errors. Hence, predictive power based on the test data might providean overly optimistic estimate compared to actual performance at deployment.

3 Reducing Prediction Error by Modifying User Behavior (“Improving" Prediction)

Platforms now have the incentive and technology8 to minimize prediction errors in a direction that is absent fromacademic prediction research: by modifying actual behavior (~y).

BMOD techniques can be used to minimize prediction error by pushing users’ behaviors towards their predictedvalues, thereby “improving" apparent prediction capability. This can be achieved via a two-step process:

1. Predict a user’s behavior, then2. Modify the user’s behavior towards the predicted value.9

The two enabling mechanisms for such a process are that (1) platforms have a plethora of powerful and testedBMOD tools, and (2) BMOD techniques are designed to modify behavior in a predictable way – here “pushing"outcome values towards their predicted values – in order “to shape individual, group, and population behavior inways that continuously improve their approximation to guaranteed outcomes" (Zuboff, 2019, p. 339).

We note that this ‘improved’ prediction strategy, which combines prediction and behavior modification, differs fromadversarial attacks, where an attacker interferes in the training or prediction stages (influencing the training data,prediction model, or predicted values) (Huang et al., 2011). It also differs from gaming by strategic users, whomanipulate their behavior (input data) to obtain a favorable prediction (e.g. Hardt et al., 2016; Munro, 2020; Frankeland Kartik, 2019). The first key difference is that adversarial attacks and gaming are aimed at modifying predictions,whereas the strategy we describe modifies the actual behavior. In this scenario, rather than directly manipulatingdata, the platforms influence the behavioral processes that generate the data. Secondly, in adversarial attacks andgaming, the intervention is performed by an attacker or a user, whereas in our case both prediction and behaviormodification are performed by the platform.

3.1 BMOD via reinforcement learning algorithms

In theory, the financial incentives and technical capabilities of internet platforms might entice platform managementor a data scientist under pressure to showcase performance to engage in this prediction “improvement" strategy.But more likely, such a strategy might be implemented unintentionally on a platform via the now-popularly usedreinforcement learning (RL) algorithms, which combine prediction, behavior modification, and user feedback tooptimize a predetermined objective function. Setting the objective function to minimizing prediction error can leadto such apparent “improved" prediction.

RL is a third paradigm of machine learning—different from supervised and unsupervised approaches—where an“agent” learns to perform a task by interacting with an unknown dynamic environment. It is considered a machinelearning approach because the agent takes a series of decisions to maximize the cumulative reward for a predefinedtask without being explicitly programmed to achieve the task. In other words, the goal is for the agent to learn theoptimal behavior through repeated trial-and-error interactions with the environment without human involvement(MathWorks, 2021). An RL agent performs actions on the environment and receives observations and a reward fromthe environment, where the reward is a simple pre-specified function intended to measure how successful an actionis with respect to completing the task.

In the case of RL on platforms, the unknown dynamic environment is the user. RL formalizes human-machineinteraction histories as sequences of {state, action, reward} trajectories, generated while interacting in real-time

8For example, Facebook’s “AI backbone" FBLearner Flow combines machine learning and experimentationcapabilities that can be applied to the entire Facebook userbase https://engineering.fb.com/core-data/introducing-fblearner-flow-facebook-s-ai-backbone/ archived at https://archive.ph/P4kkE

9While the BMOD in step 2 might itself involve prediction, such a prediction would be conditional on a platform action. Forexample, a recommender system might compute predictions of user actions given a certain recommendation. In contrast, the step1 prediction is not conditional on a platform action.

5

Page 6: arXiv:2008.12138v1 [cs.CY] 26 Aug 2020

Figure 2: Schematic of reinforcement learning as applied to users of an internet platform.

with users as well as from previously collected user interactions (Sutton and Barto, 2018). The RL agent’s goalis to intervene in its environment of human users to learn an optimal policy maximizing the accumulation ofdesigner-specified rewards (e.g., clicks). Figure 2 illustrates the operation of RL in a platform context. RLalgorithms govern personalized interventions on many platforms. For example, recommender systems used bypopular commercial platforms such as TikTok, Pandora, Instagram, and YouTube use interactive user data to selectan optimal recommendation policy for a given user (Zhou et al., 2020; Chen et al., 2019). LinkedIn uses multi-armedbandits, a simplified type of RL, for automating ad placement decisions (Tang et al., 2013), and Yahoo! usescontextual multi-armed bandits for personalized news recommendations (Li et al., 2010).

3.2 Three Predict-and-Modify Scenarios

Consider an insurance company interested in acquiring new customers, but trying to avoid high-risk people. Nowconsider an internet platform interested in selling to the insurance company the risk scores of their users. Toshowcase their predictive power, the platform can employ the two-step process of generating risk scores for aset of users, and then using behavior modification to “push" users’ behaviors towards their predicted scores (e.g.encouraging/discouraging engagement with the app during driving; displaying/avoiding ads for alcoholic beveragesduring work hours). Such a strategy would turn high-risk predictions into high-risk realities. Note that in thisscenario, the gap between the platform’s goal (showcase accurate predictions) and the insurance company’s objective(avoid high risk people) leads to pushing high-risk users’ behaviors in a direction that is not only ethically dubiousbut also at odds with the insurance company’s interests.

Another scenario is a political consulting firm interested in reaching likely voters for candidate A. An internetplatform interested in selling predicted “Voter for Candidate A" scores might prove their predictive power by thetwo-step process of generating “Voter for Candidate A" predictions, and then using behavior modification to pushusers’ behaviors towards/away from voting for A. Hence, the platform’s strategy would turn voting predictions intovoting realities. In this scenario, aside from serious ethical and legal implications, the platform’s strategy might bein line with the political consulting firm’s goal if the firm is trying to promote voting for A, but at odds if the firm isnon-partisan, or trying to promote a different candidate.

A third scenario is a government interested in predicting which citizens have high risk of becoming severely illwith COVID-19, for purposes of public health (e.g. vaccination priority). An internet platform interested in selling“Covid-19 susceptibility" scores can showcase their predictive power by the two-step process of generating scores fora set of their users, and then using BMOD to push users towards/away from infection by encouraging/discouragingphysical proximity with others and travelling to crowded locations. In this scenario too, the platform’s BMODof uses with high Covid-19 susceptible scores not only harms the individual users and their community, but is inopposition to the government’s public health protection goal.

3.3 Dataset shift that reduces prediction error

When prediction is not followed by behavior modification, differences between the training and deploymentenvironments introduce uncertainty, typically by increasing bias and/or changing the noise distribution. As describedearlier, such dataset shift challenges can cause larger prediction errors at deployment, and are therefore a majorchallenge in adversarial attacks and gaming scenarios, which involve “mischief during training or prediction" (Huanget al., 2011). In contrast, in behavior modification, training and deployment are made closer by design. Whiledataset shifts arising from uncontrollable and unforeseeable conditions increase uncertainty, behavior modificationintends to shift actual outcome values y closer to y thereby reducing uncertainty. For example, when predictingthat a user is likely to become depressed, displaying depressing news, friends’ posts, and depression-related adsincreases that user’s chance of depression (Facebook’s emotional contagion experiment by Kramer et al. (2014)displays such capability). When predicting the arrival time of a delivery, incentivizing faster/slower driving canincrease the accuracy of the predicted arrival time. Displaying donation amounts by “friends" with amounts similar

6

Page 7: arXiv:2008.12138v1 [cs.CY] 26 Aug 2020

to the user’s predicted amount can increase the chance the user donates the predicted amount (e.g., Nook et al.,2016).

4 Formalization and Analysis

To study prediction error under the prediction-based behavior modification strategy, it is useful to decompose thenew form of expected prediction error (under behavior modification) into separate meaningful sources. This helpsidentify components such as bias, variance, and noise. However, we need a technical vocabulary that can encodeboth correlation-based prediction and causal-based actions.

The challenge is that standard notation and terminology used in statistics and machine learning for predictivemodeling is insufficient for formalizing the problem of minimizing prediction error by intentionally manipulatingthe actual outcome values by way of behavior modification. The bottom panel of Figure 1 illustrates this newscenario. Specifically, predictive terminology conveys correlation-based relationships, but is not well-suited fordenoting intentional manipulations. Figure 1, which includes both causal arrows (orange) and a correlationalconnector (depicted as a squiggly black arrow, but with no causal interpretation10) is incoherent in the world ofcausal diagrams, as well as in the world of prediction. We therefore propose to integrate causal notation into theexisting predictive terminology and context in a parsimonious way. We do this by adopting the do(.) operator byPearl (2009), where do(B) denotes that variable B is not simply observed but rather intentionally modified.11 Thisallows us to incorporate intentional behavioral modification into the predictive modeling context. We then usethis notation to decompose the expected prediction error, in order to identify the different components that affectpredictive power.

4.1 Notation

Step 1 (behavior prediction) uses the ordinary notation y to denote the predicted user behavior, given the user’sprofile ~x (which can also include contextual information such as location and time-of-day). The predictions aregenerated by predictive algorithm aimed at capturing the predictive (correlation-based) relationship between theoutcome and the predictors from a training dataset:

y|~x = f(~x) + ε. (2)

Using the trained predictive algorithm f , a predicted outcome can be generated for new observation i:

yi|~xi = f(~xi). (3)

Step 2 (behavior modification) requires introducing interventional notation. We use do(B) to denote the intentionally-modified platform behavior. Specifically, we use Pearl’s “intervention as variable" formulation (Pearl, 2009, Section3.2.2) that conceptualizes the intervention as an external force that alters the function between B and X . Theimportant advantage of this formulation (over the simpler conceptualization of forcing B to take on a fixed value)is that “it contains information about richer types of interventions and is applicable to any change in the functionrelationship fi and not merely to the replacement of fi by a constant" (p. 71).

Figure 3 illustrates the two steps (generating a prediction, and modifying behavior toward the prediction) usingcausal diagrams. Nodes represent the different variables. The arrows mean that the node with incoming arrow(s) isa function of the nodes pointing to it. Next, we denote the manipulated behavior as y. We note that it is incorrectto write y .

= do(y) because the user’s outcome y is not directly manipulated. Instead, the modified behavior isfully mediated: the platform tailors its behavior B (do(B) or personalized do(Bi)) to manipulate the user’s mentalstate (emotion, feeling, mood, thought, etc. ), which in turn leads to the modified behavior yi. This modification isspecifically aimed at pushing the outcome towards its prediction, and thus do(B) inherently uses y. Using ∼ on topof terms affected by do(B), we therefore write12

yi.= yi|do(Bi), ~xi, (4)

where Bi = h(yi, ~xi) + υi and υi is an error term. Including ~xi in h(·) represents personalized behavioralmodification, based on the user’s specific predictor information ~xi (e.g., user i’s browsing history, demographics,

10We chose to use a single-headed squiggly arrow rather than a bi-directional straight arrow for the correlation-basedpredictive relationship to convey the asymmetric input-output roles of X and y. In causal diagrams, bi-directional arrowsconvey an unobservable variable affecting the two variables at the arrowheads, and there is no way to represent an asymmetriccorrelation-based predictive relationship.

11“The do(x) operator is a mathematical device that helps us specify explicitly and formally what is held constant, and what isfree to vary" (Pearl, 2009, p. 358)

12Expressions using the do(.) operator are typically written within causal queries expressed in the form of conditionalexpectations or probabilities, such as E[yi|do(Bi), xi]. We take some license to show it outside of that usual casing because ourfocus here is on representing the manipulated outcome rather than estimating a causal effect.

7

Page 8: arXiv:2008.12138v1 [cs.CY] 26 Aug 2020

(a) Generating a predicted value

(b) Modifying behavior towards the predicted value

Figure 3: Causal diagrams representing the two steps of (a) generating a predicted value and (b) modifying behaviortowards the predicted value. The doubled-headed dotted arrow represents correlation between X and y that occursby means of their being influenced by a common set of unobserved or latent variables.

location). Note that ~xi can impact the modified behavior both directly and indirectly via the platform’s personalizedBMOD. In fact, ~Xi is a collection of variables; some might be used to personalize Bi and others (or even the sameones) determine the effect of Bi on yi. This causal relationship is graphically depicted in the bottom diagram inFigure 3.

For the modified outcome, we use fdo to denote the underlying function, which can be a completely differentfunction from f :

yi = fdo(do(Bi), ~xi) + εi = g(do(Bi), ~xi) + εi, (5)

where εi has mean 0 and variance σ2.

We note that the quantity E(yi|~xi) − E(yi|~xi) = E(yi − yi|~xi), is called the (population) Conditional AverageTreatment Effect (CATE) (Athey and Imbens, 2016; Imbens and Rubin, 2015) or Individual Treatment Effect (ITE)(Shalit et al., 2017) and is of key interest in treatment effect estimation and testing.13

The prediction errors after step 2 are obtained by comparing the predicted values y to the manipulated outcomesy. Table 1 provides the short notation, full notation and description for each of the above terms.14 Togetherwith Equations (4)-(5), we now have a sufficient vocabulary for examining the prediction error under behaviormodification.

13Note that yi|~xi assumes no manipulation. Pearl (2009, p. 70-72) offers an alternative formulation to encode manipulationvs. no manipulation, by adding a binary intervention indicator IB that obtains values in {do(Bi), idle}. In our case IB = idlefor yi|~xi.

14It is possible to use Rubin’s potential outcomes notation intended for estimating treatment effects (e.g. Imbens and Rubin,2015, p. 33). This requires definingB = {0, 1, 2, . . .} as the intervention assignment, and denoting by yi(B) the outcome, whereyi(0) is the un-manipulated outcome. The quantity yi|do(B), ~x is written as yi(B)|B, ~x. We prefer the do(.) operator since itconveys the causal nature of the manipulation B and clearly differentiates it from the correlation-based prediction components ~x.Another possibility is using the counterfactual notation in Pearl et al. (2016, chapter 4) recently used in algorithmic fairnessresearch, e.g. Zhang and Bareinboim (2018); Kusner et al. (2017). While the counterfactual notation is more general than thedo(.) operator, since we focus on interventions that do not require counterfactual notions, the do(.) operator provides a moreparsimonious and elegant choice.

8

Page 9: arXiv:2008.12138v1 [cs.CY] 26 Aug 2020

Table 1: Short and full notationShort Full notation/definition Description

notationyi yi|~xi = f(~xi) + εi Outcome under no manipulationf f(~x) = E(y|~x) True function under no manipulationy y|~x = f(~x) Predicted outcome under no manipulationσ2 V ar(ε) = E(ε2) Noise variance under no manipulationfdo g(do(B), ~x) = E(y|do(B), ~x) True function under do(B)yi yi|do(Bi), ~xi = g(do(Bi), ~xi) + εi Manipulated outcomeσ2 V ar(ε) = V ar(y) = E(ε2) Noise variance under do(B)

4.2 Expected prediction error of modified outcomes (EPE)

When outcome values are intentionally “pushed" towards their predictive values, it is intuitive that the resultingexpected prediction error will be lower than the no-manipulation outcome values.15 We can now formalize thefollowing questions: Given a specific prediction algorithm f , trained on data with no behavior modification (X, ~y),when will the expected prediction error for a manipulated user with predictors ~x and manipulation do(Bi) = b belower than if the user was not manipulated? That is, for a p-norm loss function Lp, when will we get

E[Lp(y, f(b, ~x))] < E[Lp(y, f(~x))]? (6)

When might the manipulation lead to worse predictive power?

To answer these questions, we proceed to break down the EPE into several non-overlapping components. Usingthe standard L2 loss function, we can obtain the EPE under behavior modification as follows (the full derivation isgiven in A):

EPE(~x) = E(y|do(B), ~x− f(~x)

)2= σ2 +

[CATE(~x) +Bias(f(~x))

]2+ V ar(f(~x)).

(7)

Each of the terms in Equation (7) has an interesting meaning and different implications on the effect of behaviormodification on EPE. The additive nature of this formulation provides insights on the roles of data size, predictivealgorithm properties, and behavior modification qualities. By comparing EPE(~x) to EPE(~x) (the manipulatedand non-manipulated scenarios), we can see the following:

• Data size: Whether manipulating or not, data size affects EPE via the variance of the predictive algo-rithm,16 indicating that larger training samples can improve not only predictions, but also the averagemanipulated prediction error. Pushing the outcome towards a more stable prediction leads to smaller errors.

• Magnitude of behavior modification effect: The second term shows the role of the average behaviormodification magnitude (CATE) in countering the bias of the predictive algorithm. This term is minimizedwhen CATE = −Bias(f), that is, when, on average, do(B) pushes the user’s behavior in a direction andmagnitude that exactly counters the prediction algorithm’s bias. Thus, an effective behavior modificationcan improve predictive power by combating the predictive algorithm’s bias, as long as 0 < CATE <−2Bias or −2Bias < CATE < 0.

• Noise (homogeneity of prediction errors): Compared to σ2 in the no-manipulation EPE, the first termin EPE is σ2, the noise variance under behavior modification. This means behavior modification candecrease/increase also the variability of prediction errors (or equivalently, of y relative to y) across differentusers.

4.3 Modification Strategies

We now examine the trade-offs and implications of the four EPE sources (σ, CATE, Bias(f), V ar(f)) on theexpected prediction error. The resulting insights help determine ideal modification under different scenarios ofalgorithm bias and variance.

15In the prediction minimization process all subjects are initially not B-manipulated and later a sample is B-manipulatedusing personalized modifications.

16The machine learning “bias" is asymptotic in sample size: an algorithm is biased “if no matter how much training data wegive it, the learning curve will never reach perfect accuracy" (Provost and Fawcett, 2013).

9

Page 10: arXiv:2008.12138v1 [cs.CY] 26 Aug 2020

Figure 4: Hypothetical prediction of risky driving behaviors given distance, by ride-sharing platform. Illustratesthe effect of behavior modification on shifting the average outcome by CATE = −Bias (blue circles) or on bothshifting the average outcome and shrinking the variance σ2 (red circles). Yellow ‘X’ symbols represent predictedvalues f(xi). (The schematic assumes a very large training sample, and thus f ≈ E(f).)

To this end, we provide a simplistic example: A ride-sharing or social media platform can modify drivers’ behaviorby manipulating the driver’s engagement with their app while driving. Figure 4 shows a scatter plot of driver risk(y-axis) vs. daily distance (x-axis) for a fictitious sample of drivers (hollow circles). Suppose predictions are drivers’risk scores in terms of risky driving behaviors and the goal is to minimize the (squared) differences between thepredicted and actual values. Suppose that risk is a quadratic function of driving distance, but that the predictivemodel estimates a linear relationship. The linear model’s predicted risk scores are shown as yellow ‘x’ in Figure 4.We consider three specific distances: x1, x2, and x3. While f is a biased algorithm, for x1 there is no bias, for x2the bias is negative, and for x3 bias is positive (for simplicity, the schematic assumes a very large training sampleand thus f ≈ E(f)). Next, the blue and red circles represent two types of BMOD: the first BMOD affects onlyCATE while the second affects both CATE and σ.

Scenario 1: Low-bias f trained on a very large sample.

This scenario would be akin to deep learning algorithms applied to massive training data. The very large samplemeans V ar(f) ≈ 0 and Bias(f) is very small. The strategy of setting CATE = −Bias(f) is optimal if thebehavior modification also decreases error heterogeneity so that σ2 ≤ σ2. Because the bias is low, the optimalbehavior modification should have a small effect. In Figure 4, f(x1) has no bias, and therefore applying behaviormodification to drivers with distance x1 will introduce bias, and is only useful if it can sufficiently shrink thevariability of the resulting risky behaviors (red circles).

Scenario 2: High-bias f trained on a very large sample.

High-bias algorithms include models with relatively few parameters (e.g. naive Bayes, linear regression, shallowtrees, and k-NN with large k). As in scenario 1, here too V ar(f) ≈ 0. The strategy of setting CATE =

−Bias(f) (e.g. in Figure 4 increasing average risky behaviors for x2 by |Bias(f(x2))| and decreasing it for x3by |Bias(f(x3))|) is optimal if the behavior modification does not increase error heterogeneity, so that σ2 ≤ σ2.While a small modification effect (in the right direction) can help counter bias, the ideal modification effect must beas large as the bias. Note that EPE is computed for a specific ~x, and therefore generalizing the above rule to any ~xrequires either assuming homoskedastic errors ε, or that the inequality holds for all ~x (∀~x σ2

~x ≤ σ2~x).

Scenario 3: High-variance f .

If the predictive model has high variance and is estimated on a relatively small sample, then potential minimization of

σ2 and/or[CATE +Bias(f)

]2by way of behavior modification might be negligible relative to V ar(f). Because

behavior modification is based on “pushing" behavior towards f(~xi), a highly volatile f might result in erraticdo(Bi) modifications in terms of magnitude or even direction. Hence the choice of predictive model or algorithmcan be detrimental to the effectiveness of behavior modification.

10

Page 11: arXiv:2008.12138v1 [cs.CY] 26 Aug 2020

5 Discussion

We described a new strategy that platforms can use for reducing prediction error which is completely different fromapproaches taken by the fields of statistics and machine learning. This strategy combines prediction with behaviormodification, and therefore formalizing it into technical language requires supplementing predictive notation withcausal terminology. Using the do(.) operator, we are able to describe the entire system that includes the trainingdataset, the predictive algorithm, and the behavior modification.

While our EPE formula also applies to behavior modification for commercial benefit (e.g. advertising), we havefocused on the more general case which can lead to dangerous and perhaps unintentional outcomes that not onlyharm users but also platform customers. Such an outcome can be achieved intentionally by a rogue platform butmore likely unintentionally, when the platform uses automated personalization algorithms, such as reinforcementlearning, that combine prediction and behavior modification. In such cases the platform and customer goals can bemisaligned, as in risk prediction applications where the customers aims to reduce risk, while the platform pushesrisky users towards risky actions.

We mentioned earlier how minimizing prediction error differs from adversarial attacks and gaming (the latter involveonly prediction, and only in the former does the platform apply both the prediction and the manipulation). We alsodistinguish between this strategy and two related, but different, behavior modification usages commonly employedby companies: A/B testing and uplift modeling. A/B testing is used for estimating the overall effect of a behaviormodification intervention. Uplift modeling is used in precision marketing for predicting personalized user reactionsto a behavior intervention. These two purposes differ from the focus in this paper, and are also different fromeach other. Using the do(.) operator terminology, we briefly describe these approaches in B, summarizing the keydifferences in Table 2.

5.1 Technical and business implications

Contrasting the bias-variance decomposition of the manipulated and non-manipulated scenarios highlighted two keysources of the manipulated prediction error: the CATE-bias relationship, and its tradeoff with the manipulated noisevariance. We now use these insights to return to the questions we posed earlier.

1. Can behavior modification mask poor predictive algorithm performance? Behavioral big data is noisy,sparse, and high-dimensional (De Cnudde et al., 2020). Behavior modification can improve EPE bycountering the predictive algorithm’s bias as well as by reducing the noise variance. This means that poorperformance of a predictive algorithm, due to the algorithm’s bias and variance, and/or due to the datanoisiness, can be masked by do(B). Therefore, platform customers wanting to achieve the (manipulated)prediction accuracy level demonstrated by the platform, must purchase both the predictions and the abilityto apply BMOD similar to the one performed by the platform. Purchasing the predictions alone mightuncover a much weaker predictive performance when deployed to non-manipulated users (or by applying aless effective BMOD).

2. Can one infer the counterfactual EPE from the manipulated EPE? The difference between the twoquantities of no-manipulation EPE and behavior-modified EPE involves CATE, bias(f), σ, and σ.17

Some of these quantities can be estimated by the platform (e.g., CATE), while others are more difficult,if impossible, to estimate. Hence, it is unlikely the no-manipulation predictive power can be ascertainedfrom the manipulated EPE. This means platform customers who want to evaluate the no-manipulationpredictive power will need to acquire information about the estimated EPE at the time of algorithmtesting.

3. Can customers detect the manipulation via A/B tests? Platform customers who run routine A/B tests onthe platform are not likely to detect the error minimization strategy, because of the random allocation ofusers in an A/B test. This randomization spreads BMOD-affected users across the A and B conditions,and therefore the difference between the group averages will cancel out the BMOD effect. The A/B teststatistic and its statistical significance are therefore not impacted by BMOD.

4. What are the roles of personalized prediction and personalized behavior modification in error minimization?An ideal BMOD reduces not only the average magnitude of the errors, but also their variability so that errorsare more consistent across users. This highlights the role of personalized prediction that companies nowinvest in: a user’s personal ~xi data is used to select the best modification do(Bi) = h(yi, ~xi), where therange of Bi choices includes not only different stimuli18 (e.g. different content display), but also different

17EPE − EPE = σ2 − σ2 + CATE2 + 2× CATE ×Bias(f)18“People who do a lot of research on products may see an ad that features positive product reviews, whereas those who have

signed up for regular deliveries of other products in the past might see an ad offering a discount for those who “Subscribe &Save.”" www.nytimes.com/2019/01/20/technology/amazon-ads-advertising.html

11

Page 12: arXiv:2008.12138v1 [cs.CY] 26 Aug 2020

types of reinforcement (e.g. positive reinforcement such as rewards, recognition, praise, vs. negativereinforcement such as time pressure, social pressure). For example, Kosinski et al. (2013) showedhow Facebook users’ Likes can predict their psychological attributes, ranging from sexual orientationto intelligence, and suggested that including such attributes can improve personalized interventions.19

Personalized interventions have become even more powerful with the introduction of reinforcementlearning, which personalizes the system’s behavior by utilizing users’ traits (~xi) combined with theirimplicit feedback (den Hengst et al., 2020).

Personalized predictions have the potential to minimize EPE more equally both within a certain userprofile ~x and across different user profiles, by lowering conditional bias via manipulating CATE, and byshrinking the (manipulated) outcomes’ variance.Finally, the bias-variance decomposition highlights the important role of a large training dataset inminimizing the predictive algorithm’s variance V ar(f). Since predictive models trade off bias andvariance, in the behavior modification context low-bias algorithms are advantageous in terms of requiringa smaller BMOD effect to minimize EPE. One avenue for further research is the effect of behaviormodification on classification accuracy (for binary outcomes), where the effects of bias and variance onEPE are multiplicative rather than additive, and the literature reports conflicting results on their roles (e.g.Friedman, 1997; Domingos, 2000).

5.2 Humanistic and societal implications

Behavior modification, now pervasively applied by platforms to their “data subjects", is geared towards optimizingthe platform’s commercial interest, often at the cost of users’ well-being and agency. AI expert Russell (2019)highlights the advantage of making users’ behavior more predictable: “A more predictable user can be fed itemsthat they are likely to click on, thereby generating more revenue. People with more extreme political views tend tobe more predictable in which items they will click on."

Persuasive technology, a design philosophy now implemented on platforms from e-commerce sites and socialnetworks to smartphones and fitness wristbands, aims at generating “behavioral change" and “habit formation",most often without users’ knowledge or consent (Rushkoff, 2019). This application of behavior modification toplatform users is therefore fundamentally different from an organization applying of BMOD to its employees forincreasing the organization’s productivity and workers’ job satisfaction. And, clearly, such use diverges from theoriginal intention of behavior modification procedures “to change socially significant behaviors, with the goal ofimproving some aspect of a person’s life" (Miltenberger, 2015).

Given the often conflicting goals of data subjects and the platforms that collect and use their data as well asmanipulate their behavior, it is important to introduce causal notation into the predictive environment, so that thedata science (statistics, machine learning, etc.), management and computational social science communities canstudy their technical properties and implications. By introducing and integrating causal notation into the predictiveterminology, we can start studying how behavior modification can appear to create “better" predictions. This allowsexamining the effects of different behavior modification types, magnitudes, variation, and directions on anticipatedoutcomes.

6 Conclusion

Behavior modification can make users’ behavior not only more predictable but also more homogeneous. However,this apparent “predictability" can be deceptive to the platform customers, to the extent that predictive performancebased on manipulated behavior does not generalize outside of the platform environment. It also does not generalizewithin-platform, if the exact same BMOD is no longer used. Moreover, outcomes pushed towards their predictionsmay be at odds with the platform customers’ interests, and harmful to the manipulated users. While platforms havethe incentive and capabilities to minimize prediction errors, it is plausible that such minimization occurs without theplatforms’ knowledge, due to the growing use of automated personalization techniques that combine prediction andbehavior modification. It is therefore critical to have a useful technical vocabulary that integrates intentional behaviormodification into the correlation-based predictive framework to enable studying such contemporary strategies.We demonstrated how this notation can be useful to explore such issues as how behavior modification can maskpoor predictive algorithms, whether one infer the counterfactual of non-manipulated predictive power from themanipulated predictive power, whether customers running routine A/B tests on the platform can detect behavioralmodification schemes, and the possible role of personalized behavioral modification schemes. Future research canadopt the technical vocabulary and notation we proposed in this paper to further examine the potential scope andeffects of behavioral modification by digital platforms.

19“online insurance advertisements might emphasize security when facing emotionally unstable (neurotic) users but stresspotential threats when dealing with emotionally stable ones." (Kosinski et al., 2013)

12

Page 13: arXiv:2008.12138v1 [cs.CY] 26 Aug 2020

ReferencesAgrawal, A., Gans, J., and Goldfarb, A. (2018). Prediction machines: the simple economics of artificial intelligence.

Harvard Business Press.Athey, S. and Imbens, G. (2016). Recursive partitioning for heterogeneous causal effects. Proceedings of the

National Academy of Sciences, 113(27):7353–7360.Bak-Coleman, J. B., Alfano, M., Barfuss, W., Bergstrom, C. T., Centeno, M. A., Couzin, I. D., Donges, J. F.,

Galesic, M., Gersick, A. S., Jacquet, J., et al. (2021). Stewardship of global collective behavior. Proceedings ofthe National Academy of Sciences, 118(27).

Chen, M., Beutel, A., Covington, P., Jain, S., Belletti, F., and Chi, E. H. (2019). Top-k off-policy correction for areinforce recommender system. In Proceedings of the Twelfth ACM International Conference on Web Search andData Mining, pages 456–464.

De Cnudde, S., Martens, D., Evgeniou, T., and Provost, F. (2020). A benchmarking study of classification techniquesfor behavioral data. International Journal of Data Science and Analytics, 9(2):131–173.

den Hengst, F., Grua, E. M., el Hassouni, A., and Hoogendoorn, M. (2020). Reinforcement learning for personaliza-tion: A systematic literature review. Data Science, pages 1–41.

Domingos, P. (2000). A unified bias-variance decomposition. In Proceedings of 17th International Conference onMachine Learning, pages 231–238.

Eyal, N. (2014). Hooked: How to build habit-forming products. Penguin.Fogg, B. J. (2002). Persuasive technology: using computers to change what we think and do. Morgan Kaufmann.Frankel, A. and Kartik, N. (2019). Improving information from manipulable data. arXiv preprint arXiv:1908.10330.Friedman, J. H. (1997). On bias, variance, 0/1—loss, and the curse-of-dimensionality. Data mining and knowledge

discovery, 1(1):55–77.Geman, S., Bienenstock, E., and Doursat, R. (1992). Neural networks and the bias/variance dilemma. Neural

computation, 4(1):1–58.Hardt, M., Megiddo, N., Papadimitriou, C., and Wootters, M. (2016). Strategic classification. In Proceedings of the

2016 ACM conference on innovations in theoretical computer science, pages 111–122.Huang, L., Joseph, A. D., Nelson, B., Rubinstein, B. I., and Tygar, J. D. (2011). Adversarial machine learning. In

Proceedings of the 4th ACM workshop on Security and artificial intelligence, pages 43–58.Imbens, G. W. and Rubin, D. B. (2015). Causal inference in statistics, social, and biomedical sciences. Cambridge

University Press.Kohavi, R. and Longbotham, R. (2017). Online controlled experiments and a/b testing. In Sammut, C. and Webb,

G. I., editors, Encyclopedia of Machine Learning and Data Mining. Springer.Kohavi, R. and Thomke, S. (2017). The surprising power of online experiments. Harvard Business Review.Kosinski, M., Stillwell, D., and Graepel, T. (2013). Private traits and attributes are predictable from digital records

of human behavior. Proceedings of the national academy of sciences, 110(15):5802–5805.Kramer, A. D., Guillory, J. E., and Hancock, J. T. (2014). Experimental evidence of massive-scale emotional

contagion through social networks. Proceedings of the National Academy of Sciences, 111(24):8788–8790.Kusner, M., Loftus, J., Russell, C., and Silva, R. (2017). Counterfactual fairness. In Proceedings of the 31st

International Conference on Neural Information Processing Systems, pages 4069–4079.Li, L., Chu, W., Langford, J., and Schapire, R. E. (2010). A contextual-bandit approach to personalized news article

recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670.Lo, V. S. (2002). The true lift model: a novel data mining approach to response modeling in database marketing.

ACM SIGKDD Explorations Newsletter, 4(2):78–86.Martens, D., Provost, F., Clark, J., and de Fortuny, E. J. (2016). Mining massive fine-grained behavior data to

improve predictive analytics. MIS quarterly, 40(4):869–888.Mathur, A., Acar, G., Friedman, M. J., Lucherini, E., Mayer, J., Chetty, M., and Narayanan, A. (2019). Dark

patterns at scale: Findings from a crawl of 11k shopping websites. Proceedings of the ACM on Human-ComputerInteraction, 3(CSCW):1–32.

MathWorks (2021). What is reinforcement learning? MathWorks documentation.Michie, S., Richardson, M., Johnston, M., Abraham, C., Francis, J., Hardeman, W., Eccles, M. P., Cane, J., and

Wood, C. E. (2013). The behavior change technique taxonomy (v1) of 93 hierarchically clustered techniques:building an international consensus for the reporting of behavior change interventions. Annals of behavioralmedicine, 46(1):81–95.

Miltenberger, R. G. (2015). Behavior modification: Principles and procedures. Cengage Learning, 6 edition.

13

Page 14: arXiv:2008.12138v1 [cs.CY] 26 Aug 2020

Moreno-Torres, J. G., Raeder, T., Alaiz-RodríGuez, R., Chawla, N. V., and Herrera, F. (2012). A unifying view ondataset shift in classification. Pattern recognition, 45(1):521–530.

Munro, E. (2020). Learning to personalize treatments when agents are strategic. arXiv preprint arXiv:2011.06528.Nook, E. C., Ong, D. C., Morelli, S. A., Mitchell, J. P., and Zaki, J. (2016). Prosocial conformity: Prosocial norms

generalize across behavior and empathy. Personality and Social Psychology Bulletin, 42(8):1045–1062.Nord, W. R. and Peter, J. P. (1980). A behavior modification perspective on marketing. Journal of Marketing,

44(2):36–47.Pearl, J. (2009). Causality. Cambridge university press, 2 edition.Pearl, J., Glymour, M., and Jewell, N. P. (2016). Causal inference in statistics: A primer. John Wiley & Sons.Provost, F. and Fawcett, T. (2013). Data Science for Business: What you need to know about data mining and

data-analytic thinking. O’Reilly Media, Inc.Quionero-Candela, J., Sugiyama, M., Schwaighofer, A., and Lawrence, N. D. (2009). Dataset shift in machine

learning. The MIT Press.Radcliffe, N. J. and Surry, P. D. (2011). Real-world uplift modelling with significance-based uplift trees. White

Paper TR-2011-1, Stochastic Solutions, pages 1–33.Rushkoff, D. (2019). Team Human. WW Norton & Company.Russell, S. (2019). Human compatible: Artificial intelligence and the problem of control. Penguin.Rzepakowski, P. and Jaroszewicz, S. (2012). Uplift modeling in direct marketing. Journal of Telecommunications

and Information Technology, pages 43–50.Shalit, U., Johansson, F. D., and Sontag, D. (2017). Estimating individual treatment effect: generalization bounds

and algorithms. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages3076–3085. JMLR.

Shmueli, G. (2017). Research dilemmas with behavioral big data. Big data, 5(2):98–119.Sutton, R. S. and Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.Tang, L., Rosales, R., Singh, A., and Agarwal, D. (2013). Automatic ad format selection via contextual bandits.

In Proceedings of the 22nd ACM international conference on Information & Knowledge Management, pages1587–1594.

Thaler, R. H. and Sunstein, C. R. (2009). Nudge: Improving decisions about health, wealth, and happiness. Penguin.Yeung, K. (2017). ‘hypernudge’: Big data as a mode of regulation by design. Information, Communication &

Society, 20(1):118–136.Zhang, J. and Bareinboim, E. (2018). Fairness in decision-making—the causal explanation formula. In Thirty-Second

AAAI Conference on Artificial Intelligence.Zhou, S., Dai, X., Chen, H., Zhang, W., Ren, K., Tang, R., He, X., and Yu, Y. (2020). Interactive recommender

system via knowledge graph-enhanced reinforcement learning. In Proceedings of the 43rd International ACMSIGIR Conference on Research and Development in Information Retrieval, pages 179–188.

Zuboff, S. (2019). The age of surveillance capitalism: The fight for a human future at the new frontier of power.Profile Books.

Acknowledgements

We thank Foster Provost, Rob Hyndman, Pierre Pinson, Soumya Ray, Sam Ransbothan, Carlos Cinelli, Travis Greene,Boaz Shmueli, Patricia Kuo, Aurélie Lemmens, David Budescu, Raquelle Azran, Bill Meeker, and participants of2020 SCECR, 2020 ENBIS conference, 2021 IIF Symposium, and seminars at University of Iowa and ErasmusUniversity for invaluable feedback. We thank Noa Shmueli for Figure 1 artwork. G. Shmueli’s research is partiallysupported by Ministry of Science and Technology, Taiwan [Grant 108-2410-H-007-091-MY3].

14

Page 15: arXiv:2008.12138v1 [cs.CY] 26 Aug 2020

Appendix

A Derivation of the EPE Bias-Variance Decomposition (Equation (7))

The derivation for Equation (7) (bias-variance decomposition under BMOD) is as follows. For convenience, we usef and f to denote the no-manipulation true function and its estimated model. For the manipulated scenario, we usefdo to denote the true function under BMOD.

For a new observation with inputs ~x and manipulated outcome y|do(B), ~x, we can decompose the expectedprediction error as follows (for convenience, we drop subscript i):

EPE(~x) = E(y|do(B), ~x− f(~x)

)2= E(y − f)2

= E(y − fdo + fdo − f)2

= E(y − fdo)2 + E(fdo − f)2 + 2E(y − fdo)(fdo − f).

(8)

These three terms can be further simplified. The first term can be simplified by noting that y = fdo + ε:

E(y − fdo)2 = E(ε2) = σ2. (9)

The second term can be written as:

E(fdo − f)2 = E(fdo − E(f) + E(f)− f

)2=(fdo − E(f)

)2+ E

(f − E(f)

)2=(fdo − E(f)

)2+ V ar(f)

(10)

because the cross product is zero:

2E(fdo − E(f)

)(E(f)− f

)= 2

(fdo − E(f)

)(E(f)− E(f)

)= 0. (11)

We can further write Equation 10 as a function of the bias and variance of f :[fdo − E(f)

]2+ V ar(f) = E

[fdo − f + f − E(f)

]2+ V ar(f)

=[CATE +Bias(f)

]2+ V ar(f).

(12)

Finally, using the independence of the new observation’s prediction error ε from the prediction f based on thetraining data (E(εf) = 0), the third term can be shown to be zero:

2E(y − fdo)(fdo − f) = 2E(ε)(fdo − f) = 2fdoE(ε)− 2E(εf) = 0. (13)

Therefore, we can write EPE from Equation (8) as:

EPE(~x) = E(y|~x, do(B)− f(~x)

)2= σ2 +

[CATE +Bias(f)

]2+ V ar(f).

(14)

B Differences Between A/B Tests, Uplift Modeling, and “Improving" Prediction

Behavior modification techniques are used by platforms for two purposes other than reducing prediction error: forestimating the overall effect of a behavior modification intervention (A/B testing), and for predicting personalizeduser reactions to a behavior intervention for precision targeting (uplift modeling). We show how these two purposesand error minimization are different from each other. Table 2 summarizes the key differences.

B.1 Estimating the overall effect of an intervention for product improvement (A/B testing)

Platforms try to improve their products and users’ experience on an ongoing basis, from updating website design tointroducing new features, new ad layouts, marketing emails, and more. To do this, they employ A/B tests, which

15

Page 16: arXiv:2008.12138v1 [cs.CY] 26 Aug 2020

Table 2: Differences between A/B testing, uplift modeling, and “improving" prediction via behavior modificationA/B testing Uplift modeling Error minimization

Business goal Test effectiveness of new de-sign/feature

Effective user targeting Increasing value of predictiveproducts

Analysis goal Test average effect of new fea-ture: ATE = 1

nB

∑nB

i=1 yi −1nA

∑nA

i=1 yi

Predict uplift for each user i:uplifti = ˆyi,do(B=1) − ˆyi,do(B=0)

Minimize overall predic-tion error: e.g. MSE=1n

∑i (yi − yi)

2

Sequence ofevents

intervention→ estimation intervention→ prediction prediction→ intervention

Interventionlevels

do(B = 0), do(B = 1) do(B = 0), do(B = 1) do(Bi) (personalized)

X used for not used, or for CATE training predictive model(s) training f and personalizingBi

compare the impact a new test design/feature (version B) vs. an existing one (version A). Amazon, Microsoft,Facebook, Google and similar companies conduct thousands to tens of thousands of A/B tests each year, on millionsof users, testing user interface changes, enhancements to algorithms (search, ads, personalization, recommendation,etc.), changes to apps, content management system, and more (Kohavi and Longbotham, 2017; Kohavi and Thomke,2017). The A/B test is a simple randomized experiment comparing the average outcome of a treatment group to thatof a control group. Suppose that version A is the current website functionality (B = 0), and version B is a newfeature (B = 1). The A/B testing process is as follows:

1. Randomly assign nA users to condition A (do(B = 0)) and nB users to condition B (do(B = 1)).2. Measure the outcome for users in condition A (yi, i = 1, . . . , nA) and in condition B (yi, i = 1, . . . , nB).3. Estimate the Average Treatment Effect ATE = 1

nB

∑nB

i=1 yi −1nA

∑nA

i=1 yi.

4. Use statistical inference and effect magnitude estimates to determine whether the new feature adds value.

When there is interest in the ATE for certain subgroups, such as by the user’s language, steps 3-4 can be supplementedwith CATE. In summary: A/B tests are used to determine the effectiveness of a new feature compared to an existingone. This is done by comparing the average outcome of the two randomly assigned groups, and estimating andtesting ATE or CATE.

B.2 Predicting personalized behavior modification effects for maximizing conversion/revenue (upliftmodeling)

Uplift modeling, also known as differential response analysis, or true lift modeling (Rzepakowski and Jaroszewicz,2012; Radcliffe and Surry, 2011), is used in precision marketing and in political persuasion for identifying peoplewho will modify their behavior (e.g. purchase a product, or vote for a candidate) conditional on being given aparticular treatment (e.g. receiving a coupon, or a phone call), assuming the treatment can also cause a negativeoutcome for some people. In uplift modeling, an intervention is applied to a randomly sampled group of people. Theresulting data, along with data from a control group, is used to build predictive model(s) for predicting a person’schange in response, or uplift due to the intervention. The predictive model(s)20 produce predicted outcomes underdo(B = 0) and do(B = 1), which are then combined to obtain uplifti = ˆyi,do(B=1)− ˆyi,do(B=0). Finally, the upliftvalues are used to determine which users to treat (do(B = 1)) and for which to avoid treatment (do(B = 0)). Thisprocess is as follows:

1. Randomly assign nA users to condition A (do(B = 0)) and nB users to condition B (do(B = 1).2. Measure the outcome and predictors for users in conditions A ({yi, ~xi}, i = 1, . . . , nA) and B

({yi, ~xi}, i = 1, . . . , nB).3. Train a predictive model of y on X,B (e.g. Lo, 2002).

4. Use the model to predict yi,do(B=1) and yi,do(B=0), and compute uplifti = ˆyi,do(B=1) − ˆyi,do(B=0).5. Based on uplift, determine who should be treated.

B.3 Behavior modification for minimizing prediction error

The process of minimizing prediction error can be summarized as follows:

20Uplift modeling includes a two-model approach, where models of the form y = g(X) + ε are trained separately on thetreatment and control groups, and a single-model approach that trains a single model on the combined dataset (y = h(X,B)+ δ).

16

Page 17: arXiv:2008.12138v1 [cs.CY] 26 Aug 2020

1. Collect observational predictors X and outcome ~y for a sample of n users.

2. Build predictive model y = f(X). Compute personalized predicted scores yi = f(~xi) for each user.3. Apply behavior modification do(Bi) to each user to push their outcome towards their predicted value yi

(Bi = h(yi, ~xi) + υi where υi is an error term).4. Compute manipulated prediction errors yi− yi, and summarize them to demonstrate (impressive) predictive

power.

17