a case study of algorithm-assisted decision making in child...

Proceedings of Machine Learning Research 81:1–15, 2018 Conference on Fairness, Accountability, and Transparency

A case study of algorithm-assisted decision making in childmaltreatment hotline screening decisions

Alexandra Chouldechova [email protected] CollegeCarnegie Mellon UniversityPittsburgh, PA, 15213, USA

Emily Putnam-Hornstein [email protected] Dworak-Peck School of Social WorkUniversity of Southern CaliforniaLos Angeles, CA, 90089, USA

Diana Benavides-Prado [email protected]

Oleksandr Fialko [email protected]

Rhema Vaithianathan [email protected]

Centre for Social Data Analytics

Auckland University of Technology

Auckland, New Zealand

Editors: Sorelle A. Friedler and Christo Wilson

AbstractEvery year there are more than 3.6 mil-lion referrals made to child protection agen-cies across the US. The practice of screen-ing calls is left to each jurisdiction to fol-low local practices and policies, potentiallyleading to large variation in the way inwhich referrals are treated across the coun-try. Whilst increasing access to linked ad-ministrative data is available, it is difficultfor welfare workers to make systematic useof historical information about all the chil-dren and adults on a single referral call.Risk prediction models that use routinelycollected administrative data can help callworkers to better identify cases that arelikely to result in adverse outcomes. How-ever, the use of predictive analytics in thearea of child welfare is contentious. Thereis a possibility that some communities—such as those in poverty or from particu-lar racial and ethnic groups—will be dis-advantaged by the reliance on governmentadministrative data. On the other hand,these analytics tools can augment or re-place human judgments, which themselvesare biased and imperfect. In this paper wedescribe our work on developing, validat-ing, fairness auditing, and deploying a riskprediction model in Allegheny County, PA,

USA. We discuss the results of our analy-sis to-date, and also highlight key problemsand data bias issues that present challengesfor model evaluation and deployment.

1. Introduction

Every year there are more than 3.6 million refer-rals made to child protection agencies across theUS. It is estimated that 37% of US children areinvestigated for child abuse and neglect by age18 years (Kim et al., 2017). These statistics indi-cate that far from being a rare occurrence, manymore children are being pulled into the child wel-fare agencies than previously thought. Currently,screening these referral calls is left to each ju-risdiction to follow local practices and policies.These practices usually involve caseworkers gath-ering details about the adults and children associ-ated with the alleged victim. Often, the decisionon whether to investigate or not is made withoutever visiting the family or speaking with them.

Whilst electronic case management systemsand linked administrative data are increasinglyavailable, it is difficult for child welfare workersto make systematic use of historical informationabout all the children and adults on a single refer-

c© 2018 A. Chouldechova, E. Putnam-Hornstein, D. Benavides-Prado, O. Fialko & R. Vaithianathan.

Algorithms in child welfare case study

ral call. Fully interrogating a case history for allthe people named on a call could take hours. Theincreasing pressure on ensuring that investigativeresources are focused on the highest risk childrenmeans that there is some potential for predictiveanalytics to assist call screeners to more quicklyand accurately assess each referral.

Predictive Risk Modelling (PRM) uses rou-tinely collected administrative data to predict thelikelihood of future adverse outcomes. By strate-gically targeting services to the riskiest cases, itis hoped that many of the adverse events canbe prevented. Additionally, unnecessary inves-tigations which are burdensome for families andcostly for the system could be avoided. PRM hasbeen used previously in health and hospital set-tings (Panattoni et al., 2011; Billings et al., 2012)and has been suggested as a potentially usefultool that could be translated into child protec-tion settings (Vaithianathan et al., 2013).

However, the use of predictive analytics in thearea of child welfare is contentious. There isthe possibility that some communities—such asthose in poverty or from particular racial or eth-nic groups—will be disadvantaged by the relianceon government administrative data because theywill typically have more data kept about themsimply by dint of being poor and on welfare. Suchfamilies could then be flagged as high risk andbe more frequently investigated. If the algorithmuses past investigations to produce a high riskscore for a family, then this will exacerbate theoriginal bias.

On the other hand, these analytics tools canaugment or replace human judgments, whichthemselves are potentially biased. There is a pos-sibility that caseworkers are basing their screen-ing decisions in part on personal experiences orcurrent caseloads. Caseworker decisions mayalso be be affected by cognitive biases—for ex-ample over-weighting recent cases, unrelated tothe current case, where a child has been fatallyharmed. When making decisions under timepressure, caseworkers might be guilty of statisti-cal discrimination, where they use easily observedfeatures (e.g. living in a neighborhood with highcrime rates) as proxies for unobservable but morepertinent attributes (e.g. drug-use). Bias in hu-man decision-making is often difficult to assess,and the existing research does not provide a con-

sensus view of racial bias in the child welfare sys-tem (Fluke et al., 2011).

A subject where there is greater consensus con-cerns the relative accuracy of so-called “clinical”versus “actuarial” judgment.1 Some of the ear-liest research comparing human predictions tothose of statistical models goes back to the pi-oneering work of Meehl (1954). Decades of re-search and several large scale meta-analyses havelargely upheld the original conclusions: When itcomes to prediction tasks, statistical models aregenerally significantly more accurate than humanexperts (Dawes et al., 1989; Grove et al., 2000;Kleinberg et al., 2017).

Our goal in Allegheny County is to improveboth the accuracy and equity of screening deci-sions by taking a fairness-aware approach to in-corporating prediction models into the decision-making pipeline. The present paper reports onthe lessons that we have learned so far, our ap-proaches to predictive bias assessment, and sev-eral outstanding challenges. To be sure, at cer-tain points we offer more questions than an-swers. Our hope is that this report contributes tothe rich ongoing conversation concerning the useof algorithms in supporting critical decisions ingovernment—and the importance of consideringfairness and discrimination in data-driven deci-sion making. While the work presented here isfirmly grounded in the child maltreatment hot-line context, much of the discussion and our gen-eral analytic approach are broadly applicable toother domains where predictive risk modelingmay be used. Readers are also encouraged torefer to the recent work of Shroff (2017) for an-other perspective on predictive analytics in thechild welfare domain. This related work pro-vides an excellent report on various considera-tions that are important for model development,as well as strategies for effectively engaging withagency leadership.

1.1. Organization of this paper

We begin in the next section with some back-ground on the model development. As part ofthis discussion we describe both the tool thatis currently deployed in Allegheny County andthe competing models that are being developed

1. The term “actuarial” has fallen out of fashion, and hasin many cases been replaced with “machine learning”.

2


as part of an ongoing redesign process. Then inSection 3 we investigate the predictive bias prop-erties of the current tool and a Random forestmodel that has emerged in the redesign as oneof the best performing competing models. Ourpredictive bias assessment is motivated both byconsiderations of human bias and recent work onfairness criteria that has emerged in the algorith-mic fairness literature. Section 4 discusses someof the challenges in incorporating algorithms intohuman decision making processes and reflects onthe predictive bias analysis in the context of howthe model is actually being used. We discusssome of the concerns that have arisen as partof the redesign, and propose an “oracle test” asa tool for clarifying whether particular concernspertain to the statistical properties of a modelor are targeted at other potential deficiencies.We also briefly describe an independent ethicalreview of the modeling work that was commis-sioned by the County. Section 5 concludes with areflection on several of the key outstanding tech-nical challenges affecting the evaluation and im-plementation of the model.

2. The Allegheny Models

Allegheny County is a medium size county inPennsylvania, USA, centered on the city of Pitts-burgh. In 2014, Allegheny County’s Departmentof Human Services issued a Request for Proposalsfocused on the development and implementationof tools that would enhance use of the County’sintegrated data system. Specifically, the Countysought proposals that would: (1) improve theability to make efficient and consistent data-driven service decisions based on County records,(2) ensure public sector resources were being eq-uitably directed to the County’s most vulnera-ble clients, and (3) promote improvements in theoverall health, safety and well-being of Countyresidents. A consortium of researchers from fouruniversities were awarded the contract in the Fallof 2014 and commenced work in close concertwith the Allegheny County team.

In mid-2015, it was decided that the mostpromising, ethical, and readily implemented useof PRM within the Allegheny County child pro-tection context was one in which a model wouldbe deployed at the time a referral call was was re-ceived by the County. The objective was to help

call workers determine whether a maltreatmentreferral is of sufficient concern to warrant an in-person investigation (referred to as “screening-inthe call”). Calls that are thought to be innocu-ous and are not further investigated are said tobe “screened-out”. In this section we describethe development and implementation of the Al-legheny Family Screening Tool (AFST).

2.1. Then and now

Allegheny County’s Department of Human Ser-vices is fairly unique in the United States: it hasan integrated client service record and data man-agement system. This means that the County’schild protection hotline staff are in principle al-ready able to access and use historical and cross-sector administrative data (e.g., child protectiveservices, mental health services, drug and alco-hol services, homeless services) related to indi-viduals associated with a report of child abuse orneglect. Although this information is critical toassessing child risk and safety concerns, it is chal-lenging for County staff to efficiently access, re-view, and make meaning of all available records.Beyond the time required to scrutinize data forevery individual associated with a given referral(e.g., child victim, siblings, biological parents, al-leged perpetrator, other adults living at the ad-dress where the incident occurred), the Countyhas no means of ensuring that available informa-tion is consistently used or weighted by staff whenmaking hot-line screening decisions. As such, forexample, recent paternal criminal justice involve-ment that surfaces in the context of one child’s re-ferral may be a deciding factor in one case, whilefor another child with a similar referral that sameinformation may be completely ignored.

Prior to the implementation of the call screen-ing tool, for the period from April 1, 2010through May 4, 2016, the majority of CPS re-ports (52%) were screened out. Of those childrenwho were screened out, 53% were re-referred for anew allegation within 2 years. Of those who wereinitially screened-in, 13% were placed outside ofthe home within 2 years.

Retrospective analysis of cases that resultedin critical incidents has revealed many instancesof multiple calls having been repeatedly screenedout. While each screen-out decision was individ-ually defensible, looking at the full case history

3


paints the picture that child protective servicesshould have gotten involved. The primary aim ofintroducing a prediction model is to supplementthe often limited information received during thecall with a risk assessment that takes into ac-count a broader set of information available inthe integrated data system.

The implementation of the AFST presents callscreening staff with a score from 1 to 20 that re-flects the likelihood that the child will be placed(removed from home) within 2 years conditionalon being screened in (further investigated). Outof home placement was chosen as the target fortwo primary reasons. First, it is a directly ob-servable event that serves as a good proxy for se-vere child maltreatment. Second, placement de-cisions are not made by the call screening staff.By predicting an outcome that cannot be directlydetermined by the staff, we reduce the risk of get-ting trapped in a feedback loop wherein workerseffect the outcome predicted by the model (e.g.,substantiate cases that the model tells them theylikely should).

The original implementation also included ascore reflecting the likelihood of re-referral con-ditional on the case being screened out. Thismodel never gained traction in practice largelydue to lack of buy-in from County leadership.There are many reasons for why a case might bere-referred, and many of them do not merit pro-tective services involvement. A County review ofhow cases were scored by the re-referral model in-dicated that re-referral risk was a poor measureof whether a referral merits further investigation.

Even if re-referral risk were deemed to be areasonable measure, there would be cause todoubt the predictive validity of the model. Ef-fective December 31, 2014, the state of Penn-sylvania amended the Child Protective ServicesLaw (CPSL) to broaden the category of man-dated reporters. This significantly changed re-porting patterns across the state, resulting in anincrease in the number of referrals. Since the ini-tial training data predated the policy change, there-referral would not be able to take this surge inreferrals into account.

There is of course good reason to suspectthat the predictive performance of the placementmodel will also be negatively affected by themandatory reporting amendment. On the otherhand, one might also expect that the placement

model would not be as severely compromised asthe re-referral model. That is, while more re-ferrals are coming in, there may be considerableconsistency in the patterns of risk factors asso-ciated with placement risk. We will be betterpositioned to reevaluate the performance of themodel as more post-amendment data becomesavailable for retrospective analysis.

These considerations are relevant to under-standing how the AFST was deployed, and howit is intended to be used in the call screeningcontext. While in some settings machine learn-ing systems have been used to replace decisionsthat were previously made by humans, this is notthe case for the Allegheny Family Screening Tool.It was never intended or suggested that the al-gorithm would replace human decision-making.Rather, that the model should help to inform,train and improve the decisions made by the staff.

As we have previously noted, the AFST andthe call worker are relying on very different in-formation in assessing referrals. Whereas the callworkers’ screening decisions are based in largepart on the content of the allegation, the AFSTdoes not use information about the allegation perse. The AFST instead relies entirely on admin-istrative data available on the individuals associ-ated to the referral. So while the AFST is able topick up on cases that have high long arc risk, itmay just as easily miss acute incidents describedin the allegation that necessitate immediate in-vestigation. Until such a time that an AI systemis developed that can properly assess all of therisks relevant to screening decisions, full automa-tion of the call screening process remains out ofthe question. In the meantime, the AFST con-tinues to be tested as a form of decision support,as well as a way to help leadership get betterinsight into call screening decisions which histor-ically have been relatively opaque.

2.2. Modeling methodology

2.2.1. Data

The full data set consists of all n = 76, 964 re-ferral records collected by Allegheny County be-tween April 2010 and July 2014. A distinct re-ferral record is generated for each child associ-ated to an allegation. These records correspondto 36, 840 distinct referrals, involving 47, 305 dis-tinct children. In total the data set contains over

4


Child and Family

History

Demographics

Previous

referrals

Previous

protective

services received

Findings

during previous

investigations

Program

involvement

Previous

placements

Behavioural

health records

- Child Victim

- Other Children

- Parent

- Perpetrator

Data

Validation

- Performance

metrics

(AUC, TPR,

FPR)

- Expert

validation/

current

process

Modelling

- 46,503 records of screened-

in referrals spanning April

2010 to July 2014, with

around 800 predictors

- 32,086 training records,

14,417 test records, based on

independent children

- Logistic regression model

- Random Forest model (Breiman, 2001):

- 500 trees

- split based on entropy

- XGBoost model (Chen and Guestrin, 2016):

- 1,000 trees

- 0.01 learning rate

- 0.9 subsample ratio of training instances

- SVM model (Vapnik, 1998):

- Radial-basis function kernel, with

gamma = 1 / number of features

- Class weights: 0.8 placement, 0.2 no

placement

- Probability estimation using a sigmoid

function (Platt, 1999)

predicted

probabilities

for test set

Figure 1: An overview of the modeling process.

800 variables providing demographics, past wel-fare interaction, public welfare, county prison, ju-venile probation, and behavioral health informa-tion on all persons associated with each refer-ral. 19, 869 of the observed referrals, involving31, 438 distinct children were screened in for in-vestigation. This corresponds to 46, 503 referralrecords. All of the model training and evalua-tion presented in this report is conducted on thesubset of screened-in cases. Since our models aretrained and tested on the screened-in population,we encounter a version of the selective labels prob-lem (Kleinberg et al., 2017) in trying to gener-alize our evaluation results to the entire set ofreferrals. We discuss this further in Section 5.

2.2.2. Current model

The initial model—the one that has been de-ployed in Allegheny County since August 2016—is based on a logistic regression fit to a selectedsubset of 71 features. Instead of presenting callworkers with raw probability estimates or coarseclassifications, it was decided that a derived scorefrom 1 to 20 would be more suitable.2 The de-rived score corresponds to the ventiles (5% per-centiles) of the estimated probability distribu-tion. In the initial implementation, cases whoserisk is assessed in the most risky 15% of thecalls (scores of 18 or higher) were flagged for call

2. This was not a principled decision, and is being re-considered in the model redesign.

screeners as “mandatory screen-ins”.3 The toppanel of Figure 2 shows the ventile and quantilecutoffs along with red dashed lines representingthe mandatory screen-in cutpoints for the logis-tic regression model. Owing to how the scores areconstructed, two cases that both receive a scoreof 19 may have a greater absolute risk differencethan two cases that receive scores of 6 and 10,respectively.

2.2.3. Issues with model validation

As we only recently discovered, two mistakeswere made in evaluating the predictive perfor-mance of the original version of the AFST—theversion that went into deployment in 2016. Theseerrors meant that reported estimates of the pre-dictive performance of the AFST models (e.g.,AUCs) were over-optimistic. To be clear, uncov-ering these issues earlier would not have affectedhow or whether the AFST was deployed. Cor-recting the validation scheme has produced morerealistic estimates of model performance underideal conditions, but does not change the modelsthemselves. We describe the issues briefly here,and outline a corrected validation scheme in thesection that follows.

The first problem comes from the way in whichthe train-test split was initially performed. Thesplit was obtained by randomly holding out 30%of the referral records for model validation, re-

3. Despite what the name suggests, “mandatory” screen-ins can be overridden by hotline supervisors. We dis-cuss overrides in Section 5.

5


taining the remaining 70% of records for modeltraining. This has two potential issues: (1) thesame children appeared both in the training andtest set (but for different referrals); and (2) refer-rals involving multiple children had their recordssplit between the training and test set (so that,say, siblings on the same referral could have beenallocated between the train and test sets). Is-sue (2) is the primary cause of over-optimism inestimating predictive performance. This is be-cause placement outcomes, although not univer-sally the same for all children associated to a re-ferral, are nonetheless highly correlated.

Secondly, the 71 features used in the finalplacement model were selected by looking at re-gression coefficient t-statistics4 using the full setof data. The train-test split was performed onlyafter the set of features had been selected, so evenif the split itself were valid, the model evalua-tion would not have accounted for the variableselection step. This further contributed to over-optimism in the prediction performance of theoriginal AFST model.

The validation results presented in the remain-der of this paper use a corrected train-test split,and do not suffer from these issues. For com-parison purposes throughout, we use a copycatmodel of the AFST (“AFST copy”) obtained bycarrying out the variable selection procedure us-ing only the data in the train split.

Method AUC TPR FPRLogistic (all vars) 0.70 0.49 0.21AFST copy 0.74 0.54 0.20SVM 0.77 0.57 0.20Random Forest 0.77 0.58 0.20XGBoost 0.80 0.61 0.19

Table 1: Performance results for methods underanalysis. TPR and FPR correspondto the 25% highest risk cutoff (ventilescores of 16 and higher).

2.2.4. Model rebuild.

In April 2017 the research team began a rebuildof the AFST with the aim of improving model

4. For more details, see p.13 ofhttps://tinyurl.com/y8n5m9kg

accuracy by using the full set of available fea-tures and applying more flexible machine learn-ing models. Figure 1 provides a summary of thedata, model options and validation metrics usedduring the rebuild.

As part of this rebuild, we also revised thetrain-test splitting approach. Sample splittingwas performed to ensure that there were no over-lapping children or referrals across the trainingand test data. 32, 086 of the referral recordswere used in training the models, and 14, 417 re-ferral records were held out for validation pur-poses. More details on this splitting approachalong with validation results from holding out themost recent year of data as a test set are providedin the Supplement.

The machine learning methods we consideredincluded support vector machines (Vapnik, 1998)with probability prediction (Platt et al., 1999),random forests (Breiman, 2001) and XGBoost(Chen and Guestrin, 2016). We used availableimplementations of these methods in R, Pythonand LibSVM (Chang and Lin, 2011). These com-peting models produced considerable gains overthe logistic regression model. Table 1 shows sev-eral test set classification metrics for the differentmethods.

In the redesign it was also decided that themandatory screen-in threshold will be lowered tocapture the top 25% highest cases (cases scor-ing 16 or higher). This lower threshold corre-sponds to a TPR (Recall/Sensitivity) of 58% forthe random forest model and 61% for the XG-Boost model. That is, of the test set cases thatwere observed to result in placement within 2years of the call, around 3 in 5 are flagged asmandatory screen ins by the two best perform-ing models.

3. Predictive bias

Predictive bias is a major concern when deploy-ing predictive modeling in the child welfare con-text. In this section we reflect on our analysisto-date of the racial bias5 properties of the Al-legheny County models. To set the stage for ourdiscussion we begin with an overview of how hu-

5. We performed similar analyses looking instead atpoverty status and sex. There was no strong indi-cation of predictive bias with respect to these othervariables.

6


man bias is thought to contribute to observeddisproportionalities in the child welfare system.This provides a context for thinking about howmodels can be helpful in mitigating human bias.We then evaluate the predictive bias of the AFSTcopy model and the Random forest model beingconsidered in the redesign. While this work ispresented in the child welfare context, our anal-ysis translates to many other risk prediction set-tings.

2019

18

0.0

2.5

5.0

7.5

0.00 0.25 0.50 0.75 1.00

Probability of placement (logistic regression)

dens

ity

Figure 2: Estimated probabilities from the lo-gistic regression model. Vertical greylines indicate ventile cutoffs. Colouredsegments correspond to quartiles of therisk distribution. Dashed red line indi-cates the mandatory screen-in thresh-old in the initial deployment of themodel.

3.1. Setting the stage: Human bias.

To frame a discussion of model bias, it is impor-tant to first understand how disparate outcomesmight arise in the existing process. Racial dis-proportionality and disparity are widely acknowl-edged problems in the child welfare system. A2016 report from the Child Welfare InformationGateway describes four major factors that sig-nificantly contribute to explaining the observedracial differences: (1) disproportionate and dis-parate need among families of color; (2) geo-graphic context; (3) child welfare system factorsaffecting the ability to provide resources for fam-ilies of colour; and (4) implicit or explicit racialbias and discrimination by child welfare profes-sionals. Statistical modeling can help to betterunderstand and quantify factors (1) and (2), butit cannot do anything to counteract them. As we

have argued in the previous sections, by provid-ing task-relevant information to help case workersbetter prioritize cases, we can hope to move theneedle on (3) through a more strategic allocationof resources. In this section we describe how theracial bias factor is also one that we can directlyaddress.

Recent studies have found that Black chil-dren are more likely than White children to bescreened in, even in cases where Black childrenhad lower risk levels than White children (Det-tlaff et al., 2011). These disproportionalitiesare typically attributed to two plausible causes:(1) caseworkers may be applying different riskthresholds depending on the child’s race (“apply-ing different cutoffs”); and (2) caseworkers maybe overestimating the risk for Black children rel-ative to White children (“miscalibration”).

By introducing accurate and carefully deployedrisk assessment tools into the decision-makingpipeline, we can get finer control over both ofthese sources of bias. The first source of bias isthe most straightforward to mitigate once one hasa numeric risk score. One need only ensure thatthe same risk threshold is being systematicallyapplied in each case. While such a strategy mayresult in disparate impact if risk profiles differacross groups, it is at the very least enforceable.The arguably greater concern is that the under-lying model may be miscalibrated, and may thusoverestimate risk for some groups relative to oth-ers. We explore the calibration properties of theAllegheny models in the next section.

Before continuing to the next section, we pauseto briefly touch on a third source of bias, whichis known as “statistical discrimination”. Statis-tical discrimination arises when caseworkers usethe few observable factors most easily availableto them—such as zip code, race and gender—to make inferences about relevant but unobserv-able factors such as risk of sexual abuse, expo-sure to gun violence or access to resources. Onthe one hand, PRM can be seen as a formalizedversion of precisely this sort of bias. The mod-els we construct are grounded in correlations, notcausation, and many of the predictive variablesmay simply be proxies for underlying causal fac-tors. On the other hand the models are con-structed from hundreds of different case-level fea-tures and the derived risk scores turn out to behighly predictive. The only way to avoid “statis-

7


tical discrimination” is to require that predictiverisk models only use the provably causal featuresof a case. However, this would decrease modelaccuracy and degrade process outcomes.

0.0

0.1

0.2

0.3

Black White Unknown Other Mixed Hispanic Asian

Race/ethnicity

Pla

cem

ent r

ate

Figure 3: Placement rates by Race/Ethnicity ofthe victim. X-axis values are sorted indescending order of count. Error barsreflect 95% confidence intervals.

3.2. Calibration

The notion that a fair instrument is one thathas equal predictive accuracy across groups hasa long history in the Standards for Educationaland Psychological Testing (Association et al.,1999). In this literature instruments are testedfor what is known as differential prediction, oth-erwise known as predictive bias (Skeem andLowenkamp, 2016). For the purpose of thepresent discussion we adopt instead the term cal-ibration, adapted from Kleinberg et al. (2016),as it more clearly reflects the type of predic-tive accuracy with which equality is being re-quired. More formally: We say that a risk as-sessment model is well-calibrated with respect torace if for each score s, the proportion of casesscoring s that are observed to have the adverseevent (here, placement) is the same for everyrace/ethnicity group. Formally, given a groupvariable G ∈ {g1, . . . , gk}, a score S and an out-come Y , calibration refers to the condition:

P(Y = 1 | S = s,G = gj) = P(Y = 1 | S = s) ∀j

Another way of formulating this criterion is as aconditional independence statement: We requirethe outcome (placement) to be statistically inde-pendent of race conditional on the score.

We now present our empirical findings. Theresults in this section are obtained using a vali-dation sample of 14,417 screened-in referrals thatwere not used in training the placement pre-diction models. Figure 4 displays the observedplacement rates for the AFST copy and Randomforest models. These plots focus only on childrenwhose race is recorded as Black or White. In thecase of the AFST copy model, we find evidence ofpoor calibration around the top 2 score ventiles,and in the lower ventiles 1-5. Screened-in refer-rals that score a 20 on the AFST ventile scale areobserved to result in placement in 50% of casesinvolving Black children and only 30% of casesinvolving White children. That is, at the highestventile, the model appears to overestimate riskfor White children compared to Black children.Put differently, a White child who scores 20 onthe AFST has comparable placement risk to aBlack child who scores 18. Looking at the Ran-dom forest results, we again find evidence miscal-ibration in the upper score levels. Differences ob-served at the lower ventiles are much smaller forthe Random forest than the AFST copy model.We are presently delving deeper to try to un-derstand the reason for the miscalibration. De-pending on what the investigation reveals, wemay take steps to correct the scores by applyingwithin-group recalibration techniques.

Since the models are intended to serve asdecision-support tools and do not themselves pro-duce decisions, it is important to ensure that thescores reflect meaningful—furthermore, equallymeaningful—information about placement risk.Calibration is thus often the primary or onlypredictive fairness criterion considered. How-ever, it is important to note that one can ma-nipulate scores in ways that preserve calibrationbut greatly increase the disparity in outcomes(Corbett-Davies et al., 2017). Miscalibration iscertainly a concern, but it is not the only onethat is relevant.

3.3. Accuracy Equity and Error Rates

This issue was most publicly brought to light bya team at ProPublica in their investigation intothe COMPAS recidivism prediction instrument(Angwin et al., 2016). In this report the authorsfound that the false positive rate of the instru-ment was considerably higher (and the false nega-

8


AFST copy Random Forest

0 5 10 15 20 0 5 10 15 20

0.0

0.2

0.4

0.6

Ventile risk score

Obs

erve

d pl

acem

ent r

ate

race

Black

White

Figure 4: Observed placement rates by AFST model (left) and Random forest model (right) riskscore ventile broken down by victim’s race. Error bars correspond to 95% confidenceintervals.

tive rate was considerably lower) for Black defen-dants than for White defendants. Whether errorrate imbalance is an indication of predictive biasremains a widely contested issue, and is beyondthe scope of this paper. Instead, we focus on pre-senting and contextualizing an assessment of theAllegheny County models across several fairnessmetrics that emerged in the debate surroundingthe ProPublica article.

We begin with a look at the “accuracy equity”properties of the Allegheny models. The term ac-curacy equity was used in Dieterich et al. (2016)to refer to equality of AUC across groups. Fig-ure 5 shows the ROC curves for both the AFSTcopy model and the Random forest model strat-ified by race/ethnicity group. Looking at theleft panel, we find that there are some differ-ences in the logistic regression ROC curves (andthe implied AUCs) across groups. Similar vari-ation appears in the random forest ROC curves,though the most significant difference—that be-tween the “Unknown” category and the rest—is not one that directly corresponds to a salientrace/ethnicity group.

Figure 5 also displays points corresponding tothe value of (FPR, 1 - FNR) at the 25% high-est risk cutoff used for determining “mandatoryscreen-ins”. We see that even though the ROCcurves lie quite close to one another for mostgroups, the chosen threshold corresponds to dif-ferent points on the ROC curves for differentgroups. This is most clearly pronounced in thecase of the AFST copy model. While the ROC

curves for the 4 non-Unknown groups lie closetogether, at the chosen risk cutoff the FPR andFNR rates differ considerably. False positiverates are significantly higher for the Mixed racesubgroup. This FPR imbalance could lead toa perception that Mixed-race families are over-investigated relative to White families. This is acomplicated matter in part because risk of place-ment in foster care is not the only relevant con-sideration when deciding whether to screen-in acase. Thus a screen-in that is a false positive fromthe viewpoint of placement may not be an unjus-tified investigation once other considerations aretaken into account.

Furthermore, As recent work of Kleinberg et al.(2016), Chouldechova (2017) and Berk et al.(2017) has shown, some level of imbalance onother predictive accuracy criteria is unavoidablewhen a model satisfying calibration-type prop-erties is applied to a population where preva-lence differs across groups. Figure 3 showsthe observed placement rates broken down byrace/ethnicity category. The rate differences maynot appear large in an absolute sense, but evensmall differences in placement rates can lead tosignificant trade-offs across different fairness met-rics. While we do not explicitly consider the costsof different trade-offs in the current work, it isworth bearing in mind that any observed differ-ences in the error metrics that we present in thissection would come at a cost to rectify. Whichtrade-offs are worth making is the subject of on-

9


going discussion, but falls beyond the scope ofthe present paper.

4. Fairness, Processes, and Ethics

As we have seen, the latest rebuild of the Al-legheny prediction tool has improved the accu-racy and, at least by some metrics, the pre-dictive fairness of the model in comparison tothe model used in the initial deployment. Thisimprovement largely resulted from transitioningaway from logistic regression to ensemble meth-ods such as random forests and boosted decisiontrees. Yet these accuracy gains come at a cost.While logistic regression models are traditionallyviewed as being interpretable—in the sense thatone can write down a clear formula for the es-timated model—ensemble methods make predic-tions in an opaque manner. Due to interpretabil-ity concerns, decision-makers may be reluctant toadopt a more complex model despite evidence ofimproved prediction accuracy. In this section wecomment further on the merit of improved accu-racy, and offer a thought experiment that we referto as the “oracle test” to help better frame this is-sue. Furthermore, since no expected benefit canbe realized if the tool is not used as expected,we offer a look at how the tool has been usedin practice. This brings us to a discussion of thelimitations of our fairness analysis from a processperspective. We conclude the section with a dis-cussion of ethics, and the role that ethical reviewhas played in model development.

4.1. Oracle test

To begin, it is worth noting that overall accu-racy metrics and comparisons made on the ba-sis thereof may fail to present a complete pictureabout the differences between competing models.This issue was recently explored in Chouldechovaand G’Sell (2017), who showed that model dis-agreement may be highly pronounced on partic-ular salient subgroups, and hence the decision toadopt one model rather than another may comedown to what those subgroups look like and howthey are affected.

Even in cases where one model outperformsall other competitors and is thus the clearly pre-ferred choice by some metric, many other ques-tions may remain, especially those concerning the

fairness of the chosen model. Some of these ques-tions may be answerable using some of the fair-ness metrics presented in the previous section,while others may point to different concerns. Tobetter separate concerns about predictive fairnessproperties from concerns about other possible de-ficiencies of the model we find it helpful to ap-ply what we call the Oracle Test. This is a sim-ple thought experiment that proceeds as follows.Imagine that you are given access to an oracle,which for every individual informs you with per-fect accuracy whether the individual will have anevent (e.g., will be placed in foster care). Do anyof the concerns you previously had remain whenhanded this oracle? Often the answer is yes.Even if we had perfect prediction accuracy, manyvalid and reasonable concerns might remain. Wediscuss a few of these below.

Target variable bias. The prediction targetin the Allegheny models is the risk of placementin foster care for cases that are screened-in. Onemight be concerned that this outcome variableis simply a proxy for an unobserved or difficultto observe outcome of greater interest such as,say, severe maltreatment or neglect. Race-relateddifferences in reporting rates, screen-in rates, andinvestigations may mean that the target variablesare in closer alignment with the outcome of inter-est for some groups than for others. This is ar-guably an even bigger issue in criminal recidivismprediction—where one is interested in re-offensebut instead predicts re-arrest—than in the childwelfare context.

Disconnect between the prediction tar-get and decision criteria. A related issue isthat of omitted objectives or payoffs. Depend-ing on the range of functions performed and ser-vices offered by the given child welfare system,many of the considerations that enter in the de-cision process may go beyond the risk of place-ment. For instance, while the Allegheny modelsare well-suited to supporting decision-making atthe referral screening phase, they are not usedand may be inadequate for supporting decisionsabout how and whether cases should be acceptedfor services.

Explainability. In addition to knowing whatthe outcome is going to be, one may also needto know why this is the case. Even the simplestprediction models can only speak to the questionof why in a limited sense. A model that is de-

10


●

●

●

●●

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00

False positive rate (1 − Specificity)

True

pos

itive

rat

e (S

ensi

tivity

)

group

Black

Mixed

Other

Unknown

White

ROC curves for each racial group: AFST copy

●●

●

●

●

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00False positive rate (1 − Specificity)

True

pos

itive

rate

(Sen

sitiv

ity)

groupBlack

Mixed

Other

Unknown

White

ROC curves for each racial group: Random forest

Figure 5: Race-specific ROC curves for the AFST copy (left) Random forest (right) models. Pointsoverlaid on the curves correspond to the (FPR, 1 - FNR) values at the 25% highest riskcutoff delineating mandatory screen-ins (see Section 2).

composable or simulatable in the sense of Lipton(2016) may nevertheless fail to offer a satisfac-tory answer to why more penetrating than thatthe particular values of input variables combinedto produce a high-risk prediction. One may beable to understand the risk factors involved andhow they combine in the model, but the modelshave no claim to being causal. The overall utilityof such an understanding may be quite limited.

Effects of interventions. Knowing that acase will result in placement if one does not inter-vene also leaves open the question of how, or evenif, some form of preventative intervention couldhelp divert the case from this anticipated adverseoutcome. To answer the latter question one re-quires an understanding (a model) of what wouldhappen under different counterfactual conditions.This is not something that standard supervisedlearning approaches provide. They model thedata as it is, not as it might be.

4.2. Fairness is a Process Property

Our discussion so far has largely focused on theaccuracy and predictive fairness properties of thescoring tool. However, the scoring tool is, as thename suggests, merely a decision-support toolthat is presented to call screeners at a specificjuncture in the decision-making pipeline. Thebusiness process associated with a screening de-cision and where the tool enters it is illustratedin Figure 7. In the Allegheny process, call screen-

ers are free to ignore the score altogether or onlyuse it in selective cases. If this happens, thenthe fairness and accuracy properties of the toolwill not carry over into equitable and effectivedecision-making. Indeed data from the post-implementation period makes it clear that screen-ing decisions are presently not as strongly associ-ated with the AFST risk score as we might haveanticipated. While some fear that human asses-sors may fall into a pattern of rubber-stampingthe recommendations of the tool, this does notappear to be happening in Allegheny County.

The left panel of Figure 6 shows a break-down of the screening decisions over the en-tire range of AFST ventile scores. This plotshows that screen-out rates are decreasing inthe AFST score. Furthermore, we find evidenceof a sharp decrease in screen-out rates at thescreen-in threshold of 18 6. Even so, screen-outrates for this highest risk population remain non-negligible. A review of the data shows that su-pervisors are overriding nearly 1 in 4 mandatoryscreen-ins, and the rate at which overrides occurvary from one supervisor to the next. We alsofind that screen-in rates are consistently above25% across the entire range of ventile scores.

Prior to seeing the AFST score, call workersare asked to enter in two scores reflecting their

6. When a call scored 18 or above, the calls were labeledas ”mandatory screen-ins” and only supervisors wereallowed to screen them out

11


0.00

0.25

0.50

0.75

1.00

0 5 10 15 20WFST ventile score

Scre

enin

ig d

ecis

ion

brea

kdow

n

DecisionUnknown/Pending

Already active

Screen out

Screen in

AFST

0.00

0.25

0.50

0.75

1.00

1 2 3 4 6 9

Call worker assessed Risk−Safety score

Scr

eeni

nig

deci

sion

bre

akdo

wn

Decision

Unknown/Pending

Already active

Screen out

Screen in

Figure 6: Left: Breakdown of screening decisions by AFST ventile score. Ventile scores of 18-20indicate “mandatory overrides”. Right: Breakdown of screening decisions for call workerassessed risk-safety score. Data consists of 11157 referrals that have been screened sincethe AFST was first deployed.

31

Using the Model in Practice

The intent of the model is to inform and improve the decisions made by the child protection staff. As stated in the

background, it was never intended that the algorithm would replace human decision-making. To implement the model,

a supplemental step in the call screening process was added to generate re-referral and placement risk scores that the

call screener and call screening supervisor review when deciding if the referral should be investigated. Beyond this

point, the risk scores do not impact the referral progression process.

Figure 9: Referral Progression Process

Provide family with information for other services or agencies they may find helpful

New Child Welfare Case

Opens

Investigation Findings/Service

Decision

Provide family with information for other services or agencies they may find helpful

Call information received and processed

Assigned Call Screener collects additional information from sources including, but not limited to, the individual who reported the maltreatment and the Client View application that displays individual-level prior service involvement.

Call Screener assigns risk and safety ratings based on information collected.

**NEW STEP** Call screener runs the Allegheny Screening Tool

Consultation with the Call Screening Supervisor

In limited cases, a field screen is conducted

Call Screening Process

Child Welfare Call Screening

Decision

Walden

Figure 7: Referral progression process. The de-cision point that relies on the AFST ishighlighted.

assessment of the risk and safety of the allegedvictim(s). The risk score denotes whether thecall worker believes the case to be Low (1), Mod-

erate (2), or High risk (3). The safety score de-notes whether the call worker believes the allegedvictims to be at No Safety Threat (1), Impend-ing danger (2) or Present danger (3). An overallrisk-safety score is obtained by multiplying therisk and safety scores together. Figure 8 showsthe risk-safety score breakdown across the rangeof AFST ventile scores, and the Right panel ofFigure 6 shows a breakdown of the screening de-cisions across the range of risk-safety scores. Thetwo panels in Figure 9 show a breakdown of therisk and safety scores by AFST score ventile. Wefind that there is a much stronger correspon-dence between screening decisions and the callworker assessed risk-safety score than with theAFST score. This suggests that call workers arelargely continuing to rely on their own assess-ments rather than those of the AFST tool. Fur-thermore, there is no clear association betweenthe risk and safety assessments provided by thecall workers and the AFST ventile score.

An analysis is underway to better understandwhat factors influence override decisions, andwhether the overrides and weak reliance on theAFST tool is negatively affecting the overall qual-ity of the screening process. Given that decisionsare being made with only weak adherence to theAFST tool, it is far from clear that the predictivefairness properties of the models would translateinto equitable decision making. When it comesto data-driven decision making, it is important to

12


0.00

0.25

0.50

0.75

1.00


Cal

l wor

ker r

isk

asse

ssm

ent b

reak

dow

n

Risk_Safety_Score9

6

4

3

2

1

AFST

Figure 8: Overall risk-safety scores as assessedby the call workers prior to viewing theAFST ventile score. The risk-safetyscore is a product of two scores ona scale of 1-3 that call workers areasked to provide. Higher scores in-dicate higher assessment of risk andlower assessments of safety.

bear in mind that fairness is a process property,not just a model property.

4.3. Ethical review

Prior to the deployment of the AFST, the Countycommissioned an independent ethical review ofthe work to be conducted by Tim Dare and EileenGambrill7. The authors adopted a comparativelens, arguing that ethical questions about thePRM entail an analysis of the costs and benefitsof the proposed approach relative to the exist-ing process or other reasonable alternatives. Thereview emphasized the potential risks for confir-mation bias and stigmatization, and this in turnshaped decisions about how and when the toolwould be used. For instance, in order to avoidconfirmatory bias, scores are not shared withworkers who investigate cases. Furthermore, intraining it was clearly indicated that the scoresdo not reflect anything about the certainty of thepresent allegations, and do not themselves pro-vide evidence of case substantiation. It was alsodetermined that race could be included as a pre-dictor variable if it substantively improved thepredictive performance of the model.

Throughout the project, the research teamand County leaders had a strong commitment totransparency. They met with community groups,

7. See p. 46 of https://tinyurl.com/y8n5m9kg for thefull report.

stakeholders and families who were in the welfaresystem multiple times. This has lead to a strongcommunity acceptance of the project.

5. Discussion

In this final section we reflect on several of thekey outstanding technical challenges affecting theevaluation and implementation of the AFST.

5.1. Selective labels

As we discussed in Section 2, the current mod-els have been trained and evaluated on the sub-set of referrals that were observed to be screenedin. However, the purpose of the AFST is not toconvey risk once the screening decision has beenmade, but rather to help the call worker makethat decision in the first place. A key challenge isthat we do not get to observe placement outcomesfor a large fraction of cases that are screened out.This makes it difficult to assess the accuracy ofthe models on the full set of referrals, not justthose that were screened in. A version of this“selective labels” problem was studied by Klein-berg et al. (2017) in the bail decision context,but their problem setting and analytic approachdoes not directly translate to our setting. We areactively working on methodology to address thisproblem in the call screening context.

There are two key challenges. First, unlike inthe bail setting, where it may be reasonable to as-sume that cases are randomly assigned to judgeswho make decisions independently, the call work-ers and supervisors are physically co-located andtheir observed decisions are much more inter-twined. Second, screened-out referrals may befollowed by another referral at a later point ontime. Thus we do get to at least partially observeoutcomes for a subset of screened-out cases. Pre-liminary analyses indicate that the models con-tinue to perform well on the screened-out casesthat were re-referred shortly thereafter and thenscreened in. However, considerably more work isrequired to ensure that the model performs wellon the entire set of referrals.

5.2. Implementation challenges

As discussed in Section 4, data logs from the firstyear of implementation indicate that supervisors

13


0.00

0.25

0.50

0.75

1.00


Cal

l wor

ker r

isk

scor

e br

eakd

own

Risk_ScoreHigh Risk (H)

Moderate Risk (M)

Low Risk (L)

AFST

0.00

0.25

0.50

0.75

1.00


Cal

l wor

ker s

afet

y sc

ore

brea

kdow

n

Safety_ScorePresent Danger

Impending Danger

No Safety Threat

AFST

Figure 9: Call worker assessed Risk (left) and Safety (right) scores across the AFST score ventiles.

are overriding 1 in every 4 mandatory screen-ins.Furthermore, these rates differ from one supervi-sor to the next. The data indicates that the toolis resulting in lower screen-in rates for lowest riskcases, but the override decisions for the highestrisk cases have been difficult to explain.

A challenge in the implementation of the AFSTis that it coincided with the major reform in thechild welfare laws in the State of Pennsylvaniathat we previously discussed. One effect of thesechanges is that they established a State wide hot-line. These State calls are handled slightly dif-ferently by the County staff and the algorithm’sdeployment was not well integrated into this newbusiness process. This could partially explain thelevels of overrides. A new business processes isbeing planned as part of the re-build.

Additionally, the challenge remains that how-ever carefully one deploys an algorithm, data el-ements and processes change over time in waysthat impact the model’s performance in the field.Constant auditing of scores and predictor vari-ables is necessary, and monthly reports are givento leadership on a range of descriptive statisticsand performance measures. It could be that per-formance of the algorithm in the field has deteri-orated leading to justifiable overrides.

There is a persistent tension between the needto allow staff to override the algorithm, and theneed for staff to trust the model risk assessmentswhen the latter indicates high certainty in thelikely outcome. On the one hand, research hasshown that statistical models tend to outperformhuman decision makers. However, there are alsocases when the call worker receives important ad-ditional information during the referral call towhich the model does not have access. Findingthe right balance will undoubtedly involve con-

tinued conversations with the frontline staff onhow to redesign the ASFT to best enable them tomake the challenging decisions with which theyare tasked.

Acknowledgments

We thank Erin Dalton at the Allegheny CountyDepartment of Human Services for her leadershipand oversight over this project. Thanks to WeiZhu for assistance with the overrides analysis.We also wish to thank the anonymous refereesfor their comments and suggestions for improv-ing the manuscript.

References

Julia Angwin, Jeff Larson, Surya Mattu, andLauren Kirchner. How we analyzed thecompas recidivism algorithm. 2016. URLhttps://www.propublica.org/article/

how-we-analyzed-the-compas-recidivism-algorithm.

American Educational Research Association,American Psychological Association, Na-tional Council on Measurement in Education,et al. Standards for educational and psycholog-ical testing. American Educational ResearchAssociation, 1999.

Richard Berk, Hoda Heidari, Shahin Jabbari,Michael Kearns, and Aaron Roth. Fairnessin criminal justice risk assessments: The stateof the art. arXiv preprint arXiv:1703.09207,2017.

John Billings, Ian Blunt, Adam Steventon, TheoGeorghiou, Geraint Lewis, and Martin Bards-ley. Development of a predictive model to iden-

14

https://www.propublica.org/article/how- we-analyzed-the-compas-recidivism-algorithm

https://www.propublica.org/article/how- we-analyzed-the-compas-recidivism-algorithm


tify inpatients at risk of re-admission within 30days of discharge (parr-30). BMJ open, 2(4):e001667, 2012.

Leo Breiman. Random forests. Machine learning,45(1):5–32, 2001.

Chih-Chung Chang and Chih-Jen Lin. Libsvm:a library for support vector machines. ACMtransactions on intelligent systems and tech-nology (TIST), 2(3):27, 2011.

Tianqi Chen and Carlos Guestrin. Xgboost: Ascalable tree boosting system. In Proceedings ofthe 22nd acm sigkdd international conferenceon knowledge discovery and data mining, pages785–794. ACM, 2016.

Alexandra Chouldechova. Fair prediction withdisparate impact: A study of bias in recidivismprediction instruments. Big Data, 2017.

Alexandra Chouldechova and Max G’Sell. Fairerand more accurate, but for whom? arXivpreprint arXiv:1707.00046, 2017.

Sam Corbett-Davies, Emma Pierson, Avi Feller,Sharad Goel, and Aziz Huq. Algorithmic de-cision making and the cost of fairness. arXivpreprint arXiv:1701.08230, 2017.

Robyn M Dawes, David Faust, and Paul E Meehl.Clinical versus actuarial judgment. Science,243(4899):1668–1674, 1989.

Alan J Dettlaff, Stephanie L Rivaux, Donald JBaumann, John D Fluke, Joan R Rycraft, andJoyce James. Disentangling substantiation:The influence of race, income, and risk on thesubstantiation decision in child welfare. Chil-dren and Youth Services Review, 33(9):1630–1637, 2011.

William Dieterich, Christina Mendoza, and TimBrennan. Compas risk scales: Demonstratingaccuracy equity and predictive parity. 2016.

John Fluke, Brenda Jones Harden, Molly Jenk-ins, and Ashleigh Reuhrdanz. Disparities anddisproportionality in child welfare: Analysis ofthe research. 2011.

William M Grove, David H Zald, Boyd S Lebow,Beth E Snitz, and Chad Nelson. Clinical versusmechanical prediction: a meta-analysis. Psy-chological assessment, 12(1):19, 2000.

Hyunil Kim, Christopher Wildeman, MelissaJonson-Reid, and Brett Drake. Lifetimeprevalence of investigating child maltreatmentamong us children. American Journal of Pub-lic Health, 107(2):274 – 280, 2017. ISSN 0090-0036.

Jon Kleinberg, Sendhil Mullainathan, and Man-ish Raghavan. Inherent trade-offs in the fairdetermination of risk scores. arXiv preprintarXiv:1609.05807, 2016.

Jon Kleinberg, Himabindu Lakkaraju, JureLeskovec, Jens Ludwig, and Sendhil Mul-lainathan. Human decisions and machine pre-dictions. Working Paper 23180, National Bu-reau of Economic Research, February 2017.

Zachary C Lipton. The mythos of model inter-pretability. arXiv preprint arXiv:1606.03490,2016.

Paul E Meehl. Clinical versus statistical predic-tion: A theoretical analysis and a review of theevidence. 1954.

Laura E Panattoni, Rhema Vaithianathan, ToniAshton, and Geraint H Lewis. Predictive riskmodelling in health: options for new zealandand australia. Australian Health Review, 35(1):45–51, 2011.

John Platt et al. Probabilistic outputs for sup-port vector machines and comparisons to reg-ularized likelihood methods. Advances in largemargin classifiers, 10(3):61–74, 1999.

Ravi Shroff. Predictive analytics for city agencies:Lessons from children’s services. Big Data,2017.

Jennifer L Skeem and Christopher T Lowenkamp.Risk, race, and recidivism: predictive bias anddisparate impact. Criminology, 54(4):680–712,2016.

Rhema Vaithianathan, Tim Maloney, EmilyPutnam-Hornstein, and Nan Jiang. Childrenin the public benefit system at risk of maltreat-ment: Identification via predictive modeling.American journal of preventive medicine, 45(3):354–359, 2013.

Vladimir Vapnik. Statistical learning theory. Wi-ley, New York, 1998.

15

a case study of algorithm-assisted decision making in child...

Documents