a tree-based approach for addressing self-selection in impact studies with big data

ATree-BasedApproachforAddressingSelf-selectioninImpactStudies

withBigData

Inbal Yahav GalitShmueli Deepa ManiBarIlan University NationalTsingHuaUIndianSchoolofBusiness

Israel Taiwan India

Randomizationorself-selection?

Selfselection:thechallenge

• Large impactstudiesofanintervention• Individuals/firmsself-select interventiongroup/duration

• Incontrolledexperiments,somevariablesmightremainunbalanced

Howtoidentifyandadjustforself-selection?

ThreeApplicationsImpactoftrainingonearningsFieldexperimentbyUSgovt• LaLonde (1986)comparedtoobservationalcontrol• Re-analysisbyPSM(Dehejia &Wahba,1999,2002)

Experiment

Impactofe-Gov serviceinIndiaNewonlinepassportservice• surveyofonline+offlineusers• bribes,traveltime,etc.

Pseudo-experiment

Impactofoutsourcingcontractfeaturesonfinancialperformance• pricingmechanism• contractduration Observational

CommonApproaches

• Heckman-typemodeling• PropensityScoreApproach(PS)

Twosteps:1. Selectionmodel:T=f(X)2. Performanceanalysisonmatchedsamples

Y=performancemeasure(s)T=interventionX=pre-interventionvariables

PropensityScoresApproach

Step1:Estimateselectionmodel logit(T)=f(X)tocomputepropensityscores P(T|X)

Step3:EstimateEffectonY(comparegroups)e.g.,t-testorY=b0 + b1 T+ b2 X+ b3 PS+e

Y=performancemeasure(s)T=interventionX=pre-interventionvariables

Self-selection:P(T|X)≠P(T)

Step2: UsescorestocreatematchedsamplesPSM=usematchingalgorithmPSS=dividescoresintobins

ChallengesofPSinBigData

1.Matchingleadstoseveredataloss

2.PSmethodssufferfrom“datadredging”

3.Novariableselection(cannotidentifyvariablesthatdrivetheselection)

4.Assumesconstantinterventioneffect

5.Sequential natureiscomputationallycostly

6.Logisticmodelrequiresresearchertospecifyexactformofselectionmodel

ProposedSolution:Tree-basedapproach

PropensityscoresP(T|X)

Y,T,X E(Y|T)EvenE(Y|T,X)

“Kill the Intermediary”

ClassificationTreeOutput:T(treat/control)Inputs:X’s(income,edu,family…)

Recordsineachterminalnodesharesameprofile(X)andsamepropensityscoreP(T=1|X)

Tree-BasedApproach

Foursteps:1. Runselectionmodel:fittreeT=f(X)2. Presentresultingtree;seeunbalancedX’s3. Treateachterminalnodeassub-samplefor

measuringY;conductterminal-node-levelperformanceanalysis

4. Presentterminal-node-analysesvisually5. [optional]:combineanalysesfromnodeswith

homogeneouseffects

LikePS,assumesobservableself-selection

Solves challengesofPSinBigData

1.Matching leadstoseveredataloss

2.PSmethodssufferfrom“datadredging”

3.No variableselection(cannotidentifyvariablesthatdrivetheselection)

4.Assumesconstant interventioneffect

5.Sequential natureiscomputationallycostly

6.Logistic modelrequiresresearchertospecifyexactform ofselectionmodel

WhyTreesinExplanatoryStudy?

Flexiblenon-parametricselectionmodel(f)

Automateddetectionofunbalancedpre-interventionvariables(X)

Easytointerpret,transparent,visual

Applicabletobinary,polytomous,continuousintervention(T)

UsefulinBigDatacontext

Identifyheterogeneouseffects(effectofTonY)

TreeCreation

Whichalgorithm?Conditional-Inferencetrees(Hothorn etal.,2006)– Stoptreegrowthusingstatisticaltestsofindependence

– Binarysplits

BigDataSimulationBinaryintervention

T={0,1}Continuousintervention

T∼ N

Sample sizes (n) 10K, 100K, 1M

#Pre-interventionvariables (p) 4,50(+interactions)

Pre-interventionvariable types Binary,Likert-scale,continuous

Outcomevariable types Binary,continuous

Selectionmodels#1: P(T=1)=logit(b0 + b1x1+…+ bp xp)

#2: P(T=1)=logit(b0 + b1x1+…+ bp xp +interactions)

Interventioneffects

1. HomogeneousControl: E(Y | T = 0) = 0.5Intervention: E(Y | T = 1) = 0.72. HeterogeneousControl: E(Y | T = 0) = 0.5Intervention: E(Y | T = 1, X1=0) = 0.7

E(Y | T = 1, X1=1) = 0.3

1. HomogeneousControl: E(Y | T = 0) = 0Intervention: E(Y | T = 1) = 12. HeterogeneousControl: E(Y | T = 0) = 0Intervention: E(Y | T = 1, X1=0) = 1

E(Y | T = 1, X1=1) = -1

ResultsforselectionmodelP(T=1|X)=logit(b0 + b1X1+…+ bp Xp)

PSS(5bins)

BigDataScalability

TheoreticalComplexity:• O(mn/p) forbinaryX• O(m/pnlog(n)) forcontinuousX

Runtimeasfunctionofsamplesize,dimension

ScalingTreesEvenFurther

• “BigData”inresearchvs.industry• Industrialscaling– Sequentialtrees:efficientdatastructure,access(SPRINT,SLIQ,RainForest)

– Parallelcomputing(parallelSPRINT,ScalParC,SPARK,PLANET)“aslongassplitmetriccanbecomputedonsubsetsofthetrainingdataandlateraggregated,PLANETcanbeeasilyextended”

HeterogeneousEffect

ContinuousIntervention

16nodes

Study1:Impactoftrainingonfinancialgains(LaLonde 1986;Dehejia &Wahba 1999,2002)

Experiment:USgovt programrandomlyassignseligiblecandidatestotrainingprogram• Goal:increasefutureearnings• LaLonde (1986)shows:

üGroupsstatisticallyequalintermsofdemographic&pre-trainearnings

ü ATE=$1794(p<0.004)

Treereveals…

LaLonde’snaïveapproach(experiment)

TreeapproachHSdropout(n=348)

HSdegree(n=97)

Nottrained(n=260) $4554 $4,495 $4,855Trained(n=185) $6349 $5,649 $8,047

Trainingeffect$1794

(p=0.004)$1,154

(p=0.063)$3,192

(p=0.015)Overall:$1598

(p=0.017)

no yes

Highschooldegree

1. Unbalancedvariable(HSdegree)2. Heterogeneouseffect

Trainingeffect:Observationalcontrolgroup

• LaLonde alsocomparedwithobservationalcontrolgroups(PSID,CPS)– experimentaltraininggroup+obs control– showstrainingeffectnotestimatedcorrectlywithstructuralequations

• Dehejia &Wahba (1999,2002)re-analyzeCPScontrolgroup(n=15,991),usingPSM– Effectsinrange$1122-$1681,dependsonsettings– “Best”settingeffect:$1360– Usesonly119controlgroupmembers(outof15,991)

Treeforobs controlgroupreveals…

unemployedin1974(u74=0)->negativeeffect

1. Unbalancedvariables2. Heterogeneouseffectinu743. Outlier4. Eligibilityissue

outlier

eligibilityissue!

SurveycommissionedbyGovt ofIndiain2006• >9500individualswhousedpassportservices• Representativesampleof13PassportOffices• “Quasi-experimental,non-equivalentgroupsdesign”• Equalnumberofofflineandonlineusers,matched

bygeographyanddemographics

Study2:ImpactofeGov Initiative(India)

CurrentPractice

Assessimpactbycomparingonline/offlineperformancestats

AwarenessofelectronicservicesprovidedbyGovernmentofIndia

%bribeRPO

%useagent

%preferonline

%bribepolice

Simpson’sParadox

1. Demographicsproperlybalanced2. Unbalancedvariable(Aware)3. Heterogeneouseffectsonvariousy’s+evenSimpson’sparadox

PSMAwarenessofelectronicservicesprovidedbyGovernmentofIndia

WouldwedetectthiswithPSM?

Heterogeneouseffect

ScalingUptoBigData• WeinflatedeGov datasetbybootstrap• Upto9,000,000recordsand360variables• 10runsforeachconfiguration:runtimefortree

20sec

TreeApproach1. Data-drivenselectionmodel2. ScalesuptoBigData3. Lessuserchoices(datadredging)4. Nuancedinsights• Detectunbalancedvariables• Detectheterogeneouseffectfromanticipatedoutcomes

5. Simpletocommunicate6. Automaticvariableselection7. Missingvaluesdonotremoverecord8. Binary,multiple,continuousinterventions9. Post-analysisofexperiments,observationalstudies• Assumesselectiononobservables• Needsufficientdata• Continuousvariables– largetree• Instability– usevariableimportancescores(forest)

Insights fromtree-approachinthethreeapplications

Labor(Lalonde ‘86)Heterogeneouseffect:ImpactoftrainingdependsonHighschooldiploma

ContractDurationFirstattempttostudyeffectofdurationoncontractperformance

PriceMechanismHeterogeneouseffect:Fixed-pricecreateslong-termmarketvalue(notproductivity),butonlyinhigh-trustcontracts

eGovHeterogeneouseffect:Impactofonlinesystemdependsonuserawareness

ImpactofITOutsourcingContractAttributes

Howdoesfinancialperformanceofoutsourcingcontractsvarywithtwoattributesofthecontract:• Pricingmechanisms(6options)• Contractduration(continuous)

ObservationalData• >1400contracts,implemented1996-2008• 374vendorsand710clients• ObtainedfromIDCdatabase,Lexis-Nexis,

COMPUSTAT,etc.

T=SixPricingMechanisms(polytomous intervention)

Interventions(T):1. FixedPrice2. TransactionalPrice3. Time-and-Materials4. Incentive5. Combination6. JointVenture

FixedPriceVariablePrice

Pre-InterventionVariables(X):TaskTypeBidTypeContractValueUncertaintyinbusinessrequirementsOutsourcingExperienceFirmSize(marketvalueofequity)

Outcomes(Y):AnnouncementReturnsLongTermReturnsMedianIncomeEfficiency

SixPricingMechanisms(polytomous intervention)

Interventions(T):1. FixedPrice2. TransactionalPrice3. Time-and-Materials4. Incentive5. Combination6. JointVenture

FixedPrice- FixedpaymentperbillingcycleTransactional- FixedpaymentpertransactionperbillingcycleTimeandMaterials- PaymentbasedoninputtimeandmaterialsusedduringbillingcycleIncentive - PaymentbasedonoutputimprovementsagainstkeyperformanceindicatorsoranycombinationofindicatorsCombination - Acombinationofanyoftheabovecontracttypes,largelyfixedpriceandtimeandmaterialsJointVenture- Aseparatelyincorporatedentity,jointlyownedbytheclientandthevendor,usedtogoverntheoutsourcingrelationship.

FixedPriceVariablePrice

SixPriceMechanisms

Questionsofinterest:1)Doallsimpleoutsourcingengagements,governedbyfixedortransactionalpricecontracts,createvalue?

2)Whattypesofcomplexoutsourcingengagementscreatevaluefortheclient?3)Howdofirmsmitigaterisksinherenttotheseengagements?

ImpactMeasures(Y)AnnouncementReturnsFirm-specificdailyabnormalreturns(𝜀#̂$,forfirmi ondayt)• Computedas𝜀#̂$=𝑟#$-�̂�#$ ,where𝑟#$ =dailyreturn(tothevalueweightedS&P500),

estimatedfromthemarketmodel:𝑟#$ = α#+𝛽# 𝑟)$+ 𝜀#$.• Modelusedtopredictdailyreturnsforeachfirmoverannouncementperiod[-5,+5].LongTermReturnsMonthlyabnormalreturns• EstimatedfromtheFama- Frenchthree-factormodelasexcessofthatachievedby

passiveinvestmentsinsystematicriskfactors.• Expectedtobezerounderthenullhypothesisofmarketefficiency.• Usedtoestimatetheimpliedthree-yearabnormalreturnfollowingthecontract.MedianIncomeEfficiencyIncomeefficiencyisestimatedasearningsbeforeinterestandtaxesdividedbynumberofemployees.• Medianofincomeefficiencyforthethree-yearperiodfollowingcontract

implementation

SixpricingmethodologiesSelectionModel

LargecustomITtasks(complex)

BPO+simpletasks

high-trust

Node-levelPerformanceAnalysis

Combinationcontractscreatevalueforcomplexengagements

CustomIT(complex,hightrust)

ContractDuration(ContinuousIntervention)

T=Contractduration(months)

Pre-InterventionVariables(X):TaskTypeBidTypeContractValueUncertaintyinbusinessrequirementsOutsourcingExperienceFirmSize(marketvalueofequity)

Outcomes(Y)AnnouncementReturnsLongTermReturnsMedianIncomeEfficiency

Contractdurationhasnoimpactonperformancegainsfromoutsourcing

ContractDurationSelectionModel(regressiontree)

Durationproportionaltocontractvalue(1,6,8vs.4,5)

Node-LevelPerformanceAnalysis(Reg)

Marketsrewardlong-termforhigh-valuecontractsandlow-valueminimalscopecontracts

contractsrequiringspecificornon-contractibleinvestments(costsoutweighbenefits)

Pricemethodologies:MaininsightPriorresearch:• Roleoftrustonlyincomplexcontracts• Fixedpriceknowntocreatevalue;unrelatedtotrust

Treefinding:Fixed-pricecreateslong-termmarketvalue(notproductivity),butonlyinhigh-trustcontracts!

a tree-based approach for addressing self-selection in impact studies with big data

Education