a tree-based approach for addressing self-selection in impact studies with big data
TRANSCRIPT
ATree-BasedApproachforAddressingSelf-selectioninImpactStudies
withBigData
Inbal Yahav GalitShmueli Deepa ManiBarIlan University NationalTsingHuaUIndianSchoolofBusiness
Israel Taiwan India
Selfselection:thechallenge
• Large impactstudiesofanintervention• Individuals/firmsself-select interventiongroup/duration
• Incontrolledexperiments,somevariablesmightremainunbalanced
Howtoidentifyandadjustforself-selection?
ThreeApplicationsImpactoftrainingonearningsFieldexperimentbyUSgovt• LaLonde (1986)comparedtoobservationalcontrol• Re-analysisbyPSM(Dehejia &Wahba,1999,2002)
Experiment
Impactofe-Gov serviceinIndiaNewonlinepassportservice• surveyofonline+offlineusers• bribes,traveltime,etc.
Pseudo-experiment
Impactofoutsourcingcontractfeaturesonfinancialperformance• pricingmechanism• contractduration Observational
CommonApproaches
• Heckman-typemodeling• PropensityScoreApproach(PS)
Twosteps:1. Selectionmodel:T=f(X)2. Performanceanalysisonmatchedsamples
Y=performancemeasure(s)T=interventionX=pre-interventionvariables
PropensityScoresApproach
Step1:Estimateselectionmodel logit(T)=f(X)tocomputepropensityscores P(T|X)
Step3:EstimateEffectonY(comparegroups)e.g.,t-testorY=b0 + b1 T+ b2 X+ b3 PS+e
Y=performancemeasure(s)T=interventionX=pre-interventionvariables
Self-selection:P(T|X)≠P(T)
Step2: UsescorestocreatematchedsamplesPSM=usematchingalgorithmPSS=dividescoresintobins
ChallengesofPSinBigData
1.Matchingleadstoseveredataloss
2.PSmethodssufferfrom“datadredging”
3.Novariableselection(cannotidentifyvariablesthatdrivetheselection)
4.Assumesconstantinterventioneffect
5.Sequential natureiscomputationallycostly
6.Logisticmodelrequiresresearchertospecifyexactformofselectionmodel
ProposedSolution:Tree-basedapproach
PropensityscoresP(T|X)
Y,T,X E(Y|T)EvenE(Y|T,X)
“Kill the Intermediary”
ClassificationTreeOutput:T(treat/control)Inputs:X’s(income,edu,family…)
Recordsineachterminalnodesharesameprofile(X)andsamepropensityscoreP(T=1|X)
Tree-BasedApproach
Foursteps:1. Runselectionmodel:fittreeT=f(X)2. Presentresultingtree;seeunbalancedX’s3. Treateachterminalnodeassub-samplefor
measuringY;conductterminal-node-levelperformanceanalysis
4. Presentterminal-node-analysesvisually5. [optional]:combineanalysesfromnodeswith
homogeneouseffects
LikePS,assumesobservableself-selection
Solves challengesofPSinBigData
1.Matching leadstoseveredataloss
2.PSmethodssufferfrom“datadredging”
3.No variableselection(cannotidentifyvariablesthatdrivetheselection)
4.Assumesconstant interventioneffect
5.Sequential natureiscomputationallycostly
6.Logistic modelrequiresresearchertospecifyexactform ofselectionmodel
WhyTreesinExplanatoryStudy?
Flexiblenon-parametricselectionmodel(f)
Automateddetectionofunbalancedpre-interventionvariables(X)
Easytointerpret,transparent,visual
Applicabletobinary,polytomous,continuousintervention(T)
UsefulinBigDatacontext
Identifyheterogeneouseffects(effectofTonY)
TreeCreation
Whichalgorithm?Conditional-Inferencetrees(Hothorn etal.,2006)– Stoptreegrowthusingstatisticaltestsofindependence
– Binarysplits
BigDataSimulationBinaryintervention
T={0,1}Continuousintervention
T∼ N
Sample sizes (n) 10K, 100K, 1M
#Pre-interventionvariables (p) 4,50(+interactions)
Pre-interventionvariable types Binary,Likert-scale,continuous
Outcomevariable types Binary,continuous
Selectionmodels#1: P(T=1)=logit(b0 + b1x1+…+ bp xp)
#2: P(T=1)=logit(b0 + b1x1+…+ bp xp +interactions)
Interventioneffects
1. HomogeneousControl: E(Y | T = 0) = 0.5Intervention: E(Y | T = 1) = 0.72. HeterogeneousControl: E(Y | T = 0) = 0.5Intervention: E(Y | T = 1, X1=0) = 0.7
E(Y | T = 1, X1=1) = 0.3
1. HomogeneousControl: E(Y | T = 0) = 0Intervention: E(Y | T = 1) = 12. HeterogeneousControl: E(Y | T = 0) = 0Intervention: E(Y | T = 1, X1=0) = 1
E(Y | T = 1, X1=1) = -1
BigDataScalability
TheoreticalComplexity:• O(mn/p) forbinaryX• O(m/pnlog(n)) forcontinuousX
Runtimeasfunctionofsamplesize,dimension
ScalingTreesEvenFurther
• “BigData”inresearchvs.industry• Industrialscaling– Sequentialtrees:efficientdatastructure,access(SPRINT,SLIQ,RainForest)
– Parallelcomputing(parallelSPRINT,ScalParC,SPARK,PLANET)“aslongassplitmetriccanbecomputedonsubsetsofthetrainingdataandlateraggregated,PLANETcanbeeasilyextended”
Study1:Impactoftrainingonfinancialgains(LaLonde 1986;Dehejia &Wahba 1999,2002)
Experiment:USgovt programrandomlyassignseligiblecandidatestotrainingprogram• Goal:increasefutureearnings• LaLonde (1986)shows:
üGroupsstatisticallyequalintermsofdemographic&pre-trainearnings
ü ATE=$1794(p<0.004)
Treereveals…
LaLonde’snaïveapproach(experiment)
TreeapproachHSdropout(n=348)
HSdegree(n=97)
Nottrained(n=260) $4554 $4,495 $4,855Trained(n=185) $6349 $5,649 $8,047
Trainingeffect$1794
(p=0.004)$1,154
(p=0.063)$3,192
(p=0.015)Overall:$1598
(p=0.017)
no yes
Highschooldegree
1. Unbalancedvariable(HSdegree)2. Heterogeneouseffect
Trainingeffect:Observationalcontrolgroup
• LaLonde alsocomparedwithobservationalcontrolgroups(PSID,CPS)– experimentaltraininggroup+obs control– showstrainingeffectnotestimatedcorrectlywithstructuralequations
• Dehejia &Wahba (1999,2002)re-analyzeCPScontrolgroup(n=15,991),usingPSM– Effectsinrange$1122-$1681,dependsonsettings– “Best”settingeffect:$1360– Usesonly119controlgroupmembers(outof15,991)
Treeforobs controlgroupreveals…
unemployedin1974(u74=0)->negativeeffect
1. Unbalancedvariables2. Heterogeneouseffectinu743. Outlier4. Eligibilityissue
outlier
eligibilityissue!
SurveycommissionedbyGovt ofIndiain2006• >9500individualswhousedpassportservices• Representativesampleof13PassportOffices• “Quasi-experimental,non-equivalentgroupsdesign”• Equalnumberofofflineandonlineusers,matched
bygeographyanddemographics
Study2:ImpactofeGov Initiative(India)
AwarenessofelectronicservicesprovidedbyGovernmentofIndia
%bribeRPO
%useagent
%preferonline
%bribepolice
Simpson’sParadox
1. Demographicsproperlybalanced2. Unbalancedvariable(Aware)3. Heterogeneouseffectsonvariousy’s+evenSimpson’sparadox
ScalingUptoBigData• WeinflatedeGov datasetbybootstrap• Upto9,000,000recordsand360variables• 10runsforeachconfiguration:runtimefortree
20sec
TreeApproach1. Data-drivenselectionmodel2. ScalesuptoBigData3. Lessuserchoices(datadredging)4. Nuancedinsights• Detectunbalancedvariables• Detectheterogeneouseffectfromanticipatedoutcomes
5. Simpletocommunicate6. Automaticvariableselection7. Missingvaluesdonotremoverecord8. Binary,multiple,continuousinterventions9. Post-analysisofexperiments,observationalstudies• Assumesselectiononobservables• Needsufficientdata• Continuousvariables– largetree• Instability– usevariableimportancescores(forest)
Insights fromtree-approachinthethreeapplications
Labor(Lalonde ‘86)Heterogeneouseffect:ImpactoftrainingdependsonHighschooldiploma
ContractDurationFirstattempttostudyeffectofdurationoncontractperformance
PriceMechanismHeterogeneouseffect:Fixed-pricecreateslong-termmarketvalue(notproductivity),butonlyinhigh-trustcontracts
eGovHeterogeneouseffect:Impactofonlinesystemdependsonuserawareness
ImpactofITOutsourcingContractAttributes
Howdoesfinancialperformanceofoutsourcingcontractsvarywithtwoattributesofthecontract:• Pricingmechanisms(6options)• Contractduration(continuous)
ObservationalData• >1400contracts,implemented1996-2008• 374vendorsand710clients• ObtainedfromIDCdatabase,Lexis-Nexis,
COMPUSTAT,etc.
T=SixPricingMechanisms(polytomous intervention)
Interventions(T):1. FixedPrice2. TransactionalPrice3. Time-and-Materials4. Incentive5. Combination6. JointVenture
FixedPriceVariablePrice
Pre-InterventionVariables(X):TaskTypeBidTypeContractValueUncertaintyinbusinessrequirementsOutsourcingExperienceFirmSize(marketvalueofequity)
Outcomes(Y):AnnouncementReturnsLongTermReturnsMedianIncomeEfficiency
SixPricingMechanisms(polytomous intervention)
Interventions(T):1. FixedPrice2. TransactionalPrice3. Time-and-Materials4. Incentive5. Combination6. JointVenture
FixedPrice- FixedpaymentperbillingcycleTransactional- FixedpaymentpertransactionperbillingcycleTimeandMaterials- PaymentbasedoninputtimeandmaterialsusedduringbillingcycleIncentive - PaymentbasedonoutputimprovementsagainstkeyperformanceindicatorsoranycombinationofindicatorsCombination - Acombinationofanyoftheabovecontracttypes,largelyfixedpriceandtimeandmaterialsJointVenture- Aseparatelyincorporatedentity,jointlyownedbytheclientandthevendor,usedtogoverntheoutsourcingrelationship.
FixedPriceVariablePrice
SixPriceMechanisms
Questionsofinterest:1)Doallsimpleoutsourcingengagements,governedbyfixedortransactionalpricecontracts,createvalue?
2)Whattypesofcomplexoutsourcingengagementscreatevaluefortheclient?3)Howdofirmsmitigaterisksinherenttotheseengagements?
ImpactMeasures(Y)AnnouncementReturnsFirm-specificdailyabnormalreturns(𝜀#̂$,forfirmi ondayt)• Computedas𝜀#̂$=𝑟#$-�̂�#$ ,where𝑟#$ =dailyreturn(tothevalueweightedS&P500),
estimatedfromthemarketmodel:𝑟#$ = α#+𝛽# 𝑟)$+ 𝜀#$.• Modelusedtopredictdailyreturnsforeachfirmoverannouncementperiod[-5,+5].LongTermReturnsMonthlyabnormalreturns• EstimatedfromtheFama- Frenchthree-factormodelasexcessofthatachievedby
passiveinvestmentsinsystematicriskfactors.• Expectedtobezerounderthenullhypothesisofmarketefficiency.• Usedtoestimatetheimpliedthree-yearabnormalreturnfollowingthecontract.MedianIncomeEfficiencyIncomeefficiencyisestimatedasearningsbeforeinterestandtaxesdividedbynumberofemployees.• Medianofincomeefficiencyforthethree-yearperiodfollowingcontract
implementation
Node-levelPerformanceAnalysis
Combinationcontractscreatevalueforcomplexengagements
CustomIT(complex,hightrust)
ContractDuration(ContinuousIntervention)
T=Contractduration(months)
Pre-InterventionVariables(X):TaskTypeBidTypeContractValueUncertaintyinbusinessrequirementsOutsourcingExperienceFirmSize(marketvalueofequity)
Outcomes(Y)AnnouncementReturnsLongTermReturnsMedianIncomeEfficiency
Contractdurationhasnoimpactonperformancegainsfromoutsourcing
Node-LevelPerformanceAnalysis(Reg)
Marketsrewardlong-termforhigh-valuecontractsandlow-valueminimalscopecontracts
contractsrequiringspecificornon-contractibleinvestments(costsoutweighbenefits)