acct 420: logistic regression for corporate fraud · 2. pretend as though the money was invested 3....
TRANSCRIPT
ACCT420:Logistic
RegressionforCorporate
Fraud
Session7
Dr.RichardM.Crowley
1
Frontmatter
2 . 1
▪ Theory:
▪ Economics
▪ Psychology
▪ Application:
▪ Predictingfraudcontained
inannualreports
▪ Methodology:
▪ Logisticregression
▪ LASSO
Learningobjectives
2 . 2
Datacamp
▪ Exploreonyourown
▪ Nospecificrequiredclassthisweek
▪ Wewillstarthavingsomeassignedchaptersafterthebreak
▪ I’vepostthemalready,soyoucanworkonthematyourleisure
2 . 3
Corporate/SecuritiesFraud
3 . 1
Traditionalaccountingfraud
1. Acompanyisunderperforming
2. Managementcooksupsomeschemetoincreaseearnings
▪ Worldcom(1999-2001)
▪ Fakerevenueentries
▪ Capitalizinglinecosts(shouldbeexpensed)
▪ Olympus(late1980s-2011):Hidelossesinaseparateentity
▪ “Tobashischeme”
▪ WellsFargo(2011-2018?)
▪ Fake/duplicatecustomersandtransactions
3. Createaccountingstatementsusingthefakeinformation
3 . 2
Reversingit
1. Acompanyisoverperforming
2. Managementcooksupaschemeto“saveup”excessperformancefor
arainyday
▪
▪ Cookiejarreserve,fromsecretpaymentsbyIntel,madeupto
76%ofquarterlyincome
▪
3. Recognizerevenue/earningswhenneededinthefuturetohitearnings
targets
Dell(2002-2007)
Brystol-MyersSquibb(2000-2001)
3 . 3
Otheraccountingfraudtypes
▪
▪ Optionsbackdating
▪
▪ Usinganauditorthatisn’tregistered
▪
▪ Releasingfinancialstatementsthatwerenotreviewedbyanauditor
▪
▪ Relatedpartytransactions(transferringfundstofamilymembers)
▪ Insufficientinternalcontrols
▪ viaBanamex
▪
Apple(2001)
CommerceGroupCorp(2003)
CardiffInternational(2017)
ChinaNorthEastPetroleumHoldingsLimited
Citigroup(2008-2014)
AsiaPacificBreweries
3 . 4
Otheraccountingfraudtypes
▪
▪ Round-tripping:Transactionstoinflaterevenuethathaveno
substance
▪ Bribery
▪ :$55MUSDinbribestoBrazilianofficials
forcontracts
▪ BakerHughes( , ):PaymentstoofficialsinIndonesia,and
possiblytoBrazilandIndia(2001)andtoofficialsinAngola,
Indonesia,Nigeria,Russia,andUzbekistan(2007)
▪ :Fakethewholecompany,getfundingfrom
insurancefraud,theft,creditcardfraud,andfakecontracts
▪ Alsofakedarealprojecttogetacleanaudittotakethecompany
public
SupremaSpecialties(1998-2001)
KeppelO&M(2001-2014)
2001 2007
ZZZZBest(1982-1987)
3 . 5
Othersecuritiesfraudtypes
▪ :Ponzischeme
1. Getmoneyfromindividualsfor“investments”
2. Pretendasthoughthemoneywasinvested
3. Usenewinvestors’moneytopaybackanyonewithdrawingtheir
money
▪
▪ Materialmisstatements
▪ Materialomissions(FDAapplications,didn’tpaypayrolltaxes)
▪
▪ Failedtofileannualandquarterlyreports
▪
▪ Aidinganothercompany’sfraud(TakeTwo,byparking2video
games)
▪
▪ MisleadingstatementsonTwitter
BernardMadoff
ImagingDiagnosticSystems(2013)
AppliedWellnessCorporation(2008)
CapitolDistributingLLC
Tesla(2018)
3 . 6
Someofthemoreinterestingcases
▪
▪ Claimeditwasdevelopingprocessormicrocodeindependently,
whenitactuallyprovidedIntel’smicrocodetoit’sengineers
▪
▪ Shamsale-leasebackofabartoacorporateofficer
▪
▪ Notusingmark-to-marketaccountingtofairvaluestuffedanimal
inventories
▪
▪ Goldreserveswereactually…dirt.
▪
▪ Employeescreated1,280fakememberships,soldthem,and
retainedallprofits($37.5M)
AMD(1992-1993)
Am-PacInternational(1997)
CVS(2000)
CountrylandWellnessResorts,Inc.(1997-2000)
KeppelClub(2014)
3 . 7
Whatwillwelookattoday?
Misstatementsthataffectfirms’accountingstatements
andweredoneseeminglyintentionallybymanagement
orotheremployeesatthefirm.
3 . 8
Howdomisstatementscometolight?
1. Thecompany/managementadmitstoitpublicly
2. Agovernmententityforcesthecompanytodisclose
▪ Inmoreegregiouscases,governmentagenciesmaydisclosethe
fraudpubliclyaswell
3. Investorssuethefirm,forcingdisclosure
3 . 9
Wherearethesedisclosed?
IntheUS:
1. 10-K/Afilings(/Ameansamendment)
▪ Note:notall10-K/Afilingsarecausedbyfraud!
▪ Anybenigncorrectionoradjustmentcanalsobefiledasa10-K/A
▪
2. Inanoteinsidea10-Kfiling
▪ Thesearesometimesreferredtoas“littler”restatements
3. :AccountingandAuditingEnforcementReleases
▪ Generallyhighlightlargerormoreimportantcases
▪ WrittenbytheSEC,notthecompany
AuditAnalytic’swrite-uponthisfor2017
SECAAERs
3 . 10
AAERs
▪ TodaywewillexaminetheseAAERs
▪ Usingaproprietarydatasetof>1,000suchreleases
▪ Togetasenseofthedatawe’reworkingwith,readtheSummary
section(startingonpage2)ofthisAAERagainstSanofi
▪ rmc.link/420class7
WhydidtheSECreleasethisAAERregardingSanofi?
3 . 11
PredictingFraud
4 . 1
Mainquestion
▪ Thisisapureforensicanalyticsquestion
▪ “Majorinstanceofmisreporting”willbeimplementedusingAAERs
Howcanwedetectifafirmisinvolvedinamajorinstance
ofmissreporting?
4 . 2
Approaches
▪ Intheseslides,I’llwalkthroughtheprimarydetectionmethodssince
the1990s,uptocurrentlyusedmethods
▪ 1990s:Financialsandfinancialratios
▪ Followupin2011
▪ Late2000s/early2010s:Characteristicsoffirm’sdisclosures
▪ mid2010s:Moreholistictext-basedmeasuresofdisclosures
▪ Thiswilltietonextlessonwherewewillexplorehowtoworkwith
text
Allofthesearediscussedina
–IwillrefertothepaperasBCEforshort
Brown,CrowleyandElliott
(2018)
4 . 3
Thedata
▪ Ihaveprovidedsomepreprocesseddata,sanitizedofAAERdata
(whichispartiallypublic,partiallyproprietary)
▪ Itcontains399variables
▪ FromCompustat,CRSP,andtheSEC(whichIpersonallycollected)
▪ Manyprecalculatedmeasuresincluding:
▪ Firmcharacteristics,suchasauditortype(bigNaudit,midNaudit)
▪ Financialmeasures,suchastotalaccruals(rsst_acc)
▪ Financialratios,suchasROA(ni_at)
▪ Annualreportcharacteristics,suchasthemeansentencelength
(sentlen_u)
▪ Machinelearningbasedcontentanalysis(everythingwithTopic_
prepended)
PulledfromBCE’sworkingfiles
4 . 4
TrainingandTesting
▪ AlreadyhastestingandtrainingsetupinvariableTest
▪ Trainingisannualreportsreleasedin2003through2007
▪ Testingisannualreportsreleasedin2008
Whatpotentialissuesaretherewithourusualtrainingand
testingstrategy?
4 . 5
Censoring
▪ Censoringtrainingdatahelpstoemulatehistoricalsituations
▪ Buildanalgorithmusingonlythedatathatwasavailableatthetime
adecisionwouldneedtohavebeenmade
▪ Donotcensorthetestingdata
▪ Testingemulateswherewewanttomakeanoptimalchoiceinreal
life
▪ Wewanttofindfraudsregardlessofhowwellhiddentheyare!
4 . 6
Eventfrequency
▪ Veryloweventfrequenciescanmakethingstricky
year total_AAERS total_observations
1999 46 2195
2000 50 2041
2001 43 2021
2002 50 2391
2003 57 2936
2004 49 2843
df%>%
group_by(year)%>%
mutute(total_AAERS=sum(AAER),total_observations=n())%>%
slice(1)%>%
ungroup()%>%
select(year,total_AAERS,total_observations)%>%
html_df
246AAERsinthetrainingdata,401totalvariables…
4 . 7
Dealingwithinfrequentevents
▪ Afewwaystohandlethis
1. Verycarefulmodelselection(keepitsufficientlysimple)
2. Sophisticateddegeneratevariableidentificationcriterion+
simulationtoimplementcomplexmodelsthatarejustbarely
simpleenough
▪ ThemainmethodinBCE
3. Automatedmethodologiesforpairingdownmodels
▪ We’lldiscussusingLASSOforthisattheendofclass
▪ AlsoimplementedinBCE
4 . 8
1990sapproach
5 . 1
▪ EBIT
▪ Earnings/revenue
▪ ROA
▪ Logofliabilities
▪ liabilities/equity
▪ liabilities/assets
▪ quickratio
▪ Workingcapital/assets
▪ Inventory/revenue
▪ inventory/assets
▪ earnings/PP&E
▪ A/R/revenue
▪ Changeinrevenue
▪ ChangeinA/R+1
▪ > 10%changeinA/R
▪ Changeingrossprofit+1
▪ > 10%changeingross
profit
▪ Grossprofit/assets
▪ Revenueminusgrossprofit
▪ Cash/assets
▪ Logofassets
▪ PP&E/assets
▪ Workingcapital
The1990smodel
▪ Manyfinancialmeasuresandratioscanhelptopredictfraud
5 . 2
Approach
fit_1990s<-glm(AAER~ebit+ni_revt+ni_at+log_lt+ltl_at+lt_seq+
lt_at+act_lct+aq_lct+wcap_at+invt_revt+invt_at+
ni_ppent+rect_revt+revt_at+d_revt+b_rect+b_rect+
r_gp+b_gp+gp_at+revt_m_gp+ch_at+log_at+
ppent_at+wcap,
data=df[df$Test==0,],
family=binomial)summury(fit_1990s)
####Call:##glm(formula=AAER~ebit+ni_revt+ni_at+log_lt+ltl_at+##lt_seq+lt_at+act_lct+aq_lct+wcap_at+invt_revt+##invt_at+ni_ppent+rect_revt+revt_at+d_revt+b_rect+##b_rect+r_gp+b_gp+gp_at+revt_m_gp+ch_at+log_at+##ppent_at+wcap,family=binomial,data=df[df$Test==##0,])####DevianceResiduals:##Min1QMedian3QMax##-1.1391-0.2275-0.1661-0.11903.6236####Coefficients:##EstimateStd.ErrorzvaluePr(>|z|)##(Intercept)-4.660e+008.336e-01-5.5912.26e-08***##ebit-3.564e-041.094e-04-3.2570.00112**##ni_revt3.664e-023.058e-021.1980.23084##ni_at-3.196e-012.325e-01-1.3740.16932##log_lt1.494e-013.409e-010.4380.66118##ltl_at-2.306e-017.072e-01-0.3260.74438
5 . 3
ROC
##InsampleAUCOutofsampleAUC##0.74831320.7292981
5 . 4
The2011followup
6 . 1
▪ Logofassets
▪ Totalaccruals
▪ %changeinA/R
▪ %changeininventory
▪ %softassets
▪ %changeinsalesfromcash
▪ %changeinROA
▪ Indicatorforstock/bond
issuance
▪ Indicatorforoperatingleases
▪ BVequity/MVequity
▪ Lagofstockreturnminus
valueweightedmarketreturn
▪ BelowareBCE’sadditions
▪ Indicatorformergers
▪ IndicatorforBigNauditor
▪ Indicatorformediumsize
auditor
▪ Totalfinancingraised
▪ Netamountofnewcapital
raised
▪ Indicatorforrestructuring
The2011model
BasedonDechow,Ge,LarsonandSloan(2011)
6 . 2
Themodel
fit_2011<-glm(AAER~logtotasset+rsst_acc+chg_recv+chg_inv+
soft_assets+pct_chg_cashsales+chg_roa+issuance+
oplease_dum+book_mkt+lag_sdvol+merger+bigNaudit+
midNaudit+cffin+exfin+restruct,
data=df[df$Test==0,],
family=binomial)summury(fit_2011)
####Call:##glm(formula=AAER~logtotasset+rsst_acc+chg_recv+chg_inv+##soft_assets+pct_chg_cashsales+chg_roa+issuance+oplease_dum+##book_mkt+lag_sdvol+merger+bigNaudit+midNaudit+cffin+##exfin+restruct,family=binomial,data=df[df$Test==##0,])####DevianceResiduals:##Min1QMedian3QMax##-0.8434-0.2291-0.1658-0.11963.2614####Coefficients:##EstimateStd.ErrorzvaluePr(>|z|)##(Intercept)-7.14745580.5337491-13.391<2e-16***##logtotasset0.32143220.03554679.043<2e-16***##rsst_acc-0.21900950.3009287-0.7280.4667##chg_recv1.10207401.05908371.0410.2981##chg_inv0.03895041.25071420.0310.9752##soft_assets2.30945510.33257316.9443.81e-12***##pct_chg_cashsales-0.00069120.0108771-0.0640.9493##chg_roa-0.26979840.2554262-1.0560.2908##issuance0.14438410.31876060.4530.6506
6 . 3
ROC
##InsampleAUCOutofsampleAUC##0.74453780.6849225
6 . 4
Late2000s/early2010sapproach
7 . 1
▪ Logof#ofbulletpoints+1
▪ #ofcharactersinfileheader
▪ #ofexcessnewlines
▪ Amountofhtmltags
▪ Lengthofcleanedfile,
characters
▪ Meansentencelength,words
▪ S.D.ofwordlength
▪ S.D.ofparagraphlength
(sentences)
▪ Wordchoicevariation
▪ Readability
▪ ColemanLiauIndex
▪ FogIndex
▪ %activevoicesentences
▪ %passivevoicesentences
▪ #ofallcapwords
▪ #of!
▪ #of?
Thelate2000s/early2010smodel
Fromavarietyofpapers
7 . 2
Theory
▪ Generallypulledfromthecommunicationsliterature
▪ Sometimesadhoc
▪ Themainidea:
▪ Companiesthataremisreportingprobablywritetheirannualreport
differently
7 . 3
Thelate2000s/early2010smodel
fit_2000s<-glm(AAER~bullets+headerlen+newlines+alltags+
processedsize+sentlen_u+wordlen_s+paralen_s+
repetitious_p+sentlen_s+typetoken+clindex+fog+
active_p+passive_p+lm_negative_p+lm_positive_p+
allcaps+exclamationpoints+questionmarks,
data=df[df$Test==0,],
family=binomial)summury(fit_2000s)
####Call:##glm(formula=AAER~bullets+headerlen+newlines+alltags+##processedsize+sentlen_u+wordlen_s+paralen_s+repetitious_p+##sentlen_s+typetoken+clindex+fog+active_p+passive_p+##lm_negative_p+lm_positive_p+allcaps+exclamationpoints+##questionmarks,family=binomial,data=df[df$Test==0,##])####DevianceResiduals:##Min1QMedian3QMax##-0.9604-0.2244-0.1984-0.17493.2318####Coefficients:##EstimateStd.ErrorzvaluePr(>|z|)##(Intercept)-5.662e+003.143e+00-1.8010.07165.##bullets-2.635e-052.625e-05-1.0040.31558##headerlen-2.943e-043.477e-04-0.8460.39733##newlines-4.821e-051.220e-04-0.3950.69271##alltags5.060e-082.567e-070.1970.84376##processedsize5.709e-061.287e-064.4359.19e-06***
7 . 4
ROC
##InsampleAUCOutofsampleAUC##0.63777830.6295414
7 . 5
Combiningthe2000sand2011models
▪ 2011model:Parsimoniousfinancialmodel
▪ 2000smodel:Textualcharacteristics
Whyisitappropriatetocombinethe2011modelwiththe
2000smodel?
7 . 6
Themodel
fit_2000f<-glm(AAER~logtotasset+rsst_acc+chg_recv+chg_inv+
soft_assets+pct_chg_cashsales+chg_roa+issuance+
oplease_dum+book_mkt+lag_sdvol+merger+bigNaudit+
midNaudit+cffin+exfin+restruct+bullets+headerlen+
newlines+alltags+processedsize+sentlen_u+wordlen_s+
paralen_s+repetitious_p+sentlen_s+typetoken+
clindex+fog+active_p+passive_p+lm_negative_p+
lm_positive_p+allcaps+exclamationpoints+questionmarks,
data=df[df$Test==0,],
family=binomial)summury(fit_2000f)
####Call:##glm(formula=AAER~logtotasset+rsst_acc+chg_recv+chg_inv+##soft_assets+pct_chg_cashsales+chg_roa+issuance+oplease_dum+##book_mkt+lag_sdvol+merger+bigNaudit+midNaudit+cffin+##exfin+restruct+bullets+headerlen+newlines+alltags+##processedsize+sentlen_u+wordlen_s+paralen_s+repetitious_p+##sentlen_s+typetoken+clindex+fog+active_p+passive_p+##lm_negative_p+lm_positive_p+allcaps+exclamationpoints+##questionmarks,family=binomial,data=df[df$Test==0,##])####DevianceResiduals:##Min1QMedian3QMax##-0.9514-0.2237-0.1596-0.11103.3882####Coefficients:##EstimateStd.ErrorzvaluePr(>|z|)
7 . 7
ROC
##InsampleAUCOutofsampleAUC##0.76641150.7147021
7 . 8
TheBCEmodel
8 . 1
TheBCEapproach
▪ Retainthevariablesfromtheotherregressions
▪ Addinamachine-learningbasedmeasurequantifyinghowmuch
documentstalkedaboutdifferenttopicscommonacrossallfilings
▪ Learnedonjustthe1999-2003filings
8 . 2
Whatthetopicslooklike
8 . 3
TheorybehindtheBCEmodel
Whyusedocumentcontent?
8 . 4
Themodel
BCE_eq=us.formulu(puste("AAER~logtotasset+rsst_acc+chg_recv+chg_inv+
soft_assets+pct_chg_cashsales+chg_roa+issuance+oplease_dum+book_mkt+lag_sdvol+merger+bigNaudit+midNaudit+cffin+exfin+restruct+bullets+headerlen+newlines+alltags+processedsize+sentlen_u+wordlen_s+paralen_s+repetitious_p+sentlen_s+typetoken+clindex+fog+active_p+passive_p+lm_negative_p+lm_positive_p+allcaps+exclamationpoints+questionmarks+",puste(puste0("Topic_",1:30,"_n_oI"),collapse="+"),collapse=""))
fit_BCE<-glm(BCE_eq,
data=df[df$Test==0,],
family=binomial)summury(fit_BCE)
####Call:##glm(formula=BCE_eq,family=binomial,data=df[df$Test==##0,])####DevianceResiduals:##Min1QMedian3QMax##-1.0887-0.2212-0.1478-0.09403.5401####Coefficients:##EstimateStd.ErrorzvaluePr(>|z|)##(Intercept)-8.032e+003.872e+00-2.0740.03806*##logtotasset3.879e-014.554e-028.519<2e-16***##rsst_acc-1.938e-013.055e-01-0.6340.52593##chg_recv8.581e-011.071e+000.8010.42296##chg_inv-2.607e-011.223e+00-0.2130.83119##soft_assets2.555e+003.796e-016.7301.7e-11***
8 . 5
ROC
##InsampleAUCOutofsampleAUC##0.79418410.7599594
8 . 6
Comparisonacrossallmodels
##1990s20112000s2000s+2011BCE##0.74831320.74453780.63777830.76641150.7941841
8 . 7
SimplifyingmodelswithLASSO
9 . 1
WhatisLASSO?
▪ LeastAbsoluteShrinkageandSelectionOperator
▪ Leastabsolute:usesanerrortermlike ∣ε∣
▪ Shrinkage:itwillmakecoefficientssmaller
▪ Lesssensitive→lessoverfittingissues
▪ Selection:itwillcompletelyremovesomevariables
▪ Lessvariables→lessoverfittingissues
▪ Sometimescalled L regularization
▪ L means1dimensionaldistance,i.e., ∣ε∣
▪ Thisishowwecan,intheory,putmorevariablesinourmodelthan
datapoints
1
1
Greatifyouhavewaytoomanyinputsinyourmodel
9 . 2
▪ Addanadditionalpenalty
termthatisincreasinginthe
absolutevalueofeach β
▪ Incentivizeslower βs,
shrinkingthem
▪ Theselectionispartis
explainablegeometrically
Howdoesitwork?
ε + λ ββ∈R
min {N
1∣ ∣2
2 ∣ ∣1}
9 . 3
PackageforLASSO
▪
1. Forallregressioncommands,theyexpectayvectorandanxmatrix
insteadofourusualy~xformula
▪ Rhasahelperfunctiontoconvertaformulatoamatrix:
model.matrix()
▪ Supplyittherighthandsideoftheequation,startingwith~,and
yourdata
▪ Itoutputsthematrixx
▪ Alternatively,useas.matrix()onadataframeofyourinput
variables
2. It’sfamilyargumentshouldbespecifiedinquotes,i.e.,"binomial"
insteadofbinomial
glmnet
9 . 4
Ridgeregression
▪ SimilartoLASSO,butwithan
L penalty(Euclideannorm)
Elasticnetregression
▪ HybridofLASSOandRidge
▪ Belowimageby
Whatelsecanthepackagedo?
2 JaredLander
9 . 5
HowtorunaLASSO
▪ TorunasimpleLASSOmodel,useglmnet()
▪ Let’sLASSOtheBCEmodel
▪ Note:themodelselectioncanbemoreelegantlydoneusingthe
package,
librury(glmnet)
x<-model.mutrie(BCE_eq,data=df[df$Test==0,])[,-1]#[,-1]toremoveintercept
y<-model.frume(BCE_eq,data=df[df$Test==0,])[,"AAER"]
fit_LASSO<-glmnet(x=x,y=y,
family="binomial",alpha=1#SpecifiesLASSO.alpha=0isridge)
useful
seehereforanexample
9 . 6
VisualizingLasso
plot(fit_LASSO)
9 . 7
What’sunderthehood?
print(fit_LASSO)
####Call:glmnet(x=x,y=y,family="binomial",alpha=1)####Df%DevLambda##[1,]01.312e-131.433e-02##[2,]18.060e-031.305e-02##[3,]11.461e-021.189e-02##[4,]11.995e-021.084e-02##[5,]22.471e-029.874e-03##[6,]23.219e-028.997e-03##[7,]23.845e-028.197e-03##[8,]24.371e-027.469e-03##[9,]24.813e-026.806e-03##[10,]35.224e-026.201e-03##[11,]35.591e-025.650e-03##[12,]45.906e-025.148e-03##[13,]46.249e-024.691e-03##[14,]56.573e-024.274e-03##[15,]76.894e-023.894e-03##[16,]87.224e-023.548e-03##[17,]107.522e-023.233e-03##[18,]127.834e-022.946e-03##[19,]158.156e-022.684e-03##[20,]158.492e-022.446e-03##[21,]158.780e-022.229e-03##[22,]159.026e-022.031e-03##[23,]189.263e-021.850e-03##[24,]209.478e-021.686e-03##[25,]229.689e-021.536e-03
9 . 8
Oneofthe100models
#coef(fit_LASSO,s=0.002031)coefplot(fit_LASSO,lambda=0.002031,sort='magnitude')
9 . 9
Howdoesthisperform?
#na.passhasmodel.matrixretainNAvalues(sothe#ofrowsisconstant)xp<-model.mutrie(BCE_eq,data=df,na.action='na.pass')[,-1]
#s=specifiestheversionofthemodeltousepred<-predict(fit_LASSO,xp,type="response",s=0.002031)
##InsampleAUCOutofsampleAUC##0.75938280.7239785 9 . 10
Automatingmodelselection
▪ LASSOseemsnice,butpickingbetweenthe100modelsistough!
▪ Italsocontainsamethodof k-foldcrossvalidation(default, k = 10)
1. Randomlysplitsthedatainto kgroups
2. Runsthealgorithmon90%ofthedata( k − 1groups)
3. Determinesthebestmodel
4. Repeatsteps2and3 k − 1moretimes
5. Usesthebestoverallmodelacrossall kholdoutsamples
▪ Itgives2modeloptions:
▪ "lambda.min":Thebestperformingmodel
▪ "lambda.1se":Thesimplestmodelwithin1standarderrorof
"lambda.min"
▪ Thisisthebetterchoiceifyouareconcernedaboutoverfitting
9 . 11
Runningacrossvalidatedmodel
#Crossvalidationset.seed(697435)#forreproducibility
cvfit=cv.glmnet(x=x,y=y,family="binomial",alpha=1,type.measure="auc")
plot(cvfit) cvfit$lambda.min
##[1]0.00139958
cvfit$lambda.1se
##[1]0.002684268
9 . 12
lambda.min lambda.1se
Models
9 . 13
CVLASSOperformance
#s=specifiestheversionofthemodeltousepred<-predict(cvfit,xp,type="response",s="lambda.min")
pred2<-predict(cvfit,xp,type="response",s="lambda.1se")
##InsampleAUC,lambda.minOutofsampleAUC,lambda.min##0.76654630.7330212##InsampleAUC,lambda.1seOutofsampleAUC,lambda.1se##0.75099460.7124231
9 . 14
DrawbacksofLASSO
1. Nop-valuesoncoefficients
▪ Simplesolution–runtheresultingmodelwith
▪ Solutiononlyifusingfamily="gaussian":
▪ Runthelassousethe package
▪ m<-lars(x=x,y=y,type="lasso")
▪ Thentestcoefficientsusingthe package
▪ covTest(m,x,y)
2. Generallyworseinsampleperformance
3. Sometimesworseoutofsampleperformance(shortrun)
▪ BUT:predictionswillbemorestable
glm()
lars
covTest
9 . 15
Wrapup
10 . 1
Predictingfraud
▪ Whatisthereasonthatthiseventordatawouldbeusefulfor
prediction?
▪ I.e.,howdoesitfitintoyourmentalmodel?
▪ Whatifwewere…
▪ Auditors?
▪ Internalauditors?
▪ Regulators?
▪ Investors?
Whatotherdatacouldweusetopredictcorporatefraud?
10 . 2
Endmatter
11 . 1
Fornextweek
▪ Nextweek:
▪ Breakweek
▪ Fortwoweeksfromnow:
▪ Thirdindividualassignment
▪ Onbinaryprediction
▪ FinishbytheendofThursday
▪ Canbedoneinpairs
▪ SubmitoneLearn
▪ Datacamp
▪ Practiceabitmoretokeepuptodate
▪ UsingRmorewillmakeitmorenatural
11 . 2
Packagesusedfortheseslides
▪
▪
▪
▪
▪
▪
▪
▪
coefplot
glmnet
kableExtra
knitr
magrittr
revealjs
ROCR
tidyverse
11 . 3