acct 420: logistic regression for corporate fraud · 2. pretend as though the money was invested 3....

ACCT420:Logistic

RegressionforCorporate

Fraud

Session7

Dr.RichardM.Crowley

1

Frontmatter

2 . 1

▪ Theory:

▪ Economics

▪ Psychology

▪ Application:

▪ Predictingfraudcontained

inannualreports

▪ Methodology:

▪ Logisticregression

▪ LASSO

Learningobjectives

2 . 2

Datacamp

▪ Exploreonyourown

▪ Nospecificrequiredclassthisweek

▪ Wewillstarthavingsomeassignedchaptersafterthebreak

▪ I’vepostthemalready,soyoucanworkonthematyourleisure

2 . 3

Corporate/SecuritiesFraud

3 . 1

Traditionalaccountingfraud

1. Acompanyisunderperforming

2. Managementcooksupsomeschemetoincreaseearnings

▪ Worldcom(1999-2001)

▪ Fakerevenueentries

▪ Capitalizinglinecosts(shouldbeexpensed)

▪ Olympus(late1980s-2011):Hidelossesinaseparateentity

▪ “Tobashischeme”

▪ WellsFargo(2011-2018?)

▪ Fake/duplicatecustomersandtransactions

3. Createaccountingstatementsusingthefakeinformation

3 . 2

Reversingit

1. Acompanyisoverperforming

2. Managementcooksupaschemeto“saveup”excessperformancefor

arainyday

▪

▪ Cookiejarreserve,fromsecretpaymentsbyIntel,madeupto

76%ofquarterlyincome

▪

3. Recognizerevenue/earningswhenneededinthefuturetohitearnings

targets

Dell(2002-2007)

Brystol-MyersSquibb(2000-2001)

3 . 3

https://www.economist.com/newsbook/2010/07/23/taking-away-dells-cookie-jar

https://www.sec.gov/news/press/2004-105.htm

Otheraccountingfraudtypes

▪

▪ Optionsbackdating

▪

▪ Usinganauditorthatisn’tregistered

▪

▪ Releasingfinancialstatementsthatwerenotreviewedbyanauditor

▪

▪ Relatedpartytransactions(transferringfundstofamilymembers)

▪ Insufficientinternalcontrols

▪ viaBanamex

▪

Apple(2001)

CommerceGroupCorp(2003)

CardiffInternational(2017)

ChinaNorthEastPetroleumHoldingsLimited

Citigroup(2008-2014)

AsiaPacificBreweries

3 . 4

https://www.sec.gov/news/press/2007/2007-70.htm

https://dart.deloitte.com/USDART/resource/b44c3afb-3f7f-11e6-95db-51a9f8be3f47

https://www.sec.gov/litigation/admin/2018/34-84258.pdf

https://www.sec.gov/litigation/litreleases/2012/lr22552.htm


http://eresources.nlb.gov.sg/infopedia/articles/SIP_422_2005-01-25.html

Otheraccountingfraudtypes

▪

▪ Round-tripping:Transactionstoinflaterevenuethathaveno

substance

▪ Bribery

▪ :$55MUSDinbribestoBrazilianofficials

forcontracts

▪ BakerHughes( , ):PaymentstoofficialsinIndonesia,and

possiblytoBrazilandIndia(2001)andtoofficialsinAngola,

Indonesia,Nigeria,Russia,andUzbekistan(2007)

▪ :Fakethewholecompany,getfundingfrom

insurancefraud,theft,creditcardfraud,andfakecontracts

▪ Alsofakedarealprojecttogetacleanaudittotakethecompany

public

SupremaSpecialties(1998-2001)

KeppelO&M(2001-2014)

2001 2007

ZZZZBest(1982-1987)

3 . 5

https://www.sec.gov/news/press/2004-2.htm

https://www.channelnewsasia.com/news/business/keppel-o-m-bribery-case-what-you-need-to-know-9836154

https://www.sec.gov/litigation/admin/34-44784.htm

https://www.nytimes.com/2007/04/27/business/worldbusiness/27settle.html

https://www.nytimes.com/1990/02/25/books/nothing-but-zzzz-best.html

Othersecuritiesfraudtypes

▪ :Ponzischeme

1. Getmoneyfromindividualsfor“investments”

2. Pretendasthoughthemoneywasinvested

3. Usenewinvestors’moneytopaybackanyonewithdrawingtheir

money

▪

▪ Materialmisstatements

▪ Materialomissions(FDAapplications,didn’tpaypayrolltaxes)

▪

▪ Failedtofileannualandquarterlyreports

▪

▪ Aidinganothercompany’sfraud(TakeTwo,byparking2video

games)

▪

▪ MisleadingstatementsonTwitter

BernardMadoff

ImagingDiagnosticSystems(2013)

AppliedWellnessCorporation(2008)

CapitolDistributingLLC

Tesla(2018)

3 . 6

https://www.nytimes.com/2009/01/25/business/25bernie.html

https://www.sec.gov/litigation/litreleases/2013/lr22801.htm

https://www.sec.gov/litigation/admin/2010/34-61344a.pdf


https://www.sec.gov/news/press-release/2018-219

Someofthemoreinterestingcases

▪

▪ Claimeditwasdevelopingprocessormicrocodeindependently,

whenitactuallyprovidedIntel’smicrocodetoit’sengineers

▪

▪ Shamsale-leasebackofabartoacorporateofficer

▪

▪ Notusingmark-to-marketaccountingtofairvaluestuffedanimal

inventories

▪

▪ Goldreserveswereactually…dirt.

▪

▪ Employeescreated1,280fakememberships,soldthem,and

retainedallprofits($37.5M)

AMD(1992-1993)

Am-PacInternational(1997)

CVS(2000)

CountrylandWellnessResorts,Inc.(1997-2000)

KeppelClub(2014)

3 . 7

https://www.sec.gov/litigation/admin/3437730.txt

https://www.sec.gov/litigation/litreleases/lr17024.htm


https://www.sec.gov/litigation/litreleases/lr16732.htm

https://www.straitstimes.com/singapore/courts-crime/keppel-club-duo-convicted-for-37m-membership-scam

Whatwillwelookattoday?

Misstatementsthataffectfirms’accountingstatements

andweredoneseeminglyintentionallybymanagement

orotheremployeesatthefirm.

3 . 8

Howdomisstatementscometolight?

1. Thecompany/managementadmitstoitpublicly

2. Agovernmententityforcesthecompanytodisclose

▪ Inmoreegregiouscases,governmentagenciesmaydisclosethe

fraudpubliclyaswell

3. Investorssuethefirm,forcingdisclosure

3 . 9

Wherearethesedisclosed?

IntheUS:

1. 10-K/Afilings(/Ameansamendment)

▪ Note:notall10-K/Afilingsarecausedbyfraud!

▪ Anybenigncorrectionoradjustmentcanalsobefiledasa10-K/A

▪

2. Inanoteinsidea10-Kfiling

▪ Thesearesometimesreferredtoas“littler”restatements

3. :AccountingandAuditingEnforcementReleases

▪ Generallyhighlightlargerormoreimportantcases

▪ WrittenbytheSEC,notthecompany

AuditAnalytic’swrite-uponthisfor2017

SECAAERs

3 . 10

https://www.auditanalytics.com/blog/reasons-for-an-amended-10-k-2017/

https://www.sec.gov/divisions/enforce/friactions.shtml

AAERs

▪ TodaywewillexaminetheseAAERs

▪ Usingaproprietarydatasetof>1,000suchreleases

▪ Togetasenseofthedatawe’reworkingwith,readtheSummary

section(startingonpage2)ofthisAAERagainstSanofi

▪ rmc.link/420class7

WhydidtheSECreleasethisAAERregardingSanofi?

3 . 11


PredictingFraud

4 . 1

Mainquestion

▪ Thisisapureforensicanalyticsquestion

▪ “Majorinstanceofmisreporting”willbeimplementedusingAAERs

Howcanwedetectifafirmisinvolvedinamajorinstance

ofmissreporting?

4 . 2

Approaches

▪ Intheseslides,I’llwalkthroughtheprimarydetectionmethodssince

the1990s,uptocurrentlyusedmethods

▪ 1990s:Financialsandfinancialratios

▪ Followupin2011

▪ Late2000s/early2010s:Characteristicsoffirm’sdisclosures

▪ mid2010s:Moreholistictext-basedmeasuresofdisclosures

▪ Thiswilltietonextlessonwherewewillexplorehowtoworkwith

text

Allofthesearediscussedina

–IwillrefertothepaperasBCEforshort

Brown,CrowleyandElliott

(2018)

4 . 3

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2803733

Thedata

▪ Ihaveprovidedsomepreprocesseddata,sanitizedofAAERdata

(whichispartiallypublic,partiallyproprietary)

▪ Itcontains399variables

▪ FromCompustat,CRSP,andtheSEC(whichIpersonallycollected)

▪ Manyprecalculatedmeasuresincluding:

▪ Firmcharacteristics,suchasauditortype(bigNaudit,midNaudit)

▪ Financialmeasures,suchastotalaccruals(rsst_acc)

▪ Financialratios,suchasROA(ni_at)

▪ Annualreportcharacteristics,suchasthemeansentencelength

(sentlen_u)

▪ Machinelearningbasedcontentanalysis(everythingwithTopic_

prepended)

PulledfromBCE’sworkingfiles

4 . 4

TrainingandTesting

▪ AlreadyhastestingandtrainingsetupinvariableTest

▪ Trainingisannualreportsreleasedin2003through2007

▪ Testingisannualreportsreleasedin2008

Whatpotentialissuesaretherewithourusualtrainingand

testingstrategy?

4 . 5

Censoring

▪ Censoringtrainingdatahelpstoemulatehistoricalsituations

▪ Buildanalgorithmusingonlythedatathatwasavailableatthetime

adecisionwouldneedtohavebeenmade

▪ Donotcensorthetestingdata

▪ Testingemulateswherewewanttomakeanoptimalchoiceinreal

life

▪ Wewanttofindfraudsregardlessofhowwellhiddentheyare!

4 . 6

Eventfrequency

▪ Veryloweventfrequenciescanmakethingstricky

year total_AAERS total_observations

1999 46 2195

2000 50 2041

2001 43 2021

2002 50 2391

2003 57 2936

2004 49 2843

df%>%

group_by(year)%>%

mutute(total_AAERS=sum(AAER),total_observations=n())%>%

slice(1)%>%

ungroup()%>%

select(year,total_AAERS,total_observations)%>%

html_df

246AAERsinthetrainingdata,401totalvariables…

4 . 7

Dealingwithinfrequentevents

▪ Afewwaystohandlethis

1. Verycarefulmodelselection(keepitsufficientlysimple)

2. Sophisticateddegeneratevariableidentificationcriterion+

simulationtoimplementcomplexmodelsthatarejustbarely

simpleenough

▪ ThemainmethodinBCE

3. Automatedmethodologiesforpairingdownmodels

▪ We’lldiscussusingLASSOforthisattheendofclass

▪ AlsoimplementedinBCE

4 . 8

1990sapproach

5 . 1

▪ EBIT

▪ Earnings/revenue

▪ ROA

▪ Logofliabilities

▪ liabilities/equity

▪ liabilities/assets

▪ quickratio

▪ Workingcapital/assets

▪ Inventory/revenue

▪ inventory/assets

▪ earnings/PP&E

▪ A/R/revenue

▪ Changeinrevenue

▪ ChangeinA/R+1

▪ > 10%changeinA/R

▪ Changeingrossprofit+1

▪ > 10%changeingross

profit

▪ Grossprofit/assets

▪ Revenueminusgrossprofit

▪ Cash/assets

▪ Logofassets

▪ PP&E/assets

▪ Workingcapital

The1990smodel

▪ Manyfinancialmeasuresandratioscanhelptopredictfraud

5 . 2

Approach

fit_1990s<-glm(AAER~ebit+ni_revt+ni_at+log_lt+ltl_at+lt_seq+

lt_at+act_lct+aq_lct+wcap_at+invt_revt+invt_at+

ni_ppent+rect_revt+revt_at+d_revt+b_rect+b_rect+

r_gp+b_gp+gp_at+revt_m_gp+ch_at+log_at+

ppent_at+wcap,

data=df[df$Test==0,],

family=binomial)summury(fit_1990s)

####Call:##glm(formula=AAER~ebit+ni_revt+ni_at+log_lt+ltl_at+##lt_seq+lt_at+act_lct+aq_lct+wcap_at+invt_revt+##invt_at+ni_ppent+rect_revt+revt_at+d_revt+b_rect+##b_rect+r_gp+b_gp+gp_at+revt_m_gp+ch_at+log_at+##ppent_at+wcap,family=binomial,data=df[df$Test==##0,])####DevianceResiduals:##Min1QMedian3QMax##-1.1391-0.2275-0.1661-0.11903.6236####Coefficients:##EstimateStd.ErrorzvaluePr(>|z|)##(Intercept)-4.660e+008.336e-01-5.5912.26e-08***##ebit-3.564e-041.094e-04-3.2570.00112**##ni_revt3.664e-023.058e-021.1980.23084##ni_at-3.196e-012.325e-01-1.3740.16932##log_lt1.494e-013.409e-010.4380.66118##ltl_at-2.306e-017.072e-01-0.3260.74438

5 . 3

ROC

##InsampleAUCOutofsampleAUC##0.74831320.7292981

5 . 4

The2011followup

6 . 1

▪ Logofassets

▪ Totalaccruals

▪ %changeinA/R

▪ %changeininventory

▪ %softassets

▪ %changeinsalesfromcash

▪ %changeinROA

▪ Indicatorforstock/bond

issuance

▪ Indicatorforoperatingleases

▪ BVequity/MVequity

▪ Lagofstockreturnminus

valueweightedmarketreturn

▪ BelowareBCE’sadditions

▪ Indicatorformergers

▪ IndicatorforBigNauditor

▪ Indicatorformediumsize

auditor

▪ Totalfinancingraised

▪ Netamountofnewcapital

raised

▪ Indicatorforrestructuring

The2011model

BasedonDechow,Ge,LarsonandSloan(2011)

6 . 2

https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1911-3846.2010.01041.x

Themodel

fit_2011<-glm(AAER~logtotasset+rsst_acc+chg_recv+chg_inv+

soft_assets+pct_chg_cashsales+chg_roa+issuance+

oplease_dum+book_mkt+lag_sdvol+merger+bigNaudit+

midNaudit+cffin+exfin+restruct,


family=binomial)summury(fit_2011)

####Call:##glm(formula=AAER~logtotasset+rsst_acc+chg_recv+chg_inv+##soft_assets+pct_chg_cashsales+chg_roa+issuance+oplease_dum+##book_mkt+lag_sdvol+merger+bigNaudit+midNaudit+cffin+##exfin+restruct,family=binomial,data=df[df$Test==##0,])####DevianceResiduals:##Min1QMedian3QMax##-0.8434-0.2291-0.1658-0.11963.2614####Coefficients:##EstimateStd.ErrorzvaluePr(>|z|)##(Intercept)-7.14745580.5337491-13.391<2e-16***##logtotasset0.32143220.03554679.043<2e-16***##rsst_acc-0.21900950.3009287-0.7280.4667##chg_recv1.10207401.05908371.0410.2981##chg_inv0.03895041.25071420.0310.9752##soft_assets2.30945510.33257316.9443.81e-12***##pct_chg_cashsales-0.00069120.0108771-0.0640.9493##chg_roa-0.26979840.2554262-1.0560.2908##issuance0.14438410.31876060.4530.6506

6 . 3

ROC


6 . 4

Late2000s/early2010sapproach

7 . 1

▪ Logof#ofbulletpoints+1

▪ #ofcharactersinfileheader

▪ #ofexcessnewlines

▪ Amountofhtmltags

▪ Lengthofcleanedfile,

characters

▪ Meansentencelength,words

▪ S.D.ofwordlength

▪ S.D.ofparagraphlength

(sentences)

▪ Wordchoicevariation

▪ Readability

▪ ColemanLiauIndex

▪ FogIndex

▪ %activevoicesentences

▪ %passivevoicesentences

▪ #ofallcapwords

▪ #of!

▪ #of?

Thelate2000s/early2010smodel

Fromavarietyofpapers

7 . 2

Theory

▪ Generallypulledfromthecommunicationsliterature

▪ Sometimesadhoc

▪ Themainidea:

▪ Companiesthataremisreportingprobablywritetheirannualreport

differently

7 . 3

Thelate2000s/early2010smodel

fit_2000s<-glm(AAER~bullets+headerlen+newlines+alltags+

processedsize+sentlen_u+wordlen_s+paralen_s+

repetitious_p+sentlen_s+typetoken+clindex+fog+

active_p+passive_p+lm_negative_p+lm_positive_p+

allcaps+exclamationpoints+questionmarks,


family=binomial)summury(fit_2000s)

####Call:##glm(formula=AAER~bullets+headerlen+newlines+alltags+##processedsize+sentlen_u+wordlen_s+paralen_s+repetitious_p+##sentlen_s+typetoken+clindex+fog+active_p+passive_p+##lm_negative_p+lm_positive_p+allcaps+exclamationpoints+##questionmarks,family=binomial,data=df[df$Test==0,##])####DevianceResiduals:##Min1QMedian3QMax##-0.9604-0.2244-0.1984-0.17493.2318####Coefficients:##EstimateStd.ErrorzvaluePr(>|z|)##(Intercept)-5.662e+003.143e+00-1.8010.07165.##bullets-2.635e-052.625e-05-1.0040.31558##headerlen-2.943e-043.477e-04-0.8460.39733##newlines-4.821e-051.220e-04-0.3950.69271##alltags5.060e-082.567e-070.1970.84376##processedsize5.709e-061.287e-064.4359.19e-06***

7 . 4

ROC


7 . 5

Combiningthe2000sand2011models

▪ 2011model:Parsimoniousfinancialmodel

▪ 2000smodel:Textualcharacteristics

Whyisitappropriatetocombinethe2011modelwiththe

2000smodel?

7 . 6

Themodel

fit_2000f<-glm(AAER~logtotasset+rsst_acc+chg_recv+chg_inv+

soft_assets+pct_chg_cashsales+chg_roa+issuance+

oplease_dum+book_mkt+lag_sdvol+merger+bigNaudit+

midNaudit+cffin+exfin+restruct+bullets+headerlen+

newlines+alltags+processedsize+sentlen_u+wordlen_s+

paralen_s+repetitious_p+sentlen_s+typetoken+

clindex+fog+active_p+passive_p+lm_negative_p+

lm_positive_p+allcaps+exclamationpoints+questionmarks,


family=binomial)summury(fit_2000f)

####Call:##glm(formula=AAER~logtotasset+rsst_acc+chg_recv+chg_inv+##soft_assets+pct_chg_cashsales+chg_roa+issuance+oplease_dum+##book_mkt+lag_sdvol+merger+bigNaudit+midNaudit+cffin+##exfin+restruct+bullets+headerlen+newlines+alltags+##processedsize+sentlen_u+wordlen_s+paralen_s+repetitious_p+##sentlen_s+typetoken+clindex+fog+active_p+passive_p+##lm_negative_p+lm_positive_p+allcaps+exclamationpoints+##questionmarks,family=binomial,data=df[df$Test==0,##])####DevianceResiduals:##Min1QMedian3QMax##-0.9514-0.2237-0.1596-0.11103.3882####Coefficients:##EstimateStd.ErrorzvaluePr(>|z|)

7 . 7

ROC


7 . 8

TheBCEmodel

8 . 1

TheBCEapproach

▪ Retainthevariablesfromtheotherregressions

▪ Addinamachine-learningbasedmeasurequantifyinghowmuch

documentstalkedaboutdifferenttopicscommonacrossallfilings

▪ Learnedonjustthe1999-2003filings

8 . 2

Whatthetopicslooklike

8 . 3

TheorybehindtheBCEmodel

Whyusedocumentcontent?

8 . 4

Themodel

BCE_eq=us.formulu(puste("AAER~logtotasset+rsst_acc+chg_recv+chg_inv+

soft_assets+pct_chg_cashsales+chg_roa+issuance+oplease_dum+book_mkt+lag_sdvol+merger+bigNaudit+midNaudit+cffin+exfin+restruct+bullets+headerlen+newlines+alltags+processedsize+sentlen_u+wordlen_s+paralen_s+repetitious_p+sentlen_s+typetoken+clindex+fog+active_p+passive_p+lm_negative_p+lm_positive_p+allcaps+exclamationpoints+questionmarks+",puste(puste0("Topic_",1:30,"_n_oI"),collapse="+"),collapse=""))

fit_BCE<-glm(BCE_eq,


family=binomial)summury(fit_BCE)

####Call:##glm(formula=BCE_eq,family=binomial,data=df[df$Test==##0,])####DevianceResiduals:##Min1QMedian3QMax##-1.0887-0.2212-0.1478-0.09403.5401####Coefficients:##EstimateStd.ErrorzvaluePr(>|z|)##(Intercept)-8.032e+003.872e+00-2.0740.03806*##logtotasset3.879e-014.554e-028.519<2e-16***##rsst_acc-1.938e-013.055e-01-0.6340.52593##chg_recv8.581e-011.071e+000.8010.42296##chg_inv-2.607e-011.223e+00-0.2130.83119##soft_assets2.555e+003.796e-016.7301.7e-11***

8 . 5

ROC


8 . 6

Comparisonacrossallmodels

##1990s20112000s2000s+2011BCE##0.74831320.74453780.63777830.76641150.7941841

8 . 7

SimplifyingmodelswithLASSO

9 . 1

WhatisLASSO?

▪ LeastAbsoluteShrinkageandSelectionOperator

▪ Leastabsolute:usesanerrortermlike ∣ε∣

▪ Shrinkage:itwillmakecoefficientssmaller

▪ Lesssensitive→lessoverfittingissues

▪ Selection:itwillcompletelyremovesomevariables

▪ Lessvariables→lessoverfittingissues

▪ Sometimescalled L regularization

▪ L means1dimensionaldistance,i.e., ∣ε∣

▪ Thisishowwecan,intheory,putmorevariablesinourmodelthan

datapoints

1

1

Greatifyouhavewaytoomanyinputsinyourmodel

9 . 2

▪ Addanadditionalpenalty

termthatisincreasinginthe

absolutevalueofeach β

▪ Incentivizeslower βs,

shrinkingthem

▪ Theselectionispartis

explainablegeometrically

Howdoesitwork?

ε + λ ββ∈R

min {N

1∣ ∣2

2 ∣ ∣1}

9 . 3

PackageforLASSO

▪

1. Forallregressioncommands,theyexpectayvectorandanxmatrix

insteadofourusualy~xformula

▪ Rhasahelperfunctiontoconvertaformulatoamatrix:

model.matrix()

▪ Supplyittherighthandsideoftheequation,startingwith~,and

yourdata

▪ Itoutputsthematrixx

▪ Alternatively,useas.matrix()onadataframeofyourinput

variables

2. It’sfamilyargumentshouldbespecifiedinquotes,i.e.,"binomial"

insteadofbinomial

glmnet

9 . 4

https://cran.r-project.org/web/packages/glmnet/index.html

Ridgeregression

▪ SimilartoLASSO,butwithan

L penalty(Euclideannorm)

Elasticnetregression

▪ HybridofLASSOandRidge

▪ Belowimageby

Whatelsecanthepackagedo?

2 JaredLander

9 . 5

https://jaredlander.com/content/2015/11/LassoForEveryone.html

HowtorunaLASSO

▪ TorunasimpleLASSOmodel,useglmnet()

▪ Let’sLASSOtheBCEmodel

▪ Note:themodelselectioncanbemoreelegantlydoneusingthe

package,

librury(glmnet)

x<-model.mutrie(BCE_eq,data=df[df$Test==0,])[,-1]#[,-1]toremoveintercept

y<-model.frume(BCE_eq,data=df[df$Test==0,])[,"AAER"]

fit_LASSO<-glmnet(x=x,y=y,

family="binomial",alpha=1#SpecifiesLASSO.alpha=0isridge)

useful

seehereforanexample

9 . 6

https://cran.r-project.org/web/packages/useful/index.html

https://www.jaredlander.com/2018/02/using-coefplot-with-glmnet/

VisualizingLasso

plot(fit_LASSO)

9 . 7

What’sunderthehood?

print(fit_LASSO)

####Call:glmnet(x=x,y=y,family="binomial",alpha=1)####Df%DevLambda##[1,]01.312e-131.433e-02##[2,]18.060e-031.305e-02##[3,]11.461e-021.189e-02##[4,]11.995e-021.084e-02##[5,]22.471e-029.874e-03##[6,]23.219e-028.997e-03##[7,]23.845e-028.197e-03##[8,]24.371e-027.469e-03##[9,]24.813e-026.806e-03##[10,]35.224e-026.201e-03##[11,]35.591e-025.650e-03##[12,]45.906e-025.148e-03##[13,]46.249e-024.691e-03##[14,]56.573e-024.274e-03##[15,]76.894e-023.894e-03##[16,]87.224e-023.548e-03##[17,]107.522e-023.233e-03##[18,]127.834e-022.946e-03##[19,]158.156e-022.684e-03##[20,]158.492e-022.446e-03##[21,]158.780e-022.229e-03##[22,]159.026e-022.031e-03##[23,]189.263e-021.850e-03##[24,]209.478e-021.686e-03##[25,]229.689e-021.536e-03

9 . 8

Oneofthe100models

#coef(fit_LASSO,s=0.002031)coefplot(fit_LASSO,lambda=0.002031,sort='magnitude')

9 . 9

Howdoesthisperform?

#na.passhasmodel.matrixretainNAvalues(sothe#ofrowsisconstant)xp<-model.mutrie(BCE_eq,data=df,na.action='na.pass')[,-1]

#s=specifiestheversionofthemodeltousepred<-predict(fit_LASSO,xp,type="response",s=0.002031)

##InsampleAUCOutofsampleAUC##0.75938280.7239785 9 . 10

Automatingmodelselection

▪ LASSOseemsnice,butpickingbetweenthe100modelsistough!

▪ Italsocontainsamethodof k-foldcrossvalidation(default, k = 10)

1. Randomlysplitsthedatainto kgroups

2. Runsthealgorithmon90%ofthedata( k − 1groups)

3. Determinesthebestmodel

4. Repeatsteps2and3 k − 1moretimes

5. Usesthebestoverallmodelacrossall kholdoutsamples

▪ Itgives2modeloptions:

▪ "lambda.min":Thebestperformingmodel

▪ "lambda.1se":Thesimplestmodelwithin1standarderrorof

"lambda.min"

▪ Thisisthebetterchoiceifyouareconcernedaboutoverfitting

9 . 11

Runningacrossvalidatedmodel

#Crossvalidationset.seed(697435)#forreproducibility

cvfit=cv.glmnet(x=x,y=y,family="binomial",alpha=1,type.measure="auc")

plot(cvfit) cvfit$lambda.min

##[1]0.00139958

cvfit$lambda.1se

##[1]0.002684268

9 . 12

lambda.min lambda.1se

Models

9 . 13

CVLASSOperformance

#s=specifiestheversionofthemodeltousepred<-predict(cvfit,xp,type="response",s="lambda.min")

pred2<-predict(cvfit,xp,type="response",s="lambda.1se")

##InsampleAUC,lambda.minOutofsampleAUC,lambda.min##0.76654630.7330212##InsampleAUC,lambda.1seOutofsampleAUC,lambda.1se##0.75099460.7124231

9 . 14

DrawbacksofLASSO

1. Nop-valuesoncoefficients

▪ Simplesolution–runtheresultingmodelwith

▪ Solutiononlyifusingfamily="gaussian":

▪ Runthelassousethe package

▪ m<-lars(x=x,y=y,type="lasso")

▪ Thentestcoefficientsusingthe package

▪ covTest(m,x,y)

2. Generallyworseinsampleperformance

3. Sometimesworseoutofsampleperformance(shortrun)

▪ BUT:predictionswillbemorestable

glm()

lars

covTest

9 . 15

https://www.rdocumentation.org/packages/stats/versions/3.5.1/topics/glm

https://cran.r-project.org/web/packages/lars/index.html

https://cran.r-project.org/web/packages/covTest/covTest.pdf

Wrapup

10 . 1

Predictingfraud

▪ Whatisthereasonthatthiseventordatawouldbeusefulfor

prediction?

▪ I.e.,howdoesitfitintoyourmentalmodel?

▪ Whatifwewere…

▪ Auditors?

▪ Internalauditors?

▪ Regulators?

▪ Investors?

Whatotherdatacouldweusetopredictcorporatefraud?

10 . 2

Endmatter

11 . 1

Fornextweek

▪ Nextweek:

▪ Breakweek

▪ Fortwoweeksfromnow:

▪ Thirdindividualassignment

▪ Onbinaryprediction

▪ FinishbytheendofThursday

▪ Canbedoneinpairs

▪ SubmitoneLearn

▪ Datacamp

▪ Practiceabitmoretokeepuptodate

▪ UsingRmorewillmakeitmorenatural

11 . 2

Packagesusedfortheseslides

▪

▪

▪

▪

▪

▪

▪

▪

coefplot

glmnet

kableExtra

knitr

magrittr

revealjs

ROCR

tidyverse

11 . 3

https://github.com/jaredlander/coefplot

https://cran.r-project.org/web/packages/glmnet/index.html

https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html

https://yihui.name/knitr/

https://magrittr.tidyverse.org/

https://github.com/rstudio/revealjs

http://rocr.bioinf.mpi-sb.mpg.de/

https://www.tidyverse.org/

acct 420: logistic regression for corporate fraud · 2. pretend as though the money was invested 3....

Documents