text analytics, nlp, and accounting research...early 2010s loughran and mcdonald (2011 jf) points...

26
Text analytics, NLP, and accounting research 2018 November 23 Dr. Richard M. Crowley [email protected] http://rmc.link/ 1

Upload: others

Post on 25-Jan-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Textanalytics,NLP,and

accountingresearch

2018November23

Dr.RichardM.Crowley

[email protected]

http://rmc.link/

1

Foundations

2 . 1

Whatistextanalytics?

▪ Thiscouldbeassimpleasextractingspecific

words/phrases/sentences

▪ Thiscouldbeascomplexasextractinglatent(hidden)patterns

structureswithintext

▪ Sentiment

▪ Content

▪ Emotion

▪ Writercharacteristics

▪ …

▪ Oftencalledtextmining(inCS)ortextualanalysis(inaccounting)

Extractingmeaningfulinformationfromtext

2 . 2

WhatisNLPthen?

▪ NLPstandsforNaturalLanguageProcessing

▪ ItisaverydiversefieldwithinCS

▪ Grammar/linguistics

▪ Conversations

▪ Conversionfromaudio,images

▪ Translation

▪ Dictation

▪ Generation

NLPisafielddevotedtounderstandinghowtounderstand

humanlanguage

2 . 3

WhydiscussNLP?

Considerthefollowingsituation:

▪ WithoutNLP:

1. HireanRA/mechanicalturkarmy…

2. Useadictionary:Words/phraseslike“earnings,”“profitability,”

“netincome”arelikelytobeinthesentences

▪ WithNLP:

1. Wecouldassociatesentenceswithoutsidedatatobuildaclassifier

(supervisedapproach)

2. Wecouldaskanalgorithmtolearnthestructureofallsentences,

andthenextracttheusefulpartexpost(unsupervised)

Youhaveacollectionof1millionsentences,andyouwant

toknowwhichareaccountingrelevant

2 . 4

▪ Firms

▪ Letterstoshareholders

▪ Annualandquarterly

reports

▪ 8-Ks

▪ Pressreleases

▪ Conferencecalls

▪ Firmwebsites

▪ Twitterposts

▪ Investors

▪ Blogposts

▪ Socialmediaposts

▪ Intermediaries

▪ Newspaperarticles

▪ Analystreports

▪ Government

▪ FASBexposuredrafts

▪ Commentletters

▪ IRScode

▪ Courtcases

Datathathasbeenstudied

2 . 5

Abriefhistoryoftextanalyticsin

accountingresearch

3 . 1

Indexes

▪ Ex.:Botosan(1997TAR):For

firmswithlowanalyst

following,moredisclosure

⇒Lowercostofequity

▪ Indexof35aspectsof10-Ks

▪ CoveredindetailinColeand

Jones(2004JAL)

▪ Mostusesmallsamples

▪ Oftenuseselectindustries

Readability

▪ Automatedstartingwith

DorrellandDarsey(1991

JTWC)inaccounting…

▪ Atleast32studiesonthisin

the1980sandearly1990sper

JonesandShoemaker(1994

JAL)

▪ Only2usefulldocs

▪ Only2use>100docs

1980sand1990s

▪ Readthrough“small”amountsoftext,recordselectedaspects

Manualcontentanalysis

3 . 2

2000s

▪ Withcomputerpowerincreasing,twonewavenuesopened:

1. Dothesamemethodsasbefore,atscale

▪ Ex.:Li(2008JAE):Readability,butwithmanydocumentsinstead

of<100

2. Implementingstatisticaltechniques(oftenfortone/sentiment)

▪ Forinstance,sentimentclassificationwithNaïveBayes,SVM,or

otherstatisticalclassifiers

▪ AntweilerandFrank(2005JF)

▪ DasandChen(2007MS)

▪ Li(2010JAR)

Automation

3 . 3

Early2010s

▪ LoughranandMcDonald(2011JF)pointsoutthemisspecificationof

usingdictionariesfromothercontexts

▪ Alsoprovidesasetsofpositive,negative,modalstrong/weak,

litigious,andconstrainingwords( )

▪ Subsequentworkbytheauthorsprovidesacritique:

▪ Alotofpapersignorethiscritique,andarestillatriskof

misspecification

Dictionariestakethehelm

availablehere

Applyingfinancialdictionaries“withoutmodificationto

othermediasuchasearningscallsandsocialmediais

likelytobeproblematic”(LoughranandMcDonald2016)

3 . 4

Late2010stopresent

▪ LoughranandMcDonalddictionariesfrequentlyused

▪ BogindexisperhapsanewentrantintheFogindexvsdocument

lengthdebate

▪ LDAmethodsfirstpublishedinAccounting/FinanceinBaoandDatta

(2014MS),withahandfulofotherpapersfollowingsuit.

▪ Moremethodsonthehorizon

Fragmentationandnewmethods

3 . 5

Goingforward

▪ Why?Becauseaccountingresearchhasbeenbehindthetimes,but

seemstobecatchingup

▪ Wecanincorporatemorethanayear’sworthofinnovationinNLP

eachyear…

Alotofchoices

3 . 6

Usefulmethodsforanalytics

4 . 1

Contentclassification:LatentDirichlet

Allocation

▪ LatentDirichletAllocation,fromBlei,Ng,andJordan(2003)

▪ Oneofthemostpopularmethodsunderthefieldoftopicmodeling

▪ LDAisaBayesianmethodofassessingthecontentofadocument

▪ LDAassumesthereareasetoftopicsineachdocument,andthatthis

setfollowsaDirichletpriorforeachdocument

▪ WordswithintopicsalsohaveaDirichletprior

Moredetailsfromthecreator

4 . 2

Example:LDA,10topics,all201410-Ks

#TopicsgeneratedusingR'sstmlibrarylabelTopics(topics)

##Topic1TopWords:##HighestProb:properti,oper,million,decemb,compani,interest,leas##FREX:ffo,efih,efh,tenant,hotel,casino,guc##Lift:aliansc,baluma,change-of-ownership,crj700s,directly-reimburs,escena,hhmk##Score:reit,hotel,game,ffo,tenant,casino,efih##Topic2TopWords:##HighestProb:compani,stock,share,common,financi,director,offic##FREX:prc,asher,shaanxi,wfoe,eit,hubei,yew##Lift:aagc,abramowitz,accello,akash,alix,alkam,almati##Score:prc,compani,penni,stock,share,rmb,director##Topic3TopWords:##HighestProb:product,develop,compani,clinic,market,includ,approv##FREX:dose,preclin,nda,vaccin,oncolog,anda,fdas##Lift:1064nm,12-001hr,25-gaug,2ml,3shape,503b,600mg##Score:clinic,fda,preclin,dose,patent,nda,product##Topic4TopWords:##HighestProb:invest,fund,manag,market,asset,trade,interest##FREX:uscf,nfa,unl,uga,mlai,bno,dno##Lift:a-1t,aion,apx-endex,bessey,bolduc,broyhil,buran##Score:uscf,fhlbank,rmbs,uga,invest,mlai,ung##Topic5TopWords:##HighestProb:servic,report,file,program,provid,network,requir##FREX:echostar,fcc,fccs,telesat,ilec,starz,retransmiss##Lift:1100-n,2-usb,2011-c1,2012-ccre4,2013-c9,aastra,accreditor##Score:entergi,fcc,echostar,wireless,broadcast,video,cabl##Topic6TopWords:##HighestProb:loan,bank,compani,financi,decemb,million,interest##FREX:nonaccru,oreo,tdrs,bancorp,fdic,charge-off,alll

4 . 3

PapersusingLDA(orvariants)

▪ BaoandDatta(2014MS):Quantifyingriskdisclosures

▪ Bird,Karolyi,andMa(2018working):8-Kcategorizationmismatches

▪ Brown,Crowley,andElliott(2018working):

▪ Contentbasedfrauddetection

▪ Crowley(2016working):

▪ Mismatchbetween10-Kandwebsitedisclosures

▪ Crowley,Huang,andLu(2018working):

▪ FinancialdisclosureonTwitter

▪ Crowley,Huang,Lu,andLuo(2018working):

▪ CSRdisclosureonTwitter

▪ Dyer,Lang,andStice-Lawrence(2017JAE):

▪ Changesin10-Ksovertime

▪ HobergandLewis(2017JCF):AAERsand10-KMD&Acontent,expost

▪ Huang,Lehavy,Zang,andZheng(2018MS):

▪ Analystinterpretationofconferencecalls4 . 4

Sentiment:Varied

▪ GeneralpurposewordlistslikeHarvardIV

▪ Tetlock(2007JF)

▪ Tetlock,Saar-Tsechansky,andMacskassy(2008JF)

▪ Manyrecentpapersuse10-KspecificdictionariesfromLoughranand

McDonald(2011JF)

▪ SomeworkusingNaiveBayesandsimilar

▪ AntweilerandFrank(2005JF),DasandChen(2007MS),Li(2010

JAR),Huang,ZangandZheng(2014TAR),Sprenger,Tumasjan,

Sandner,andWelpe(2014EFM)

▪ SomeworkusingSVM

▪ AntweilerandFrank(2005JF)

4 . 5

Sentiment:Whatisusedinpractice(CSside)

▪ Embeddingsmethodscanmakethispossible

▪ Embeddingsabstractawayfromwords,convertingwords/phrases/

sentences/paragraphs/documentstohighdimensionalvectors

▪ UsedinBrown,Crowley,andElliott(2018working)(wordlevel)

▪ UsedinWIPbyCrowley,Huang,andLu(sentence/documentlevel)

▪ Embeddingsarepasttoasupervisedclassifiertolearnsentiment

▪ Othermethodsincludeweaksupervision

▪ SuchastheJointSentimentTopicmodelbyLinandHe(2009ACM)

(usedinCrowley(2016working))

“TheprevalenceofpolysemesinEnglish–wordsthathave

multiplemeanings–makesanabsolutemappingof

specificwordsintofinancialsentimentimpossible.”–

LoughranandMcDonald(2011)

4 . 6

Readability…

▪ 2008:Fogindexkick-startedthisareainaccounting

▪ Li(2008JAE),abunchofotherpapers

▪ 2014:Filelengthcapturescomplexitymoreaccurately…

▪ LoughranandMcDonald(2014JF;2016JAR)

▪ 2017:Bogindex

▪ Bonsall,Leone,MillerandRennekamp(2017JAE);BonsallandMiller

(2017RAST)

▪ SubjecttoLoughranandMcDonald’scritiqueofgeneralpurpose

dictionaries

“[…]Theuseofwordlistsderivedoutsidethecontextofbusinessapplicationshas

thepotentialforerrorsthatarenotsimplynoiseandcanserveasunintended

measuresofindustry,firm,ortimeperiod.Thecomputationallinguisticsliterature

haslongemphasizedtheimportanceofdevelopingcategorizationproceduresin

thecontextoftheproblembeingstudied(e.g.,Berelson[1952]).”–LM2016

4 . 7

Readability…

Theliteraturehasnotaddressedthis.

“Thereareproblemswiththefacevalidityofthe

accountingreadabilitystudies.Accountingresearchers

have,ingeneral,assumedthatthereadabilityformulas

measurenotonlyreadabilitybutalsounderstandability.

Indeed,readabilityandunderstandabilityhaveoften

beenusedinterchangeably,theassumptionbeingthey

aresynonymous.However,althoughtheseconceptsare

related,theydodiffer.”–JonesandShoemaker(1994

JAL)

4 . 8

Goingforward

5 . 1

Goingforward

▪ Therearealotofcoolmethods

▪ Therearealotofcoolmeasures

▪ Itiseasytogetwrappedupinthetechnicaldetailsandachievements

andlosesightofthepurposeforusingthem

▪ Tonedispersion(AlleeandDeAngelis2015JAR)

▪ Disclosure“Scriptability”(Allee,DeAngelis,andMoon2018JAR)

▪ Contentdifferences

▪ DeAngelis(2014dissertation)–uniquecontent

▪ Crowley(2016working)–extentofcontentdifferences

▪ Industryclassification

Tailor-mademeasures

HobergandPhillips(6papers)

5 . 2

Python:

▪ Textparsing:spaCy

▪ LDA:gensim

▪ Sentiment:NLTK,SpaCy,or

handcodeusingCounter()

(superfast)

▪ Classifiers:scikit-learnor

kerasorpytorch

▪ Othermeasures:NLTK,spaCy

R:

▪ LDA:stm+quanteda+

convert(dfm,to='stm')

▪ Sentiment(dictionary):tidytext

▪ Classifiers:caret,e1071,or

keras

▪ Othermeasures:Usingpython

islikelybetter

Recommendedcodinglibraries

▪ Alsouseful:MALLET,StanfordNLP

5 . 3

References

▪ Allee,KristianD.,andMatthewD.DeAngelis.2015.“TheStructureofVoluntaryDisclosureNarratives:EvidencefromToneDispersion.”JournalofAccountingResearch53(2):241–74. .

▪ Allee,KristianD.,MatthewD.DeAngelis,andJamesR.Moon.2018.“Disclosure‘Scriptability.’”JournalofAccountingResearch56(2):363–430. .

▪ Antweiler,Werner,andMurrayZ.Frank.2005.“IsAllThatTalkJustNoise?TheInformationContentofInternetStockMessageBoards.”TheJournalofFinance59(3):1259–94. .

▪ Y.Bao,andA.Datta.2014.“SimultaneouslyDiscoveringandQuantifyingRiskTypesfromTextualDisclosures.”ManagementScience60(6):1371–1391.

▪ Bird,Andrew,StephenA.Karolyi,andPaulMa.2018.“StrategicDisclosureMisclassification.”SSRNScholarlyPaperID2778805.Rochester,NY:SocialScienceResearchNetwork. .

▪ Blei,DavidM.,AndrewY.Ng,andMichaelI.Jordan.2003.“LatentDirichletAllocation.”J.Mach.Learn.Res.3(March):993–1022.

▪ Bonsall,SamuelB.,AndrewJ.Leone,BrianP.Miller,andKristinaRennekamp.2017.“APlainEnglishMeasureofFinancialReportingReadability.”JournalofAccountingandEconomics63(2):329–57.

.▪ Bonsall,SamuelB.,andBrianP.Miller.2017.“TheImpactofNarrativeDisclosureReadabilityonBondRatingsandthe

CostofDebt.”ReviewofAccountingStudies22(2):608–43..

▪ Botosan,C.A.1997.“Disclosurelevelandthecostofequitycapital.”TheAccountingReview72(3),323–349.▪ Brown,NerissaC.,RichardCrowley,andW.BrookeElliott.2018.“WhatAreYouSaying?UsingTopictoDetectFinancial

Misreporting.”SSRNScholarlyPaperID2803733.Rochester,NY:SocialScienceResearchNetwork..

▪ Cole,C.J.andC.L.Jones.“ManagementDiscussionandAnalysis:AReviewandImplicationsforFutureResearch.”JournalofAccountingLiterature24,135–174.

https://doi.org/10.1111/1475-679X.12072

https://doi.org/10.1111/1475-679X.12203

https://doi.org/10.1111/j.1540-6261.2004.00662.x

https://papers.ssrn.com/abstract=2778805

https://doi.org/10.1016/j.jacceco.2017.03.002

http://dx.doi.org.libproxy.smu.edu.sg/10.1007/s11142-017-9388-0

https://papers.ssrn.com/abstract=2803733

5 . 4

References

▪ Crowley,Richard.2016.“DisclosurethroughMultipleDisclosureChannels.”Dissertation,UIUC..

▪ Crowley,Richard,WenliHuang,andHaiLu.2018.“DiscretionaryDisclosureonTwitter.”SSRNScholarlyPaperID3105847.Rochester,NY:SocialScienceResearchNetwork. .

▪ Crowley,Richard,WenliHuang,HaiLu,andWeiLuo.“DoFirmsTweetSocialResponsibility?EvidencefromMachineLearningAnalysis.”Workingpaper,SingaporeManagementUniversity.

▪ Das,SanjivR.,andMikeY.Chen.2007.“Yahoo!ForAmazon:SentimentExtractionfromSmallTalkontheWeb.”ManagementScience53(9):1375–88. .

▪ Dorrell,J.T.,andN.S.Darsey.1991.“Ananalysisofthereadabilityandstyleofletterstostockholders.”JournalofTechnicalWritingandCommunication21:73–83.

▪ Dyer,Travis,MarkLang,andLorienStice-Lawrence.2017.“TheEvolutionof10-KTextualDisclosure:EvidencefromLatentDirichletAllocation.”JournalofAccountingandEconomics64(2):221–45.

.▪ Hoberg,Gerard,andCraigLewis.2017.“DoFraudulentFirmsProduceAbnormalDisclosure?”JournalofCorporate

Finance43(April):58–85. .▪ Huang,AllenH.,ReuvenLehavy,AmyY.Zang,andRongZheng.2018.“AnalystInformationDiscoveryand

InterpretationRoles:ATopicModelingApproach.”ManagementScience64(6):2833–55..

▪ Huang,AllenH.,AmyY.Zang,andRongZheng.2014.“EvidenceontheInformationContentofTextinAnalystReports.”AccountingReview89(6):2151–80. .

▪ Jones,M.J.andP.A.Shoemaker.“AccountingNarratives:AReviewofEmpiricalStudiesofContentandReadability.”JournalofAccountingLiterature13,142.

http://hdl.handle.net/2142/90526

https://papers.ssrn.com/abstract=3105847

https://doi.org/10.1287/mnsc.1070.0704

https://doi.org/10.1016/j.jacceco.2017.07.002

https://doi.org/10.1016/j.jcorpfin.2016.12.007

https://doi.org/10.1287/mnsc.2017.2751

https://doi.org/10.2308/accr-50833

5 . 5

References

▪ Li,Feng.2008.“AnnualReportReadability,CurrentEarnings,andEarningsPersistence.”JournalofAccountingandEconomics,EconomicConsequencesofAlternativeAccountingStandardsandRegulation,45(2):221–47.

.▪ Li,Feng.2010a.“TheInformationContentofForward-LookingStatementsinCorporateFilings—ANaïveBayesian

MachineLearningApproach.”JournalofAccountingResearch48(5):1049–1102..

▪ Li,Feng.2010b.▪ Lin,Chenghua,andYulanHe.2009.“JointSentiment/TopicModelforSentimentAnalysis.”InProceedingsofthe18th

ACMConferenceonInformationandKnowledgeManagement,375–384.CIKM’09.NewYork,NY,USA:ACM..

▪ Loughran,Tim,andBillMcDonald.2011.“WhenIsaLiabilityNotaLiability?TextualAnalysis,Dictionaries,and10-Ks.”TheJournalofFinance66(1):35–65. .

▪ Loughran,Tim,andBillMcDonald.2014.“MeasuringReadabilityinFinancialDisclosures.”TheJournalofFinance69(4):1643–71. .

▪ Loughran,Tim,andBillMcDonald.2016.“TextualAnalysisinAccountingandFinance:ASurvey.”JournalofAccountingResearch54(4):1187–1230. .

▪ Sprenger,TimmO.,AndranikTumasjan,PhilippG.Sandner,andIsabellM.Welpe.2014.“TweetsandTrades:TheInformationContentofStockMicroblogs.”EuropeanFinancialManagement20(5):926–57.

.▪ Tetlock,PaulC.2007.“GivingContenttoInvestorSentiment:TheRoleofMediaintheStockMarket.”TheJournalof

Finance62(3):1139–68. .▪ Tetlock,PaulC.,MaytalSaar‐Tsechansky,andSofusMacskassy.2008.“MoreThanWords:QuantifyingLanguageto

MeasureFirms’Fundamentals.”TheJournalofFinance63(3):1437–67..

https://doi.org/10.1016/j.jacceco.2008.02.003

https://doi.org/10.1111/j.1475-679X.2010.00382.x

https://doi.org/10.1145/1645953.1646003

https://doi.org/10.1111/j.1540-6261.2010.01625.x

https://doi.org/10.1111/jofi.12162

https://doi.org/10.1111/1475-679X.12123

https://doi.org/10.1111/j.1468-036X.2013.12007.x

https://doi.org/10.1111/j.1540-6261.2007.01232.x

https://doi.org/10.1111/j.1540-6261.2008.01362.x

5 . 6