text analytics, nlp, and accounting research...early 2010s loughran and mcdonald (2011 jf) points...
TRANSCRIPT
Textanalytics,NLP,and
accountingresearch
2018November23
Dr.RichardM.Crowley
http://rmc.link/
1
Whatistextanalytics?
▪ Thiscouldbeassimpleasextractingspecific
words/phrases/sentences
▪ Thiscouldbeascomplexasextractinglatent(hidden)patterns
structureswithintext
▪ Sentiment
▪ Content
▪ Emotion
▪ Writercharacteristics
▪ …
▪ Oftencalledtextmining(inCS)ortextualanalysis(inaccounting)
Extractingmeaningfulinformationfromtext
2 . 2
WhatisNLPthen?
▪ NLPstandsforNaturalLanguageProcessing
▪ ItisaverydiversefieldwithinCS
▪ Grammar/linguistics
▪ Conversations
▪ Conversionfromaudio,images
▪ Translation
▪ Dictation
▪ Generation
NLPisafielddevotedtounderstandinghowtounderstand
humanlanguage
2 . 3
WhydiscussNLP?
Considerthefollowingsituation:
▪ WithoutNLP:
1. HireanRA/mechanicalturkarmy…
2. Useadictionary:Words/phraseslike“earnings,”“profitability,”
“netincome”arelikelytobeinthesentences
▪ WithNLP:
1. Wecouldassociatesentenceswithoutsidedatatobuildaclassifier
(supervisedapproach)
2. Wecouldaskanalgorithmtolearnthestructureofallsentences,
andthenextracttheusefulpartexpost(unsupervised)
Youhaveacollectionof1millionsentences,andyouwant
toknowwhichareaccountingrelevant
2 . 4
▪ Firms
▪ Letterstoshareholders
▪ Annualandquarterly
reports
▪ 8-Ks
▪ Pressreleases
▪ Conferencecalls
▪ Firmwebsites
▪ Twitterposts
▪ Investors
▪ Blogposts
▪ Socialmediaposts
▪ Intermediaries
▪ Newspaperarticles
▪ Analystreports
▪ Government
▪ FASBexposuredrafts
▪ Commentletters
▪ IRScode
▪ Courtcases
Datathathasbeenstudied
2 . 5
Indexes
▪ Ex.:Botosan(1997TAR):For
firmswithlowanalyst
following,moredisclosure
⇒Lowercostofequity
▪ Indexof35aspectsof10-Ks
▪ CoveredindetailinColeand
Jones(2004JAL)
▪ Mostusesmallsamples
▪ Oftenuseselectindustries
Readability
▪ Automatedstartingwith
DorrellandDarsey(1991
JTWC)inaccounting…
▪ Atleast32studiesonthisin
the1980sandearly1990sper
JonesandShoemaker(1994
JAL)
▪ Only2usefulldocs
▪ Only2use>100docs
1980sand1990s
▪ Readthrough“small”amountsoftext,recordselectedaspects
Manualcontentanalysis
3 . 2
2000s
▪ Withcomputerpowerincreasing,twonewavenuesopened:
1. Dothesamemethodsasbefore,atscale
▪ Ex.:Li(2008JAE):Readability,butwithmanydocumentsinstead
of<100
2. Implementingstatisticaltechniques(oftenfortone/sentiment)
▪ Forinstance,sentimentclassificationwithNaïveBayes,SVM,or
otherstatisticalclassifiers
▪ AntweilerandFrank(2005JF)
▪ DasandChen(2007MS)
▪ Li(2010JAR)
Automation
3 . 3
Early2010s
▪ LoughranandMcDonald(2011JF)pointsoutthemisspecificationof
usingdictionariesfromothercontexts
▪ Alsoprovidesasetsofpositive,negative,modalstrong/weak,
litigious,andconstrainingwords( )
▪ Subsequentworkbytheauthorsprovidesacritique:
▪ Alotofpapersignorethiscritique,andarestillatriskof
misspecification
Dictionariestakethehelm
availablehere
Applyingfinancialdictionaries“withoutmodificationto
othermediasuchasearningscallsandsocialmediais
likelytobeproblematic”(LoughranandMcDonald2016)
3 . 4
Late2010stopresent
▪ LoughranandMcDonalddictionariesfrequentlyused
▪ BogindexisperhapsanewentrantintheFogindexvsdocument
lengthdebate
▪ LDAmethodsfirstpublishedinAccounting/FinanceinBaoandDatta
(2014MS),withahandfulofotherpapersfollowingsuit.
▪ Moremethodsonthehorizon
Fragmentationandnewmethods
3 . 5
Goingforward
▪ Why?Becauseaccountingresearchhasbeenbehindthetimes,but
seemstobecatchingup
▪ Wecanincorporatemorethanayear’sworthofinnovationinNLP
eachyear…
Alotofchoices
3 . 6
Contentclassification:LatentDirichlet
Allocation
▪ LatentDirichletAllocation,fromBlei,Ng,andJordan(2003)
▪ Oneofthemostpopularmethodsunderthefieldoftopicmodeling
▪ LDAisaBayesianmethodofassessingthecontentofadocument
▪ LDAassumesthereareasetoftopicsineachdocument,andthatthis
setfollowsaDirichletpriorforeachdocument
▪ WordswithintopicsalsohaveaDirichletprior
Moredetailsfromthecreator
4 . 2
Example:LDA,10topics,all201410-Ks
#TopicsgeneratedusingR'sstmlibrarylabelTopics(topics)
##Topic1TopWords:##HighestProb:properti,oper,million,decemb,compani,interest,leas##FREX:ffo,efih,efh,tenant,hotel,casino,guc##Lift:aliansc,baluma,change-of-ownership,crj700s,directly-reimburs,escena,hhmk##Score:reit,hotel,game,ffo,tenant,casino,efih##Topic2TopWords:##HighestProb:compani,stock,share,common,financi,director,offic##FREX:prc,asher,shaanxi,wfoe,eit,hubei,yew##Lift:aagc,abramowitz,accello,akash,alix,alkam,almati##Score:prc,compani,penni,stock,share,rmb,director##Topic3TopWords:##HighestProb:product,develop,compani,clinic,market,includ,approv##FREX:dose,preclin,nda,vaccin,oncolog,anda,fdas##Lift:1064nm,12-001hr,25-gaug,2ml,3shape,503b,600mg##Score:clinic,fda,preclin,dose,patent,nda,product##Topic4TopWords:##HighestProb:invest,fund,manag,market,asset,trade,interest##FREX:uscf,nfa,unl,uga,mlai,bno,dno##Lift:a-1t,aion,apx-endex,bessey,bolduc,broyhil,buran##Score:uscf,fhlbank,rmbs,uga,invest,mlai,ung##Topic5TopWords:##HighestProb:servic,report,file,program,provid,network,requir##FREX:echostar,fcc,fccs,telesat,ilec,starz,retransmiss##Lift:1100-n,2-usb,2011-c1,2012-ccre4,2013-c9,aastra,accreditor##Score:entergi,fcc,echostar,wireless,broadcast,video,cabl##Topic6TopWords:##HighestProb:loan,bank,compani,financi,decemb,million,interest##FREX:nonaccru,oreo,tdrs,bancorp,fdic,charge-off,alll
4 . 3
PapersusingLDA(orvariants)
▪ BaoandDatta(2014MS):Quantifyingriskdisclosures
▪ Bird,Karolyi,andMa(2018working):8-Kcategorizationmismatches
▪ Brown,Crowley,andElliott(2018working):
▪ Contentbasedfrauddetection
▪ Crowley(2016working):
▪ Mismatchbetween10-Kandwebsitedisclosures
▪ Crowley,Huang,andLu(2018working):
▪ FinancialdisclosureonTwitter
▪ Crowley,Huang,Lu,andLuo(2018working):
▪ CSRdisclosureonTwitter
▪ Dyer,Lang,andStice-Lawrence(2017JAE):
▪ Changesin10-Ksovertime
▪ HobergandLewis(2017JCF):AAERsand10-KMD&Acontent,expost
▪ Huang,Lehavy,Zang,andZheng(2018MS):
▪ Analystinterpretationofconferencecalls4 . 4
Sentiment:Varied
▪ GeneralpurposewordlistslikeHarvardIV
▪ Tetlock(2007JF)
▪ Tetlock,Saar-Tsechansky,andMacskassy(2008JF)
▪ Manyrecentpapersuse10-KspecificdictionariesfromLoughranand
McDonald(2011JF)
▪ SomeworkusingNaiveBayesandsimilar
▪ AntweilerandFrank(2005JF),DasandChen(2007MS),Li(2010
JAR),Huang,ZangandZheng(2014TAR),Sprenger,Tumasjan,
Sandner,andWelpe(2014EFM)
▪ SomeworkusingSVM
▪ AntweilerandFrank(2005JF)
4 . 5
Sentiment:Whatisusedinpractice(CSside)
▪ Embeddingsmethodscanmakethispossible
▪ Embeddingsabstractawayfromwords,convertingwords/phrases/
sentences/paragraphs/documentstohighdimensionalvectors
▪ UsedinBrown,Crowley,andElliott(2018working)(wordlevel)
▪ UsedinWIPbyCrowley,Huang,andLu(sentence/documentlevel)
▪ Embeddingsarepasttoasupervisedclassifiertolearnsentiment
▪ Othermethodsincludeweaksupervision
▪ SuchastheJointSentimentTopicmodelbyLinandHe(2009ACM)
(usedinCrowley(2016working))
“TheprevalenceofpolysemesinEnglish–wordsthathave
multiplemeanings–makesanabsolutemappingof
specificwordsintofinancialsentimentimpossible.”–
LoughranandMcDonald(2011)
4 . 6
Readability…
▪ 2008:Fogindexkick-startedthisareainaccounting
▪ Li(2008JAE),abunchofotherpapers
▪ 2014:Filelengthcapturescomplexitymoreaccurately…
▪ LoughranandMcDonald(2014JF;2016JAR)
▪ 2017:Bogindex
▪ Bonsall,Leone,MillerandRennekamp(2017JAE);BonsallandMiller
(2017RAST)
▪ SubjecttoLoughranandMcDonald’scritiqueofgeneralpurpose
dictionaries
“[…]Theuseofwordlistsderivedoutsidethecontextofbusinessapplicationshas
thepotentialforerrorsthatarenotsimplynoiseandcanserveasunintended
measuresofindustry,firm,ortimeperiod.Thecomputationallinguisticsliterature
haslongemphasizedtheimportanceofdevelopingcategorizationproceduresin
thecontextoftheproblembeingstudied(e.g.,Berelson[1952]).”–LM2016
4 . 7
Readability…
Theliteraturehasnotaddressedthis.
“Thereareproblemswiththefacevalidityofthe
accountingreadabilitystudies.Accountingresearchers
have,ingeneral,assumedthatthereadabilityformulas
measurenotonlyreadabilitybutalsounderstandability.
Indeed,readabilityandunderstandabilityhaveoften
beenusedinterchangeably,theassumptionbeingthey
aresynonymous.However,althoughtheseconceptsare
related,theydodiffer.”–JonesandShoemaker(1994
JAL)
4 . 8
Goingforward
▪ Therearealotofcoolmethods
▪ Therearealotofcoolmeasures
▪ Itiseasytogetwrappedupinthetechnicaldetailsandachievements
andlosesightofthepurposeforusingthem
▪ Tonedispersion(AlleeandDeAngelis2015JAR)
▪ Disclosure“Scriptability”(Allee,DeAngelis,andMoon2018JAR)
▪ Contentdifferences
▪ DeAngelis(2014dissertation)–uniquecontent
▪ Crowley(2016working)–extentofcontentdifferences
▪ Industryclassification
▪
Tailor-mademeasures
HobergandPhillips(6papers)
5 . 2
Python:
▪ Textparsing:spaCy
▪ LDA:gensim
▪ Sentiment:NLTK,SpaCy,or
handcodeusingCounter()
(superfast)
▪ Classifiers:scikit-learnor
kerasorpytorch
▪ Othermeasures:NLTK,spaCy
R:
▪ LDA:stm+quanteda+
convert(dfm,to='stm')
▪ Sentiment(dictionary):tidytext
▪ Classifiers:caret,e1071,or
keras
▪ Othermeasures:Usingpython
islikelybetter
Recommendedcodinglibraries
▪ Alsouseful:MALLET,StanfordNLP
5 . 3
References
▪ Allee,KristianD.,andMatthewD.DeAngelis.2015.“TheStructureofVoluntaryDisclosureNarratives:EvidencefromToneDispersion.”JournalofAccountingResearch53(2):241–74. .
▪ Allee,KristianD.,MatthewD.DeAngelis,andJamesR.Moon.2018.“Disclosure‘Scriptability.’”JournalofAccountingResearch56(2):363–430. .
▪ Antweiler,Werner,andMurrayZ.Frank.2005.“IsAllThatTalkJustNoise?TheInformationContentofInternetStockMessageBoards.”TheJournalofFinance59(3):1259–94. .
▪ Y.Bao,andA.Datta.2014.“SimultaneouslyDiscoveringandQuantifyingRiskTypesfromTextualDisclosures.”ManagementScience60(6):1371–1391.
▪ Bird,Andrew,StephenA.Karolyi,andPaulMa.2018.“StrategicDisclosureMisclassification.”SSRNScholarlyPaperID2778805.Rochester,NY:SocialScienceResearchNetwork. .
▪ Blei,DavidM.,AndrewY.Ng,andMichaelI.Jordan.2003.“LatentDirichletAllocation.”J.Mach.Learn.Res.3(March):993–1022.
▪ Bonsall,SamuelB.,AndrewJ.Leone,BrianP.Miller,andKristinaRennekamp.2017.“APlainEnglishMeasureofFinancialReportingReadability.”JournalofAccountingandEconomics63(2):329–57.
.▪ Bonsall,SamuelB.,andBrianP.Miller.2017.“TheImpactofNarrativeDisclosureReadabilityonBondRatingsandthe
CostofDebt.”ReviewofAccountingStudies22(2):608–43..
▪ Botosan,C.A.1997.“Disclosurelevelandthecostofequitycapital.”TheAccountingReview72(3),323–349.▪ Brown,NerissaC.,RichardCrowley,andW.BrookeElliott.2018.“WhatAreYouSaying?UsingTopictoDetectFinancial
Misreporting.”SSRNScholarlyPaperID2803733.Rochester,NY:SocialScienceResearchNetwork..
▪ Cole,C.J.andC.L.Jones.“ManagementDiscussionandAnalysis:AReviewandImplicationsforFutureResearch.”JournalofAccountingLiterature24,135–174.
https://doi.org/10.1111/1475-679X.12072
https://doi.org/10.1111/1475-679X.12203
https://doi.org/10.1111/j.1540-6261.2004.00662.x
https://papers.ssrn.com/abstract=2778805
https://doi.org/10.1016/j.jacceco.2017.03.002
http://dx.doi.org.libproxy.smu.edu.sg/10.1007/s11142-017-9388-0
https://papers.ssrn.com/abstract=2803733
5 . 4
References
▪ Crowley,Richard.2016.“DisclosurethroughMultipleDisclosureChannels.”Dissertation,UIUC..
▪ Crowley,Richard,WenliHuang,andHaiLu.2018.“DiscretionaryDisclosureonTwitter.”SSRNScholarlyPaperID3105847.Rochester,NY:SocialScienceResearchNetwork. .
▪ Crowley,Richard,WenliHuang,HaiLu,andWeiLuo.“DoFirmsTweetSocialResponsibility?EvidencefromMachineLearningAnalysis.”Workingpaper,SingaporeManagementUniversity.
▪ Das,SanjivR.,andMikeY.Chen.2007.“Yahoo!ForAmazon:SentimentExtractionfromSmallTalkontheWeb.”ManagementScience53(9):1375–88. .
▪ Dorrell,J.T.,andN.S.Darsey.1991.“Ananalysisofthereadabilityandstyleofletterstostockholders.”JournalofTechnicalWritingandCommunication21:73–83.
▪ Dyer,Travis,MarkLang,andLorienStice-Lawrence.2017.“TheEvolutionof10-KTextualDisclosure:EvidencefromLatentDirichletAllocation.”JournalofAccountingandEconomics64(2):221–45.
.▪ Hoberg,Gerard,andCraigLewis.2017.“DoFraudulentFirmsProduceAbnormalDisclosure?”JournalofCorporate
Finance43(April):58–85. .▪ Huang,AllenH.,ReuvenLehavy,AmyY.Zang,andRongZheng.2018.“AnalystInformationDiscoveryand
InterpretationRoles:ATopicModelingApproach.”ManagementScience64(6):2833–55..
▪ Huang,AllenH.,AmyY.Zang,andRongZheng.2014.“EvidenceontheInformationContentofTextinAnalystReports.”AccountingReview89(6):2151–80. .
▪ Jones,M.J.andP.A.Shoemaker.“AccountingNarratives:AReviewofEmpiricalStudiesofContentandReadability.”JournalofAccountingLiterature13,142.
http://hdl.handle.net/2142/90526
https://papers.ssrn.com/abstract=3105847
https://doi.org/10.1287/mnsc.1070.0704
https://doi.org/10.1016/j.jacceco.2017.07.002
https://doi.org/10.1016/j.jcorpfin.2016.12.007
https://doi.org/10.1287/mnsc.2017.2751
https://doi.org/10.2308/accr-50833
5 . 5
References
▪ Li,Feng.2008.“AnnualReportReadability,CurrentEarnings,andEarningsPersistence.”JournalofAccountingandEconomics,EconomicConsequencesofAlternativeAccountingStandardsandRegulation,45(2):221–47.
.▪ Li,Feng.2010a.“TheInformationContentofForward-LookingStatementsinCorporateFilings—ANaïveBayesian
MachineLearningApproach.”JournalofAccountingResearch48(5):1049–1102..
▪ Li,Feng.2010b.▪ Lin,Chenghua,andYulanHe.2009.“JointSentiment/TopicModelforSentimentAnalysis.”InProceedingsofthe18th
ACMConferenceonInformationandKnowledgeManagement,375–384.CIKM’09.NewYork,NY,USA:ACM..
▪ Loughran,Tim,andBillMcDonald.2011.“WhenIsaLiabilityNotaLiability?TextualAnalysis,Dictionaries,and10-Ks.”TheJournalofFinance66(1):35–65. .
▪ Loughran,Tim,andBillMcDonald.2014.“MeasuringReadabilityinFinancialDisclosures.”TheJournalofFinance69(4):1643–71. .
▪ Loughran,Tim,andBillMcDonald.2016.“TextualAnalysisinAccountingandFinance:ASurvey.”JournalofAccountingResearch54(4):1187–1230. .
▪ Sprenger,TimmO.,AndranikTumasjan,PhilippG.Sandner,andIsabellM.Welpe.2014.“TweetsandTrades:TheInformationContentofStockMicroblogs.”EuropeanFinancialManagement20(5):926–57.
.▪ Tetlock,PaulC.2007.“GivingContenttoInvestorSentiment:TheRoleofMediaintheStockMarket.”TheJournalof
Finance62(3):1139–68. .▪ Tetlock,PaulC.,MaytalSaar‐Tsechansky,andSofusMacskassy.2008.“MoreThanWords:QuantifyingLanguageto
MeasureFirms’Fundamentals.”TheJournalofFinance63(3):1437–67..
https://doi.org/10.1016/j.jacceco.2008.02.003
https://doi.org/10.1111/j.1475-679X.2010.00382.x
https://doi.org/10.1145/1645953.1646003
https://doi.org/10.1111/j.1540-6261.2010.01625.x
https://doi.org/10.1111/jofi.12162
https://doi.org/10.1111/1475-679X.12123
https://doi.org/10.1111/j.1468-036X.2013.12007.x
https://doi.org/10.1111/j.1540-6261.2007.01232.x
https://doi.org/10.1111/j.1540-6261.2008.01362.x
5 . 6