corpus driven deception detection in financial...
TRANSCRIPT
Abstract
Cost of fraud to the UK climbs high as £193 billion a year. In a bid to find new ways for deception detection, a
study was undertaken to examine linguistic features in financial text. There is still an over-reliance on building
financial fraud prediction models using numerical data. However financial text is much more abundant and
could hold clues to deceit. The linguistic correlates of deception are uncovered using a variety of tools and
techniques in a newly built corpus of financial text (6.3 million words). Primary amongst them is a new way to
gauge readability as this is a key ploy used to obfuscate and thereby deceive. The extracted features are put
trough feature selection routines and then passed to machine learning based supervised classifiers. K-means
clustering is also executed over the features. All results show that financial text can aid in deception detection.
Results
References
Figure 8: The Classifiers
Figure 7: Dimensionality reduction techniques applied to matrices
Figure 9: Best Performing Classifiers on Document Representation Schemes – Peer Set Figure 10: Best Performing Classifiers on Document Representation Schemes – Matched Pair
Figure 11: Application of word embeddings to historical corpora, shows how meanings of some words have shifted [14]
Figure 12: Proposed application to check for quality of narrative
Figure 13: A quality narratives should cover the above 3 areas—Companies Act 2013
Figure 14: Capture of compositional nature of the text
[1] Rutherford B "Genre Analysis of Corporate Annual Report Narratives", Journal of Business Communication, Volume 42, Number 4, 2005. [2] E. Fitzpatrick and J. Bachenko, "Building a Data Collection for Deception Research," in Proceedings of the EACL 2012 Workshop on Computational Approaches to Deception Detection , France, 2012, pp. 31-38. [3] Z. Rezaee and R. Riley, “Financial Statement Fraud: Prevention and Detection”, 2 ed.: Wiley, 2009. [4] D. M. Merkl‐Davies and N. Brennan, "Discretionary disclosure strategies in corporate narratives: incremental information or impression management," Journal of Accounting Literature, vol. 26, pp. 116–196, 2007. [5] E. W. T. Ngai, Y. Hu, Y. H. Wong, Y. Chen, and X. Sun, "The application of data mining techniques in financial fraud detection: A classification framework and an academic review of litera ture," Decision Support Systems, vol. 50, pp. 559-569, 2011. [6] A. Gepp and K. Kumar, "Predicting Financial Distress: A Comparison of Survival Analysis and Decision Tree Techniques," Procedia Computer Science, vol. 54, pp. 396-404, 2015/01/01 . [7] T. McEnery and A. Wilson, Corpus Linguistics, An Introduction: Edinburgh University Press, 2005. [8] L. Zhou, J. Burgoon, J. Nunamker, and D. Twitchell, "Automating Linguistics-Based Cues for Detecting Deception in Text-based Asynchronous Computer Mediated Communication," Group Decision and Negotiation, vol. 13, pp. 81-106, 2004. [9] M. L. Newman, J. W. Pennebaker, D. S. Berry, and J. M. Richards, "Lying Words: Predicting Deception From Linguistic Styles," Personality and Social Psychology Bulletin vol. 29, pp. 665- 675, 2003. [10] J. T. Hancock, L. E. Curry, S. Goorha, and M. Woodworth, "On Lying and Being Lied To: A Linguistic Analysis of Deception in Computer-Mediated Communication," Discourse Processes, vol. 45, pp. 1-23, 2007. [11] D. S. McNamara, A. C. Graesser, P. M. McCarthy, and Z. Cai, Automated Evaluation of Text and Discourse with Coh-Metrix. New York: Cambridge University Press 2014.. [12] R. Bloomfield, "The "Incomplete Revelation Hypothesis" and Financial Reporting," Accounting Horizons, vol. 16, p. 233, 2002. [13] N. D. Duran, C. Hall, P. M. McCarthy, and D. S. McNamara, "The linguistic correlates of conversational deception: Comparing natural language processing technologies," Applied Psycho-
linguistics, vol. 31, pp. 439-462, 2010. [14] T Mikolov, I Sutskever, K Chen, GS Corrado, J Dean, “Distributed Representations of Words and Phrases and their Compositionality”, Advances in neural information processing systems, 2013.
Introduction
In the arena of high stakes deception: “there is a scarcity of ground truth verification for data collected from real
world sources” [2]. To address such a short supply of “ground truth” a new corpus of 6.3 million words is con-
structed that contains narratives of firms known to have committed financial statement fraud (FSF), juxtaposed
with narratives (10-K/Annual Reports) from similar non-fraud firms. FSF is known to be the costliest type of
fraud [3]. The study focuses in on this to demonstrate that deception can be detected from text. For the first
time, an extensive range of linguistic features are extracted from the corpus to look for significant discrepan-
cies between the narratives of and fraud and non-fraud firms.
The literature search reveals that in financial reporting there are two main competing theories that seek to ex-
plain management motivations and effects on stakeholders. This is depicted in figure 1. The literature search
also unveils the techniques used to build predictive fraud models, shown in figure 2.
Figure 1: Theoretical underpinnings in financial reporting [4] Figure 2: The categories of financial fraud and ‘intelligent’ detection techniques [5]
Deceptive Linguistic Cues The effect in text Authors Theory/Method
Word quantity Could be higher or lower in deceptive text. Generally, higher quantities of verbs, nouns, modifiers and group references. Zhou [8] Interpersonal
Deception Theory
Pronoun use First person singular pronouns less frequent, greater use of third person pronouns. This is known as distancing strategies (reducing ownership of a statement).
Newman et al [9] Zhou [8]
Interpersonal Deception Theory
Emotion words Slightly more negativity, greater emotional expressiveness. Newman et al [9] Leakage Theory
Markers of cognitive complexity
Fewer exclusive terms (eg but, except), negations (eg no, never) and cau-sation words (eg because, effect) and motion verbs - all require a deceiv-er to be more specific and precise. Repetitive phrasing and less diverse language is more marked in the language of liars. Also, more mention of cognitive operations such as thinking, admitting, hoping.
Newman et al [9] Hancock et al [10]
Reality Monitoring
Modal verbs Verbs such as would, should, and could lower the level of commitment to facts. Hancock et al [10] Interpersonal
Deception Theory
Verbal
non-immediacy
“Any indication through lexical choices, syntax and phraseology of sepa-ration, non-identity, attenuation of directness, or change in the intensity of interaction between the communicator and his referents.” Results in the use of more informal, non-immediate language.
Zhou [8] Interpersonal Deception Theory
Uncertainty
“Impenetrable sentence structures (syntactic ambiguity) or use of evasive and ambiguous language that introduces uncertainty (semantic ambigui-ty). Modifiers, modal verbs (e.g., should, could), and generalizing or “allness” terms (eg “everybody”) increases uncertainty.”
Zhou [8] Interpersonal Deception Theory
Half-truths and equivocations
Increased inclusion of adjectives and adverbs that qualify the meaning in statements. Sentences less cohesive and coherent thereby reducing readability.
McNamara et al [11]
Bloomfield [12]
Management Obfuscation Hypothesis
Passive voice Increase in use, another distancing strategy - switch subject/object around. Duran et al [13] Interpersonal
Deception Theory
Relevance
manipulations Irrelevant details. Duran et al [13]
Bloomfield [12]
Management Obfuscation Hypothesis
Sense based words Increase use of words such as see, touch, listen. Hancock et al [10] Reality Monitoring
Table 1:Linguistic cues to deception
Methodology
A 102 annual report/10-K narratives of fraud firms are gathered (2 years before the FSF is uncovered). This is
matched by 306 similar narratives of non-fraud firms of the same time period and industry sector. Figure 4 illus-
trates the processing undertaken to determine significant differences in language use. Once the reports have
been cleaned of all formatting, Corpus Linguistics methodology is applied. This entails frequency inspection,
keyword analysis, collocations and concordances. According to McEnery [7] corpus data is a rich source of lin-
guistic utterances and a powerful tool rooted in the scientific method, open to objective verification of results. A
study conducted by Rutherford [1] uncovered meaningful words used in financial reports. Multidimensional scal-
ing was applied to the keyword counts (those deemed significant by Principal Component Analysis-PCA). The
results shown in figure 6 show clear differences in usage of these words in the two report categories.
The reports were mapped to 10 different documentation representation schemes, depicted in figure 5. The lin-
guistic features for each scheme was extracted using tools Python , R and tools such as Coh-Metrix, to extract
more in depth features to gauge readability. Other tools used was LIWC 2015, word lists specifically for the fi-
nancial domain, WordNet, were used to pick up further prominent words and synonyms. Zhou [8] derived ratios
(LBC) known to be used in deceptive text were also extracted. In a bid to pick ‘what was said’ as opposed to
’how’ Latent Dirichlet Allocation was also executed to pick up ‘topics’.
The features extracted were then put through feature selection routines as depicted in figure 7. The reduced
term document matrices were set up, using matched pair and peer set composition. The former was a matrix
set up for one fraud report to one non–fraud report. The latter was one fraud report to three non-fraud reports
(unbalanced to reflect a more realistic distribution). All matrices were put through 5 classifiers, shown in figure 8
and results in figure 9.
Figure 4: Framework to decipher financial reports for deception detection
Figure 5: Document representation schemes
Figure 6: PCA selected Rutherford keywords
Conclusion
Based on the feature extraction (figure 5) feature selection (figure 7) methods used and the 5 classifier models
built the results (figure 9 and figure 10) indicate that narratives of fraud and non-fraud firms are discernably dis-
tinct. The classifiers perform better when the data is balanced as in the unbalanced set there is higher degree of
false negatives. The linguistic cues mentioned (table 1) do hold potential to build more accurate fraud prediction
models.
Future Work
To further highlight the differences in word usage between fraud and non-fraud firms Word-embeddings can be
constructed using neural network models such as Word2vec, as illustrated in figure 11. These decipher the
meaning of words based on context analysis.
The UK government updated the companies act in 2013 in a bid to boost transparency and improve information
content in annual reports. Figure 13 highlights the areas where improvements need to be made by companies.
From a computational perspectives, new ways need to be devised to determine the quality of the financial nar-
rative disclosed. Figure 14 shows that the capture of compositional meaning of sentences would help in that di-
rection. Figure 15 shows that lightweight ontologies and parsers are some of the practical steps that can be un-
dertaken to determine quality of disclosure.
Corpus Driven Deception Detection in Financial Text
Saliha Minhas, Prof Amir Hussain