corpus driven deception detection in financial...

1
Abstract Cost of fraud to the UK climbs high as £193 billion a year. In a bid to find new ways for deception detection, a study was undertaken to examine linguistic features in financial text. There is still an over-reliance on building financial fraud prediction models using numerical data. However financial text is much more abundant and could hold clues to deceit. The linguistic correlates of deception are uncovered using a variety of tools and techniques in a newly built corpus of financial text (6.3 million words). Primary amongst them is a new way to gauge readability as this is a key ploy used to obfuscate and thereby deceive. The extracted features are put trough feature selection routines and then passed to machine learning based supervised classifiers. K-means clustering is also executed over the features. All results show that financial text can aid in deception detection. Results References Figure 8: The Classifiers Figure 7: Dimensionality reducon techniques applied to matrices Figure 9: Best Performing Classifiers on Document Representaon Schemes – Peer Set Figure 10: Best Performing Classifiers on Document Representaon Schemes – Matched Pair Figure 11: Applicaon of word embeddings to historical corpora, shows how meanings of some words have shiſted [14] Figure 12: Proposed applicaon to check for quality of narrave Figure 13: A quality narraves should cover the above 3 areas—Companies Act 2013 Figure 14: Capture of composional nature of the text [1] Rutherford B "Genre Analysis of Corporate Annual Report Narraves", Journal of Business Communicaon, Volume 42, Number 4, 2005. [2] E. Fitzpatrick and J. Bachenko, "Building a Data Collecon for Decepon Research," in Proceedings of the EACL 2012 Workshop on Computaonal Approaches to Decepon Detecon , France, 2012, pp. 31-38. [3] Z. Rezaee and R. Riley, Financial Statement Fraud: Prevenon and Detecon”, 2 ed.: Wiley, 2009. [4] D. M. MerklDavies and N. Brennan, "Discreonary disclosure strategies in corporate narraves: incremental informaon or impression management," Journal of Accounng Literature, vol. 26, pp. 116196, 2007. [5] E. W. T. Ngai, Y. Hu, Y. H. Wong, Y. Chen, and X. Sun, "The applicaon of data mining techniques in financial fraud detecon: A classificaon framework and an academic review of litera ture," Decision Support Systems, vol. 50, pp. 559-569, 2011. [6] A. Gepp and K. Kumar, "Predicng Financial Distress: A Comparison of Survival Analysis and Decision Tree Techniques," Procedia Computer Science, vol. 54, pp. 396-404, 2015/01/01 . [7] T. McEnery and A. Wilson, Corpus Linguiscs, An Introducon: Edinburgh University Press, 2005. [8] L. Zhou, J. Burgoon, J. Nunamker, and D. Twitchell, "Automang Linguiscs-Based Cues for Detecng Decepon in Text-based Asynchronous Computer Mediated Communicaon," Group Decision and Negoaon, vol. 13, pp. 81-106, 2004. [9] M. L. Newman, J. W. Pennebaker, D. S. Berry, and J. M. Richards, "Lying Words: Predicng Decepon From Linguisc Styles," Personality and Social Psychology Bullen vol. 29, pp. 665- 675, 2003. [10] J. T. Hancock, L. E. Curry, S. Goorha, and M. Woodworth, "On Lying and Being Lied To: A Linguisc Analysis of Decepon in Computer-Mediated Communicaon," Discourse Processes, vol. 45, pp. 1-23, 2007. [11] D. S. McNamara, A. C. Graesser, P. M. McCarthy, and Z. Cai, Automated Evaluaon of Text and Discourse with Coh-Metrix. New York: Cambridge University Press 2014.. [12] R. Bloomfield, "The "Incomplete Revelaon Hypothesis" and Financial Reporng," Accounng Horizons, vol. 16, p. 233, 2002. [13] N. D. Duran, C. Hall, P. M. McCarthy, and D. S. McNamara, "The linguisc correlates of conversaonal decepon: Comparing natural language processing technologies," Applied Psycho- linguiscs, vol. 31, pp. 439-462, 2010. [14] T Mikolov, I Sutskever, K Chen, GS Corrado, J Dean, Distributed Representaons of Words and Phrases and their Composionality”, Advances in neural informaon processing systems, 2013. Introduction In the arena of high stakes deception: there is a scarcity of ground truth verification for data collected from real world sources” [2]. To address such a short supply of ground trutha new corpus of 6.3 million words is con- structed that contains narratives of firms known to have committed financial statement fraud (FSF), juxtaposed with narratives (10-K/Annual Reports) from similar non-fraud firms. FSF is known to be the costliest type of fraud [3]. The study focuses in on this to demonstrate that deception can be detected from text. For the first time, an extensive range of linguistic features are extracted from the corpus to look for significant discrepan- cies between the narratives of and fraud and non-fraud firms. The literature search reveals that in financial reporting there are two main competing theories that seek to ex- plain management motivations and effects on stakeholders. This is depicted in figure 1. The literature search also unveils the techniques used to build predictive fraud models, shown in figure 2. Figure 1: Theorecal underpinnings in financial reporng [4] Figure 2: The categories of financial fraud and intelligentdetecon techniques [5] Decepve Linguisc Cues The effect in text Authors Theory/Method Word quanty Could be higher or lower in decepve text. Generally, higher quanes of verbs, nouns, modifiers and group references. Zhou [8] Interpersonal Decepon Theory Pronoun use First person singular pronouns less frequent, greater use of third person pronouns. This is known as distancing strategies (reducing ownership of a statement). Newman et al [9] Zhou [8] Interpersonal Decepon Theory Emoon words Slightly more negavity, greater emoonal expressiveness. Newman et al [9] Leakage Theory Markers of cognive complexity Fewer exclusive terms (eg but, except), negaons (eg no, never) and cau- saon words (eg because, effect) and moon verbs - all require a deceiv- er to be more specific and precise. Repeve phrasing and less diverse language is more marked in the language of liars. Also, more menon of cognive operaons such as thinking, adming, hoping. Newman et al [9] Hancock et al [10] Reality Monitoring Modal verbs Verbs such as would, should, and could lower the level of commitment to facts. Hancock et al [10] Interpersonal Decepon Theory Verbal non-immediacy Any indicaon through lexical choices, syntax and phraseology of sepa- raon, non-identy, aenuaon of directness, or change in the intensity of interacon between the communicator and his referents.Results in the use of more informal, non-immediate language. Zhou [8] Interpersonal Decepon Theory Uncertainty Impenetrable sentence structures (syntacc ambiguity) or use of evasive and ambiguous language that introduces uncertainty (semanc ambigui- ty). Modifiers, modal verbs (e.g., should, could), and generalizing or allnessterms (eg everybody”) increases uncertainty.Zhou [8] Interpersonal Decepon Theory Half-truths and equivocaons Increased inclusion of adjecves and adverbs that qualify the meaning in statements. Sentences less cohesive and coherent thereby reducing readability. McNamara et al [11] Bloomfield [12] Management Obfuscaon Hypothesis Passive voice Increase in use, another distancing strategy - switch subject/object around. Duran et al [13] Interpersonal Decepon Theory Relevance manipulaons Irrelevant details. Duran et al [13] Bloomfield [12] Management Obfuscaon Hypothesis Sense based words Increase use of words such as see, touch, listen. Hancock et al [10] Reality Monitoring Table 1:Linguisc cues to decepon Methodology A 102 annual report/10-K narratives of fraud firms are gathered (2 years before the FSF is uncovered). This is matched by 306 similar narratives of non-fraud firms of the same time period and industry sector. Figure 4 illus- trates the processing undertaken to determine significant differences in language use. Once the reports have been cleaned of all formatting, Corpus Linguistics methodology is applied. This entails frequency inspection, keyword analysis, collocations and concordances. According to McEnery [7] corpus data is a rich source of lin- guistic utterances and a powerful tool rooted in the scientific method, open to objective verification of results. A study conducted by Rutherford [1] uncovered meaningful words used in financial reports. Multidimensional scal- ing was applied to the keyword counts (those deemed significant by Principal Component Analysis-PCA). The results shown in figure 6 show clear differences in usage of these words in the two report categories. The reports were mapped to 10 different documentation representation schemes, depicted in figure 5. The lin- guistic features for each scheme was extracted using tools Python , R and tools such as Coh-Metrix, to extract more in depth features to gauge readability. Other tools used was LIWC 2015, word lists specifically for the fi- nancial domain, WordNet, were used to pick up further prominent words and synonyms. Zhou [8] derived ratios (LBC) known to be used in deceptive text were also extracted. In a bid to pick what was saidas opposed to howLatent Dirichlet Allocation was also executed to pick up topics’. The features extracted were then put through feature selection routines as depicted in figure 7. The reduced term document matrices were set up, using matched pair and peer set composition. The former was a matrix set up for one fraud report to one non–fraud report. The latter was one fraud report to three non-fraud reports (unbalanced to reflect a more realistic distribution). All matrices were put through 5 classifiers, shown in figure 8 and results in figure 9. Figure 4: Framework to decipher financial reports for decepon detecon Figure 5: Document representaon schemes Figure 6: PCA selected Rutherford keywords Conclusion Based on the feature extraction (figure 5) feature selection (figure 7) methods used and the 5 classifier models built the results (figure 9 and figure 10) indicate that narratives of fraud and non-fraud firms are discernably dis- tinct. The classifiers perform better when the data is balanced as in the unbalanced set there is higher degree of false negatives. The linguistic cues mentioned (table 1) do hold potential to build more accurate fraud prediction models. Future Work To further highlight the differences in word usage between fraud and non-fraud firms Word-embeddings can be constructed using neural network models such as Word2vec, as illustrated in figure 11. These decipher the meaning of words based on context analysis. The UK government updated the companies act in 2013 in a bid to boost transparency and improve information content in annual reports. Figure 13 highlights the areas where improvements need to be made by companies. From a computational perspectives, new ways need to be devised to determine the quality of the financial nar- rative disclosed. Figure 14 shows that the capture of compositional meaning of sentences would help in that di- rection. Figure 15 shows that lightweight ontologies and parsers are some of the practical steps that can be un- dertaken to determine quality of disclosure. Corpus Driven Deception Detection in Financial Text Saliha Minhas, Prof Amir Hussain

Upload: buihanh

Post on 17-Apr-2019

216 views

Category:

Documents


0 download

TRANSCRIPT

Abstract

Cost of fraud to the UK climbs high as £193 billion a year. In a bid to find new ways for deception detection, a

study was undertaken to examine linguistic features in financial text. There is still an over-reliance on building

financial fraud prediction models using numerical data. However financial text is much more abundant and

could hold clues to deceit. The linguistic correlates of deception are uncovered using a variety of tools and

techniques in a newly built corpus of financial text (6.3 million words). Primary amongst them is a new way to

gauge readability as this is a key ploy used to obfuscate and thereby deceive. The extracted features are put

trough feature selection routines and then passed to machine learning based supervised classifiers. K-means

clustering is also executed over the features. All results show that financial text can aid in deception detection.

Results

References

Figure 8: The Classifiers

Figure 7: Dimensionality reduction techniques applied to matrices

Figure 9: Best Performing Classifiers on Document Representation Schemes – Peer Set Figure 10: Best Performing Classifiers on Document Representation Schemes – Matched Pair

Figure 11: Application of word embeddings to historical corpora, shows how meanings of some words have shifted [14]

Figure 12: Proposed application to check for quality of narrative

Figure 13: A quality narratives should cover the above 3 areas—Companies Act 2013

Figure 14: Capture of compositional nature of the text

[1] Rutherford B "Genre Analysis of Corporate Annual Report Narratives", Journal of Business Communication, Volume 42, Number 4, 2005. [2] E. Fitzpatrick and J. Bachenko, "Building a Data Collection for Deception Research," in Proceedings of the EACL 2012 Workshop on Computational Approaches to Deception Detection , France, 2012, pp. 31-38. [3] Z. Rezaee and R. Riley, “Financial Statement Fraud: Prevention and Detection”, 2 ed.: Wiley, 2009. [4] D. M. Merkl‐Davies and N. Brennan, "Discretionary disclosure strategies in corporate narratives: incremental information or impression management," Journal of Accounting Literature, vol. 26, pp. 116–196, 2007. [5] E. W. T. Ngai, Y. Hu, Y. H. Wong, Y. Chen, and X. Sun, "The application of data mining techniques in financial fraud detection: A classification framework and an academic review of litera ture," Decision Support Systems, vol. 50, pp. 559-569, 2011. [6] A. Gepp and K. Kumar, "Predicting Financial Distress: A Comparison of Survival Analysis and Decision Tree Techniques," Procedia Computer Science, vol. 54, pp. 396-404, 2015/01/01 . [7] T. McEnery and A. Wilson, Corpus Linguistics, An Introduction: Edinburgh University Press, 2005. [8] L. Zhou, J. Burgoon, J. Nunamker, and D. Twitchell, "Automating Linguistics-Based Cues for Detecting Deception in Text-based Asynchronous Computer Mediated Communication," Group Decision and Negotiation, vol. 13, pp. 81-106, 2004. [9] M. L. Newman, J. W. Pennebaker, D. S. Berry, and J. M. Richards, "Lying Words: Predicting Deception From Linguistic Styles," Personality and Social Psychology Bulletin vol. 29, pp. 665- 675, 2003. [10] J. T. Hancock, L. E. Curry, S. Goorha, and M. Woodworth, "On Lying and Being Lied To: A Linguistic Analysis of Deception in Computer-Mediated Communication," Discourse Processes, vol. 45, pp. 1-23, 2007. [11] D. S. McNamara, A. C. Graesser, P. M. McCarthy, and Z. Cai, Automated Evaluation of Text and Discourse with Coh-Metrix. New York: Cambridge University Press 2014.. [12] R. Bloomfield, "The "Incomplete Revelation Hypothesis" and Financial Reporting," Accounting Horizons, vol. 16, p. 233, 2002. [13] N. D. Duran, C. Hall, P. M. McCarthy, and D. S. McNamara, "The linguistic correlates of conversational deception: Comparing natural language processing technologies," Applied Psycho-

linguistics, vol. 31, pp. 439-462, 2010. [14] T Mikolov, I Sutskever, K Chen, GS Corrado, J Dean, “Distributed Representations of Words and Phrases and their Compositionality”, Advances in neural information processing systems, 2013.

Introduction

In the arena of high stakes deception: “there is a scarcity of ground truth verification for data collected from real

world sources” [2]. To address such a short supply of “ground truth” a new corpus of 6.3 million words is con-

structed that contains narratives of firms known to have committed financial statement fraud (FSF), juxtaposed

with narratives (10-K/Annual Reports) from similar non-fraud firms. FSF is known to be the costliest type of

fraud [3]. The study focuses in on this to demonstrate that deception can be detected from text. For the first

time, an extensive range of linguistic features are extracted from the corpus to look for significant discrepan-

cies between the narratives of and fraud and non-fraud firms.

The literature search reveals that in financial reporting there are two main competing theories that seek to ex-

plain management motivations and effects on stakeholders. This is depicted in figure 1. The literature search

also unveils the techniques used to build predictive fraud models, shown in figure 2.

Figure 1: Theoretical underpinnings in financial reporting [4] Figure 2: The categories of financial fraud and ‘intelligent’ detection techniques [5]

Deceptive Linguistic Cues The effect in text Authors Theory/Method

Word quantity Could be higher or lower in deceptive text. Generally, higher quantities of verbs, nouns, modifiers and group references. Zhou [8] Interpersonal

Deception Theory

Pronoun use First person singular pronouns less frequent, greater use of third person pronouns. This is known as distancing strategies (reducing ownership of a statement).

Newman et al [9] Zhou [8]

Interpersonal Deception Theory

Emotion words Slightly more negativity, greater emotional expressiveness. Newman et al [9] Leakage Theory

Markers of cognitive complexity

Fewer exclusive terms (eg but, except), negations (eg no, never) and cau-sation words (eg because, effect) and motion verbs - all require a deceiv-er to be more specific and precise. Repetitive phrasing and less diverse language is more marked in the language of liars. Also, more mention of cognitive operations such as thinking, admitting, hoping.

Newman et al [9] Hancock et al [10]

Reality Monitoring

Modal verbs Verbs such as would, should, and could lower the level of commitment to facts. Hancock et al [10] Interpersonal

Deception Theory

Verbal

non-immediacy

“Any indication through lexical choices, syntax and phraseology of sepa-ration, non-identity, attenuation of directness, or change in the intensity of interaction between the communicator and his referents.” Results in the use of more informal, non-immediate language.

Zhou [8] Interpersonal Deception Theory

Uncertainty

“Impenetrable sentence structures (syntactic ambiguity) or use of evasive and ambiguous language that introduces uncertainty (semantic ambigui-ty). Modifiers, modal verbs (e.g., should, could), and generalizing or “allness” terms (eg “everybody”) increases uncertainty.”

Zhou [8] Interpersonal Deception Theory

Half-truths and equivocations

Increased inclusion of adjectives and adverbs that qualify the meaning in statements. Sentences less cohesive and coherent thereby reducing readability.

McNamara et al [11]

Bloomfield [12]

Management Obfuscation Hypothesis

Passive voice Increase in use, another distancing strategy - switch subject/object around. Duran et al [13] Interpersonal

Deception Theory

Relevance

manipulations Irrelevant details. Duran et al [13]

Bloomfield [12]

Management Obfuscation Hypothesis

Sense based words Increase use of words such as see, touch, listen. Hancock et al [10] Reality Monitoring

Table 1:Linguistic cues to deception

Methodology

A 102 annual report/10-K narratives of fraud firms are gathered (2 years before the FSF is uncovered). This is

matched by 306 similar narratives of non-fraud firms of the same time period and industry sector. Figure 4 illus-

trates the processing undertaken to determine significant differences in language use. Once the reports have

been cleaned of all formatting, Corpus Linguistics methodology is applied. This entails frequency inspection,

keyword analysis, collocations and concordances. According to McEnery [7] corpus data is a rich source of lin-

guistic utterances and a powerful tool rooted in the scientific method, open to objective verification of results. A

study conducted by Rutherford [1] uncovered meaningful words used in financial reports. Multidimensional scal-

ing was applied to the keyword counts (those deemed significant by Principal Component Analysis-PCA). The

results shown in figure 6 show clear differences in usage of these words in the two report categories.

The reports were mapped to 10 different documentation representation schemes, depicted in figure 5. The lin-

guistic features for each scheme was extracted using tools Python , R and tools such as Coh-Metrix, to extract

more in depth features to gauge readability. Other tools used was LIWC 2015, word lists specifically for the fi-

nancial domain, WordNet, were used to pick up further prominent words and synonyms. Zhou [8] derived ratios

(LBC) known to be used in deceptive text were also extracted. In a bid to pick ‘what was said’ as opposed to

’how’ Latent Dirichlet Allocation was also executed to pick up ‘topics’.

The features extracted were then put through feature selection routines as depicted in figure 7. The reduced

term document matrices were set up, using matched pair and peer set composition. The former was a matrix

set up for one fraud report to one non–fraud report. The latter was one fraud report to three non-fraud reports

(unbalanced to reflect a more realistic distribution). All matrices were put through 5 classifiers, shown in figure 8

and results in figure 9.

Figure 4: Framework to decipher financial reports for deception detection

Figure 5: Document representation schemes

Figure 6: PCA selected Rutherford keywords

Conclusion

Based on the feature extraction (figure 5) feature selection (figure 7) methods used and the 5 classifier models

built the results (figure 9 and figure 10) indicate that narratives of fraud and non-fraud firms are discernably dis-

tinct. The classifiers perform better when the data is balanced as in the unbalanced set there is higher degree of

false negatives. The linguistic cues mentioned (table 1) do hold potential to build more accurate fraud prediction

models.

Future Work

To further highlight the differences in word usage between fraud and non-fraud firms Word-embeddings can be

constructed using neural network models such as Word2vec, as illustrated in figure 11. These decipher the

meaning of words based on context analysis.

The UK government updated the companies act in 2013 in a bid to boost transparency and improve information

content in annual reports. Figure 13 highlights the areas where improvements need to be made by companies.

From a computational perspectives, new ways need to be devised to determine the quality of the financial nar-

rative disclosed. Figure 14 shows that the capture of compositional meaning of sentences would help in that di-

rection. Figure 15 shows that lightweight ontologies and parsers are some of the practical steps that can be un-

dertaken to determine quality of disclosure.

Corpus Driven Deception Detection in Financial Text

Saliha Minhas, Prof Amir Hussain