1
Confused but convinced:
Article complexity and publishing success over time
by
Marc Berninger1, Florian Kiesel#, 2, Dirk Schiereck3, Eduard Gaar4
April 18, 2018
____________________________________ * We thank Ruediger Fahlenbrach, Campbell Harvey, Timothy Loughran, Ian Marsh, Christoph
Merkelbach, and Steven Ongena for their helpful comments and suggestions on earlier drafts of this
paper. We are also grateful to Nico Gärtner, Felicia Müller, Till Nedderhut, and Wadan Wardak for
their valuable research assistance. All remaining errors are our own. # Corresponding author. 1 Department of Business Administration, Economics and Law, Technische Universität Darmstadt,
64289 Darmstadt, Germany, phone +49 6151 16 - 24344, email: [email protected] 2 Department of Business Administration, Economics and Law, Technische Universität Darmstadt,
64289 Darmstadt, Germany, phone +49 6151 16 - 24294, email: [email protected] 3 Department of Business Administration, Economics and Law, Technische Universität Darmstadt,
64289 Darmstadt, Germany, phone +49 6151 16 - 24291, email: [email protected] 4 Department of Business Administration, Economics and Law, Technische Universität Darmstadt,
64289 Darmstadt, Germany, phone +49 6151 16 - 24292, email: [email protected]
2
Confused but convinced:
Article complexity and publishing success over time
Abstract
Using a sample of 4,160 finance articles published in the leading finance journals between 2000
and 2016, we study the readability of academic publications and its impact on the number of
citations the paper receives. We use latent dirichlet allocation (LDA) to cluster the paper into topics
and find that the article complexity, measured by length and the Flesch-Kincaid-Index, increases
over time, while we do not find considerable changes within the article topics. The results reveal
a correlation with the number a paper is being cited. We do not find that this result holds for 8,236
articles published in other finance journals. These patterns suggest that scientists gain recognition
by unintelligible writing.
Keywords: Academic writing, Readability, Textual Analysis, Finance Literature, Scientometrics,
Latent dirichlet allocation
JEL classification: D83, G00, G10
3
1 Introduction
“Publish or perish” is a well-known principle in academia (e.g. Zivney & Bertin, 1992). That is
because the success of researchers in finance is a two-step measurement. First, for most researchers
publishing in a leading journal or at least a decent journal is a great achievement, as it is important
for future promotion and tenure decisions (Fishe, 1998). The second step relates to the attention a
paper receives after it is published. It is very common to examine the citations when evaluating
promotions or journal standings. More citations indicate that the contribution provides novel
insights in a relevant field and other researchers acknowledge the author’s findings. Moreover,
other researchers use the paper’s original contribution as a basis for their own research. This
already indicates that even the most famous scientists depended on their predecessors and it shows
that citations are required to incorporate the previous works. Aside from citing properly in order
to keep academic honestly and to minimize the risk of plagiarism, other arguments may arise why
articles should be cited. Authors can improve the quality of their own paper by using citations:
Citations serve (1) as a proxy for the detail-orientation and the accuracy of the paper and the author.
It helps (2) to precise problems and (3) to define and use jargon words in the respective field.
Moreover, a longer reference list shows (4) the scientific knowledge of an author and (5) increases
the scholar’s credibility.
These five reasons indicate that authors have a personal interest to cite more papers, especially
well-known articles. However, citing a paper does not guarantee that they are completely
understood or even read by researchers. In order to ensure that academic papers are written easily
understandable, authors are instructed in the peer-review process or on the journal’s website to use
a plain English. Zimmerman (1989) suggests that researchers may increase the probability of more
citations by writing for the largest possible audience. If not only academic experts understand the
contribution but also doctoral students or practitioners, the probability of citations increases. Thus,
editors and authors should have the same intention in preparing articles.
In this paper, we analyze the readability of finance articles and its impact on the number of citation
a paper receives. The focal question of the current article is how is the readability of journal articles
and how does the readability affect the number of citations an article receives in the finance
discipline.
In order to answer this question, we construct a unique comprehensive dataset consisting of 4,160
journal articles from the three leading finance journals, Journal of Financial Economics (JFE),
4
Review of Financial Studies (RFS), and The Journal of Finance (JF), between 2000 and 2016. We
consider three proxy variables as a measure of the complexity of an article: the Flesch-Kincaid
score, the number of words, and the number of complex words (more than two syllables adjusted
by the number of “jargon” finance words). We control on the journal, article and author level and
apply a latent dirichlet allocation (LDA) in order to build topic clusters and to control for topic
effects. In addition, we construct a control sample consisting of 8,236 further academic finance
journal articles.
First, we find that the readability becomes worse over time. In our investigation period, the average
Flesch-Kincaid index increased from 16 in the years between 2000 and 2004 to a spike of 18 in
the years of 2011 to 2015. The value of the Flesch-Kincaid corresponds with a grade level of the
US educational system. This indicates that within 15 years, the average reader needed two more
years of formal education to clearly understand the paper. We also find that the average finance
article increased by 33% in length, from 12,000 words to more than 16,000 words, and the
percentage of complex words increased by 49%. We find that this shift is not related to a change
in the topics addressed by the papers as they are very stable over the time. Second and most
interestingly, the results of our empirical analyses show that articles which are complex to read
receive the highest number of citations. We find that all measures of readability show a strong
inverse relation between readability and citations. Combining these facts, we show that not only
the readability increased over time, but also that the likelihood of a citation increases when it is
harder to read. We compare the results with a control sample of 8,236 finance articles from non-
top-three journals and find that these articles are not cited more frequently if they are more difficult
to read.
These patterns suggest several insights. From an author’s perspective, the results lead to the
conclusion that clear communication is not always beneficial with regard to the (expected) number
of citations. This provides strong support to the “Doctor Fox phenomenon”. Doctor Fox, a trained
actor, gave a nonsensical lecture to specialists in the field and was able to bluff the audience with
his presentation style (Naftulin, Ware, & Donnelly, 1973). Our study provides first evidence that
it is also important how a report is written.
From a scholar’s perspective, researchers tend to cite more complex articles. A bibliography with
complex articles gives an indication that the scholar is widely read and well-informed. Gaining an
understanding of the effects of readability on the number of times an article is being cited is highly
5
relevant to the discipline. From a practitioner’s perspective, academic journals as a good source
for new finance knowledge for practitioners, lowers when the success of an article is measured by
the number of citations. Especially for practitioners, a clear understanding of the different
characteristics underlying scholarly work in finance is relevant because it informs them about the
work’s relevance to decision areas they face and the extent to which academic journals in finance
may provide good sources for new finance knowledge in the future. Therefore, a paper’s impact
for the society is much greater if it is easily understandable for the largest possible audience and
not only for experts. However, the trend we show in our analysis and the impact of the readability
on the number of citations indicate that the gap between academics and practitioners increases over
time.
2 Data and measures
2.1 Sample selection
Focusing on the finance discipline, we first sampled the three leading journals: JFE, RFS and JOF.
We began by inventorying all articles published between 2000 and 2016. We collected a total of
4,922 articles for these three journals. After a very restrictive screening process, including only
full-text original research articles, articles with a valid digital object identifier (DOI), and machine-
readable articles, we ended up with a total of 4,160 articles. To control whether our results remain
robust for other journals, we construct a control sample of 8,236 articles published in further eleven
finance journals. More details to the data is provided in the appendix.
After we collected the information of the articles and the authors on the journal’s website, we
extracted the citations and metadata from the CrossRef website. The metadata contains, among
others, the number of references in the article and the year of publication. Finally, we matched the
data with the Google Scholar database.
2.2 Measures
Readability measures. In order to measure the readability of the article, we use three different
measures. The first measure is the Flesch-Kincaid formula (Kincaid et al. 1975). It is based on two
components, the number of syllables per word and the number of words per sentence. The Flesch-
Kindcaid score is calculated as follows:
6
FleschKincaid = (11.8 × syllables per word) + (0.39 × words per sentence) - 15.59 (1)
The result is a number that corresponds with a grade level of the US educational system. The
variable FleschKincaidText is the Flesch-Kincaid score of the full-text paper, whereas
FleschKincaidAbstract is the Flesch-Kincaid score of the paper’s abstract. Two other popular
readability measures, the Fog-Index (Gunning 1952) and the Flesch Reading Ease Score, use
similar components and therefore all measures highly correlate with each other.1 We therefore use
other measures for our robustness checks. Li (2008) and You and Zhang (2009) argue that the
length of the document, measured as the number of words is a good proxy for the readability. The
information-extracting cost of longer documents is higher than for shorter documents, longer
documents seem to be deterring and more difficult to read (Li, 2008). Following this approach, we
use the number of words (omitting the average sentence length and syllable count) as an alternative
readability measure. The third measure combines the two previous measures. The Fog-Index uses
the number of “complex words”, defined as a word with more than two syllables and does not
incorporate the average number of syllables per word. Loughran and McDonald (2014) criticize
this measure in business documents as they frequently contain multisyllables words to describe
operations. We control for this financial terminology and subtract all financial words appearing in
the Campbell Harvey’s Hypertextual Finance Glossar.
Citation variables. We want to measure the impact of the readability on the number of citation a
paper receives. We use two databases to obtain the number of citations, CrossRef and Google
Scholar. Articles published in the early years of the sample should have received more citations
than recent publications. Therefore, we follow the approach of Chan, Chan, Tong, and Zhang
(2016) and normalize the citations of each article by the number of years since the article has been
published.
Controls. In order to control for other potential determinants influencing the number of citations a
paper receives, we furthermore analyzed the articles on a full-text basis to generate additional
1 These measures are most used in literature (e.g. Bauerly, Johnson, & Singh, 2006; Hartley, Pennebaker, & Fox,
2003; Hartley, Sotto, & Pennebaker, 2002; Loveland, Whatley, Ray, & Reidy, 1973; Stremersch, Verniers, & Verhoef,
2007), but also in other cases. Moreover, the former SEC Chairman Christopher Cox suggested that the Flesch-Kincaid
model can be used to gauge compliance with the SEC’s plain English initiatives (SEC, 2007).
7
article related control variables. We run several linear regressions and incorporate these article-
and author-control variables. The definitions of the variables can be found in the appendix.
3 Results
3.1 Descriptive statistics
Table 1 provides the summary statistics of the key variables we employ. The average citations
vary heavily between our two different citation providers, CrossRef and Google Scholar. We find
that in average an article published in a top-three finance journal receives 7.3 citations per year
measured by CrossRef, while Google Scholar finds 24.8 citations per article and year. Google
Scholar indexes additionally non-journal material2 and therefore the number of citations is higher.
The average Flesch-Kincaid level is 17.5, indicating that scholars need approximately 17.5 years
of reading experience to understand the text while the first reading. The abstract shows a Flesch-
Kincaid level of only 15.7, showing that the abstract is easier to read than the full-text of the paper.
This suggests that writing an abstract may differ from writing the full-text of the paper. Abstracts
provide the main contribution of the article in a very condensed way. Due to the word limitations
of the abstract, authors have to describe their work precisely and the readability improves. Another
measure for the article complexity is the number of words of the article (Li, 2008; You & Zhang,
2009). On average, an article consists of 14,217 words, while on average 2,945 of these words are
complex words.
We use LDA to identify notable changes in finance journals over our period and the probability
that an article fits in one of our topics. Table 1 shows that in average the most related category,
selected as the best fit topic, is 44%. The higher the percentage, the more specific the article is.
Lower percentage indicate that a clustering to one topic is less likely and the topic is more diverse.
The topic fit varies between 14% and 97%. In addition, the descriptive statistics reveal that the
average title length of finance papers consists of 8.5 words. The shortest ones only have one word
(“Anomalies”, “Tipping” both published in RFS, and “Comovement” published in JFE) and the
longest 23 (“A first look at the accuracy of the CRSP Mutual Fund Database and a comparison of
the CRSP and Morningstar Mutual Fund Databases”). However, most article titles vary between 6
2 Google does not explicitly state what is covered in its citations, but preprints/postprints from ArXiv and RePec,
conference proceedings, technical reports, books, and dissertations in addition to electronic journal articles from
traditional publishers have been found.
8
and 11 words. Moreover, published finance papers on average consist of 42 references, 6.6 tables,
and 2.7 figures. Medoff (2003) shows that a lead article receives a significantly higher number of
citations in the first 5 years after publication. If the article is listed first in the respective issue or
volume, we consider it as a lead article. 3.7% of our articles are listed as the first article in the
respective journal. Most journals stopped to publish issues and started to publish volumes with
more articles. This in combination with the lower dissemination of paper-based editions, the
importance of lead articles will decrease over time.
Table 1 also provides details to author related variables. We use the Financial Times Global MBA
Ranking and define the first 10 business schools as top business schools and control whether at
least one author is at a top business school. We find that approximately 29% of the articles
published in the leading journals are written by researchers from a top business school or are
coauthored with academics from these faculties. Most papers are written with coauthors, in average
2 scholars publish a paper together. Joint works allow productivity gains due to division of labor
(McDowell & Melvin, 1983). Karolyi (2011) and Hollis (2001) show that for single economists
more co-authors are associated with higher research quality and greater frequency of publications.
In order to compare our findings, we recalculate the variables for our control sample. This sample
includes 8,236 published finance articles from other finance journals. Table 1 additionally
summarizes the statistics of the control sample. Not surprisingly, these articles are cited
significantly less. Using the CrossRef database, they only receive 2 citations per year and using
the Google Scholar database 7.2 citations per year, a difference of 5.3 and 17.6 citations per year,
respectively. Interestingly, we find that these papers are easier to read compared with the top peers.
The Flesch-Kincaid level is 16 in average and 1.4 points less than articles published in the top
journals. In addition, these papers are significantly shorter, measured by the total number of words
and contain less complex words. The article related variables seem to be similar. However, the top
fit is 1.7% higher, suggesting that articles published in other journals, such as the Journal of
Financial Intermediation or the Journal of Financial Markets, are more specialized. We also find
that researchers from top business schools significantly publish less in these journals. We find that
29.2% of articles in the top journals are written with coauthors from a top business school, but our
control sample of 8,236 articles contains only 3.3% articles with coauthors from premiere business
schools.
9
Summarizing, we find that articles published in the top-three journals show significantly different
characteristics than their peers. Less surprising, the number of citations is significantly higher for
top journals. However, and more surprisingly, the readability of these articles is more complex.
3.2 Readability over the time
We find that the Flesch-Kincaid for top journal articles is in average 17.5. Panel A of Figure 1
illustrates the Flesch-Kincaid for each year of our sample period. At the beginning, the average
Flesch-Kincaid is 16 and is even decreasing the following years. However, since 2002 the average
Flesch-Kincaid is increasing with a spike in 2015. In the recent years, the Flesch-Kincaid index is
in average around 18. The year 2016 is the only exception with a Flesch-Kincaid index of 16.5. In
contrast, we find that the readability of the abstracts is remarkably stable over the time. The average
Flesch-Kincaid index of the abstracts varies between 15.3 (year 2001) and 16.0 (year 2010). This
result is surprising, suggesting that the authors care more about a clear communication within the
abstract, while the readability of the full-text is less important for them. Panel A of Figure 1 also
shows the average Flesch-Kincaid level of our control sample. The index varies between 14.6 and
16.4, and is in average more than one level below the readability of the top journals. Moreover,
the figure illustrates that we cannot find a sharp increase of the readability measure for these
articles, but also, they are increasing over the time. The readability of the abstracts for the control
sample shows similar values than the top-three sample. The variation of the readability is rather
small and the readability of the abstracts is less time variant.
In addition to the classical readability measure, we provide the number of words in finance articles
over the time. Figure 1 Panel B provides the average number of words in the full-text. We find a
steady increase over the past years. The average number of words in the year 2000 was around
11,818 words but increased to over 16,617 words. The number of words is determined without
counting the words in the reference list or the appendix. It seems that articles need to be longer as
referees or editors require more proofs for statistical significance or other robustness test. The
figure also illustrates that the number of complex words is increasing over time, starting from
1,287 words to 1,918 words in the year 2016. The findings indicate that finance journals become
harder to reader. The information-process cost increases and scholars have an incentive to
minimize the costs in not reading or just skimming the full-text of the paper.
10
Again, we also provide the development for our control sample. The number of words is almost
2,000 words lower in all the years. We also reveal an increasing trend in the number of words, but
the highest average number of words for one year in our control sample is with 11,996 still below
the average number of words for the top journals in the year 2000.
Summarizing, we find a strong evidence that finance papers become more complex over time. This
is true for top finance publications as well as for articles in non-top journals. However, the
readability and the length of top articles increased significantly over the time. In the next section,
we analyze the topics of the articles and examine whether trends in the topics can be found that
could explain the differences in the readability.
3.3 Topics over the time
One critical issue in determining the readability and the number of citations a paper receives is the
topic of the article. In order to determine the topic of the article, two simple methodologies can be
adopted: (1) reading the articles, and (2) using the keywords or the Journal of Economic Literature
(JEL) classification. However, both methodologies bring some issues. Reading the articles is time-
taking and the outcome is highly based on a personal opinion. Keywords or JEL classifications are
not always provided and it might be that the keywords are optimized with catchy words. In order
to get an objective and replicable measure for the content of the paper, we apply a LDA procedure.
LDA is developed by Blei et al. (2003) and a statistical, unsupervised Bayesian machine-learning
process that falls into the family of topic models. In general, topic modelling describes methods
for the analysis of a large quantity of unlabelled data, such as a corpus of text documents. It is
generally considered the simplest of topic models (Blei, 2012) and uses the probability of words
co-occuring within documents to identify sets of latent topics and their associated words and is
conceptually similar to factor analysis, where the model produces topics instead of factors.
Thereby, a topic represents a probability distribution over a bundle of words (Reed, 2012). The
method provides an advantage especially for the analysis of qualitative aspects, in comparison to
quantitative approaches (Bellstamm, Bhagat and Cookson, 2017). A further advantage over other
topic models is that LDA identifies a mix of topics contained in a large corpus of texts and within
each document instead of just allocating a whole document to one topic (Blei, Ng and Jordan,
2003). Furthermore, the model differs from other topic modelling methods as it can process a
previously unseen data sample. Hence, there exist no limitations through a training data set (Blei,
11
Ng and Jordan, 2003; Bellstamm, Bhagat and Cookson, 2017). The whole method applied as a
modelling tool has been recently introduced for its application in finance literature (Ganglmair and
Wardlaw, 2015; Hoberg and Lewis, 2015; Goldsmith-Pinkham, Hirtle and Lucca, 2016;
Bellstamm, Bhagat and Cookson, 2017). And at this point it suits our research idea because it is
able to discover the main themes that pervade our large collection of papers given in a probability
for belonging to a topic and the words with the highest share per topic and thus allows us to control
for content. For a detailed explanation of the method see the appendix.
Figure 2 provides illustrates the top 10 keywords for each cluster after applying LDA. The
disadvantage of LDA is that the clusters are not directly headlined, but as we are only interested
in the comparison between clusters, a headline is not needed for our analysis. However, the top
keywords already give a strong indication for possible labels for each cluster. For example, Cluster
2 combines firm and management related keywords, whereas Cluster 10 is related to mergers and
acquisitions and Cluster 16 to IPOs. The figure shows that the LDA approach is a good
approximation for identifying the topics of finance articles. After defining the categories for each
word, we measure the fit for each article to one of the 20 clusters. The cluster with the highest
probability is selected as the topic of the article.
Figure 3 plots the relative number of articles for each topic over time. In general, the pattern is
clear. Most topics have remained relatively constant over the years 2000 to 2016 and therefore do
not explain the overall increase in the readability or the length in the top journals articles. The
notable exception is Cluster 3 which is related to banks. The cluster increased since 2008,
indicating that banking related research increased since the burst of the financial crisis.
3.4 The impact of readability on citations
In this section, we analyze the impact of the readability of finance articles on the number of
citations a paper receives. In order to account for these factors, we run several linear regressions.
The dependent variable is the number of citations received by article i, as reported by CrossRef
and Google Scholar in beginning of 2017, divided by the number of years since the article has
appeared. Our main variables of interest are the readability measures.
The results of these regression analyses are provided in Table 2. We find that all readability
measures have a significant effect at the 5% level. The Flesch-Kincaid for the full-texts, the length
12
of the text, and the number of complex words have a significant positive effect on the average
number of citations, but the Flesch-Kincaid for the abstract has a highly significant negative effect
on the number of citations. The results are significant for both citation databases, CrossRef and
Google Scholar. The findings indicate a strong relationship between the readability and the number
of citations a paper receives. Interestingly, we find that a clear written abstract has a positive effect
on the number of citations.
We perform the analysis and incorporate article, author and control variables to take into account
that the title length (Jacques & Sebire, 2010; Jamali & Nikzad, 2011), number of tables and figures
(Ayres & Vars, 2000; Stremersch et al., 2007), lead articles Medoff (2003), authors from top
business schools (Chung & Cox, 1990; Ederington, 1979; Heck, Cooley, & Hubbard, 1986;
Klemkosky & Tuttle, 1977; Niemi, 1987), the number of authors (Hollis, 2001; Karolyi, 2011), or
the surname of the first author (Einav & Yariv, 2006) may affect the number of citations. In
addition, we include topic fixed effects and year fixed effects. The inclusion of the control
variables and the fixed effects has a substantial effect on the adjusted R2, but the significant effect
of our readability measures remains.
In Table 3, we recapitulate the analyses of Table 2 using our control sample of 8,236 finance
articles. Our readability measures now lack in significance and we cannot find an impact of
readability on the number of citations a paper receives. The only significant effect we find is for
the Flesch-Kincaid full-text and only for Google Scholar. We therefore have to conclude that
readability does not influence the number of citations for non-top journals.
4 Conclusion
This paper analyzes the impact of how finance articles are written and its effect on the number of
citation. We therefore use a sample of 12,396 published articles, divided into 4,160 articles
published in the three leading finance journals, and 8,236 articles, published in other finance
journals between 2000 and 2016. We provide evidence that the readability of articles is linked to
the number of citations. We find that a higher complexity of the full-text is linked with a higher
number of citations, whereas an abstract that is hard to understand has a negative effect on the
number of citations.
13
This paper should not to be read as a call to use unintelligible writing. In this paper, we consider
only one determinant for the number of citations, the readability. However, focusing on that one
aspect, it does not appear that a clear and consensus writing is always awarded by the faculty. We
find that papers with a complex writing are cited more frequently. As long as people are capable
of being influenced, authors can have an advantage in using the Doctor Fox phenomenon and
improve their number of citations with a complex writing. This is in line with the findings of
Spence (1973). He finds that scholars are signaling their understanding of the difficult content
while citing complex articles.
From our point of view, our paper contributes to a small empirical literature on the determinants
of citations in finance. Chung and Cox (1990), Ederington (1979), Klemkosky and Tuttle (1977),
Heck et al. (1986) and Niemi (1987) analyze the relation between the author’s institutional
affiliation and the publication success in finance and provide evidence that the production of
articles published in journals with high impact factors, such as the JOF, is concentrated at relatively
few institutions. Klemkosky and Tuttle (1977) show that six universities accounted for
approximately 25% of the total pages published in these journals. Ederington (1979) is one among
the first who focuses on publication success in finance literature and its determinants. He measures
the success of articles by the number of times the article has been cited and analyzes to which
extent finance articles differ in their impact and how highly cited articles differ from articles which
are less cited. He focuses on author information, particularly the authors’ affiliations, and discovers
that an article from a top business school receives about 70% more citations than a similar article
from an unranked school. Moreover, he states that longer articles are cited more frequently by
leading journals. These articles are also cited more often from major journals by other fields.
We also believe that our findings raise some practical questions for the academic faculty and
editors of finance journals. For authors, the findings suggest that scholars are more convinced
about the quality of the article if the content is written in a complex style. Armstrong (1980)
suggested already almost 40 years ago: “If you can’t convince them, confuse them!”. We have to
follow this advice for authors as our findings support the advice. For practitioners, we see the risk
that academic writing for leading journals becomes too complex to understand and therefore loses
its practical relevance. Especially for practitioners the information-process cost is increasing to
understand an article published in a top-three journal. Journal articles should inform practitioners
the work’s relevance to decision areas they face. The risk is that finance journals do not being seen
14
as a good source for new finance knowledge in the future. For editors and the academic finance
community, the findings suggest that they should be wary of using the journal and citations as the
only measures for the quality of scholarly work. This is in line with the conclusion of (Hamermesh,
2018). In a nutshell, from an author’s perspective, we can encourage everyone to emphasize on a
clear writing to convince the readers – unless you make it into a leading journal. In this case, a
touch of confusion might also be helpful.
References
Armstrong, J. S. (1980). Unintelligible management research and academic prestige. Interfaces, 10(2),
80-86. doi:10.1287/inte.10.2.80
Ayres, I., & Vars, Fredrick E. (2000). Determinants of citations to articles in elite law reviews. The
Journal of Legal Studies, 29(S1), 427-450. doi:10.1086/468081
Bauerly, R. J., Johnson, D. T., & Singh, M. (2006). Readability and writing well. Marketing Management
Journal, 16, 216-227.
Bellstam, G., Bhagat, S., & Cookson, A. (2017). A Text-Based Analysis of Corporate Innovation.
Elsevier Academic Press. Retrieved October 6, 2017, from
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2803232.
Blei, D., Ng, A., & Jordan, M. (2003). Latent dirichlet allocation. The Journal of machine Learning
research, 3, 993-1022.
Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77-84.
doi:10.1145/2133806.2133826
Chan, J. Y., Chan, K. C., Tong, J. Y., & Zhang, F. (2016). Using Google Scholar citations to rank
accounting programs: A global perspective. Review of Quantitative Finance and Accounting,
47(1), 29-55. doi:10.1007/s11156-014-0493-x
Chung, K. H., & Cox, R. A. K. (1990). Patterns of Productivity in the Finance Literature: A Study of the
Bibliometric Distributions. The Journal of Finance, 45(1), 301-309. doi:10.1111/j.1540-
6261.1990.tb05095.x
Ederington, L. H. (1979). Aspects of the production of significant financial research. The Journal of
Finance, 34(3), 777-786. doi:10.1111/j.1540-6261.1979.tb02142.x
Einav, L., & Yariv, L. (2006). What's in a surname? The effects of surname initials on academic success.
Journal of Economic Perspectives, 20(1), 175-187. doi:doi: 10.1257/089533006776526085
Fishe, R. P. H. (1998). What are the research standards for full professor of finance? The Journal of
Finance, 53(3), 1053-1079. doi:10.1111/0022-1082.00043
Ganglmair, B., & Wardlaw, M. (2015). Measuring Contract Completeness: A Text Based Analysis of
Loan Agreements. Unpublished manuscript.
Goldsmith-Pinkham, P., Hirtle, B., & Lucca, D. (2016). Parsing the content of bank supervision.Working
Paper.
Hamermesh, D. S. (2018). Citations in cconomics: Measurement, uses, and impacts. Journal of Economic
Literature, 56(1), 115-156. doi:10.1257/jel.20161326
Hartley, J., Pennebaker, J. W., & Fox, C. (2003). Abstracts, introductions and discussions: How far do
they differ in style? Scientometrics, 57(3), 389-398. doi:10.1023/a:1025008802657
Hartley, J., Sotto, E., & Pennebaker, J. (2002). Style and substance in psychology: Are influential articles
more readable than less influential ones? Social Studies of Science, 32(2), 321-334.
doi:10.1177/0306312702032002005
15
Heck, J. L., Cooley, P. L., & Hubbard, C. M. (1986). Contributing authors and institutions to The Journal
of Finance: 1946–1985. The Journal of Finance, 41(5), 1129-1140. doi:10.1111/j.1540-
6261.1986.tb02535.x
Hoberg, G., & Lewis, C. (2015). Do Fraudulent Firms Produce Abnormal Disclosure?. Vanderbilt Owen
Graduate School of Management Research Paper No. 2298302.
Hollis, A. (2001). Co-authorship and the output of academic economists. Labour Economics, 8(4), 503-
530. doi:10.1016/S0927-5371(01)00041-0
Jacques, T. S., & Sebire, N. J. (2010). The impact of article titles on citation hits: An analysis of general
and specialist medical journals. JRSM Short Reports, 1(1), 1-5. doi:10.1258/shorts.2009.100020
Jamali, H. R., & Nikzad, M. (2011). Article title type and its relation with the number of downloads and
citations. Scientometrics, 88(2), 653-661. doi:10.1007/s11192-011-0412-z
Karolyi, G. A. (2011). The ultimate irrelevance proposition in finance? Financial Review, 46(4), 485-512.
doi:10.1111/j.1540-6288.2011.00309.x
Klemkosky, R. C., & Tuttle, D. L. (1977). The institutional source and concentration of financial
research. The Journal of Finance, 32(3), 901-907. doi:10.1111/j.1540-6261.1977.tb01996.x
Li, F. (2008). Annual report readability, current earnings, and earnings persistence. Journal of Accounting
and Economics, 45(2), 221-247. doi:10.1016/j.jacceco.2008.02.003
Loughran, T. I. M., & McDonald, B. (2014). Measuring readability in financial disclosures. The Journal
of Finance, 69(4), 1643-1671. doi:10.1111/jofi.12162
Loveland, J., Whatley, A., Ray, B., & Reidy, R. (1973). An analysis of the readability of selected
management journals. Academy of Management Journal, 16(3), 522-524. doi:10. 2307/255014
McDowell, J. M., & Melvin, M. (1983). The determinants of co-authorship: An analysis of the economics
literature. The Review of Economics and Statistics, 65(1), 155-160. doi:10.2307/1924423
Medoff, M. H. (2003). Article placement and market signalling. Applied Economics Letters, 10(8), 479-
482. doi:10.1080/1350485032000095348
Naftulin, D. H., Ware, J. E., & Donnelly, F. A. (1973). The Doctor Fox lecture: A paradigm of
educational seduction. Journal of Medical Education, 48, 630-635.
Niemi, A. W. (1987). Institutional contributions to the leading finance journals, 1975 through 1986: A
note. The Journal of Finance, 42(5), 1389-1397. doi:10.1111/j.1540-6261.1987.tb04374.x
Reed, C. (2012). Latent Dirichlet Allocation: Towards a Deeper Understanding. Retrieved February 28,
2018, from http://obphio.us/pdfs/lda_tutorial.pdf.
SEC. (2007). Speech by SEC Chairman: Closing Remarks to the Second Annual Corporate Governance
Summit. Retrieved from https://www.sec.gov/news/speech/2007/spch032307cc.htm
Spence, M. (1973). Job market signaling. The Quarterly Journal of Economics, 87(3), 355-374.
doi:10.2307/1882010
Stremersch, S., Verniers, I., & Verhoef, P. C. (2007). The quest for citations: Drivers of article impact.
Journal of Marketing, 71(3), 171-193. doi:10.1509/jmkg.71.3.171
You, H., & Zhang, X.-J. (2009). Financial reporting complexity and investor underreaction to 10-K
information. Review of Accounting Studies, 14(4), 559-586. doi:10.1007/s11142-008-9083-2
Zimmerman, J. L. (1989). Improving a manuscripts readability and likelihood of publication. Issues in
Accounting Education, 4(2), 458-466.
Zivney, T. L., & Bertin, W. J. (1992). Publish or perish: What the competition is really doing. The
Journal of Finance, 47(1), 295-329. doi:10.1111/j.1540-6261.1992.tb03987.x
16
Table 1: Summary statistics.
This table shows the descriptive sample statistic of our sample of 4,160 finance journal articles, divided into articles published in the Journal of
Financial Economics, the Review of Financial Studies and The Journal of Finance, and a control sample of eleven other finance journals. CrossRef and GoogleScholar are the number of citations received as reported by CrossRef and Google Scholar, respectively, in beginning of 2017, divided
by the number of years since the article has appeared. FleschKincaidText is the Flesch-Kincaid score of the full paper and gives the number of years
of education that a reader hypothetically needs to understand the paragraph or text, whereas FleschKincaidAbstract is the Flesch-Kincaid score of the abstract of the paper. Length is the logarithm of the number of words in the full-text. ComplexWords is the logarithm of the number of complex
words, defined as those that have more than two syllables minus jargon words, defined as words listed in the Professor Campbell Harvey’s finance
glossary. TopicFit is the value of the highest percentage of one of our two topics obtained from the Latent Dirichlet Allocation. TitleLength is the number of words in the article’s title. References, the number of references in the article. Tables and Figures are the total numbers of tables and
figures in the article, respectively. LeadArticle includes all articles that are in the lead position in their respective issue. TopBusinessSchool is
defined as 1, if at least one of the authors’ affiliation is listed on the Financial Times Global MBA ranking in the year prior to the year of the publication of the article. Authors is the number of authors of the article. Name is the surname initial of the first author on a 26 numerical scale
(A=1, B=2,. . . ,Z=26). The equality of means and medians of the two samples are tested for statistical significance using the two-sample t-test and
the Mann-Whitney U test. ∗, ∗∗, ∗∗∗ denote statistical significance at the 5%, 1%, and 0.1% level, respectively.
Top journals (n=4,160) Control sample (n=8,236) Differences
Mean Median Mean Median Mean Median
Citations
Crossref 7.260 4.500 1.973 1.000 5.287*** 3.500***
GoogleScholar 24.843 12.854 7.282 3.571 17.561*** 9.283***
Readability variables
FleschKincaidText 17.461 17.021 16.027 15.715 1.434*** 1.306***
FlechKincaidAbstract 15.687 15.538 15.838 15.676 -0.151** -0.138**
Length 14,217 14,123 10,019 9,720 4,197*** 4,403***
ComplexWords 2,945 2,910 2,059 1,987 886*** 2,024***
Article related variables
TopicFit 44.022 41.939 45.715 43.490 -1.693*** -1.551***
TitleLength 8.543 8.000 9.820 9.000 -1.277*** -1.000***
References 41.519 41.000 37.735 35.000 3.783*** 6.000***
Tables 6.565 7.000 5.578 6.000 0.989*** 1.000***
Figures 2.710 2.000 2.138 1.000 0.572*** 1.000***
Leadarticle 0.413 0.000 0.448 0.000 -0.003 0.000
Author related variables
TopBusiness School 0.292 0.000 0.033 0.000 0.259*** 0.000
Authors 2.315 2.000 2.259 2.000 0.053** 0.000***
17
Figure 1: Readability of finance articles over time
This figure shows the readability of our sample of 4,160 articles published in the top-three finance journals and 8,236 articles that published in
other finance journals. Panel A shows the readability measured as Flesch-Kincaid and Panel B shows the readability using the number of words and number of complex words per article.
Panel A: Flesch-Kincaid readability measure over time
Panel B: Number of words and number of complex words over time
14
15
16
17
18
19
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
Top journals (Text) Control sample (Text)
Top journals (Abstract) Control sample (Abstract)
0
500
1.000
1.500
2.000
2.500
7.000
9.000
11.000
13.000
15.000
17.000
00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16
Co
mp
lex w
ord
s
To
tal
num
ber
Top journals (Total words) Control sample (Total words)
Top journals (Complex words) Control sample (Complex words)
18
Figure 2: Top 10 keywords per category.
This table shows the top 10 keywords for each category using LDA.
19
Figure 3: Distribution of topics per year.
This table shows the distribution of topics in the three finance journals The Journal of Finance, Review of Financial Studies and Journal of
Financial Economics in the years 2000 to 2016 analyzing with LDA.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
20
Table 2: Results for the top-three finance journals.
This table shows the regression results for the sample of the 4,160 articles published in the Journal of Financial Economics, Review of Financial
Studies, and The Journal of Finance. The dependent variables are the standardized citation, using CrossRef and Google Scholar as source for
citations. FleschKincaidText is the Flesch-Kincaid score of the full paper and gives the number of years of education that a reader hypothetically
needs to understand the paragraph or text, whereas FleschKincaidAbstract is the Flesch-Kincaid score of the abstract of the paper. Length is the natural
logarithm of the number of words in the full-text and ComplexWords is the number of complex words, defined as words with more than two
syllables minus minus all financial words appearing in the Campbell Harvey’s Hypertextual Finance Glossar. *, **, *** denote statistical significance at the 5%, 1% and 0.1% level, respectively.
Crossref Google Scholar
I II III IV V VI
FleschKincaidText 0.152** 0.424*
(0.048) (0.195)
FlechKincaidAbstract -0.214*** -0.776***
(0.056) (0.195)
Length 1.531** 3.710*
(0.507) (1.843)
ComplexWords 1.377*** 4.075*
(0.435) (1.703)
Article controls YES YES YES YES YES YES
Author controls YES YES YES YES YES YES
Year FE YES YES YES YES YES YES
Topic FE YES YES YES YES YES YES
Observation 4,160 4,160 4,160 4,160 4,160 4,160
R2 / Adj. R2 .151/.141 .148/.138 .148/.138 .231/.222 .228/.219 .230/.221
21
Table 3: Results for the control sample.
This table shows the regression results for the sample of the 4,160 articles published in the Journal of Financial Economics, Review of Financial
Studies, and The Journal of Finance. The dependent variables are the standardized citation, using CrossRef and Google Scholar as source for citations. FleschKincaid
Text is the Flesch-Kincaid score of the full paper and gives the number of years of education that a reader hypothetically
needs to understand the paragraph or text, whereas FleschKincaidAbstract is the Flesch-Kincaid score of the abstract of the paper. Length is the natural
logarithm of the number of words in the full-text and ComplexWords is the number of complex words, defined as words with more than two syllables minus minus all financial words appearing in the Campbell Harvey’s Hypertextual Finance Glossar. *, **, *** denote statistical significance
at the 5%, 1% and 0.1% level, respectively.
Crossref Google Scholar
I II III IV V VI
FleschKincaidText 0.020 0.096*
(0.011) (0.048)
FlechKincaidAbstract 0.007 0.036
(0.009) (0.044)
Length -0.031 0.437
(0.109) (0.576)
ComplexWords 0.011 0.758
(0.113) (0.587)
Article controls YES YES YES YES YES YES
Author controls YES YES YES YES YES YES
Year FE YES YES YES YES YES YES
Topic FE YES YES YES YES YES YES
Observation 8,236 8,236 8,236 8,236 8,236 8,236
R2 / Adj. R2 .136/.130 .136/.130 .136/.130 .108/.102 .108/.102 .108/.102
22
A. Appendix
A1 Sample
A1.1 Control sample construction
Our control sample is based on 8,236 journal articles published in finance-oriented journals from
January 2000 to December 2016. We used the Journal Citation Reports to determine a journal’s
ranking and the Journal Quality List provided by Harzing to determine the most important finance-
oriented journals published in English language.
We excluded two journals, namely the Review of Finance and the Journal of Financial and
Quantitative Analysis, as they are sometimes considered as leading journals and sometimes they
are not. In order to distinguish our control from the original sample, we incorporated the next
journals in the list. In total, we are able to include articles from 11 major finance-oriented journals,
namely European Financial Management, Financial Review, Journal of Banking & Finance,
Journal of Corporate Finance, Journal of Empirical Finance, Journal of Financial
Intermediation, Journal of Financial Markets, Journal of Financial Stability, Journal of Futures
Markets, Journal of International Financial Markets, Institutions and Money, and Journal of
Money, Credit and Banking. These journals are top-ranked quality journals that publish the full
spectrum of finance research. We exclusively focus on finance publications and exclude broader
management journals, such as Management Science, as the writing between disciplines may vary
and we are interested in analyzing the writing of finance articles.3
3 For an overview of academic writing across disciplines and the differences in the writing of academic articles see,
for example, Hyland (2002, 2008).
23
A1.2 Data cleaning
For each journal, we downloaded all articles in Portable Document Format (PDF) from the
respective journal's website that were published in the volumes and issues between January 2000
and December 2016. In total, we downloaded 16,091 PDF documents. In addition, we collected
the authors’ names, their affiliation, the paper's title, the abstract and the Digital Object Identifier
(DOI) for each article from the website. As the focus of the present study is on journal articles, we
drop all PDF documents that are not full-text original research articles. We followed a multi-step
screening procedure to exclude PDF documents that did not match these criteria. First, we
eliminated all PDF documents without a valid DOI as well as back and front matters. In a next
step, we also removed editorial board announcements and editor notes. We also excluded PDF
documents labeled ``miscellanea'', content and issue information, discussions, and PDFs
containing exclusively author acknowledgements. This leaves us with a sample of 14,578 PDFs.
In addition, we also removed PDF documents without any author information on the website. This
elimination round includes forewords, specific announcements, or general journal information. We
also eliminated very short papers with less than three pages. The final step was to exclude all PDF
documents that are not machine readable. This leaves us with a final sample of 12,396 full-text
original research articles and includes approximately 77% of the initial PDF documents.
24
A1.3 Final sample
Table 4: Articles by journal and year.
This table shows the journal articles for the entire sample of 12,396 articles during the investigation period from January 2000 to December 2016.
The articles are divided by journal and year.
Articles from the Journal of Money, Credit and Banking are underrepresented because they are only available online since 2007. The sample also includes a
limited number of 101 articles from the Journal of Financial Stability due to a large amount of non-machine-readable PDF articles published since 2010.
Journal ‘00 ’01 ‘02 ‘03 ‘04 ‘05 ‘06 ‘07 ‘08 ‘09 ‘10 ‘11 ‘12 ‘13 ‘14 ‘15 ‘16 Total
Panel A: Top Journals
Journal of Financial Economics 56 61 58 52 75 79 87 103 102 87 100 136 124 153 102 119 123 1,617
Rev. Financ. Stud. 38 41 55 66 26 31 76 84 112 98 111 102 89 86 104 81 51 1,251
The Journal of Finance 88 79 88 75 90 85 86 84 81 78 69 60 50 67 71 70 71 1,292
Panel B: Control sample
European Financial Management 6 0 16 23 23 25 27 45 34 44 68 34 25 47 15 21 8 461
Financial Review 31 32 31 30 25 26 26 25 24 26 52 28 21 29 33 22 18 479
Journal of Banking & Finance 71 20 83 81 114 53 86 132 110 186 238 256 247 382 295 273 115 2,742
Journal of Corporate Finance 17 18 21 28 32 45 35 47 48 38 47 95 75 74 46 108 120 894
Journal of Empirical Finance 0 23 26 26 30 29 12 10 7 13 10 12 12 62 79 66 101 518
Journal of Financial Intermediation 16 11 16 15 15 17 23 21 23 27 26 27 28 30 24 27 346
Journal of Financial Markets 15 14 18 22 17 16 18 15 17 32 20 23 18 27 46 21 29 368
Journal of Financial Stability 0 0 0 0 11 10 13 20 24 23 101
Journal of Futures Markets 0 34 62 43 51 50 50 50 54 88 50 57 50 51 56 46 17 809
Journal of International Financial
Markets, Institutions and Money 25 23 23 27 29 26 29 28 40 61 36 43 69 86 109 49 43 746
Journal of Money, Credit and
Banking 0 0 0 0 0 0 0 88 78 86 82 77 78 79 67 78 59 772
Total 363 356 497 488 538 492 568 752 754 887 909 950 886 1,173 1,047 981 755 12,396
0
A2 Methodology and variables
A2.1 Methodology
The aim of this paper is to analyze whether the readability, article or author specific variables
determine the number of citations a paper receives. In order to account for these factors, we run
several linear regressions. The ordinary least squares (OLS) regression takes the following form:
𝐶𝐼𝑇𝐸𝑖 = 𝛼 + 𝛽 × 𝑅𝐸𝐴𝐷𝐴𝐵𝐼𝐿𝐼𝑇𝑌𝑖,𝑝 + 𝛿 × 𝐴𝑅𝑇𝐼𝐶𝐿𝐸𝑖,𝑝 + 𝜓 × 𝐴𝑈𝑇𝐻𝑂𝑅𝑖,𝑝
+ 𝑌𝐸𝐴𝑅 𝐹𝐸 + 𝑇𝑂𝑃𝐼𝐶 𝐹𝐸 + 휀𝑖
(1)
The dependent variable CITEi is the number of citations received by article i, as reported by
CrossRef and Google Scholar in beginning of 2017, divided by the number of years since the
article has appeared. Articles published in the early years of the sample should have received more
citations than recent publications. Therefore, we follow the approach by Chan et al. (2016) and
normalize the citations of each article by the number of years since the article has been published.
The independent variables are divided into readability related variables, sentiment specific
variables, article specific variables, and author specific variables. READABILITYi,p is a vector that
includes three different measures of readability. In order to measure the readability of the article,
we use three different measures. The first measure is the Flesch-Kincaid formula. It is based on
two components, the syllables per word and the number of words per sentence. The Flesch-
Kindcaid score is calculated as follows:
FleschKincaid = (11.8 × syllables per word) + (0.39 × words per sentence) - 15.59 (1)
The result is a number that corresponds with a grade level of the U.S. educational system. The
variable FleschKincaidText is the Flesch-Kincaid score of the full paper, whereas
FleschKincaidAbstract is the Flesch-Kincaid score of the abstract of the paper. Two other popular
readability measures, the Fog-Index and the Flesch Reading Ease Score, use similar components
and all measures highly correlate with each other.4 Li (2008) and You and Zhang (2009) argue that
4 These measures are most used in literature (e.g. Bauerly, Johnson, & Singh, 2006; Hartley, Pennebaker, & Fox,
2003; Hartley, Sotto, & Pennebaker, 2002; Loveland, Whatley, Ray, & Reidy, 1973; Stremersch, Verniers, & Verhoef,
1
the length of the document, measured as the number of words is a good proxy for the readability.
The information-process cost of longer documents is higher than for shorter documents, longer
documents seem to be deterring and more difficult to read (Li, 2008). Following this approach, we
use only the number of words (omitting the average sentence length and syllable count) as an
alternative readability measure. The third measure combines the two previous measures. The Fog-
Index uses the number of “complex words”, defined as a word with more than two syllables and
does not incorporate the average number of syllables per word. Loughran and McDonald (2014)
criticize this measure in business documents as they contain multisyllables words to describe
operations. We control for this financial terminology and subtract all financial words appearing in
the Campbell Harvey’s Hypertextual Finance Glossar.
ARTICLEi,p is an article-specific vector that includes TopicFit is the value of the highest percentage
of one of our two topics obtained from the Latent Dirichlet Allocation. TitleLength is the number
of words in the article’s title. References, the number of references in the article. Tables and
Figures are the total numbers of tables and figures in the article, respectively. LeadArticle includes
all articles that are in the lead position in their respective issue.
AUTHORi,p is a vector that includes author-specific variables. TopBusinessSchool is set as 1, if at
least one of the authors’ affiliations is listed in the top ten of the Financial Times Global MBA
ranking in the year prior to the year of the publication of the article. Authors is defined as the
number of authors of the article. The initial of the surname of the first author is coded into integers
between 1 and 26 lexicographically (A=1, B=2,…, Z=26) and defines the variable Name.
A2.2 Choice of variables
Our approach incorporating article related variables follows Stremersch et al. (2007) and Ayres
and Vars (2000) who measure the clarity of a paper using the number of tables and the number of
figures. Moreover, Ayres and Vars (2000) find that the number of citations of elite law journals
increases with an article’s length. In addition, Medoff (2003) shows that a lead article receives a
2007), but also in other cases. Moreover, the former SEC Chairman Christopher Cox suggested that the Fox-Index
can be used to gauge compliance with the SEC’s plain English initiatives (Loughran & McDonald, 2014).
2
significantly higher number of citations in the first five years after publication. If the article is
listed first in the respective issue or volume, we consider it as a lead article.
The choice for the author related variables is as follows: Chung and Cox (1990), Ederington
(1979), Klemkosky and Tuttle (1977), Heck, Cooley, and Hubbard (1986) and Niemi (1987)
analyze the relation between the author’s institutional affiliation and the publication success in
finance, measured as the quantitative contribution to leading finance journals or the number of
citations they receive. They provide evidence that the production of articles published in journals
with high impact factors, such as The Journal of Finance, is concentrated at relatively few
institutions. Klemkosky and Tuttle (1977) show that six universities accounted for approximately
25% of the total pages published in these journals. Ederington (1979) find that articles published
by researchers from top business schools are cited more often. Furthermore, we also control for
the number of authors as the number of papers with co-authors is increasing. Joint works allow
productivity gains due to division of labour (McDowell & Melvin, 1983). Karolyi (2011) and
Hollis (2001) show for single economists more co-authors are associated with higher research
quality, greater length, and greater frequency of publications. Finally, we control for the surname
of the first author. Einav and Yariv (2006) analyze the impact of surname initials on professional
outcomes in the academic labor market for economist. Their data provides evidence that earlier
surname initials are significantly more likely to receive tenure. In analogy to Einav and Yariv
(2006), the initial of the surname is coded into integers between 1 and 26 lexicographically (A=1,
B=2,…, Z=26).
A2.3 Latent Dirichlet Allocation (LDA)
A LDA model like we use in this paper5 defines the basic terms of “words”, “documents” and
“corpus” to describe the underlying data sample. This definition is in line with the original
definition of Blei, Ng and Jordan (2003). The terms are defined as follows:
A word is the basic unit of discrete data, defined to be an item of vocabulary indexed by
{1,…,V}. We represent words using unit-basis vectors that have a single component equal
to one and all other components equal to zero. Thus, using superscripts to denote
5 The LDA model we use in this paper is the VEM algorithm developed by Blei, Ng and Jordan (2003).
3
components, the vth word in the vocabulary is represented by a V-vector w such that wv =
1 and wu = 0 for u ≠ v.
A document d is a sequence of N words denoted w = (w1, w2,…,wN), where wn is the nth
word in the sequence.
A corpus is a collection of M documents denoted by D = {w1, w2,…,wM}
The latent dirichlet allocation is a three-level hierarchical Bayesian model. First, and as corpus-
level parameters set once per corpus, second d as document-level variables set one per document
and third zn and wn as word-level variables set once for each word of each document (Blei, Ng and
Jordan, 2003). The idea behind this is that each item of a collection is portrayed as a mix of various
latent topics whereas each topic is denoted as a set of underlying topic probabilities. Further, each
topic is represented by a distribution over words (Blei, Ng and Jordan, 2003). The following
process describes how a generative process for a document vector w within a corpus D works. This
description is again picked from the original paper from Blei et al. (2003):
1. Choose N Poisson ().
2. Choose Dir ()
3. For each of the N words wn:
a. Choose a topic zn Multinomial ()
b. Choose a word wn from p(wn|zn, ), a multinomial probability conditioned on the
topic zn.
For simplicity some assumptions are made. For instance, describes a scalar that denotes the
length or more accurate the number of words per document over a Poisson distribution. K is a
dimensionality of the Dirichlet distribution and known and fixed and is a predefined parameter
vector over the distribution of d. is a per-topic-per-word probability ij = p(wj = 1|zi = 1) and
described as a k V matrix with k as the number of topics and V as the number of distinctive
words. d is a k 1 vector of Dirichlet random variables for every document d where each entry
represents the per-document-per-topic probability. Hence, it can be interpreted as the proportion
of a topic within a document. That means that N (the number of words per document) is
independent form d and zn. Further, a so-called “bag-of-words” assumption is made. That means
4
that the order of words can be neglected and that the documents are independent form each other
and within a document the words are assumed to be exchangeable (Finettei, Machi and Smith,
1990).
The matrices (k V matrix) and (k M matrix) are needed if one wants to classify a corpus
of documents for further analysis. For calculating the variables one can use the VEM algorithm in
R’s topic modelling package which is developed by Blei et al. (2003). Next to the corpus of
documents only the number of topics k is then needed. After all, the LDA algorithm calculates two
estimated outputs out of the data sample: first, ̂ with the estimated per-topic-per-word
probabilities �̂�ij and second ̂ with the estimated per-document-per-topic probabilities ̂d.
5
A3 Descriptive statistics
Table 5: Descriptive sample statistics.
This table shows the journal articles for the entire sample of 4,160 articles during the investigation period
from January 2000 to December 2016. The articles are divided by journal and year.
Variable n Mean Median Std.
deviation
25%
quantile
75%
quantile
Dependent variables
Crossref 4,160 7.260 4.500 9.722 2.000 9.000
GoogleScholar 4,160 24.843 12.854 35.739 4.380 31.063
Readability variables
FleschKincaidText 4,160 17.461 17.021 3.529 15.187 19.029
FlechKincaidAbstract 4,160 15.687 15.538 2.464 13.996 17.164
Length 4,160 9.525 9.556 0.295 9.386 9.704
ComplexWords 4,160 7.944 7.977 0.319 7.789 8.148
Article related variables
Topic fit 4,160 44.022 41.939 13.990 33.585 52.241
Title length 4,160 8.543 8.000 3.323 6.000 11.000
References 4,160 41.519 41.000 23.280 29.000 53.000
Tables 4,160 6.565 7.000 3.724 4.000 9.000
Figures 4,160 2.710 2.000 2.822 0.000 4.000
Leadarticle 4,160 0.413 0.000 0.199 0.000 0.000
JFE 4,160 0.389 0.000 0.488 0.000 1.000
RFS 4,160 0.301 0.000 0.459 0.000 1.000
Author related variables
TopBusiness School 4,160 0.292 0.000 0.455 0.000 1.000
Authors 4,160 2.315 2.000 0.855 2.000 3.000
Name 4,160 7.863 7.000 6.046 3.000 12.000
6
Table 3: Descriptive sample statistics for the 11 remaining journals.
This table shows the journal articles for the entire sample of 8,236 articles during the investigation period
from January 2000 to December 2016. The articles are divided by journal and year.
Variable n Mean Median Std.
deviation
25%
quantile
75%
quantile
Dependent variables
Crossref 8,236 1.973 1.000 3.174 0.400 2.333
GoogleScholar 8,236 7.282 3.571 14.126 1.531 7.904
Readability variables
FleschKincaidText 8,236 16.027 15.715 2.933 14.401 17.130
FlechKincaidAbstract 8,236 15.838 15.676 2.965 14.083 17.415
Length 8,236 9.146 9.182 0.386 8.946 9.404
ComplexWords 8,236 7.554 7.595 0.412 7.329 7.835
Article related variables
Topic fit 8,236 45.715 43.490 14.711 34.819 54.481
Title length 8,236 9.820 9.000 3.528 7.000 12.000
References 8,236 37.735 35.000 19.022 25.000 47.000
Tables 8,236 5.578 6.000 3.547 3.000 8.000
Figures 8,236 2.138 1.000 2.764 0.000 3.000
Leadarticle 8,236 0.448 0.000 0.207 0.000 0.000
Author related variables
TopBusiness School 8,236 0.033 0.000 0.179 0.000 0.000
Authors 8,236 2.259 2.000 0.879 2.000 3.000
Name 8,236 8.855 7.000 6.633 3.000 13.000
7
A4 Results
Table 4: Results for the three top finance journals
Crossref Google Scholar
I II III IV V VI
Readability variables
FleschKincaidText 0.152** 0.424*
(0.048) (0.195)
FlechKincaidAbstract -0.214*** -0.776***
(0.056) (0.195)
Length 1.531** 3.710*
(0.507) (1.843)
ComplexWords 1.377*** 4.075*
(0.435) (1.703)
Article related variables
Topic fit 0.045*** 0.046*** 0.047*** 0.154*** 0.156*** 0.159***
(0.010) (0.010) (0.010) (0.035) (0.035) (0.035)
Title length -0.178*** -0.174*** -0.172*** -0.875*** -0.863*** -0.858***
(0.038) (0.039) (0.039) (0.143) (0.144) (0.144)
References 0.047*** 0.042*** 0.041*** 0.127*** 0.113*** 0.110***
(0.007) (0.007) (0.007) (0.028) (0.029) (0.029)
Tables 0.149** 0.119* 0.124** 0.388* 0.327 0.314
(0.045) (0.046) (0.047) (0.179) (0.187) (0.189)
Figures 0.172* 0.129 0.140 0.600** 0.498* 0.514*
(0.075) (0.074) (0.075) (0.228) (0.228) (0.226)
Leadarticle 1.241 1.352 1.340* 2.244 2.547 2.522
(0.714) (0.712) (0.714) (2.673) (2.659) (2.663)
JFE -0.396 -0.787* -0.821* -7.650*** -8.814*** -8.939***
(0.341) (0.331) (0.334) (1.484) (1.447) (1.460)
RFS -2.317*** -2.116*** -2.202*** -32.380*** -31.880*** -32.126***
(0.354) (0.344) (0.343) (1.136) (1.091) (1.099)
Author related variables
TopBusiness
School
1.520*** 1.437*** 1.450*** 6.867*** 6.645*** 6.665***
(0.320) (0.322) (0.321) (1.181) (1.189) (1.189)
Authors 0.626** 0.604** 0.605** 2.477*** 2.408** 2.395**
(0.183) (0.188) (0.186) (0.691) (0.692) (0.691)
Name 0.000 0.001 0.001 -0.070 -0.068 -0.067
(0.025) (0.025) (0.025) (0.083) (0.083) (0.083)
Intercept 1.560 -12.746*** -9.180** 21.549*** -16.037 -12.861
(1.505) (4.703) (3.358) (6.054) (16.442) (12.572)
Year FE YES YES YES YES YES YES
Topic FE YES YES YES YES YES YES
n 4,160 4,160 4,160 4,160 4,160 4,160
R2 / Adj. R2 .151/.141 .148/.138 .148/.138 .231/.222 .228/.219 .230/.221
8
Table 5: Results for 11 journals
Crossref Google Scholar
I II III IV V VI
Readability variables
FleschKincaidText 0.020 0.096*
(0.011) (0.048)
FlechKincaidAbstract 0.007 0.036
(0.009) (0.044)
Length -0.031 0.437
(0.109) (0.576)
ComplexWords 0.011 0.758
(0.113) (0.587)
Article related variables
Topic fit 0.008*** 0.008*** 0.008*** 0.026* 0.026* 0.027**
(0.002) (0.002) (0.002) (0.010) (0.010) (0.010)
Title length -0.030** -0.030** -0.030** -0.180*** -0.179*** -0.179***
(0.009) (0.009) (0.009) (0.042) (0.042) (0.042)
References 0.022*** 0.023*** 0.022*** 0.093*** 0.090*** 0.087***
(0.003) (0.003) (0.003) (0.014) (0.015) (0.016)
Tables 0.012 0.016 0.014 0.008 0.002 -0.007
(0.011) (0.012) (0.012) (0.059) (0.063) (0.062)
Figures 0.041* 0.039 0.038 0.319*** 0.297* 0.293*
(0.019) (0.020) (0.020) (0.115) (0.120) (0.120)
Leadarticle 0.076 0.081 0.078 0.550 0.528 0.501
(0.144) (0.143) (0.144) (0.649) (0.646) (0.647)
Author related variables
TopBusiness
School
1.216** 1.211*** 1.209** 6.634** 6.575*** 6.567**
(0.421) (0.421) (0.421) (2.087) (2.087) (2.088)
Authors 0.198*** 0.200*** 0.199*** 0.516** 0.515** 0.509**
(0.041) (0.041) (0.041) (0.177) (0.177) (0.176)
Name -0.010 -0.010 -0.010 -0.069** -0.070** -0.069**
(0.005) (0.005) (0.005) (0.026) (0.026) (0.026)
Intercept 0.224 0.917 0.574 -0.036 -1.750 -3.307
(0.336) (0.965) (0.833) (1.536) (5.007) (4.209)
Year FE YES YES YES YES YES YES
Topic FE YES YES YES YES YES YES
Journal FE YES YES YES YES YES YES
n 8,236 8,236 8,236 8,236 8,236 8,236
R2 / Adj. R2 .136/.130 .136/.130 .136/.130 .108/.102 .108/.102 .108/.102
9
A5 Further empirical results
Appendix 1: Distribution of topics per year (remaining journals). This table shows the distribution of topics in the eleven finance journals in the years 2000 to 2016 analyzing with
LDA.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
10
A6 Alternative citation
Appendix 2: Results for the top journals but with the regression model
Crossref Google Scholar
I II III IV V VI
Readability variables
FleschKincaidText 0.025*** 0.026***
(0.005) (0.005)
FlechKincaidAbstract -0.030*** -0.032***
(0.006) (0.007)
Length 0.315*** 0.280***
(0.070) (0.072)
ComplexWords 0.311*** 0.322***
(0.064) (0.066)
Article related variables
Topic fit 0.005*** 0.005*** 0.005*** 0.004*** 0.004*** 0.005***
(0.001) (0.001) (0.001) (0.001) (0.001) (0.001)
Title length -0.024*** -0.023*** -0.023*** -0.029*** -0.028*** -0.028***
(0.004) (0.004) (0.004) (0.005) (0.005) (0.005)
References 0.007*** 0.006*** 0.006*** 0.006*** 0.005*** 0.005***
(0.001) (0.001) (0.001) (0.001) (0.001) (0.001)
Tables 0.023*** 0.016** 0.016** 0.019*** 0.014* 0.012*
(0.005) (0.006) (0.006) (0.005) (0.006) (0.006)
Figures 0.018** 0.010 0.012* 0.024*** 0.017** 0.018**
(0.006) (0.006) (0.006) (0.006) (0.006) (0.006)
Leadarticle 0.205* 0.228** 0.226** 0.201* 0.225** 0.224**
(0.079) (0.080) (0.080) (0.078) (0.078) (0.079)
JFE -0.113** -0.176*** -0.186*** -0.359*** -0.425*** -0.436***
(0.037) (0.036) (0.036) (0.038) (0.037) (0.038)
RFS -0.217*** -0.186*** -0.205*** -1.963*** -1.932*** -1.952***
(0.040) (0.039) (0.040) (0.039) (0.038) (0.039)
Author related variables
TopBusiness
School
0.229*** 0.213*** 0.215*** 0.273*** 0.258*** 0.259***
(0.033) (0.033) (0.033) (0.034) (0.034) (0.034)
Authors 0.106*** 0.103*** 0.103*** 0.079*** 0.077*** 0.075***
(0.018) (0.018) (0.018) (0.020) (0.020) (0.020)
Name 0.000 0.000 0.000 -0.003 -0.003 -0.003
(0.003) (0.003) (0.003) (0.003) (0.003) (0.003)
Intercept -0.448** -3.305*** -2.772*** 0.314 -2.251*** -2.113***
(0.172) (0.649) (0.503) (0.174) (0.668) (0.513)
Year FE YES YES YES YES YES YES
Topic FE YES YES YES YES YES YES
n 4,015 4,015 4,015 4,112 4,112 4,112
R2 / Adj. R2 .202/.193 .199/.190 .199/.190 .467/.460 .463/.457 .458/.465