Download - Confused but convinced: Article complexity and publishing ...extracted the citations and metadata from the CrossRef website. The metadata contains, among others, the number of references

1

Confused but convinced:

Article complexity and publishing success over time

by

Marc Berninger1, Florian Kiesel#, 2, Dirk Schiereck3, Eduard Gaar4

April 18, 2018

____________________________________ * We thank Ruediger Fahlenbrach, Campbell Harvey, Timothy Loughran, Ian Marsh, Christoph

Merkelbach, and Steven Ongena for their helpful comments and suggestions on earlier drafts of this

paper. We are also grateful to Nico Gärtner, Felicia Müller, Till Nedderhut, and Wadan Wardak for

their valuable research assistance. All remaining errors are our own. # Corresponding author. 1 Department of Business Administration, Economics and Law, Technische Universität Darmstadt,

64289 Darmstadt, Germany, phone +49 6151 16 - 24344, email: [email protected] 2 Department of Business Administration, Economics and Law, Technische Universität Darmstadt,



64289 Darmstadt, Germany, phone +49 6151 16 - 24292, email: [email protected]

2

Confused but convinced:

Article complexity and publishing success over time

Abstract

Using a sample of 4,160 finance articles published in the leading finance journals between 2000

and 2016, we study the readability of academic publications and its impact on the number of

citations the paper receives. We use latent dirichlet allocation (LDA) to cluster the paper into topics

and find that the article complexity, measured by length and the Flesch-Kincaid-Index, increases

over time, while we do not find considerable changes within the article topics. The results reveal

a correlation with the number a paper is being cited. We do not find that this result holds for 8,236

articles published in other finance journals. These patterns suggest that scientists gain recognition

by unintelligible writing.

Keywords: Academic writing, Readability, Textual Analysis, Finance Literature, Scientometrics,

Latent dirichlet allocation

JEL classification: D83, G00, G10

3

1 Introduction

“Publish or perish” is a well-known principle in academia (e.g. Zivney & Bertin, 1992). That is

because the success of researchers in finance is a two-step measurement. First, for most researchers

publishing in a leading journal or at least a decent journal is a great achievement, as it is important

for future promotion and tenure decisions (Fishe, 1998). The second step relates to the attention a

paper receives after it is published. It is very common to examine the citations when evaluating

promotions or journal standings. More citations indicate that the contribution provides novel

insights in a relevant field and other researchers acknowledge the author’s findings. Moreover,

other researchers use the paper’s original contribution as a basis for their own research. This

already indicates that even the most famous scientists depended on their predecessors and it shows

that citations are required to incorporate the previous works. Aside from citing properly in order

to keep academic honestly and to minimize the risk of plagiarism, other arguments may arise why

articles should be cited. Authors can improve the quality of their own paper by using citations:

Citations serve (1) as a proxy for the detail-orientation and the accuracy of the paper and the author.

It helps (2) to precise problems and (3) to define and use jargon words in the respective field.

Moreover, a longer reference list shows (4) the scientific knowledge of an author and (5) increases

the scholar’s credibility.

These five reasons indicate that authors have a personal interest to cite more papers, especially

well-known articles. However, citing a paper does not guarantee that they are completely

understood or even read by researchers. In order to ensure that academic papers are written easily

understandable, authors are instructed in the peer-review process or on the journal’s website to use

a plain English. Zimmerman (1989) suggests that researchers may increase the probability of more

citations by writing for the largest possible audience. If not only academic experts understand the

contribution but also doctoral students or practitioners, the probability of citations increases. Thus,

editors and authors should have the same intention in preparing articles.

In this paper, we analyze the readability of finance articles and its impact on the number of citation

a paper receives. The focal question of the current article is how is the readability of journal articles

and how does the readability affect the number of citations an article receives in the finance

discipline.

In order to answer this question, we construct a unique comprehensive dataset consisting of 4,160

journal articles from the three leading finance journals, Journal of Financial Economics (JFE),

4

Review of Financial Studies (RFS), and The Journal of Finance (JF), between 2000 and 2016. We

consider three proxy variables as a measure of the complexity of an article: the Flesch-Kincaid

score, the number of words, and the number of complex words (more than two syllables adjusted

by the number of “jargon” finance words). We control on the journal, article and author level and

apply a latent dirichlet allocation (LDA) in order to build topic clusters and to control for topic

effects. In addition, we construct a control sample consisting of 8,236 further academic finance

journal articles.

First, we find that the readability becomes worse over time. In our investigation period, the average

Flesch-Kincaid index increased from 16 in the years between 2000 and 2004 to a spike of 18 in

the years of 2011 to 2015. The value of the Flesch-Kincaid corresponds with a grade level of the

US educational system. This indicates that within 15 years, the average reader needed two more

years of formal education to clearly understand the paper. We also find that the average finance

article increased by 33% in length, from 12,000 words to more than 16,000 words, and the

percentage of complex words increased by 49%. We find that this shift is not related to a change

in the topics addressed by the papers as they are very stable over the time. Second and most

interestingly, the results of our empirical analyses show that articles which are complex to read

receive the highest number of citations. We find that all measures of readability show a strong

inverse relation between readability and citations. Combining these facts, we show that not only

the readability increased over time, but also that the likelihood of a citation increases when it is

harder to read. We compare the results with a control sample of 8,236 finance articles from non-

top-three journals and find that these articles are not cited more frequently if they are more difficult

to read.

These patterns suggest several insights. From an author’s perspective, the results lead to the

conclusion that clear communication is not always beneficial with regard to the (expected) number

of citations. This provides strong support to the “Doctor Fox phenomenon”. Doctor Fox, a trained

actor, gave a nonsensical lecture to specialists in the field and was able to bluff the audience with

his presentation style (Naftulin, Ware, & Donnelly, 1973). Our study provides first evidence that

it is also important how a report is written.

From a scholar’s perspective, researchers tend to cite more complex articles. A bibliography with

complex articles gives an indication that the scholar is widely read and well-informed. Gaining an

understanding of the effects of readability on the number of times an article is being cited is highly

5

relevant to the discipline. From a practitioner’s perspective, academic journals as a good source

for new finance knowledge for practitioners, lowers when the success of an article is measured by

the number of citations. Especially for practitioners, a clear understanding of the different

characteristics underlying scholarly work in finance is relevant because it informs them about the

work’s relevance to decision areas they face and the extent to which academic journals in finance

may provide good sources for new finance knowledge in the future. Therefore, a paper’s impact

for the society is much greater if it is easily understandable for the largest possible audience and

not only for experts. However, the trend we show in our analysis and the impact of the readability

on the number of citations indicate that the gap between academics and practitioners increases over

time.

2 Data and measures

2.1 Sample selection

Focusing on the finance discipline, we first sampled the three leading journals: JFE, RFS and JOF.

We began by inventorying all articles published between 2000 and 2016. We collected a total of

4,922 articles for these three journals. After a very restrictive screening process, including only

full-text original research articles, articles with a valid digital object identifier (DOI), and machine-

readable articles, we ended up with a total of 4,160 articles. To control whether our results remain

robust for other journals, we construct a control sample of 8,236 articles published in further eleven

finance journals. More details to the data is provided in the appendix.

After we collected the information of the articles and the authors on the journal’s website, we

extracted the citations and metadata from the CrossRef website. The metadata contains, among

others, the number of references in the article and the year of publication. Finally, we matched the

data with the Google Scholar database.

2.2 Measures

Readability measures. In order to measure the readability of the article, we use three different

measures. The first measure is the Flesch-Kincaid formula (Kincaid et al. 1975). It is based on two

components, the number of syllables per word and the number of words per sentence. The Flesch-

Kindcaid score is calculated as follows:

6

FleschKincaid = (11.8 × syllables per word) + (0.39 × words per sentence) - 15.59 (1)

The result is a number that corresponds with a grade level of the US educational system. The

variable FleschKincaidText is the Flesch-Kincaid score of the full-text paper, whereas

FleschKincaidAbstract is the Flesch-Kincaid score of the paper’s abstract. Two other popular

readability measures, the Fog-Index (Gunning 1952) and the Flesch Reading Ease Score, use

similar components and therefore all measures highly correlate with each other.1 We therefore use

other measures for our robustness checks. Li (2008) and You and Zhang (2009) argue that the

length of the document, measured as the number of words is a good proxy for the readability. The

information-extracting cost of longer documents is higher than for shorter documents, longer

documents seem to be deterring and more difficult to read (Li, 2008). Following this approach, we

use the number of words (omitting the average sentence length and syllable count) as an alternative

readability measure. The third measure combines the two previous measures. The Fog-Index uses

the number of “complex words”, defined as a word with more than two syllables and does not

incorporate the average number of syllables per word. Loughran and McDonald (2014) criticize

this measure in business documents as they frequently contain multisyllables words to describe

operations. We control for this financial terminology and subtract all financial words appearing in

the Campbell Harvey’s Hypertextual Finance Glossar.

Citation variables. We want to measure the impact of the readability on the number of citation a

paper receives. We use two databases to obtain the number of citations, CrossRef and Google

Scholar. Articles published in the early years of the sample should have received more citations

than recent publications. Therefore, we follow the approach of Chan, Chan, Tong, and Zhang

(2016) and normalize the citations of each article by the number of years since the article has been

published.

Controls. In order to control for other potential determinants influencing the number of citations a

paper receives, we furthermore analyzed the articles on a full-text basis to generate additional

1 These measures are most used in literature (e.g. Bauerly, Johnson, & Singh, 2006; Hartley, Pennebaker, & Fox,

2003; Hartley, Sotto, & Pennebaker, 2002; Loveland, Whatley, Ray, & Reidy, 1973; Stremersch, Verniers, & Verhoef,

2007), but also in other cases. Moreover, the former SEC Chairman Christopher Cox suggested that the Flesch-Kincaid

model can be used to gauge compliance with the SEC’s plain English initiatives (SEC, 2007).

7

article related control variables. We run several linear regressions and incorporate these article-

and author-control variables. The definitions of the variables can be found in the appendix.

3 Results

3.1 Descriptive statistics

Table 1 provides the summary statistics of the key variables we employ. The average citations

vary heavily between our two different citation providers, CrossRef and Google Scholar. We find

that in average an article published in a top-three finance journal receives 7.3 citations per year

measured by CrossRef, while Google Scholar finds 24.8 citations per article and year. Google

Scholar indexes additionally non-journal material2 and therefore the number of citations is higher.

The average Flesch-Kincaid level is 17.5, indicating that scholars need approximately 17.5 years

of reading experience to understand the text while the first reading. The abstract shows a Flesch-

Kincaid level of only 15.7, showing that the abstract is easier to read than the full-text of the paper.

This suggests that writing an abstract may differ from writing the full-text of the paper. Abstracts

provide the main contribution of the article in a very condensed way. Due to the word limitations

of the abstract, authors have to describe their work precisely and the readability improves. Another

measure for the article complexity is the number of words of the article (Li, 2008; You & Zhang,

2009). On average, an article consists of 14,217 words, while on average 2,945 of these words are

complex words.

We use LDA to identify notable changes in finance journals over our period and the probability

that an article fits in one of our topics. Table 1 shows that in average the most related category,

selected as the best fit topic, is 44%. The higher the percentage, the more specific the article is.

Lower percentage indicate that a clustering to one topic is less likely and the topic is more diverse.

The topic fit varies between 14% and 97%. In addition, the descriptive statistics reveal that the

average title length of finance papers consists of 8.5 words. The shortest ones only have one word

(“Anomalies”, “Tipping” both published in RFS, and “Comovement” published in JFE) and the

longest 23 (“A first look at the accuracy of the CRSP Mutual Fund Database and a comparison of

the CRSP and Morningstar Mutual Fund Databases”). However, most article titles vary between 6

2 Google does not explicitly state what is covered in its citations, but preprints/postprints from ArXiv and RePec,

conference proceedings, technical reports, books, and dissertations in addition to electronic journal articles from

traditional publishers have been found.

8

and 11 words. Moreover, published finance papers on average consist of 42 references, 6.6 tables,

and 2.7 figures. Medoff (2003) shows that a lead article receives a significantly higher number of

citations in the first 5 years after publication. If the article is listed first in the respective issue or

volume, we consider it as a lead article. 3.7% of our articles are listed as the first article in the

respective journal. Most journals stopped to publish issues and started to publish volumes with

more articles. This in combination with the lower dissemination of paper-based editions, the

importance of lead articles will decrease over time.

Table 1 also provides details to author related variables. We use the Financial Times Global MBA

Ranking and define the first 10 business schools as top business schools and control whether at

least one author is at a top business school. We find that approximately 29% of the articles

published in the leading journals are written by researchers from a top business school or are

coauthored with academics from these faculties. Most papers are written with coauthors, in average

2 scholars publish a paper together. Joint works allow productivity gains due to division of labor

(McDowell & Melvin, 1983). Karolyi (2011) and Hollis (2001) show that for single economists

more co-authors are associated with higher research quality and greater frequency of publications.

In order to compare our findings, we recalculate the variables for our control sample. This sample

includes 8,236 published finance articles from other finance journals. Table 1 additionally

summarizes the statistics of the control sample. Not surprisingly, these articles are cited

significantly less. Using the CrossRef database, they only receive 2 citations per year and using

the Google Scholar database 7.2 citations per year, a difference of 5.3 and 17.6 citations per year,

respectively. Interestingly, we find that these papers are easier to read compared with the top peers.

The Flesch-Kincaid level is 16 in average and 1.4 points less than articles published in the top

journals. In addition, these papers are significantly shorter, measured by the total number of words

and contain less complex words. The article related variables seem to be similar. However, the top

fit is 1.7% higher, suggesting that articles published in other journals, such as the Journal of

Financial Intermediation or the Journal of Financial Markets, are more specialized. We also find

that researchers from top business schools significantly publish less in these journals. We find that

29.2% of articles in the top journals are written with coauthors from a top business school, but our

control sample of 8,236 articles contains only 3.3% articles with coauthors from premiere business

schools.

9

Summarizing, we find that articles published in the top-three journals show significantly different

characteristics than their peers. Less surprising, the number of citations is significantly higher for

top journals. However, and more surprisingly, the readability of these articles is more complex.

3.2 Readability over the time

We find that the Flesch-Kincaid for top journal articles is in average 17.5. Panel A of Figure 1

illustrates the Flesch-Kincaid for each year of our sample period. At the beginning, the average

Flesch-Kincaid is 16 and is even decreasing the following years. However, since 2002 the average

Flesch-Kincaid is increasing with a spike in 2015. In the recent years, the Flesch-Kincaid index is

in average around 18. The year 2016 is the only exception with a Flesch-Kincaid index of 16.5. In

contrast, we find that the readability of the abstracts is remarkably stable over the time. The average

Flesch-Kincaid index of the abstracts varies between 15.3 (year 2001) and 16.0 (year 2010). This

result is surprising, suggesting that the authors care more about a clear communication within the

abstract, while the readability of the full-text is less important for them. Panel A of Figure 1 also

shows the average Flesch-Kincaid level of our control sample. The index varies between 14.6 and

16.4, and is in average more than one level below the readability of the top journals. Moreover,

the figure illustrates that we cannot find a sharp increase of the readability measure for these

articles, but also, they are increasing over the time. The readability of the abstracts for the control

sample shows similar values than the top-three sample. The variation of the readability is rather

small and the readability of the abstracts is less time variant.

In addition to the classical readability measure, we provide the number of words in finance articles

over the time. Figure 1 Panel B provides the average number of words in the full-text. We find a

steady increase over the past years. The average number of words in the year 2000 was around

11,818 words but increased to over 16,617 words. The number of words is determined without

counting the words in the reference list or the appendix. It seems that articles need to be longer as

referees or editors require more proofs for statistical significance or other robustness test. The

figure also illustrates that the number of complex words is increasing over time, starting from

1,287 words to 1,918 words in the year 2016. The findings indicate that finance journals become

harder to reader. The information-process cost increases and scholars have an incentive to

minimize the costs in not reading or just skimming the full-text of the paper.

10

Again, we also provide the development for our control sample. The number of words is almost

2,000 words lower in all the years. We also reveal an increasing trend in the number of words, but

the highest average number of words for one year in our control sample is with 11,996 still below

the average number of words for the top journals in the year 2000.

Summarizing, we find a strong evidence that finance papers become more complex over time. This

is true for top finance publications as well as for articles in non-top journals. However, the

readability and the length of top articles increased significantly over the time. In the next section,

we analyze the topics of the articles and examine whether trends in the topics can be found that

could explain the differences in the readability.

3.3 Topics over the time

One critical issue in determining the readability and the number of citations a paper receives is the

topic of the article. In order to determine the topic of the article, two simple methodologies can be

adopted: (1) reading the articles, and (2) using the keywords or the Journal of Economic Literature

(JEL) classification. However, both methodologies bring some issues. Reading the articles is time-

taking and the outcome is highly based on a personal opinion. Keywords or JEL classifications are

not always provided and it might be that the keywords are optimized with catchy words. In order

to get an objective and replicable measure for the content of the paper, we apply a LDA procedure.

LDA is developed by Blei et al. (2003) and a statistical, unsupervised Bayesian machine-learning

process that falls into the family of topic models. In general, topic modelling describes methods

for the analysis of a large quantity of unlabelled data, such as a corpus of text documents. It is

generally considered the simplest of topic models (Blei, 2012) and uses the probability of words

co-occuring within documents to identify sets of latent topics and their associated words and is

conceptually similar to factor analysis, where the model produces topics instead of factors.

Thereby, a topic represents a probability distribution over a bundle of words (Reed, 2012). The

method provides an advantage especially for the analysis of qualitative aspects, in comparison to

quantitative approaches (Bellstamm, Bhagat and Cookson, 2017). A further advantage over other

topic models is that LDA identifies a mix of topics contained in a large corpus of texts and within

each document instead of just allocating a whole document to one topic (Blei, Ng and Jordan,

2003). Furthermore, the model differs from other topic modelling methods as it can process a

previously unseen data sample. Hence, there exist no limitations through a training data set (Blei,

11

Ng and Jordan, 2003; Bellstamm, Bhagat and Cookson, 2017). The whole method applied as a

modelling tool has been recently introduced for its application in finance literature (Ganglmair and

Wardlaw, 2015; Hoberg and Lewis, 2015; Goldsmith-Pinkham, Hirtle and Lucca, 2016;

Bellstamm, Bhagat and Cookson, 2017). And at this point it suits our research idea because it is

able to discover the main themes that pervade our large collection of papers given in a probability

for belonging to a topic and the words with the highest share per topic and thus allows us to control

for content. For a detailed explanation of the method see the appendix.

Figure 2 provides illustrates the top 10 keywords for each cluster after applying LDA. The

disadvantage of LDA is that the clusters are not directly headlined, but as we are only interested

in the comparison between clusters, a headline is not needed for our analysis. However, the top

keywords already give a strong indication for possible labels for each cluster. For example, Cluster

2 combines firm and management related keywords, whereas Cluster 10 is related to mergers and

acquisitions and Cluster 16 to IPOs. The figure shows that the LDA approach is a good

approximation for identifying the topics of finance articles. After defining the categories for each

word, we measure the fit for each article to one of the 20 clusters. The cluster with the highest

probability is selected as the topic of the article.

Figure 3 plots the relative number of articles for each topic over time. In general, the pattern is

clear. Most topics have remained relatively constant over the years 2000 to 2016 and therefore do

not explain the overall increase in the readability or the length in the top journals articles. The

notable exception is Cluster 3 which is related to banks. The cluster increased since 2008,

indicating that banking related research increased since the burst of the financial crisis.

3.4 The impact of readability on citations

In this section, we analyze the impact of the readability of finance articles on the number of

citations a paper receives. In order to account for these factors, we run several linear regressions.

The dependent variable is the number of citations received by article i, as reported by CrossRef

and Google Scholar in beginning of 2017, divided by the number of years since the article has

appeared. Our main variables of interest are the readability measures.

The results of these regression analyses are provided in Table 2. We find that all readability

measures have a significant effect at the 5% level. The Flesch-Kincaid for the full-texts, the length

12

of the text, and the number of complex words have a significant positive effect on the average

number of citations, but the Flesch-Kincaid for the abstract has a highly significant negative effect

on the number of citations. The results are significant for both citation databases, CrossRef and

Google Scholar. The findings indicate a strong relationship between the readability and the number

of citations a paper receives. Interestingly, we find that a clear written abstract has a positive effect

on the number of citations.

We perform the analysis and incorporate article, author and control variables to take into account

that the title length (Jacques & Sebire, 2010; Jamali & Nikzad, 2011), number of tables and figures

(Ayres & Vars, 2000; Stremersch et al., 2007), lead articles Medoff (2003), authors from top

business schools (Chung & Cox, 1990; Ederington, 1979; Heck, Cooley, & Hubbard, 1986;

Klemkosky & Tuttle, 1977; Niemi, 1987), the number of authors (Hollis, 2001; Karolyi, 2011), or

the surname of the first author (Einav & Yariv, 2006) may affect the number of citations. In

addition, we include topic fixed effects and year fixed effects. The inclusion of the control

variables and the fixed effects has a substantial effect on the adjusted R2, but the significant effect

of our readability measures remains.

In Table 3, we recapitulate the analyses of Table 2 using our control sample of 8,236 finance

articles. Our readability measures now lack in significance and we cannot find an impact of

readability on the number of citations a paper receives. The only significant effect we find is for

the Flesch-Kincaid full-text and only for Google Scholar. We therefore have to conclude that

readability does not influence the number of citations for non-top journals.

4 Conclusion

This paper analyzes the impact of how finance articles are written and its effect on the number of

citation. We therefore use a sample of 12,396 published articles, divided into 4,160 articles

published in the three leading finance journals, and 8,236 articles, published in other finance

journals between 2000 and 2016. We provide evidence that the readability of articles is linked to

the number of citations. We find that a higher complexity of the full-text is linked with a higher

number of citations, whereas an abstract that is hard to understand has a negative effect on the

number of citations.

13

This paper should not to be read as a call to use unintelligible writing. In this paper, we consider

only one determinant for the number of citations, the readability. However, focusing on that one

aspect, it does not appear that a clear and consensus writing is always awarded by the faculty. We

find that papers with a complex writing are cited more frequently. As long as people are capable

of being influenced, authors can have an advantage in using the Doctor Fox phenomenon and

improve their number of citations with a complex writing. This is in line with the findings of

Spence (1973). He finds that scholars are signaling their understanding of the difficult content

while citing complex articles.

From our point of view, our paper contributes to a small empirical literature on the determinants

of citations in finance. Chung and Cox (1990), Ederington (1979), Klemkosky and Tuttle (1977),

Heck et al. (1986) and Niemi (1987) analyze the relation between the author’s institutional

affiliation and the publication success in finance and provide evidence that the production of

articles published in journals with high impact factors, such as the JOF, is concentrated at relatively

few institutions. Klemkosky and Tuttle (1977) show that six universities accounted for

approximately 25% of the total pages published in these journals. Ederington (1979) is one among

the first who focuses on publication success in finance literature and its determinants. He measures

the success of articles by the number of times the article has been cited and analyzes to which

extent finance articles differ in their impact and how highly cited articles differ from articles which

are less cited. He focuses on author information, particularly the authors’ affiliations, and discovers

that an article from a top business school receives about 70% more citations than a similar article

from an unranked school. Moreover, he states that longer articles are cited more frequently by

leading journals. These articles are also cited more often from major journals by other fields.

We also believe that our findings raise some practical questions for the academic faculty and

editors of finance journals. For authors, the findings suggest that scholars are more convinced

about the quality of the article if the content is written in a complex style. Armstrong (1980)

suggested already almost 40 years ago: “If you can’t convince them, confuse them!”. We have to

follow this advice for authors as our findings support the advice. For practitioners, we see the risk

that academic writing for leading journals becomes too complex to understand and therefore loses

its practical relevance. Especially for practitioners the information-process cost is increasing to

understand an article published in a top-three journal. Journal articles should inform practitioners

the work’s relevance to decision areas they face. The risk is that finance journals do not being seen

14

as a good source for new finance knowledge in the future. For editors and the academic finance

community, the findings suggest that they should be wary of using the journal and citations as the

only measures for the quality of scholarly work. This is in line with the conclusion of (Hamermesh,

2018). In a nutshell, from an author’s perspective, we can encourage everyone to emphasize on a

clear writing to convince the readers – unless you make it into a leading journal. In this case, a

touch of confusion might also be helpful.

References

Armstrong, J. S. (1980). Unintelligible management research and academic prestige. Interfaces, 10(2),

80-86. doi:10.1287/inte.10.2.80

Ayres, I., & Vars, Fredrick E. (2000). Determinants of citations to articles in elite law reviews. The

Journal of Legal Studies, 29(S1), 427-450. doi:10.1086/468081

Bauerly, R. J., Johnson, D. T., & Singh, M. (2006). Readability and writing well. Marketing Management

Journal, 16, 216-227.

Bellstam, G., Bhagat, S., & Cookson, A. (2017). A Text-Based Analysis of Corporate Innovation.

Elsevier Academic Press. Retrieved October 6, 2017, from

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2803232.

Blei, D., Ng, A., & Jordan, M. (2003). Latent dirichlet allocation. The Journal of machine Learning

research, 3, 993-1022.

Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77-84.

doi:10.1145/2133806.2133826

Chan, J. Y., Chan, K. C., Tong, J. Y., & Zhang, F. (2016). Using Google Scholar citations to rank

accounting programs: A global perspective. Review of Quantitative Finance and Accounting,

47(1), 29-55. doi:10.1007/s11156-014-0493-x

Chung, K. H., & Cox, R. A. K. (1990). Patterns of Productivity in the Finance Literature: A Study of the

Bibliometric Distributions. The Journal of Finance, 45(1), 301-309. doi:10.1111/j.1540-

6261.1990.tb05095.x

Ederington, L. H. (1979). Aspects of the production of significant financial research. The Journal of

Finance, 34(3), 777-786. doi:10.1111/j.1540-6261.1979.tb02142.x

Einav, L., & Yariv, L. (2006). What's in a surname? The effects of surname initials on academic success.

Journal of Economic Perspectives, 20(1), 175-187. doi:doi: 10.1257/089533006776526085

Fishe, R. P. H. (1998). What are the research standards for full professor of finance? The Journal of

Finance, 53(3), 1053-1079. doi:10.1111/0022-1082.00043

Ganglmair, B., & Wardlaw, M. (2015). Measuring Contract Completeness: A Text Based Analysis of

Loan Agreements. Unpublished manuscript.

Goldsmith-Pinkham, P., Hirtle, B., & Lucca, D. (2016). Parsing the content of bank supervision.Working

Paper.

Hamermesh, D. S. (2018). Citations in cconomics: Measurement, uses, and impacts. Journal of Economic

Literature, 56(1), 115-156. doi:10.1257/jel.20161326

Hartley, J., Pennebaker, J. W., & Fox, C. (2003). Abstracts, introductions and discussions: How far do

they differ in style? Scientometrics, 57(3), 389-398. doi:10.1023/a:1025008802657

Hartley, J., Sotto, E., & Pennebaker, J. (2002). Style and substance in psychology: Are influential articles

more readable than less influential ones? Social Studies of Science, 32(2), 321-334.

doi:10.1177/0306312702032002005

15

Heck, J. L., Cooley, P. L., & Hubbard, C. M. (1986). Contributing authors and institutions to The Journal

of Finance: 1946–1985. The Journal of Finance, 41(5), 1129-1140. doi:10.1111/j.1540-

6261.1986.tb02535.x

Hoberg, G., & Lewis, C. (2015). Do Fraudulent Firms Produce Abnormal Disclosure?. Vanderbilt Owen

Graduate School of Management Research Paper No. 2298302.

Hollis, A. (2001). Co-authorship and the output of academic economists. Labour Economics, 8(4), 503-

530. doi:10.1016/S0927-5371(01)00041-0

Jacques, T. S., & Sebire, N. J. (2010). The impact of article titles on citation hits: An analysis of general

and specialist medical journals. JRSM Short Reports, 1(1), 1-5. doi:10.1258/shorts.2009.100020

Jamali, H. R., & Nikzad, M. (2011). Article title type and its relation with the number of downloads and

citations. Scientometrics, 88(2), 653-661. doi:10.1007/s11192-011-0412-z

Karolyi, G. A. (2011). The ultimate irrelevance proposition in finance? Financial Review, 46(4), 485-512.

doi:10.1111/j.1540-6288.2011.00309.x

Klemkosky, R. C., & Tuttle, D. L. (1977). The institutional source and concentration of financial

research. The Journal of Finance, 32(3), 901-907. doi:10.1111/j.1540-6261.1977.tb01996.x

Li, F. (2008). Annual report readability, current earnings, and earnings persistence. Journal of Accounting

and Economics, 45(2), 221-247. doi:10.1016/j.jacceco.2008.02.003

Loughran, T. I. M., & McDonald, B. (2014). Measuring readability in financial disclosures. The Journal

of Finance, 69(4), 1643-1671. doi:10.1111/jofi.12162

Loveland, J., Whatley, A., Ray, B., & Reidy, R. (1973). An analysis of the readability of selected

management journals. Academy of Management Journal, 16(3), 522-524. doi:10. 2307/255014

McDowell, J. M., & Melvin, M. (1983). The determinants of co-authorship: An analysis of the economics

literature. The Review of Economics and Statistics, 65(1), 155-160. doi:10.2307/1924423

Medoff, M. H. (2003). Article placement and market signalling. Applied Economics Letters, 10(8), 479-

482. doi:10.1080/1350485032000095348

Naftulin, D. H., Ware, J. E., & Donnelly, F. A. (1973). The Doctor Fox lecture: A paradigm of

educational seduction. Journal of Medical Education, 48, 630-635.

Niemi, A. W. (1987). Institutional contributions to the leading finance journals, 1975 through 1986: A

note. The Journal of Finance, 42(5), 1389-1397. doi:10.1111/j.1540-6261.1987.tb04374.x

Reed, C. (2012). Latent Dirichlet Allocation: Towards a Deeper Understanding. Retrieved February 28,

2018, from http://obphio.us/pdfs/lda_tutorial.pdf.

SEC. (2007). Speech by SEC Chairman: Closing Remarks to the Second Annual Corporate Governance

Summit. Retrieved from https://www.sec.gov/news/speech/2007/spch032307cc.htm

Spence, M. (1973). Job market signaling. The Quarterly Journal of Economics, 87(3), 355-374.

doi:10.2307/1882010

Stremersch, S., Verniers, I., & Verhoef, P. C. (2007). The quest for citations: Drivers of article impact.

Journal of Marketing, 71(3), 171-193. doi:10.1509/jmkg.71.3.171

You, H., & Zhang, X.-J. (2009). Financial reporting complexity and investor underreaction to 10-K

information. Review of Accounting Studies, 14(4), 559-586. doi:10.1007/s11142-008-9083-2

Zimmerman, J. L. (1989). Improving a manuscripts readability and likelihood of publication. Issues in

Accounting Education, 4(2), 458-466.

Zivney, T. L., & Bertin, W. J. (1992). Publish or perish: What the competition is really doing. The

Journal of Finance, 47(1), 295-329. doi:10.1111/j.1540-6261.1992.tb03987.x

https://www.sec.gov/news/speech/2007/spch032307cc.htm

16

Table 1: Summary statistics.

This table shows the descriptive sample statistic of our sample of 4,160 finance journal articles, divided into articles published in the Journal of

Financial Economics, the Review of Financial Studies and The Journal of Finance, and a control sample of eleven other finance journals. CrossRef and GoogleScholar are the number of citations received as reported by CrossRef and Google Scholar, respectively, in beginning of 2017, divided

by the number of years since the article has appeared. FleschKincaidText is the Flesch-Kincaid score of the full paper and gives the number of years

of education that a reader hypothetically needs to understand the paragraph or text, whereas FleschKincaidAbstract is the Flesch-Kincaid score of the abstract of the paper. Length is the logarithm of the number of words in the full-text. ComplexWords is the logarithm of the number of complex

words, defined as those that have more than two syllables minus jargon words, defined as words listed in the Professor Campbell Harvey’s finance

glossary. TopicFit is the value of the highest percentage of one of our two topics obtained from the Latent Dirichlet Allocation. TitleLength is the number of words in the article’s title. References, the number of references in the article. Tables and Figures are the total numbers of tables and

figures in the article, respectively. LeadArticle includes all articles that are in the lead position in their respective issue. TopBusinessSchool is

defined as 1, if at least one of the authors’ affiliation is listed on the Financial Times Global MBA ranking in the year prior to the year of the publication of the article. Authors is the number of authors of the article. Name is the surname initial of the first author on a 26 numerical scale

(A=1, B=2,. . . ,Z=26). The equality of means and medians of the two samples are tested for statistical significance using the two-sample t-test and

the Mann-Whitney U test. ∗, ∗∗, ∗∗∗ denote statistical significance at the 5%, 1%, and 0.1% level, respectively.

Top journals (n=4,160) Control sample (n=8,236) Differences

Mean Median Mean Median Mean Median

Citations

Crossref 7.260 4.500 1.973 1.000 5.287*** 3.500***

GoogleScholar 24.843 12.854 7.282 3.571 17.561*** 9.283***

Readability variables

FleschKincaidText 17.461 17.021 16.027 15.715 1.434*** 1.306***

FlechKincaidAbstract 15.687 15.538 15.838 15.676 -0.151** -0.138**

Length 14,217 14,123 10,019 9,720 4,197*** 4,403***

ComplexWords 2,945 2,910 2,059 1,987 886*** 2,024***

Article related variables

TopicFit 44.022 41.939 45.715 43.490 -1.693*** -1.551***

TitleLength 8.543 8.000 9.820 9.000 -1.277*** -1.000***

References 41.519 41.000 37.735 35.000 3.783*** 6.000***

Tables 6.565 7.000 5.578 6.000 0.989*** 1.000***

Figures 2.710 2.000 2.138 1.000 0.572*** 1.000***

Leadarticle 0.413 0.000 0.448 0.000 -0.003 0.000

Author related variables

TopBusiness School 0.292 0.000 0.033 0.000 0.259*** 0.000

Authors 2.315 2.000 2.259 2.000 0.053** 0.000***

17

Figure 1: Readability of finance articles over time

This figure shows the readability of our sample of 4,160 articles published in the top-three finance journals and 8,236 articles that published in

other finance journals. Panel A shows the readability measured as Flesch-Kincaid and Panel B shows the readability using the number of words and number of complex words per article.

Panel A: Flesch-Kincaid readability measure over time

Panel B: Number of words and number of complex words over time

14

15

16

17

18

19

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016

Top journals (Text) Control sample (Text)

Top journals (Abstract) Control sample (Abstract)

0

500

1.000

1.500

2.000

2.500

7.000

9.000

11.000

13.000

15.000

17.000

00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16

Co

mp

lex w

ord

s

To

tal

num

ber

Top journals (Total words) Control sample (Total words)

Top journals (Complex words) Control sample (Complex words)

18

Figure 2: Top 10 keywords per category.

This table shows the top 10 keywords for each category using LDA.

19

Figure 3: Distribution of topics per year.

This table shows the distribution of topics in the three finance journals The Journal of Finance, Review of Financial Studies and Journal of

Financial Economics in the years 2000 to 2016 analyzing with LDA.

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

20

Table 2: Results for the top-three finance journals.

This table shows the regression results for the sample of the 4,160 articles published in the Journal of Financial Economics, Review of Financial

Studies, and The Journal of Finance. The dependent variables are the standardized citation, using CrossRef and Google Scholar as source for

citations. FleschKincaidText is the Flesch-Kincaid score of the full paper and gives the number of years of education that a reader hypothetically

needs to understand the paragraph or text, whereas FleschKincaidAbstract is the Flesch-Kincaid score of the abstract of the paper. Length is the natural

logarithm of the number of words in the full-text and ComplexWords is the number of complex words, defined as words with more than two

syllables minus minus all financial words appearing in the Campbell Harvey’s Hypertextual Finance Glossar. *, **, *** denote statistical significance at the 5%, 1% and 0.1% level, respectively.

Crossref Google Scholar

I II III IV V VI

FleschKincaidText 0.152** 0.424*

(0.048) (0.195)

FlechKincaidAbstract -0.214*** -0.776***

(0.056) (0.195)

Length 1.531** 3.710*

(0.507) (1.843)

ComplexWords 1.377*** 4.075*

(0.435) (1.703)

Article controls YES YES YES YES YES YES

Author controls YES YES YES YES YES YES

Year FE YES YES YES YES YES YES

Topic FE YES YES YES YES YES YES

Observation 4,160 4,160 4,160 4,160 4,160 4,160

R2 / Adj. R2 .151/.141 .148/.138 .148/.138 .231/.222 .228/.219 .230/.221

21

Table 3: Results for the control sample.

This table shows the regression results for the sample of the 4,160 articles published in the Journal of Financial Economics, Review of Financial

Studies, and The Journal of Finance. The dependent variables are the standardized citation, using CrossRef and Google Scholar as source for citations. FleschKincaid

Text is the Flesch-Kincaid score of the full paper and gives the number of years of education that a reader hypothetically

needs to understand the paragraph or text, whereas FleschKincaidAbstract is the Flesch-Kincaid score of the abstract of the paper. Length is the natural

logarithm of the number of words in the full-text and ComplexWords is the number of complex words, defined as words with more than two syllables minus minus all financial words appearing in the Campbell Harvey’s Hypertextual Finance Glossar. *, **, *** denote statistical significance

at the 5%, 1% and 0.1% level, respectively.


I II III IV V VI

FleschKincaidText 0.020 0.096*

(0.011) (0.048)

FlechKincaidAbstract 0.007 0.036

(0.009) (0.044)

Length -0.031 0.437

(0.109) (0.576)

ComplexWords 0.011 0.758

(0.113) (0.587)

Article controls YES YES YES YES YES YES

Author controls YES YES YES YES YES YES



Observation 8,236 8,236 8,236 8,236 8,236 8,236

R2 / Adj. R2 .136/.130 .136/.130 .136/.130 .108/.102 .108/.102 .108/.102

22

A. Appendix

A1 Sample

A1.1 Control sample construction

Our control sample is based on 8,236 journal articles published in finance-oriented journals from

January 2000 to December 2016. We used the Journal Citation Reports to determine a journal’s

ranking and the Journal Quality List provided by Harzing to determine the most important finance-

oriented journals published in English language.

We excluded two journals, namely the Review of Finance and the Journal of Financial and

Quantitative Analysis, as they are sometimes considered as leading journals and sometimes they

are not. In order to distinguish our control from the original sample, we incorporated the next

journals in the list. In total, we are able to include articles from 11 major finance-oriented journals,

namely European Financial Management, Financial Review, Journal of Banking & Finance,

Journal of Corporate Finance, Journal of Empirical Finance, Journal of Financial

Intermediation, Journal of Financial Markets, Journal of Financial Stability, Journal of Futures

Markets, Journal of International Financial Markets, Institutions and Money, and Journal of

Money, Credit and Banking. These journals are top-ranked quality journals that publish the full

spectrum of finance research. We exclusively focus on finance publications and exclude broader

management journals, such as Management Science, as the writing between disciplines may vary

and we are interested in analyzing the writing of finance articles.3

3 For an overview of academic writing across disciplines and the differences in the writing of academic articles see,

for example, Hyland (2002, 2008).

23

A1.2 Data cleaning

For each journal, we downloaded all articles in Portable Document Format (PDF) from the

respective journal's website that were published in the volumes and issues between January 2000

and December 2016. In total, we downloaded 16,091 PDF documents. In addition, we collected

the authors’ names, their affiliation, the paper's title, the abstract and the Digital Object Identifier

(DOI) for each article from the website. As the focus of the present study is on journal articles, we

drop all PDF documents that are not full-text original research articles. We followed a multi-step

screening procedure to exclude PDF documents that did not match these criteria. First, we

eliminated all PDF documents without a valid DOI as well as back and front matters. In a next

step, we also removed editorial board announcements and editor notes. We also excluded PDF

documents labeled ``miscellanea'', content and issue information, discussions, and PDFs

containing exclusively author acknowledgements. This leaves us with a sample of 14,578 PDFs.

In addition, we also removed PDF documents without any author information on the website. This

elimination round includes forewords, specific announcements, or general journal information. We

also eliminated very short papers with less than three pages. The final step was to exclude all PDF

documents that are not machine readable. This leaves us with a final sample of 12,396 full-text

original research articles and includes approximately 77% of the initial PDF documents.

24

A1.3 Final sample

Table 4: Articles by journal and year.

This table shows the journal articles for the entire sample of 12,396 articles during the investigation period from January 2000 to December 2016.

The articles are divided by journal and year.

Articles from the Journal of Money, Credit and Banking are underrepresented because they are only available online since 2007. The sample also includes a

limited number of 101 articles from the Journal of Financial Stability due to a large amount of non-machine-readable PDF articles published since 2010.

Journal ‘00 ’01 ‘02 ‘03 ‘04 ‘05 ‘06 ‘07 ‘08 ‘09 ‘10 ‘11 ‘12 ‘13 ‘14 ‘15 ‘16 Total

Panel A: Top Journals

Journal of Financial Economics 56 61 58 52 75 79 87 103 102 87 100 136 124 153 102 119 123 1,617

Rev. Financ. Stud. 38 41 55 66 26 31 76 84 112 98 111 102 89 86 104 81 51 1,251

The Journal of Finance 88 79 88 75 90 85 86 84 81 78 69 60 50 67 71 70 71 1,292

Panel B: Control sample

European Financial Management 6 0 16 23 23 25 27 45 34 44 68 34 25 47 15 21 8 461

Financial Review 31 32 31 30 25 26 26 25 24 26 52 28 21 29 33 22 18 479

Journal of Banking & Finance 71 20 83 81 114 53 86 132 110 186 238 256 247 382 295 273 115 2,742

Journal of Corporate Finance 17 18 21 28 32 45 35 47 48 38 47 95 75 74 46 108 120 894

Journal of Empirical Finance 0 23 26 26 30 29 12 10 7 13 10 12 12 62 79 66 101 518

Journal of Financial Intermediation 16 11 16 15 15 17 23 21 23 27 26 27 28 30 24 27 346

Journal of Financial Markets 15 14 18 22 17 16 18 15 17 32 20 23 18 27 46 21 29 368

Journal of Financial Stability 0 0 0 0 11 10 13 20 24 23 101

Journal of Futures Markets 0 34 62 43 51 50 50 50 54 88 50 57 50 51 56 46 17 809

Journal of International Financial

Markets, Institutions and Money 25 23 23 27 29 26 29 28 40 61 36 43 69 86 109 49 43 746

Journal of Money, Credit and

Banking 0 0 0 0 0 0 0 88 78 86 82 77 78 79 67 78 59 772

Total 363 356 497 488 538 492 568 752 754 887 909 950 886 1,173 1,047 981 755 12,396

0

A2 Methodology and variables

A2.1 Methodology

The aim of this paper is to analyze whether the readability, article or author specific variables

determine the number of citations a paper receives. In order to account for these factors, we run

several linear regressions. The ordinary least squares (OLS) regression takes the following form:

𝐶𝐼𝑇𝐸𝑖 = 𝛼 + 𝛽 × 𝑅𝐸𝐴𝐷𝐴𝐵𝐼𝐿𝐼𝑇𝑌𝑖,𝑝 + 𝛿 × 𝐴𝑅𝑇𝐼𝐶𝐿𝐸𝑖,𝑝 + 𝜓 × 𝐴𝑈𝑇𝐻𝑂𝑅𝑖,𝑝

+ 𝑌𝐸𝐴𝑅 𝐹𝐸 + 𝑇𝑂𝑃𝐼𝐶 𝐹𝐸 + 휀𝑖

(1)

The dependent variable CITEi is the number of citations received by article i, as reported by

CrossRef and Google Scholar in beginning of 2017, divided by the number of years since the

article has appeared. Articles published in the early years of the sample should have received more

citations than recent publications. Therefore, we follow the approach by Chan et al. (2016) and

normalize the citations of each article by the number of years since the article has been published.

The independent variables are divided into readability related variables, sentiment specific

variables, article specific variables, and author specific variables. READABILITYi,p is a vector that

includes three different measures of readability. In order to measure the readability of the article,

we use three different measures. The first measure is the Flesch-Kincaid formula. It is based on

two components, the syllables per word and the number of words per sentence. The Flesch-

Kindcaid score is calculated as follows:

FleschKincaid = (11.8 × syllables per word) + (0.39 × words per sentence) - 15.59 (1)

The result is a number that corresponds with a grade level of the U.S. educational system. The

variable FleschKincaidText is the Flesch-Kincaid score of the full paper, whereas

FleschKincaidAbstract is the Flesch-Kincaid score of the abstract of the paper. Two other popular

readability measures, the Fog-Index and the Flesch Reading Ease Score, use similar components

and all measures highly correlate with each other.4 Li (2008) and You and Zhang (2009) argue that

4 These measures are most used in literature (e.g. Bauerly, Johnson, & Singh, 2006; Hartley, Pennebaker, & Fox,

2003; Hartley, Sotto, & Pennebaker, 2002; Loveland, Whatley, Ray, & Reidy, 1973; Stremersch, Verniers, & Verhoef,

1

the length of the document, measured as the number of words is a good proxy for the readability.

The information-process cost of longer documents is higher than for shorter documents, longer

documents seem to be deterring and more difficult to read (Li, 2008). Following this approach, we

use only the number of words (omitting the average sentence length and syllable count) as an

alternative readability measure. The third measure combines the two previous measures. The Fog-

Index uses the number of “complex words”, defined as a word with more than two syllables and

does not incorporate the average number of syllables per word. Loughran and McDonald (2014)

criticize this measure in business documents as they contain multisyllables words to describe

operations. We control for this financial terminology and subtract all financial words appearing in

the Campbell Harvey’s Hypertextual Finance Glossar.

ARTICLEi,p is an article-specific vector that includes TopicFit is the value of the highest percentage

of one of our two topics obtained from the Latent Dirichlet Allocation. TitleLength is the number

of words in the article’s title. References, the number of references in the article. Tables and

Figures are the total numbers of tables and figures in the article, respectively. LeadArticle includes

all articles that are in the lead position in their respective issue.

AUTHORi,p is a vector that includes author-specific variables. TopBusinessSchool is set as 1, if at

least one of the authors’ affiliations is listed in the top ten of the Financial Times Global MBA

ranking in the year prior to the year of the publication of the article. Authors is defined as the

number of authors of the article. The initial of the surname of the first author is coded into integers

between 1 and 26 lexicographically (A=1, B=2,…, Z=26) and defines the variable Name.

A2.2 Choice of variables

Our approach incorporating article related variables follows Stremersch et al. (2007) and Ayres

and Vars (2000) who measure the clarity of a paper using the number of tables and the number of

figures. Moreover, Ayres and Vars (2000) find that the number of citations of elite law journals

increases with an article’s length. In addition, Medoff (2003) shows that a lead article receives a

2007), but also in other cases. Moreover, the former SEC Chairman Christopher Cox suggested that the Fox-Index

can be used to gauge compliance with the SEC’s plain English initiatives (Loughran & McDonald, 2014).

2

significantly higher number of citations in the first five years after publication. If the article is

listed first in the respective issue or volume, we consider it as a lead article.

The choice for the author related variables is as follows: Chung and Cox (1990), Ederington

(1979), Klemkosky and Tuttle (1977), Heck, Cooley, and Hubbard (1986) and Niemi (1987)

analyze the relation between the author’s institutional affiliation and the publication success in

finance, measured as the quantitative contribution to leading finance journals or the number of

citations they receive. They provide evidence that the production of articles published in journals

with high impact factors, such as The Journal of Finance, is concentrated at relatively few

institutions. Klemkosky and Tuttle (1977) show that six universities accounted for approximately

25% of the total pages published in these journals. Ederington (1979) find that articles published

by researchers from top business schools are cited more often. Furthermore, we also control for

the number of authors as the number of papers with co-authors is increasing. Joint works allow

productivity gains due to division of labour (McDowell & Melvin, 1983). Karolyi (2011) and

Hollis (2001) show for single economists more co-authors are associated with higher research

quality, greater length, and greater frequency of publications. Finally, we control for the surname

of the first author. Einav and Yariv (2006) analyze the impact of surname initials on professional

outcomes in the academic labor market for economist. Their data provides evidence that earlier

surname initials are significantly more likely to receive tenure. In analogy to Einav and Yariv

(2006), the initial of the surname is coded into integers between 1 and 26 lexicographically (A=1,

B=2,…, Z=26).

A2.3 Latent Dirichlet Allocation (LDA)

A LDA model like we use in this paper5 defines the basic terms of “words”, “documents” and

“corpus” to describe the underlying data sample. This definition is in line with the original

definition of Blei, Ng and Jordan (2003). The terms are defined as follows:

A word is the basic unit of discrete data, defined to be an item of vocabulary indexed by

{1,…,V}. We represent words using unit-basis vectors that have a single component equal

to one and all other components equal to zero. Thus, using superscripts to denote

5 The LDA model we use in this paper is the VEM algorithm developed by Blei, Ng and Jordan (2003).

3

components, the vth word in the vocabulary is represented by a V-vector w such that wv =

1 and wu = 0 for u ≠ v.

A document d is a sequence of N words denoted w = (w1, w2,…,wN), where wn is the nth

word in the sequence.

A corpus is a collection of M documents denoted by D = {w1, w2,…,wM}

The latent dirichlet allocation is a three-level hierarchical Bayesian model. First, and as corpus-

level parameters set once per corpus, second d as document-level variables set one per document

and third zn and wn as word-level variables set once for each word of each document (Blei, Ng and

Jordan, 2003). The idea behind this is that each item of a collection is portrayed as a mix of various

latent topics whereas each topic is denoted as a set of underlying topic probabilities. Further, each

topic is represented by a distribution over words (Blei, Ng and Jordan, 2003). The following

process describes how a generative process for a document vector w within a corpus D works. This

description is again picked from the original paper from Blei et al. (2003):

1. Choose N Poisson ().

2. Choose Dir ()

3. For each of the N words wn:

a. Choose a topic zn Multinomial ()

b. Choose a word wn from p(wn|zn, ), a multinomial probability conditioned on the

topic zn.

For simplicity some assumptions are made. For instance, describes a scalar that denotes the

length or more accurate the number of words per document over a Poisson distribution. K is a

dimensionality of the Dirichlet distribution and known and fixed and is a predefined parameter

vector over the distribution of d. is a per-topic-per-word probability ij = p(wj = 1|zi = 1) and

described as a k V matrix with k as the number of topics and V as the number of distinctive

words. d is a k 1 vector of Dirichlet random variables for every document d where each entry

represents the per-document-per-topic probability. Hence, it can be interpreted as the proportion

of a topic within a document. That means that N (the number of words per document) is

independent form d and zn. Further, a so-called “bag-of-words” assumption is made. That means

4

that the order of words can be neglected and that the documents are independent form each other

and within a document the words are assumed to be exchangeable (Finettei, Machi and Smith,

1990).

The matrices (k V matrix) and (k M matrix) are needed if one wants to classify a corpus

of documents for further analysis. For calculating the variables one can use the VEM algorithm in

R’s topic modelling package which is developed by Blei et al. (2003). Next to the corpus of

documents only the number of topics k is then needed. After all, the LDA algorithm calculates two

estimated outputs out of the data sample: first, ̂ with the estimated per-topic-per-word

probabilities �̂�ij and second ̂ with the estimated per-document-per-topic probabilities ̂d.

5

A3 Descriptive statistics

Table 5: Descriptive sample statistics.

This table shows the journal articles for the entire sample of 4,160 articles during the investigation period

from January 2000 to December 2016. The articles are divided by journal and year.

Variable n Mean Median Std.

deviation

25%

quantile

75%

quantile

Dependent variables

Crossref 4,160 7.260 4.500 9.722 2.000 9.000

GoogleScholar 4,160 24.843 12.854 35.739 4.380 31.063


FleschKincaidText 4,160 17.461 17.021 3.529 15.187 19.029

FlechKincaidAbstract 4,160 15.687 15.538 2.464 13.996 17.164

Length 4,160 9.525 9.556 0.295 9.386 9.704

ComplexWords 4,160 7.944 7.977 0.319 7.789 8.148


Topic fit 4,160 44.022 41.939 13.990 33.585 52.241

Title length 4,160 8.543 8.000 3.323 6.000 11.000

References 4,160 41.519 41.000 23.280 29.000 53.000

Tables 4,160 6.565 7.000 3.724 4.000 9.000

Figures 4,160 2.710 2.000 2.822 0.000 4.000

Leadarticle 4,160 0.413 0.000 0.199 0.000 0.000

JFE 4,160 0.389 0.000 0.488 0.000 1.000

RFS 4,160 0.301 0.000 0.459 0.000 1.000


TopBusiness School 4,160 0.292 0.000 0.455 0.000 1.000

Authors 4,160 2.315 2.000 0.855 2.000 3.000

Name 4,160 7.863 7.000 6.046 3.000 12.000

6

Table 3: Descriptive sample statistics for the 11 remaining journals.

This table shows the journal articles for the entire sample of 8,236 articles during the investigation period

from January 2000 to December 2016. The articles are divided by journal and year.

Variable n Mean Median Std.

deviation

25%

quantile

75%

quantile

Dependent variables

Crossref 8,236 1.973 1.000 3.174 0.400 2.333

GoogleScholar 8,236 7.282 3.571 14.126 1.531 7.904


FleschKincaidText 8,236 16.027 15.715 2.933 14.401 17.130

FlechKincaidAbstract 8,236 15.838 15.676 2.965 14.083 17.415

Length 8,236 9.146 9.182 0.386 8.946 9.404

ComplexWords 8,236 7.554 7.595 0.412 7.329 7.835


Topic fit 8,236 45.715 43.490 14.711 34.819 54.481

Title length 8,236 9.820 9.000 3.528 7.000 12.000

References 8,236 37.735 35.000 19.022 25.000 47.000

Tables 8,236 5.578 6.000 3.547 3.000 8.000

Figures 8,236 2.138 1.000 2.764 0.000 3.000

Leadarticle 8,236 0.448 0.000 0.207 0.000 0.000


TopBusiness School 8,236 0.033 0.000 0.179 0.000 0.000

Authors 8,236 2.259 2.000 0.879 2.000 3.000

Name 8,236 8.855 7.000 6.633 3.000 13.000

7

A4 Results

Table 4: Results for the three top finance journals


I II III IV V VI


FleschKincaidText 0.152** 0.424*

(0.048) (0.195)


(0.056) (0.195)

Length 1.531** 3.710*

(0.507) (1.843)

ComplexWords 1.377*** 4.075*

(0.435) (1.703)


Topic fit 0.045*** 0.046*** 0.047*** 0.154*** 0.156*** 0.159***

(0.010) (0.010) (0.010) (0.035) (0.035) (0.035)

Title length -0.178*** -0.174*** -0.172*** -0.875*** -0.863*** -0.858***

(0.038) (0.039) (0.039) (0.143) (0.144) (0.144)

References 0.047*** 0.042*** 0.041*** 0.127*** 0.113*** 0.110***

(0.007) (0.007) (0.007) (0.028) (0.029) (0.029)

Tables 0.149** 0.119* 0.124** 0.388* 0.327 0.314

(0.045) (0.046) (0.047) (0.179) (0.187) (0.189)

Figures 0.172* 0.129 0.140 0.600** 0.498* 0.514*

(0.075) (0.074) (0.075) (0.228) (0.228) (0.226)

Leadarticle 1.241 1.352 1.340* 2.244 2.547 2.522

(0.714) (0.712) (0.714) (2.673) (2.659) (2.663)

JFE -0.396 -0.787* -0.821* -7.650*** -8.814*** -8.939***

(0.341) (0.331) (0.334) (1.484) (1.447) (1.460)

RFS -2.317*** -2.116*** -2.202*** -32.380*** -31.880*** -32.126***

(0.354) (0.344) (0.343) (1.136) (1.091) (1.099)


TopBusiness

School

1.520*** 1.437*** 1.450*** 6.867*** 6.645*** 6.665***

(0.320) (0.322) (0.321) (1.181) (1.189) (1.189)

Authors 0.626** 0.604** 0.605** 2.477*** 2.408** 2.395**

(0.183) (0.188) (0.186) (0.691) (0.692) (0.691)

Name 0.000 0.001 0.001 -0.070 -0.068 -0.067

(0.025) (0.025) (0.025) (0.083) (0.083) (0.083)

Intercept 1.560 -12.746*** -9.180** 21.549*** -16.037 -12.861

(1.505) (4.703) (3.358) (6.054) (16.442) (12.572)



n 4,160 4,160 4,160 4,160 4,160 4,160

R2 / Adj. R2 .151/.141 .148/.138 .148/.138 .231/.222 .228/.219 .230/.221

8

Table 5: Results for 11 journals


I II III IV V VI


FleschKincaidText 0.020 0.096*

(0.011) (0.048)

FlechKincaidAbstract 0.007 0.036

(0.009) (0.044)

Length -0.031 0.437

(0.109) (0.576)

ComplexWords 0.011 0.758

(0.113) (0.587)


Topic fit 0.008*** 0.008*** 0.008*** 0.026* 0.026* 0.027**

(0.002) (0.002) (0.002) (0.010) (0.010) (0.010)

Title length -0.030** -0.030** -0.030** -0.180*** -0.179*** -0.179***

(0.009) (0.009) (0.009) (0.042) (0.042) (0.042)

References 0.022*** 0.023*** 0.022*** 0.093*** 0.090*** 0.087***

(0.003) (0.003) (0.003) (0.014) (0.015) (0.016)

Tables 0.012 0.016 0.014 0.008 0.002 -0.007

(0.011) (0.012) (0.012) (0.059) (0.063) (0.062)

Figures 0.041* 0.039 0.038 0.319*** 0.297* 0.293*

(0.019) (0.020) (0.020) (0.115) (0.120) (0.120)

Leadarticle 0.076 0.081 0.078 0.550 0.528 0.501

(0.144) (0.143) (0.144) (0.649) (0.646) (0.647)


TopBusiness

School

1.216** 1.211*** 1.209** 6.634** 6.575*** 6.567**

(0.421) (0.421) (0.421) (2.087) (2.087) (2.088)

Authors 0.198*** 0.200*** 0.199*** 0.516** 0.515** 0.509**

(0.041) (0.041) (0.041) (0.177) (0.177) (0.176)

Name -0.010 -0.010 -0.010 -0.069** -0.070** -0.069**

(0.005) (0.005) (0.005) (0.026) (0.026) (0.026)

Intercept 0.224 0.917 0.574 -0.036 -1.750 -3.307

(0.336) (0.965) (0.833) (1.536) (5.007) (4.209)



Journal FE YES YES YES YES YES YES

n 8,236 8,236 8,236 8,236 8,236 8,236

R2 / Adj. R2 .136/.130 .136/.130 .136/.130 .108/.102 .108/.102 .108/.102

9

A5 Further empirical results

Appendix 1: Distribution of topics per year (remaining journals). This table shows the distribution of topics in the eleven finance journals in the years 2000 to 2016 analyzing with

LDA.

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

10

A6 Alternative citation

Appendix 2: Results for the top journals but with the regression model


I II III IV V VI


FleschKincaidText 0.025*** 0.026***

(0.005) (0.005)


(0.006) (0.007)

Length 0.315*** 0.280***

(0.070) (0.072)

ComplexWords 0.311*** 0.322***

(0.064) (0.066)


Topic fit 0.005*** 0.005*** 0.005*** 0.004*** 0.004*** 0.005***

(0.001) (0.001) (0.001) (0.001) (0.001) (0.001)

Title length -0.024*** -0.023*** -0.023*** -0.029*** -0.028*** -0.028***

(0.004) (0.004) (0.004) (0.005) (0.005) (0.005)

References 0.007*** 0.006*** 0.006*** 0.006*** 0.005*** 0.005***

(0.001) (0.001) (0.001) (0.001) (0.001) (0.001)

Tables 0.023*** 0.016** 0.016** 0.019*** 0.014* 0.012*

(0.005) (0.006) (0.006) (0.005) (0.006) (0.006)

Figures 0.018** 0.010 0.012* 0.024*** 0.017** 0.018**

(0.006) (0.006) (0.006) (0.006) (0.006) (0.006)

Leadarticle 0.205* 0.228** 0.226** 0.201* 0.225** 0.224**

(0.079) (0.080) (0.080) (0.078) (0.078) (0.079)

JFE -0.113** -0.176*** -0.186*** -0.359*** -0.425*** -0.436***

(0.037) (0.036) (0.036) (0.038) (0.037) (0.038)

RFS -0.217*** -0.186*** -0.205*** -1.963*** -1.932*** -1.952***

(0.040) (0.039) (0.040) (0.039) (0.038) (0.039)


TopBusiness

School

0.229*** 0.213*** 0.215*** 0.273*** 0.258*** 0.259***

(0.033) (0.033) (0.033) (0.034) (0.034) (0.034)

Authors 0.106*** 0.103*** 0.103*** 0.079*** 0.077*** 0.075***

(0.018) (0.018) (0.018) (0.020) (0.020) (0.020)

Name 0.000 0.000 0.000 -0.003 -0.003 -0.003

(0.003) (0.003) (0.003) (0.003) (0.003) (0.003)

Intercept -0.448** -3.305*** -2.772*** 0.314 -2.251*** -2.113***

(0.172) (0.649) (0.503) (0.174) (0.668) (0.513)



n 4,015 4,015 4,015 4,112 4,112 4,112

R2 / Adj. R2 .202/.193 .199/.190 .199/.190 .467/.460 .463/.457 .458/.465

Download - Confused but convinced: Article complexity and publishing ...extracted the citations and metadata from the CrossRef website. The metadata contains, among others, the number of references

Top Related