the performance of online communities

77
The Performance of Online Communities – An Empirical Investigation of Wikipedia Diploma Thesis at the Institute of Entrepreneurship & Innovation Vienna University of Economics and Business Administration Degree Program: Business Administration Submitted by: Roman Pickl Degree Program Identification No.: J151 Student Enrolment No.: h0451691 Advisor: Univ. Prof. Dr. Nikolaus Franke Assistant Advisor: Dr. Philipp Türtscher Vienna, 17 June 2009

Upload: rompic

Post on 10-Apr-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: The Performance of Online Communities

The Performance of Online Communities –

An Empirical Investigation of Wikipedia

Diploma Thesis

at the Institute of Entrepreneurship & Innovation

Vienna University of Economics and Business Administration

Degree Program: Business Administration

Submitted by:

Roman Pickl

Degree Program Identification No.: J151

Student Enrolment No.: h0451691

Advisor: Univ. Prof. Dr. Nikolaus Franke

Assistant Advisor: Dr. Philipp Türtscher

Vienna, 17 June 2009

Page 2: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | I |

Abstract

Online communities have been thriving in recent years and have not only drawn the attention

of researchers and professionals but also influence our daily lives. Their success factors,

however, are still rather unclear. This paper sheds light on this topic by analyzing the rela-

tionships between the characteristics of online communities and their performance. There-

fore, 5000 communities around individual articles in Wikipedia were analyzed. The results

demonstrate that community characteristics are significantly linked to the quantity and qual-

ity of the output created by online communities. The number of users is by far the most influ-

ential force that drives content creation. When it comes to the quality of the output, however,

characteristics of community members and how they collaborate are as important as the

sheer number of contributors. These findings can be utilized by community operators who

want to foster the development of their online communities and firms which look for promis-

ing communities to scan for innovative ideas and users.

Acknowledgments

I would like to express my gratitude to my advisor Univ. Prof. Dr. Nikolaus Franke and my

assistant advisor Dr. Philipp Türtscher for their support, encouragement and time to listen to

and discuss little problems and roadblocks. Special thanks also go to Matthias Pickl and

Daniel Winzer who provided thoughtful comments on this thesis. Furthermore, I want to

thank the IT Department of the Vienna University of Economics and Business Administra-

tion, especially Franz Schäfer, for helping with the means to handle the huge amount of data

and thus making this research project possible. Finally, I thank my family and my girlfriend

Julia for their continuous support and encouragement.

Page 3: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | II |

Table of contents

1 Introduction....................................... ............................................................................ 1

1.1 Objective.................................................................................................................. 2 1.2 Structure.................................................................................................................. 3

2 Literature Review and Hypotheses Development....... ................................................ 4

2.1 Online Communities ................................................................................................ 4 2.2 Performance of Online Communities ....................................................................... 6

2.2.1 Information-Quantity ........................................................................................... 7 2.2.2 Article-Quality ..................................................................................................... 8

2.3 Links between Community Characteristics and Performance................................... 9 2.3.1 Community-Centered Perspective .................................................................... 10 2.3.2 User-Centered Perspective............................................................................... 12 2.3.3 Collaboration-Centered Perspective ................................................................. 14

3 Research Method.................................... .................................................................... 17

3.1 Research Site ........................................................................................................ 17 3.2 Study Design ......................................................................................................... 18 3.3 Data Collection and Cleansing............................................................................... 19 3.4 Measures............................................................................................................... 22

3.4.1 Operationalization of Performance Indicators ................................................... 22 3.4.2 Operationalization of Community Characteristics.............................................. 23

4 Results ............................................ ............................................................................ 26

4.1 Descriptive Statistics.............................................................................................. 26 4.2 Inferential Statistics................................................................................................ 28

4.2.1 Results related to Information-Quantity............................................................. 29 4.2.2 Results related to Article-Quality....................................................................... 30

5 Discussion and Implications........................ .............................................................. 31

5.1 Implications for Theory........................................................................................... 35 5.2 Implications for Methods........................................................................................ 36 5.3 Implications for Practice......................................................................................... 37 5.4 Limitations ............................................................................................................. 38 5.5 Directions for Future Research .............................................................................. 39

6 References ......................................... ......................................................................... 41

7 Appendix ........................................... .......................................................................... 52

7.1 Figures and Examples ........................................................................................... 52 7.2 Data Collection ...................................................................................................... 53

7.2.1 Setting up the Database and Drawing the Sample............................................ 53 7.2.2 Parsing the Data and Calculating all Variables ................................................. 58

7.3 References used in the Appendix: ......................................................................... 73

Page 4: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | III |

Figures and Tables

Figure 1: Characteristics of online communities (or iginal illustration)............................ 9

Figure 2: Excerpt of the revision history of a Wiki pedia article (Wikipedia 2009h)....... 17

Figure 3: Research model (original illustration) ... ........................................................... 19

Figure 4: Sampling process (original illustration) . .......................................................... 20

Table 1: Results of the conducted OLS-regressions.. ..................................................... 28

Table 2: Summary of results for hypotheses H1A-H6A (Information-Quantity) ............ 31

Table 3: Summary of results for hypotheses H1B-H6B (Article-Quality) ....................... 33

Table 4: Significant effects of community character istics on performance .................. 35

Page 5: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 1 |

1 Introduction

With the increasing ubiquity of the Internet, recent years have seen a surge in the number of

online communities1 (Kozinets 1999, p. 253; Rashid et al. 2006, p. 955). These communities

of interest are known for attracting innovative users with a high level of domain-specific

knowledge and are hence an important source of innovation (Füller, Jawecki & Mühlbacher

2007, p. 60). Even though open source software projects like Linux and Apache are among

the first examples that come to mind when thinking about successful online communities, this

phenomenon is not limited to the software sector. In fact, communities have shown astonish-

ing performances in numerous diverse industries (Lakhani & Panetta 2007, p. 98).

Wikipedia, the online encyclopedia that allows everyone to edit or create new articles, is a

case in point. Initiated in 2001, it has grown to more than 11 million articles in more than 260

languages written by more than 14 million users (Wikimedia 2008d). As of today, it is one of

the most visited pages on the Internet (Alexa.com 2008) and even though often scrutinized

due to its open source principle, generally known for its notably high quality (Giles 2005, p.

900).

The success of Wikipedia and various other online communities has not only drawn the atten-

tion of researchers and professionals but also influenced how amongst others society func-

tions in terms of production, learning, communication and commerce (Cothrel & Williams

1999, p. 54; Tapscott & Williams 2008, p. 20; Wanga & Fesenmaier 2004, p. 709). Conse-

quently, many firms across industries have tried to embrace online communities, in particular

to harness their creative potential for developing new products and services (Füller, Matzler &

Hoppe 2008, p. 609; Nambisan 2002, p. 393). Many of those communities, however, fail (Co-

threl & Williams 1999, p. 54) and success factors still remain rather unclear (Leimeister, Sid-

iras & Krcmar 2006, p. 281).

1 The terms “online community”, “virtual community”, “computer-mediated community”, “cyber commu-nity”, “net community” or “e-community” are often used interchangeably (Döring 2001; Wanga & Fe-senmaier 2004, p. 709)

Page 6: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 2 |

1.1 Objective

Even though online communities have been studied from a variety of perspectives and several

authors have derived recommendations for operating online communities, characteristics of

successful communities have hardly been substantiated empirically (Leimeister, Sidiras &

Krcmar 2006, p. 281). This thesis aims to close this research gap by introducing a framework

to combine several of these perspectives and empirically analyze the link between characteris-

tics of online communities and their performance.

Businesses that want to involve online communities in their innovation process generally have

two distinct options: they can either try to find and utilize already existing communities or

attempt to build their own (Franke 2005, p. 708). Analyzing the link between community

characteristics and performance yields valuable insights for both, operators who want to pro-

actively encourage the development of successful online communities and firms which look

for promising communities to scan for innovative ideas and users. Due to the rise of a new

paradigm, often called “Web 2.0” (O'Reilly 2005), which causes growing interest in user gen-

erated content and user participation today (Tapscott & Williams 2008, p. 38), it is even more

important to shed light on this topic.

Given that Wikipedia “can be viewed as a massive experiment in collective action” (Viégas et

al. 2007, p. 2), observing communities in this online environment allows the examination of

numerous communities with different characteristics. In this study, a random sample of 5000

articles, created in 2007, is retrieved from Wikipedia and the characteristics of the community

of users collaborating on each article are analyzed. These characteristics are then linked to the

quantity and quality of the output created by these diverse online communities to answer the

following research question:

How is the performance of an online community related to its characteristics?

Page 7: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 3 |

1.2 Structure

This thesis is structured as follows: chapter 2 provides a summary of the results of an in-

depth literature review and deals with the current state of research. Furthermore, the research

question is formulated in more detail and several hypotheses are developed. While chapter 3

provides a detailed explanation of the methodology used in this thesis, results are presented in

chapter 4. Chapter 5 concludes with a discussion of the results, implications for theory,

methods and practice as well as an outlook for promising future research directions. Addition-

ally, several examples along with an in-depth explanation of the data collection and analysis

method used in this study can be found in the appendix.

Page 8: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 4 |

2 Literature Review and Hypotheses Development

This chapter defines key terms and provides an insight into the theoretical background of this

study. After introducing the concept of online communities and factors indicating community

performance, the relationships between the characteristics of online communities and their

performance are explored from various perspectives.

2.1 Online Communities

The idea of geographically separated people meeting online to talk about common areas of

interest and to build online communities is older than the Internet itself (Licklider & Taylor

1968, p. 38). In fact, one of the first services of the Arpanet, file transfer, was soon diverted

from its intended use and employed for sending messages (Barabási 2003, p. 149). As a re-

sult, huge mailing lists emerged and due to the rise of the Arpanet, other networks and even-

tually the Internet more and more people were able to gather online and exchange their

thoughts on various topics (Cothrel & Williams 1999, p. 54; Koch 2002, p. 327). As a result

of his positive experiences of support and friendship on the bulletin board system “The

WELL” Rheingold coined the term “Virtual Community” to describe this social phenomenon

(Döring 2001; Rheingold 1993). Consequently, a scholarly discussion started whether the

term “community” should be used in this context at all as virtual communities lack face-to-

face contact and various other characteristics of traditional communities (Döring 2001; Jones

1997; Preece & Maloney-Krichmar 2005).

Even though scientists nowadays generally agree that virtual communities are “real” commu-

nities (Döring 2001) and the strict distinction between online and offline activities is losing

importance (Preece, Maloney-Krichmar & Abras 2003, p. 8), online communities are still a

vague concept with no widely accepted definition (Leimeister, Sidiras & Krcmar 2006, p.

278; Preece 2001, p. 347). It is, however, not the intent of this paper to comprehensively re-

view and analyze each single argument. Rather, online communities are understood as a cate-

gory with fuzzy boundaries (Bruckman 2006, p. 618) and are discussed in a broad context.

Page 9: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 5 |

Online communities have existed in various forms for approximately three decades (Ridings,

Gefen & Arinze 2002, p. 272). In recent years several researchers stressed the high innovation

potential of online and offline user communities and examined the specific ways of how their

users cooperate and support each other (Franke & Shah 2003, p. 159; von Hippel 2001, pp.

84-85). They argue that users, not producers, are responsible for a considerable amount of

major innovations in various industries (Franke & Shah 2003, p. 157), in some occasions even

without the need for a manufacturer (von Hippel 2001, p. 86). As a result, more and more

businesses are nowadays trying to harness the enormous creative potential of online commu-

nities (Füller, Matzler & Hoppe 2008, p. 609). However, many of them fail (Cothrel & Wil-

liams 1999, p. 54) and success factors are still quite unclear (Leimeister, Sidiras & Krcmar

2006, p. 281).

This paper aims to shed light on this topic by examining the relationships between community

characteristics and performance. Therefore, communities of users collaborating on individual

articles in Wikipedia are investigated. One may argue that observing the community around

an article amounts to taking it out of context, similarly to merely tracking changes in a single

file of an open source project, without considering dependencies. Mateos Garcia & Stein-

mueller (2003), however, call attention to an important difference between open source soft-

ware systems like Linux and open content collections like Wikipedia. While contributions to

open source software (e.g. a new software module) inhere a high need for integration due to

their “cumulative dependency”, articles in a collection are far less dependent on each other (p.

17) and valuable as standalone items (Wales 2005a, 1:30). In line with these findings, Voss

highlights that “most likely one can determine subcommunities” in Wikipedia (2005, p. 8).

Consequently, it seems plausible to consider the communities collaborating on every article of

Wikipedia as independent online communities. This not only allows to utilize Wikipedia as a

massive experiment but also to examine the produced output and performance of numerous

communities with different characteristics.

Page 10: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 6 |

2.2 Performance of Online Communities

Due to the fact that there are numerous kinds of online communities it is of uttermost impor-

tance to define clear and measurable objectives to asses their performance (Cothrel 2000, pp.

17-18). What is more, online communities can be examined from different perspectives and

consequently various performance indicators are discussed in the literature (Leimeister, Sid-

iras & Krcmar 2006, p. 279; Preece 2001, p. 354). While many of them, however, are very

general measures, Stvilia et al. suggest a specific set of metrics to assess Wikipedia articles

which seems especially relevant in the context of this paper (2005, p. 3). Applicable and rele-

vant factors of these studies were adapted to the research setting and enriched with a number

of additional metrics. Thus, this paper applies a direct approach when measuring the perform-

ance of different communities in Wikipedia by analyzing their output, whereas previous re-

search often focused on participation as a proxy of value creation (Cothrel & Williams 1999,

p. 55; Preece 2001, pp. 350-351).

Cothrel & Williams point out that a successful online community is “one that achieves its

purpose” (1999, p. 55). Asked about the purpose of Wikipedia Co-founder Jimmy Wales once

remarked that:

“Wikipedia is first and foremost an effort to create and distribute a free encyclopedia

of the highest possible quality to every single person on the planet in their own lan-

guage. […] the entire purpose of the community is precisely this goal” (2005b).

Even though the success of online communities is often a question of perspective (Leimeister,

Sidiras & Krcmar 2006, p. 279; Preece 2001, p. 354), due to their intrinsic motivation non-

commercial operators tend to agree on success factors with community members (Leimeister,

Sidiras & Krcmar 2006, p. 292). Thus in the case of the non-profit project Wikipedia, the pur-

pose for members and operators is crystal clear: create the largest high-quality reference work

for free. Consequently, the performance of communities in Wikipedia needs to be evaluated in

terms of the quantity of information and the quality of the articles produced.

Page 11: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 7 |

2.2.1 Information-Quantity

When thinking about quantifying the information included in Wikipedia articles, the length of

each article is the first measure that comes to mind. However, if counting the number of

words were the only measure applied, articles that repeat the same information over and over

again would score far too high. Therefore, to increase its validity, the analysis was comple-

mented with an examination of the vocabulary used in each article, a proxy of the information

content contained. Furthermore, the number of links was assessed to take the amount of ex-

ternal information referred to into account. The following paragraphs provide additional in-

formation on these measures.

Number of words: The length of articles in Wikipedia and hence the quantity of information

produced by different communities varies sharply (Stvilia et al. 2005, p. 7). To asses the

length of every article the total number of words was counted. This measure was preferred to

using the mere number of characters to take different topics and hence varying word-lengths

of their vocabularies into account.

Vocabulary: This measure depicts the number of unique words and hence the word pool used

in the community effort. A short test revealed that this easy-to-understand, yet easy-to-

calculate metric correlates very well (0.985) with the zipped file size of the articles which is

known to be a good proxy of entropy (Voss 2005, p. 9).

Number of links: Buriol et al. found that the average number of outgoing links per Wikipedia

article increased over the last few years (2006, p. 5). This measure not only indicates the

amount of external information incorporated, but also the number of additional references

provided and is hence considered as an important dimension of information quantity.

Page 12: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 8 |

2.2.2 Article-Quality

The results of an expert-led investigation carried out by Nature points out that Wikipedia arti-

cles have a similar quality to those in the Encyclopaedia Britannica (Giles 2005, p. 900). This

study has attracted much attention and has been criticized for the selection and comparison of

articles (Encyclopædia Britannica 2006). Due to the fact that only a small number of articles

(42) was reviewed, it additionally lacks validity (den Besten, Loubser & Dalle 2008, p. 1). To

overcome these weaknesses several researchers have tried to automatically assess the quality

of Wikipedia articles (p. 8). One approach is to calculate readability scores as a measure of

quality (p. 8), a metric also applied in this study. To take the level of integration of each arti-

cle into account this analysis was complemented with an examination of the number of cate-

gories each article is placed into. Further details are provided in the following paragraphs.

Readability: Readability metrics have been used by several researchers to assess the quality

of Wikipedia articles (den Besten, Loubser & Dalle 2008, p. 8; Stvilia et al. 2005, p. 9).

Stvilia et al., for example, found that featured articles, i.e. a selection of the best articles de-

termined by Wikipedia’s editors (Wikipedia 2009e), show higher Flesch readability scores

than articles in a random set (2005, p. 7). The Flesch readability formula is a very popular

function of the number of words per sentence and the number of syllables per word used in a

text which yields a number between 0 (very difficult) & 100 (very easy) to assess its readabil-

ity (den Besten, Loubser & Dalle 2008, p. 10). While very easy to compute, it allows assess-

ing the readability of texts with considerable accuracy (p. 11).

Number of categories: The categorization of articles, which is based on a special form of

social tagging, plays an important role in structuring the content of Wikipedia (Voss 2005, p.

10). According to the editing guidelines every article should not only be placed in at least one

category but in all categories to which it logically belongs (Wikipedia 2009d). Thus, the num-

ber of categories an article is placed into can be used as a proxy for measuring how well the

output is structured as well as the level of integration within the encyclopedia.

Page 13: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 9 |

2.3 Links between Community Characteristics and Per formance

While several researchers have examined the relationships between particular characteristics

of online communities and their performance, this thesis aims to combine these attempts into

a holistic framework to analyze online communities. Therefore, in the following sections

characteristics of online communities are examined from different perspectives (see figure 1).

Figure 1: Characteristics of online communities (or iginal illustration)

The community-centered perspective, which deals with the effects of size and heterogeneity

of the communities on the created output, is followed by the application of a more user spe-

cific view based on the activity and focus of the average community member. Last but not

least, the characteristics and effects of the users’ collaboration are reviewed in terms of inter-

activity and dynamics.

Page 14: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 10 |

2.3.1 Community-Centered Perspective

With an increasing number of members communities generally have access to more resources

and better information (Butler 2001, p. 348). However, the size of a community is not the

only factor influencing its capability to perform well. Another important factor often dis-

cussed in the literature is a group’s composition and hence the heterogeneity of its members

(Horwitz & Horwitz 2007, p. 988). The community-centered perspective applied in this sec-

tion therefore examines the links between the sizes of the analyzed communities, the hetero-

geneity of their members and the quantity as well as quality of their outputs.

Size: Counting the number of members is the first thing that comes to mind, when thinking

about the characteristics of an online community. As there is still a lack of research on the

relatively new field of collaborative content creation platforms like Wikipedia, findings in the

free and open source software movement can be of help, as these two fields share considera-

bly similar philosophies (Ortega, Gonzalez-Barahona & Robles 2007, p. 47; Stvilia et al.

2005, p. 1). Thus Wikipedia, similarly to other communities, takes advantage of the collective

knowledge of its users (Stvilia et al. 2005, p. 6), thereby obeying Linus’s Law: “Given

enough eyeballs all bugs are shallow” (Raymond 2000). The number of eyeballs however, is

hard to estimate, as the number of lurkers, i.e. users who read but do not participate, is hard to

grasp (Nonnecke & Preece 1999, p. 123). In their analyses Stvilia et al. focus on people who

bother to make a change, noting that this number is “obviously much smaller and probably

more interesting and maybe correlating with the real number of eyeballs” (2005, p. 6). Strictly

speaking, only people who identify themselves with the community and as a result participate

actively are understood as members of the communities analyzed in this thesis.

Fernandez-Ramil, Izquierdo-Cortazar & Mens show that the number of unique contributors to

open source projects has a positive impact on a software’s total lines of code (2008, p. 4). A

similar relationship could be found by Ortega, Gonzalez-Barahona & Robles in Wikipedia

between the number of unique authors and an article’s size (2007, p. 52). Consequently, the

following hypothesis can be derived from these findings:

Page 15: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 11 |

Hypothesis H1A: The quantity of information produced by an online community increases

with the number of participants.

Large groups often come up with better solutions than individual experts (Surowiecki 2005, p.

XVII; Tapscott & Williams 2008, p. 41). Similarly, open source projects benefit from peer

reviews by a large base of users (Senyard & Michlmayr 2004, p. 2). As Butler puts it “in lar-

ger social structures it is more likely that there is a member who knows the needed informa-

tion” (2001, p. 348). In the context of Wikipedia Lih reasons that “with more editors, there are

more voices and different points of view for a given subject” (2004, p. 8) and Wilkinson &

Huberman (2007, p. 4) point out that there is a strong link between the number of editors,

edits and article quality. While large groups, however, traditionally often failed to take advan-

tage of this fact due to logistical problems and various other adverse effects, today the utiliza-

tion of computer mediated communication systems has the potential to significantly reduce if

not eradicate these problems (Butler 2001, pp. 349-350; Surowiecki 2005, pp. 275-277):

Hypothesis H1B: The quality of articles produced by an online community increases with

the number of participants.

To sum up, even though the performance of large groups traditionally suffered due to prob-

lems inherent in their structure, modern information and communication technology helps to

harness their wealth of resources.

Heterogeneity: As online communities often gather around shared interests (Cosley, Ludford

& Terveen 2003, p. 8) and even inherit the risk of “balkanization”, i.e. an ongoing separation

in special interest groups (Van Alystyne & Brynjolfsson 2005, p. 851), it is interesting to ana-

lyze how the heterogeneity of community members influences the outcome of their joined

effort. In a meta-analysis of effects of team diversity on team performance Horwitz & Hor-

witz found a significant positive influence on both quantity and quality of output (2007, p.

1000). Surowiecki highlights the importance of diversity and notes that it is especially impor-

tant in small groups as they are very prone to groupthink (2005, pp. 29, 36). He concludes that

Page 16: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 12 |

while “homogeneous groups are great at doing what they do well, […] they become progres-

sively less able to investigate alternatives” (2005, p. 31). Katz and Te’eni note that even

though different perspectives are essential, they can also increase misunderstandings and re-

quire additional effort to “overcome the gap” (2007, p. 262). Ludford et al. found that groups

consisting of dissimilar users contribute more to an online community than similar users

(2004, pp. 5,7), while Cosley, Ludford & Terveen did not find any evidence that similarity of

members influences the outcome of a task (2003, p. 6). Interestingly enough, neither of them

got the results they expected. While Ludford et al. anticipated that similar groups would con-

tribute more due to the attraction of interacting with similar others (2004, p. 7), Cosley, Lud-

ford & Terveen were looking for positive influences comparable to those experienced by tra-

ditional diverse teams (2003, p. 2). These mixed results highlight the possibility of a more

complex, curvilinear, relationship between the heterogeneity of community members and the

output of the online community. Indeed, several researchers have proposed an inverted U-

shaped curvilinear relationship between heterogeneity and outcome (Horwitz & Horwitz

2007, p. 1008):

Hypothesis H2A: The quantity of information produced by an online community follows an

inverted U-shaped curvilinear relationship with the level of its heterogeneity.

Hypothesis H2B: The quality of articles produced by an online community follows an

inverted U-shaped curvilinear relationship with the level of its heterogeneity.

2.3.2 User-Centered Perspective

This section takes a closer look at the characteristics of the individual members of the ana-

lyzed communities. Several researchers have examined the activity of community members to

measure how engaged they are with the community (Cothrel & Williams 1999, p. 56; Preece

2001, pp. 350-351). This paper not only sheds light on the often analyzed activity and partici-

pation of community members but also examines how focused their effort on the analyzed

community is.

Page 17: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 13 |

Activity: Several researchers found that a small group of users can be credited with the ma-

jority of contributions in online community efforts (Füller, Jawecki & Mühlbacher 2007, p.

64; Mockus, Fielding & Herbsleb 2000, p. 5; Stewart 2005, p. 829). Similarly, in 2005 Wales

said that in the English Wikipedia a core group of 0.7% (524) of all users is responsible for

50% of all edits, with 2% (1400) contributing 73.4% of all edits (2005a, 22:20). What is

more, even though the number of edits does not say anything about the changed content,

Wales implied that Wikipedia is written by this core group of users (2005c). In contrast,

Swartz claimed in a blog post that most new content is added by outsiders who contribute

rarely (2006). Consequently Kittur et al. analyzed this issue and not only found that the pro-

portion of edits by “elite users” of Wikipedia has declined in recent years due to the enormous

growth of a group of low-edit users (2007, pp. 3-4), but also that more than 70% of all words

are now changed by editors with less than 10000 edits. However, they also found that experi-

enced, more active users tend to add more content (1.81 words for every word removed) than

novice users (0.86 words for every word removed) (2007, p. 6).

Hypothesis H3A: The quantity of information produced by an online community increases

with the activity of its contributors.

Kittur et al. note that even though novice users tend to delete more words than they add, they

may still increase the quality of the output (2007, p. 6). Indeed, Anthony, Smith & William-

son found that when it comes to the quality of contributions low-edit anonymous users

(“Good Samaritans”) play an equally important role as committed registered Wikipedians

(“Zealots”) (2007, p. 15). While the quality of edits by anonymous users decreases with the

number of their overall edits, the quality of contributions by registered users points in the op-

posite direction (p. 16). As this study does not distinguish between registered and anonymous

editors it is expected that these effects cancel each other out and there is no significant rela-

tionship between the quality of articles produced and the activity of users.

Hypothesis H3B: The quality of articles produced by an online community is independent

of the activity of its contributors.

Page 18: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 14 |

Focus: With increasing complexity and size of open content projects, specialization is getting

more and more important. Indeed, contributors in open source projects are often found to spe-

cialize in specific elements of the developed software to apply their domain specific knowl-

edge (von Krogh, Spaeth & Lakhani 2003, p. 1230). Similarly, editors in Wikipedia often

choose to contribute to articles where they have specific knowledge or personal interest

(Wikipedia 2009m). It is therefore posited that this specialization and focus on specific arti-

cles not only yields an increase in the quantity of the created content but also leads to consid-

erable quality improvements.

Hypothesis H4A: The quantity of information produced by an online community increases

with the focus of its contributors.

Hypothesis H4B: The quality of articles produced by an online community increases with

the focus of its contributors.

2.3.3 Collaboration-Centered Perspective

It has been noted that collaboration and social interaction between users is an important requi-

site for the success of user communities (Füller, Jawecki & Mühlbacher 2007, p. 61) and “not

an issue that can be ignored” (Kollock 1996, 23rd paragraph). Similarly, Tapscott & Williams

point out that the success of Wikipedia “is built on the premise that collaboration among users

will improve content over time, in the way that the open source community steadily improved

Linus Torvalds’s first version of Linux” (2008, p. 71). Due to the comprehensive record of

activities in Wikipedia it is not only possible to analyze the “behaviour of information pro-

ducers” (Almeida, Mozafari & Cho 2007, p. 1) but also its effects. To address the interesting

topic of collaboration, this section deals with the interactions between contributors and the

dynamics of contributions.

Interactivity: The development of ideas and innovations is often no solitary process but

benefits from the assistance of other community members. Franke & Shah for example found

that members of user communities do not innovate in isolation but rather receive crucial ad-

Page 19: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 15 |

vice and assistance from other members (2003, p. 158). Similarly, von Hippel notes that “in-

novation communities also tend to behave in a collaborative manner“ (2005, p. 105). Given

these findings, it comes as no real surprise that researchers have pointed out the essential need

for interactivity in online communities (Jones 1997; Schoberth, Preece & Heinzl 2003, p. 3).

What is more, researchers found that a critical mass of users is needed “to initiate a sustain-

able interactive discourse” (Schoberth, Preece & Heinzl 2003, p. 3), probably because the

number of possible interactions considerably increases with the size of the group (Butler

2001, p. 348). In the context of Wikipedia Buriol et al. identified a high number of interac-

tions between editors in articles (2006, p. 4).

The impact of interactivity on the quantity of information in Wikipedia articles may be diluted

by “edit wars”, i.e. “interactions where two people or groups alternate between versions of the

page” which are not restricted to controversial topics (Viégas, Wattenberg & Dave 2004, p.

579). However, the number of edit wars has dropped significantly in the last few years

(Viégas et al. 2007, p. 3). As a result, it is anticipated that the highlighted positive impacts

outweigh and that not only the quantity, but also the quality of the produced articles increases

with the level of interactivity:

Hypothesis H5A: The quantity of information produced by an online community increases

with the level of interactivity.

Hypothesis H5B: The quality of articles produced by an online community increases with

the level of interactivity.

Dynamics: In order to thrive communities have to be dynamic (Mynatt et al. 1998, p. 128).

This study examines the dynamics within online communities by analyzing the distribution of

contributions over time. Members of the analyzed communities can either contribute occa-

sionally or collaborate intensively on the Wikipedia article within a short period of time.

Page 20: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 16 |

It is expected that the momentum gained in the latter case has positive effects on both the

quantity and the quality of the output produced by the online communities:

Hypothesis H6A: The quantity of information produced by an online community increases

with the level of dynamics.

Hypothesis H6B: The quality of articles produced by an online community increases with

the level of dynamics.

Page 21: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 17 |

3 Research Method

This chapter explains the research method applied in this study in greater detail. A short in-

troduction of the research site is followed by sections dealing with the study design and the

data collection process. The chapter concludes with information on which measures were used

to operationalize each variable.

3.1 Research Site

Wikipedia, “the free encyclopedia that anyone can edit” (Wikipedia 2009l), is one of the most

successful examples of massive collaborative content development (Ortega, Gonzalez-

Barahona & Robles 2008, p. 304) and the largest encyclopedia in the world (Tapscott & Wil-

liams 2008, p. 71). It applies the “wiki”-concept, invented by Cunningham, to allow users to

easily edit articles, while saving all changes and revisions in its database (Holloway, Bozice-

vic & Börner 2007, p. 30). This history of each page provides a “design trace” of how the

article evolved (Garud, Jain & Tuertscher 2008, p. 361) and provides valuable information on

the editor, the time of the edit, and the changes committed (see figure 2 for an example).

Figure 2: Excerpt of the revision history of a Wiki pedia article (Wikipedia 2009h)

Page 22: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 18 |

Due to these comprehensive records of participation and the availability of complete data-

base-dumps (Wikimedia 2009) Wikipedia is a unique source of data (den Besten, Loubser &

Dalle 2008, p. 8).

3.2 Study Design

As the aim of this study is to analyze the relationship between characteristics of online com-

munities and the quantity and quality of output they create, this paper utilizes Wikipedia as a

natural experiment to analyze a large number of communities with diverse characteristics.

Owing to Wikipedia’s increasing popularity its article base has grown significantly over the

last few years (Viégas et al. 2007, p. 5) and consequently complete dumps of the English

Wikipedia have not only reached enormous file sizes that make them hard to analyze (den

Besten, Loubser & Dalle 2008, p. 8) but have even failed or have been corrupted recently

(Wikimedia 2008b, 2008c). Due to these disturbances and its more manageable size this paper

focuses on the German-language Wikipedia, which is, following the English version, the sec-

ond biggest of all language editions (Wikimedia 2008d). As a matter of fact, however, given

enough computing time all the analyses conducted can be easily performed on the English

version of Wikipedia as well as on an even larger sample.

To minimize biases due to changes in Wikipedia’s popularity and user base only revisions of

articles created in 2007 were analyzed. What is more, all articles edited by only one user were

excluded as they do not qualify as community effort. Of the more than 160.000 remaining

articles redirects to other articles were removed and a random sample of 5000 articles was

drawn.

The relationships between community-centered, user-centered & collaboration-centered char-

acteristics on the one hand and the output dimensions Information-Quantity and Article-

Quality on the other hand as discussed in chapter 2 are examined in this natural experiment

based on the last version of each article in 2007. Additionally, the effect of the age of the ana-

lyzed articles (time passed since its first edit) was controlled for to rule out the possible alter-

native explanation that older communities had more time to create extensive and high quality

content. Figure 3 provides an overview of the assumed relationships that are scientifically

tested in this study.

Page 23: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 19 |

Figure 3: Research model (original illustration)

3.3 Data Collection and Cleansing

Complete database dumps of Wikipedia and its sister projects are provided online by the

Wikimedia Foundation Inc. (Wikimedia 2009). Even though dumps including all pages with

complete revision history are available, given the huge amount of data the “stub-meta-

history.xml.gz” dump was used, which does not include any page text, but complete revision

metadata. The dump from June 7th of the German Wikipedia (Wikimedia 2008a) was

downloaded in August 2008 and imported into a MySQL database using the MWDumper-tool

(MediaWiki.org 2009).

Almeida et al. mention that Wikipedia dumps are often incomplete due to errors occurred dur-

ing their generations (2007, p. 2). Similar problems were found in the dump analyzed in this

paper where the table containing all pages, was out of sync with the pages included in the

table storing all revisions. Consequently, distinct pages in the revision table were used as a

basis and, where necessary, missing values queried from the Wikipedia API (Wikipedia

2009g).

Page 24: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 20 |

The Wikipedia database-dump consists of more than two million pages. However, as the

Wikipedia database is not only comprised of Wikipedia articles but includes various other

pages as well (e.g. talk pages, user pages, image pages etc. see: Wikipedia 2009p) all pages

not belonging to the main namespace, which consists of all articles ever written, were re-

moved from the database. What is more, redirects to other articles were removed as well. A

random sample of 5000 articles created in 2007 with more than one user was then drawn from

the remaining database with standard SQL-statements (see figure 4 for an overview of the

sampling process).

Figure 4: Sampling process (original illustration)

To examine the output of each community in greater detail the last revision of the year 2007

was downloaded of each article using a Python script (Gude 2008) which was adapted to the

German version of Wikipedia. The yielded XML files include, amongst others, information

about the article and the author, time and text (including wiki markup; see Wikipedia 2009v)

of each revision (examples can be found in the appendix).

Page 25: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 21 |

To calculate the readability of an article, a plain text version of the article content was re-

quired. Instead of removing the wikitext markup, however, it became evident that web scrap-

ing the content from the Wikipedia homepage was easier to accomplish. If an article is ac-

cessed online the wikitext markup is parsed into formatted HTML text. What is more, the

content of an article is preceded and followed by specific HTML comments (<!--start content

-->, <!--end content -->) in the webpage’s source code. Hence, with the help of regular ex-

pressions an article’s content can easily be extracted and HTML markup removed to yield the

plain text version needed.

Due to the huge amount of articles under study a parser was developed in the Python pro-

gramming language to automatically obtain and analyze the files discussed above (an in-depth

explanation of this method can be found in the appendix). In this process edits by bots, i.e.

“automated or semi-automated tools that carry out repetitive and mundane tasks” (Wikipedia

2009c), were determined on the basis of a recent user-group assignment list (Wikipedia

2009b) and omitted in the analyses. It is important to note, however, that these assignments

are not static and may have changed since 2007, resulting in bots not recognized correctly by

the parser.

Vandalism is another topic which needs to be addressed in this context. Due to the low entry

barrier Wikipedia is quite vulnerable to vandalism. However, due to the fact that all revisions

are stored in the database, malicious edits can be fixed easily and fast. Indeed Wikipedians do

a very good job as flawed articles are often amended within minutes (Viégas, Wattenberg &

Dave 2004, p. 579). Vandalism can occur in various forms and is often hard to detect auto-

matically as there is no crystal clear definition of vandalism in Wikipedia (pp. 578-579). To

reduce the number of false positives only two often unambiguous cases were marked as van-

dalism in this analysis:

� Mass deletion of all content on a page

� More than 90% of content was deleted, the remaining text has less than 500 characters

and no meaningful comment was created (Wikipedia 2009w)

Page 26: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 22 |

However, the majority of articles examined by hand showed low levels of vandalism, proba-

bly because vandals tend to “specialize” on very popular articles (Wikipedia 2009u) which are

often older than the articles analyzed in this study. Edits identified by the above explained

measures were omitted and vandals were excluded from further analysis of the characteristics

of the specific community.

3.4 Measures

This section deals with the way the discussed factors of performance and community charac-

teristics were operationalized and explains the applied metrics.

3.4.1 Operationalization of Performance Indicators

As already discussed in chapter 2.2 the constructs Information-Quantity and Article-Quality

consist of several variables.

3.4.1.1 Information-Quantity

The following three variables were standardized and averaging to build the Information-

Quantity construct:

Number of words: To quantify the length of an article the words included in the article were

counted. Therefore the markup was stripped from the HTML versions of each article to yield

a plain text version. This text was split into individual words at every white-space character

with the help of regular expressions.

Vocabulary: The number of unique words was calculated accordingly. For simplicity reasons

no stemming was conducted and stop words were not removed.

Number of links: Regular expressions were used to determine outgoing links on every ana-

lyzed article page. Duplicate links and page internal links were omitted.

Page 27: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 23 |

3.4.1.2 Article-Quality

The Article-Quality construct was created by standardizing and averaging the following two

variables:

Readability: The Flesch reading ease is a function of the average sentence length (ASL;

words per sentence) and the average number of syllables per word (ASW) (den Besten, Loub-

ser & Dalle 2008, p. 10):

ASW84.6ASL1.015206.835FREEnglish ⋅−⋅−=

This formula yields a number between 0 (very difficult) & 100 (very easy), with standard

English texts usually scoring a number between 60 and 70 (den Besten, Loubser & Dalle

2008, p. 11). Flesch readability scores were calculated for the plain text versions of each arti-

cle’s last revision in 2007 using an online tool (stilversprechend.de 2009a) which applies an

adapted version of the formula for the German language (stilversprechend.de 2009b):

ASW58.5ASL180FREGerman ⋅−−=

Number of categories: In Wikipedia an article can be placed into a category by adding a spe-

cific category tag (“[[Kategorie:Category name]]“ in the German language version, Wikipedia

2009f) to the page. Occurrences of these tags were counted in the last revision of 2007 to cal-

culate the number of categories for each analyzed article.

3.4.2 Operationalization of Community Characteristics

In the following paragraphs the operationalization of community-centered, user-centered and

collaboration-centered variables are outlined.

3.4.2.1 Community-Centered Perspective

Size: Users who want to contribute to an article in Wikipedia have two options: they can ei-

ther sign up to Wikipedia or choose to remain anonymous. Whereas in the former case their

username is associated with their revisions, their IP address is stored in the latter case. It has

Page 28: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 24 |

been argued that anonymous users play an important role in the creation of content (Anthony,

Smith & Williamson 2007, p. 15; Kittur et al. 2007, p. 6; Viégas, Wattenberg & Dave 2004,

p. 580). Consequently, this paper examines the number of distinct users regardless of whether

they sign up/in or choose to stay anonymous, even though Stvilia et al. point out that this

number can only be an approximation of the actual number of distinct editors as e.g. an indi-

vidual may make edits with more than one username (2005, p. 5).

Heterogeneity: On average, users in the analyzed sample have edited 265 articles in 2007.

Consequently, the 265 most important articles (i.e. articles most members of the community

contributed to during the year 2007) were queried from the created tables with standard SQL

query statements when analyzing the heterogeneity of the members of a community. In the

next step a vector was created for each community user depicting the editing patterns in those

articles he/she co-authored (number of edits) and which article he/she didn’t edit (“0”). These

vectors of edited articles can be understood as areas of “common interest” (Korfiatis, Poulos

& Bokos 2006, p. 256), “interest profiles” (Cosley, Ludford & Terveen 2003, p. 2) or

“knowledge profiles” (Van Alystyne & Brynjolfsson 2005, p. 854). The similarities of each

user to every other user were then computed by calculating the cosine of each knowledge pro-

file pair (Manning & Schütze 2003, p. 300):

|y||x|

yx)y,xcos(

⋅=

This often called cosine-similarity is the cosine of the angle between two vectors and has al-

ready been used in other studies when analyzing the similarity of community members (for

example in: Cosley, Ludford & Terveen 2003, p. 4; Van Alystyne & Brynjolfsson 2005, p.

854). Similarly to Van Alystyne & Brynjolfsson’s approach, groups of users are compared in

this paper by the average similarity of their profiles (2005, p. 854). The heterogeneity was

then calculated by subtracting a community’s average similarity (a number between 0 and 1)

from 1:

Heterogeneity SimilarityAverage1−=

Page 29: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 25 |

3.4.2.2 User-Centered Perspective

Activity: To measure the general activity of community members, the number of edits in all

articles in Wikipedia in the year 2007 per user was queried from the database and averaged

for each community. Let us suppose a community consists of two contributors. A made 100

edits in Wikipedia articles in 2007, whereas B contributed 200 times. The activity of users in

this community hence amounts to 150 edits.

Focus: To assess the level of commitment in each community the average proportion of their

members’ activity in the analyzed community was calculated. To proceed with the previous

example: A and B made 10 edits in the analyzed article. The focus of users in this community

hence amounts to 0.075 (A: 10/100; B: 10/200).

3.4.2.3 Collaboration-Centered Perspective

Interactivity: All edits in 2007 of each article in the sample were analyzed in this paper. In a

first step the number of interactive edits was counted i.e. the first edit and all edits that were

preceded by an edit of another community member. The level of interactivity was then calcu-

lated as the ratio between interactive edits minus the number of distinct authors and the total

number of edits:

Interactivity EditsTotal

UsersEditseInteractiv −=

Let us suppose that four users (A, B, C, D) created an article and the revision history reveals

the following eight edits: A B C D A B C D. The interactivity level of this example amounts

to 0.5 as all edits by those four contributors are interactive edits ([8-4]/8).

Dynamics: To assess the dynamics within communities the median time between edits was

calculated for each article. This metric allows analyzing whether community members inten-

sively edited the article in a short period of time or if their efforts were distributed over the

whole year 2007. As less dynamic communities exhibit higher median times between edits the

yielded figure was multiplied by -1 to ease interpretation.

Page 30: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 26 |

4 Results

The following chapter presents the results of this study in two sections. While the first sec-

tion, Descriptive Statistics, provides an in-depth descriptive analysis of the articles and com-

munities in the sample, the second section, Inferential Statistics, contains the results of two

ordinary least squares (OLS) regressions used to statistically test the developed hypotheses on

the relationships between characteristics of online communities and their performance.

4.1 Descriptive Statistics

The random sample drawn from the Wikipedia database consists of 5000 articles that were

created in 2007 and edited by more than one user. Due to these sampling criteria it is no sur-

prise that the average article age is slightly skewed towards older articles that had more time

to attract enough contributors and amounts to 195.81 days (standard deviation: 105.55) with a

minimum of 0.43 days and a maximum of 364.94 days. Furthermore, due to the fact that

Wikipedia is the largest encyclopedia in the world (Tapscott & Williams 2008, p. 71) and is

still growing (Wikipedia 2009i), it is plausible that articles created in 2007, as evident from

the sample, often cover very specific, niche topics or recent events.

Community-centered perspective: Online communities in the sample generally show a

moderate number of contributors with an average of 5.57 users (s.d.: 5.67). As already prede-

termined by the sample criteria the minimum number of users found in the sample is 2. How-

ever, there are also a number of outliers. The most extreme outlier is the community that col-

laborated on the article about Knut the famous polar bear born in the Berlin Zoo with 230

contributors. Concerning the heterogeneity of community members, a considerable diversity

was found. The average level of heterogeneity amounts to 0.8252 (s.d.: 0.1272) with a mini-

mum of 0.00 and a maximum of 0.9998.

User-centered perspective: On average, users in a community show an activity of 5135.79

contributions (s.d.: 4305.81). The minimum activity amounts to 2 edits, while the maximum is

38029. Furthermore, community members tend to work on more than one article, as the aver-

Page 31: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 27 |

age focus only amounts to 9.04 percent (s.d.: 13.26) with a minimum of 0.00 percent and a

maximum of 91.34 percent.

Collaboration-centered perspective: The average interactivity level in the sampled commu-

nities amounts to 0.0908 (s.d.: 0.1079) with a minimum of 0 and a maximum of 0.6. The me-

dian time between edits averages to: 8.77 days (s.d.: 24.58) with a minimum of 21 seconds

and a maximum of 283.93 days.

Information-Quantity : The average article in the sample consists of 382.77 words (s.d.:

543.64). There is, however, also an article without any words in the sample, as its last revi-

sion of the year 2007 did not contain any content. The longest article deals with an in-depth

description of the course of the NHL season 2007 (11784 words). What is more, 220.06

unique words (s.d.: 216.78) are used on average in each article. While the minimum is again

0, stemming from the empty article discussed above, the article with the most unique words

contains a table on Chinese Unicode characters (3738 words). The average number of unique

links amounts to 33.2 (s.d.: 37.23) per article. The minimum number of links is once more 0

due to the empty article, while the article with the highest number of unique links lists all fe-

male Olympic medalists in athletics (789 links).

Article-Quality : The average Flesch readability score of articles in the sample amounts to

56.97 (s.d.: 11.42) depicting a reasonable readability (stilversprechend.de 2009b). For six arti-

cles, however, no valid readability values could be determined due to insufficient length of

the articles’ content. What is more, looking at the most extreme outliers (min: 5; max: 100)

reveals that articles consisting of mere tables and lists cannot be assessed well with the help of

the Flesch readability function. On average articles in the sample are placed into 3.05 catego-

ries (s.d.: 2.15). While there are several articles in the sample which are not part of any cate-

gory, an article dealing with the achievements of a German silviculture scientist shows the

highest number of categories (18).

Page 32: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 28 |

4.2 Inferential Statistics

The hypotheses developed in chapter 2 were statistically tested with the help of two OLS-

regressions. While the first regression deals with the influences of community characteristics

on the quantity of information produced, the second regression examines their effects on the

articles’ quality. Table 1 summarizes the test results derived from these two regressions:

Dependent Variables

Hypotheses

Information-

Quantity1

Hypotheses Article-Quality1

Community-Centered

# Users H1A(+) 0.290*** H1B(+) 0.080***

Heterogeneity -0.090*** -0.077***

Heterogeneity²

H2A(∩)

-0.067**

H2B(∩)

-0.083***

User-Centered

Activity H3A(+) -0.027† H3B(0) 0.014

Focus H4A(+) -0.054*** H4B(+) -0.102***

Collaboration-Centered

Interactivity H5A(+) 0.089*** H5B(+) 0.020

Dynamics H6A(+) 0.021 H6B(+) 0.076***

Article age -0.059*** -0.008

R² (R² adjusted)

F-Value

p-Value

0.105 (0.103)

74.175

0.000

0.025 (0.023)

15.967

0.000

† p < .10 (two-tailed test), * p < .05 (two-tailed test), ** p < .01 (two-tailed test), *** p < .001 (two-tailed test);

articles: n=5000; 1 values are standardized coefficients (β-values); predictors were standardized before entry

Table 1: Results of the conducted OLS-regressions

Page 33: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 29 |

4.2.1 Results related to Information-Quantity

The fit indices for the OLS-regression on Information-Quantity indicate a good fit of the

model with an adjusted R-square of 0.103 (F-Value: 74.175; p-Value: 0.000).

The test of H1A revealed that the quantity of information produced by an online community,

as predicted, increases with the number of contributors (β= 0.290***).

Furthermore, the coefficient for the quadratic term of heterogeneity shows the expected nega-

tive sign, which is indicative of an inverted U-shaped relationship between Information-

Quantity and heterogeneity (Aiken, West & Reno 1991, p. 65). Using differential calculus, the

maximum point of the inverted U can easily be calculated (Aiken, West & Reno 1991, p. 65;

Eisinga, Scheepers & van Snippenburg 1991, p. 113) and is located at a heterogeneity level of

0.58. In order to be able to compare the effect of heterogeneity with the standardized regres-

sion coefficients of other predictors, the method outlined in Eisinga, Scheepers & van Snip-

penburg (1991, p. 109) was used to obtain a composite effect of the linear and quadratic term.

The standardized regression coefficient of this combined effect amounts to 0.064 and is

highly significant (p<0.000). Note that even though the sign of this coefficient is a technical

artifice (Eisinga, Scheepers & van Snippenburg 1991, p. 110) and is hence not related to the

sign of the relationship between independent and dependent variable, its size allows investi-

gating the relative importance of heterogeneity for the explanation of the dependent variable

Information-Quantity. These results support H2A.

Even though, positive impacts of the activity and focus of community members on the quan-

tity of output were predicted, the analysis revealed negative relationships (activity: β= -

0.027†; focus: β= -0.054***). Consequently, H3A and H4A had to be rejected.

The analysis provides support for H5A, as the level of interactivity within the examined

communities has a highly significant positive influence on the information quantity (β=

0.089***).

H6A, however, which predicted a positive influence of collaboration dynamics did not find

empirical support as the relationship was not significant (β= 0.021; p= 0.139).

The control variable article age exerts a negative influence on the quantity of information (β=

-0.059***).

Page 34: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 30 |

4.2.2 Results related to Article-Quality

The indicators of how well the model fits the data point out a moderate fit with an adjusted R-

square of 0.023 (F-Value: 15.967; p-Value: 0.000).

The analysis reveals that the quality of the articles produced, as predicted, increases with the

number of contributors (β= 0.080***). Hence, H1B was supported by the data.

Again, the coefficient for the quadratic term of heterogeneity shows the expected negative

sign, suggesting an inverted U-shaped relationship between Article-Quality and heterogeneity.

The maximum point of the inverted U is located at a heterogeneity level of 0.66. What is

more, the standardized regression coefficient of the combined effect amounts to 0.062 and is

highly significant (p<0.000). These results support H2B.

As predicted in H3B, no evidence of a relation between the activity of community members

and the quality of the output could be found (β= 0.014; p= 0.375).

H4B, which predicted a positive influence of focus of community members, had to be rejected

as the impact turned out to be negative (β= -0.102***).

H5B, positing a positive influence of interactivity, did not find empirical support in the data.

Even though the sign is as expected, the effect is not significant (β= 0.020; p= 0.153).

The expected positive influence of the collaboration dynamic (H6B) found support in the data

(β= 0.076***).

Article age was controlled for but showed no significant impact on the quality of produced

articles (β= -0.008; p= 0.601).

Page 35: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 31 |

5 Discussion and Implications

The aim of this study was to investigate the relationship between the characteristics of online

communities and their performance. Therefore the output and characteristics of 5000 commu-

nities gathering around Wikipedia articles were analyzed from different perspectives. The

results demonstrate that the number of users is by far the most influential force that drives

content creation. When it comes to the quality of the created output, however, characteristics

of community members and how they collaborate are as important as the sheer number of

contributors. The following paragraphs review and discuss these and other findings in greater

detail.

Amongst others, the study revealed that the quantity of information created by an online com-

munity is related to a number of community characteristics. Table 2 summarizes those find-

ings:

Perspective Hypotheses: The quantity of information produced by an online

community … Supported

H1A: … increases with the number of participants. YES

Community-

Centered H2A: … follows an inverted U-shaped curvilinear relationship with the

level of its heterogeneity.

YES

H3A: … increases with the activity of its contributors. NO User-

Centered H4A: … increases with the focus of its contributors. NO

H5A: … increases with the level of interactivity. YES Collaboration-

Centered H6A: … increases with the level of dynamics. NO

Table 2: Summary of results for hypotheses H1A-H6A (Information-Quantity)

Both hypotheses regarding community-centered characteristics found support in the data. The

analysis showed that the size of the community has by far the biggest influence among all

factors, with larger communities tending to create more output. Furthermore, the output of

Page 36: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 32 |

community users, as predicted, follows an inverted U-shaped curvilinear relationship concern-

ing their heterogeneity, probably the result of disagreement over what should or should not be

part of an article.

When it comes to the hypotheses concerning user-centered characteristics, neither the ex-

cepted positive influence of activity nor the posited positive influence of focus was supported

by the data as both influences turned out to be negative. In their analyses Kittur et al. found

that more active and experienced users tend to add more content than novice users (2007, p.

6). They, however, calculated these numbers over the whole time Wikipedia had been in exis-

tence and this reported trend may have shifted over recent years, especially in newly created

articles that, as already discussed above, nowadays often cover very specific, niche topics.

What is more, as the activity of community members in this study was measured as the num-

ber of contributions in 2007, this finding may be diluted by the fact that experienced users

that were very active in previous years and curbed their activity in 2007 were counted as oc-

casional contributors. Regarding the focus of community members on specific articles it was

expected that specialization leads to an increase in the output created. However, it turned out

that online communities benefit if their members are not too focused on a task. It seems as if

not only experts in a field but also novice users can contribute considerably to an open-

content project. Nevertheless, further research is needed to clarify the impact of activity and

focus on the quantity of content created.

What is more, whereas evidence of the positive influence of interactivity on the quantity of

created content was found, the impact of dynamics was positive but not significant. Thus, it

could be shown that the output increases as community members do not work in solitary con-

ditions but assist each other and collaborate interactively.

If at all, a positive impact of the control variable article age was expected. The negative influ-

ence found, however, points out that even young communities can be very productive. Some

of the articles examined may not only have grown over time, but also might have been short-

ened again. These shrinkages can happen if text is deleted or more dramatically if an article is

split and large sections of it are moved to a more specific page (Viégas, Wattenberg & Dave

2004, p. 580).

Page 37: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 33 |

The second OLS-regression conducted showed the links between the quality of the output

created by an online community and a number of community characteristics. Table 3 summa-

rizes those findings:

Perspective Hypotheses: The quality of articles produced by an online commu-

nity… Supported

H1B: … increases with the number of participants. YES

Community-

Centered H2B: … follows an inverted U-shaped curvilinear relationship with the

level of its heterogeneity.

YES

H3B: … is independent of the activity of its contributors. YES User-

Centered H4B: … increases with the focus of its contributors. NO

H5B: … increases with the level of interactivity. NO Collaboration-

Centered H6B: … increases with the level of dynamics. YES

Table 3: Summary of results for hypotheses H1B-H6B (Article-Quality)

The analysis provided support for both hypotheses regarding the impact of community-

centered characteristics. Larger communities tend to create output of higher quality. In con-

trast to the results on Information-Quantity, however, community size is not the most impor-

tant factor. Again, evidence for an inverted U-shaped curvilinear relationship between the

heterogeneity of community members and the quality of their output was found. This finding

highlights the importance of a moderate level of heterogeneity in an online community.

Regarding the user-centered characteristics, as predicted, the average activity had no signifi-

cant influence on the quality of the communities’ output. Furthermore, it became evident that

the posited positive influence of focus is in fact negative and the most important of all factors

influencing an article’s quality. Concerning the influence of activity additional research is

needed to test whether a distinction between user-groups can replicate the findings of An-

thony, Smith & Williamson who found that the quality of edits by anonymous users decreases

with the number of their overall edits while the quality of contributions by registered users

points in the opposite direction (2007, p. 16). Similarly to the findings on Information-

Page 38: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 34 |

Quantity, the results regarding the influence of focus on the quality of the created output im-

ply that communities benefit significantly from contributions by users who are not specialized

on an article but work on a variety of topics, or as Wikipedians put it:

“It turns out that in some ways, analytic skills and neutrality often play a greater role

than specialisation; editors who have worked for a time on a variety of articles usually

become quite capable of making good quality editorial decisions regarding specialist

material, even on unfamiliar technical subjects” (Wikipedia 2009m).

However, every article needs some experts that watch for and correct errors (Wikipedia

2009m). In line with these findings, Williams & Cothrel (2000, p. 90) stress the importance of

maintaining a balance between experts and novice users. Nevertheless, further research is

needed to clarify these connections in greater detail.

Finally, the effects of both collaboration-centered characteristics interactivity and focus show

the expected positive trend. The effect of interactivity, however, is not significant and hence

needs further clarification. The importance of intensive collaboration in online communities is

further stressed by the evident positive impact of its dynamics on the quality of the produced

output.

Following, this discussion of the link between characteristics of online communities and their

performance, it is of interest to examine which communities generate both extensive and high

quality content. Therefore, the significant effects of community characteristics on both per-

formance indicators Information-Quantity and Article-Quality found in this study are summa-

rized in table 4.

Page 39: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 35 |

Information-Quantity Article-Quality

Community-Centered

# Users + +

Heterogeneity ∩ ∩

User-Centered

Activity -

Focus - -

Collaboration-Centered

Interactivity +

Dynamics +

Table 4: Significant effects of community character istics on performance

Looking at this table it becomes evident that when taking both quantity and quality into ac-

count, those communities perform best that consist of a large number of users with a moderate

level of heterogeneity and a fair share of occasional and novice contributors who operate in a

variety of fields and collaborate interactively and dynamically.

5.1 Implications for Theory

While previous approaches often examined particular aspects of online communities, this

study introduced a framework to combine several of these perspectives to analyze the link

between community characteristics and performance. Furthermore, it extended the current

literature on online communities by utilizing Wikipedia as a massive experiment to analyze

5000 diverse communities and thereby empirically testing and substantiating these relation-

ships.

Even though additional research is needed to further clarify certain findings, it was shown that

not only general characteristics of online communities but also user specific characteristics

Page 40: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 36 |

play an important role in the creation of extensive and high quality content. What is more, the

results also stress the importance of taking the specific way members of online communities

collaborate into account.

Future research may build on these findings and the approach applied in this study to develop

an even more detailed framework for analyzing the link between characteristics of online

communities and their performance.

5.2 Implications for Methods

In this study a scaleable approach was introduced to automatically analyze diverse communi-

ties in online environments. Thanks to the availability of and easy access to its database

Wikipedia is a unique source of data that proved to be a good research site for natural experi-

ments and yielded considerable insights into the link between characteristics and performance

of online communities.

The most accurate analysis of communities in Wikipedia could be gained from analyzing all

the available data. However, the databases of all popular language editions have grown to

enormous sizes and hence working with a sample seems to be the best way to proceed. Due to

limited computing resources and the large size of the English Wikipedia database it was de-

cided in this study to analyze a sample of 5000 communities in the German version of

Wikipedia. Given more computing time, however, the analyses conducted can be easily ex-

tended to a bigger sample as well as different language editions due to the efficient and scale-

able approach applied in this study to compare and validate results. What is more, boundaries

of analyzed communities can be enlarged to not only examine communities gathering around

individual articles but larger communities e.g. in WikiProjects, which are collections of arti-

cles that deal with specific topics (Wikipedia 2009t).

As this study analyzed the output of communities based on the latest version of the year 2007,

advancing the introduced method to see how the output changes and evolves over time and in

years to come may yield additional insights. This information can be easily extracted from the

collected data by the developed parser.

Page 41: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 37 |

Even though the data available on Wikipedia and its sister projects (e.g. Wikibooks, Wiktion-

ary, Wikinews etc.) still promise various ways of analyses and research, the parser developed

can also easily be expanded and adapted to analyze other online environments to increase the

generalizability of the results.

5.3 Implications for Practice

The results of this study show that the performance of online communities not only depend on

general community characteristics like size and heterogeneity but also on more user specific

characteristics such as activity and focus. What is more, how these users collaborate plays an

important role in influencing content quantity and quality. These findings suggest that com-

munity operators can pro-actively influence the performance of online communities by pro-

viding favorable conditions.

As already mentioned before, Wikipedia is one of the most successful examples of mass-

collaboration, most likely due to the favorable conditions provided by its operators and the

software used. While the low entry barriers for contributors for example allow novice users to

contribute without going trough a lengthy sign-up process often found in other online envi-

ronments, the applied “wiki”-concept ensures that they cannot do any real harm. What is

more, several tools like watch lists and revision histories support contributors in collaborating

interactively and dynamically.

Community operators can learn from the presented results and Wikipedia, as a best-practice

example, to apply appropriate strategies and tools in their effort to influence the performance

of communities.

When involving online communities in their innovation process, businesses generally have

two distinct options: they can either try to find and harness an already existing community or

attempt to build their own (Franke 2005, p. 708). Either way, results of this study imply that

they should aim for the following characteristics of online communities to foster the creation

of content which is both extensive and of good quality:

Page 42: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 38 |

� A large community consisting of moderately heterogeneous users

� Low entry barriers that allow both novice users and experts to collaborate on various

tasks and topics

� An environment which not only supports but fosters interactive and dynamic collabo-

ration

5.4 Limitations

The methods employed in this study have a number of inherent limitations and involve a

number of assumptions that are challenged and discussed in the following paragraphs.

This study used cross-sectional data to examine the link between several community charac-

teristics and the performance of online communities. Even if most of the developed hypothe-

ses were supported by the data and several meaningful correlations could be found, it could

still be that this study mixed up cause and effect. Longer articles for example may attract

more contributors than shorter articles and not the other way around. Longitudinal analyses

may allow stronger causal claims than the approach applied.

Due to the fact that this analysis was not a controlled experiment in a laboratory setting but

rather a natural experiment, not all variables could be controlled. External influences can

hence not be ruled out and unmeasured variables may have had a significant impact on the

results. Especially the low fit of the OLS-regression on Article-Quality highlights that some

important variables may have been omitted.

To keep the research design concise and easy to understand only the main effects of inde-

pendent variables were analyzed in this study. However, during the analyses it became evi-

dent that there could be significant interactions between several community characteristics

discussed in this paper. Including interactions between these variables in the analyses may

yield a more complex, yet more comprehensive model and increase its fit with the data.

Page 43: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 39 |

Last but not least, the communities analyzed in the context of Wikipedia may have systematic

differences to communities in other online environments. The generalizability of the findings

of this study is therefore limited and results should be interpreted with caution.

5.5 Directions for Future Research

Social scientists often moan about the difficult access to data for research. In the case of

Wikipedia, quite the opposite is the case. Even though full dumps of Wikipedia and its sister

projects and hence comprehensive records of collaboration are available, the enormous

amount of data is quite hard to handle. The scaleable approach introduced in this study can be

enhanced and applied to several interesting research questions.

Wikipedians have recently started a project to assess every article in Wikipedia (Wikipedia

2009k). While this scheme has not yet been adopted in the German language version

(Wikipedia 2009a) and could hence not be used in this study, future studies can draw upon

this valuable resource to better quantify the performance of online communities.

Furthermore, more and more Wikipedians gather around WikiProjects, collections of articles

that deal with specific topics (Wikipedia 2009t). Analyzing these large communities of inter-

est in combination with the widely used article assessments may yield additional insights.

Regarding the used measures, future research could dig deeper e.g. by analyzing the access

levels of contributing users (Wikipedia 2009r), the number of barn stars (Wikipedia 2009o)

they have received and the type of comments on their user pages (Wikipedia 2009s) to de-

scribe the characteristics of community users in greater detail.

What is more, each article in Wikipedia has a talk page that is used for editorial coordination

(Wikipedia 2009q). These talk pages are another valuable resource to analyze collaboration

characteristics. Further research may relate discussions on talk pages to the creation of content

in the article to gain valuable insights.

Page 44: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 40 |

Even though Wikipedia has already been the subject of many studies, the growth of other

Wikimedia projects (Wikipedia 2009n) as well as the availability of extensive data dumps

(Wikimedia 2009), various statistics (Wikipedia 2009j), traffic data (stats.grok.se 2009) and

external quality indicators like Google’s page rank promise considerable possibilities for fur-

ther interesting research.

Page 45: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 41 |

6 References

Aiken, L.S., West, S.G. & Reno, R.R. 1991, Multiple regression: testing and interpreting

interactions, SAGE Publications Newbury Park, CA.

Almeida, R.B., Mozafari, B. & Cho, J. 2007, 'On the Evolution of Wikipedia', International

Conference on Weblogs and Social Media, Boulder, Colorado, USA,

<http://www.icwsm.org/papers/2--Almeida-Mozafari-Cho.pdf>.

Anthony, D., Smith, S.W. & Williamson, T. 2007, The Quality of Open Source Production:

Zealots and Good Samaritans in the Case of Wikipedia,

<http://www.cs.dartmouth.edu/reports/TR2007-606.pdf>.

Barabási, A.-L. 2003, Linked, Plume, New York.

Bruckman, A. 2006, 'A New Perspective on “Community” and its Implications for Computer-

Mediated Communication Systems', paper presented to the CHI 2006, Montréal, Qué-

bec, Canada, <http://www.cc.gatech.edu/~asb/papers/bruckman-community-

chi06.pdf>.

Buriol, L.S., Castillo, C., Donato, D., Leonardi, S. & Millozzi, S. 2006, 'Temporal Analysis of

the Wikigraph', paper presented to the 2006 IEEE/WIC/ACM International Conference

on Web Intelligence, Hong Kong,

<http://www.inf.ufrgs.br/~buriol/papers/buriol_2006_temporal_analysis_wikigraph.pd

f>

Butler, B.S. 2001, 'Membership Size, Communication Activity, and Sustainability: A Re-

source-Based Model of Online Social Structures', Information Systems Research, vol.

12, no. 4, pp. 346-362.

Cosley, D., Ludford, P. & Terveen, L. 2003, 'Studying the Effect of Similarity in Online

Task-Focused Interactions', 2003 international ACM SIGGROUP conference on Sup-

porting group work, Sanibel Island, Florida, USA pp. 321-329

<http://www.grouplens.org/papers/pdf/simex-group2003.pdf>.

Page 46: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 42 |

Cothrel, J. & Williams, R.L. 1999, 'On-line communities: helping them form and grow', Jour-

nal of Knowledge Management, vol. 3, no. 1, pp. 54-60.

Cothrel, J.P. 2000, 'Measuring the success of an online community', Strategy & Leadership,

vol. 28, no. 2, pp. 17-21.

den Besten, M., Loubser, M. & Dalle, J.-M. 2008, Wikipedia as a Distributed Problem-

Solving Network,

<http://www.oii.ox.ac.uk/downloads/index.cfm?File=research/dpsn/Wikipedia_full.pd

f>.

Eisinga, R., Scheepers, P. & van Snippenburg, L. 1991, 'The standardized effect of a com-

pound of dummy variables or polynomial terms', Quality & Quantity, vol. 25, pp. 103-

114.

Fernandez-Ramil, J., Izquierdo-Cortazar, D. & Mens, T. 2008, 'Relationship between Size,

Effort, Duration and Number of Contributors in Large FLOSS projects', BENEVOL

2008, Eindhoven,

<ftp://ftp.umh.ac.be/pub/ftp_infofs/2008/Benevol2008RamilEtAl.pdf>.

Franke, N. 2005, 'Open Source & Co.: Innovative User-Netzwerke', in S. Albers & O. Gass-

mann (eds), Handbuch Technologie- und Innovationsmanagement, Gabler, Wiesba-

den, pp. 695-712.

Franke, N. & Shah, S. 2003, 'How communities support innovative activities: an exploration

of assistance and sharing among end-users', Research Policy, vol. 32, no. 1, pp. 157-

178.

Füller, J., Jawecki, G. & Mühlbacher, H. 2007, 'Innovation creation by online basketball

communities', Journal of Business Research, vol. 60, no. 1, pp. 60-71.

Füller, J., Matzler, K. & Hoppe, e. 2008, 'Brand Community Members as a Source of Innova-

tion', Journal of Product Innovation Management, vol. 25, no. 6, pp. 609-619.

Garud, R., Jain, S. & Tuertscher, P. 2008, 'Incomplete by Design and Designing for Incom-

pleteness', Organization Studies, vol. 29, pp. 351-371.

Giles, J. 2005, 'Internet encyclopaedias go head to head', Nature, vol. 438, pp. 900-901.

Page 47: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 43 |

Holloway, T., Bozicevic, M. & Börner, K. 2007, 'Analyzing and visualizing the semantic

coverage of Wikipedia and its authors', Complexity, vol. 12, no. 3, pp. 30-40.

Horwitz, S.K. & Horwitz, I.B. 2007, 'The Effects of Team Diversity on Team Outcomes: A

Meta-Analytic Review of Team Demography', Journal of Management, vol. 33, no. 6,

pp. 987-1015.

Jones, Q. 1997, 'Virtual-Communities, Virtual Settlements & Cyber-Archaeology: A Theo-

retical Outline', Journal of Computer-Mediated Communication, vol. 3, no. 3.

Katz, A. & Te'eni, D. 2007, 'The Contingent Impact of Contextualization on Computer-

Mediated Collaboration', Organization Science, vol. 18, no. 2, pp. 261-279.

Kittur, A., Ch, E., Pendleton, B.A., Suh, B. & Mytkowicz, T. 2007, 'Power of the Few vs.

Wisdom of the Crowd: Wikipedia and the Rise of the Bourgeoisie', CHI 2007, San

Jose, CA, <http://www.parc.com/research/publications/files/5904.pdf>.

Koch, M. 2002, 'Requirements for community support systems - modularization, integration

and ubiquitous user interfaces', Behaviour & Information Technology, vol. 21, no. 5,

pp. 327-332.

Kollock, P. 1996, 'Design Principles for Online Communities', First International Harvard

Conference on the Internet and Society, Boston,USA,

<http://www.sscnet.ucla.edu/soc/faculty/kollock/papers/design.htm>.

Korfiatis, N.T., Poulos, M. & Bokos, G. 2006, 'Evaluating authoritative sources using social

networks: an insight from Wikipedia', Online Information Review, vol. 30, no. 3, pp.

252-262.

Kozinets, R.V. 1999, 'E-Tribalized Marketing?: The Strategic Implications of Virtual Com-

munities of Consumption', European Management Journal, vol. 17, no. 3, pp. 252–

264.

Lakhani, K.R. & Panetta, J.A. 2007, 'The Principles of Distributed Innovation', Innovations,

vol. 2, no. 3, pp. 97-112.

Page 48: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 44 |

Leimeister, J.M., Sidiras, P. & Krcmar, H. 2006, 'Exploring Success Factors of Virtual Com-

munities: The Perspectives of Members and Operators', Journal of Organizational

Computing and Electronic Commerce, vol. 16, no. 3&4, pp. 277–298.

Licklider, J.C.R. & Taylor, R.W. 1968, 'The Computer as a Communication Device', Science

and Technology, pp. 21-41.

Lih, A. 2004, 'Wikipedia as Participatory Journalism: Reliable Sources? Metrics for evaluat-

ing collaborative media as a news resource', paper presented to the 5th International

Symposium on Online Journalism, University of Texas at Austin, USA, April 16-17,

2004.

Ludford, P.J., Cosley, D., Frankowski, D. & Terveen, L. 2004, 'Think different: increasing

online community participation using uniqueness and group dissimilarity', SIGCHI

conference on Human factors in computing systems, ACM, Vienna, Austria, pp. 631-

638, <http://grouplens.org/papers/pdf/thinkdifferent-chi2004.pdf>.

Manning, C.D. & Schütze, H. 2003, Foundations of Statistical Natural Language Processing,

MIT Press, Cambridge,MA.

Mateos Garcia, J. & Steinmueller, W.E. 2003, 'Applying the open source development model

to knowledge work.' INK Open Source Research Working Paper No. 2,

<http://www.sussex.ac.uk/Units/spru/publications/imprint/sewps/sewp94/sewp94.pdf>

Mockus, A., Fielding, R.T. & Herbsleb, J. 2000, 'A Case Study of Open Source Software De-

velopment: The Apache Server', The 22th International Conference on Software Engi-

neering, Limerick, Ireland, <http://mockus.us/papers/apache.pdf>.

Mynatt, E.D., O'Day, V.L., Adler, A. & Ito, M. 1998, 'Network Communities: Something

Old, Something New, Something Borrowed . . .' Computer Supported Cooperative

Work (CSCW), vol. 7, no. 1-2, pp. 123-156.

Nambisan, S. 2002, 'Designing Virtual Customer Environments for New Product Develop-

ment: Toward a Theory', Academy of Management Review, vol. 27, no. 3, pp. 392-

413.

Page 49: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 45 |

Nonnecke, B. & Preece, J. 1999, 'Shedding Light on Lurkers in Online Communities', Ethno-

graphic Studies in Real and Virtual Environments: Inhabited Information Spaces and

Connected Communities, Edinburgh, pp. 123-128,

<http://www.ifsm.umbc.edu/~preece/paper/16%20Shedding%20Light.final.pdf>.

Ortega, F., Gonzalez-Barahona, J.M. & Robles, G. 2007, 'The Top Ten Wikipedias: A Quanti-

tative Analysis Using WikiXRay ', ICSOFT, Barcelona, Spain, pp. 46-53,

<http://libresoft.es/oldsite/downloads/C4_159_Ortega.pdf>.

Ortega, F., Gonzalez-Barahona, J.M. & Robles, G. 2008, 'On the Inequality of Contributions

to Wikipedia', 41st Annual Hawaii International Conference on System Sciences

Honolulu, Hawaii, p. 304, <http://libresoft.es/downloads/Ineq_Wikipedia.pdf>.

Preece, J. 2001, 'Sociability and usability in online communities: determining and measuring

success', Behaviour & Information Technology, vol. 20, no. 5, pp. 347-356.

Preece, J. & Maloney-Krichmar, D. 2005, 'Online Communities: Design, Theory, and Prac-

tice', Journal of Computer-Mediated Communication, vol. 10, no. 4, p. article 1.

Preece, J., Maloney-Krichmar, D. & Abras, C. 2003, History and emergence of online com-

munities, Berkshire Publishing Group, Sage,

<http://www.ifsm.umbc.edu/~preece/paper/6%20Final%20Enc%20preece%20et%20a

l.pdf>.

Rashid, A.M., Ling, K., Tassone, R.D., Resnick, P., Kraut, R. & Riedl, J. 2006, 'Motivating

Participation by Displaying the Value of Contribution', CHI 2006, ACM, Montréal,

Québec, Canada, pp. 955-

958<http://www.si.umich.edu/~presnick/papers/CHI06/rashidAl.pdf>.

Raymond, E.S. 2000, The Cathedral and the Bazaar (Electronic Version), viewed 07.12.

2008, <http://www.catb.org/~esr/writings/cathedral-bazaar/cathedral-

bazaar/ar01s04.html>.

Rheingold, H. 1993, The Virtual Community (Electronic Version), viewed 07.11.2008

<http://www.rheingold.com/vc/book/>.

Page 50: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 46 |

Ridings, C.M., Gefen, D. & Arinze, B. 2002, 'Some antecedents and effects of trust in virtual

communities', Journal of Strategic Information Systems, vol. 11, pp. 271-295.

Schoberth, T., Preece, J. & Heinzl, A. 2003, 'Online Communities: A Longitudinal Analysis

of Communication Activities', 36th Annual Hawaii International Conference on Sys-

tem Sciences, Big Island, Hawaii,

<http://www.ifsm.umbc.edu/~preece/paper/9%20HICSSNOCD06v2.pdf>.

Senyard, A. & Michlmayr, M. 2004, 'How to Have a Successful Free Software Project', 11th

Asia-Pacific Software Engineering Conference (APSEC’04), Busan, Korea,

<http://kb.cospa-project.org/retrieve/2450/senyardmichlmay.pdf>.

Stewart, D. 2005, 'Social Status in an Open-Source Community', American Sociological Re-

view, vol. 70, no. 5, pp. 823-842.

Stvilia, B., Twidale, M.B., Smith, L.C. & Gasser, L. 2005, 'Assessing information quality of a

community-based encyclopedia ', International Conference on Information Quality,

Cambridge,England, pp. 442-454,

<http://www.isrl.uiuc.edu/~stvilia/papers/quantWiki.pdf>.

Surowiecki, J. 2005, The Wisdom of Crowds, Anchor Books, New York.

Tapscott, D. & Williams, A.D. 2008, Wikinomics: How Mass Collaboration Changes Every-

thing, Penguin Group, New York.

Van Alystyne, M. & Brynjolfsson, E. 2005, 'Global Village or Cyber-Balkans? Modeling and

Measuring the Integration of Electronic Communities', Management Science, vol. 51,

no. 6, pp. 851-868.

Viégas, F.B., Wattenberg, M. & Dave, K. 2004, 'Studying Cooperation and Conflict between

Authors with history flow Visualizations', SIGCHI conference on Human factors in

computing systems, vol. 6, ACM, Vienna,Austria, pp. 575-

582<http://alumni.media.mit.edu/~fviegas/papers/history_flow.pdf>.

Viégas, F.B., Wattenberg, M., Kriss, J. & Ham, F.v. 2007, 'Talk Before You Type: Coordina-

tion in Wikipedia', 40th Hawaii International Conference on System Sciences, Hono-

Page 51: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 47 |

lulu, Hawaii, USA,

<http://www.research.ibm.com/visual/papers/wikipedia_coordination_final.pdf>.

von Hippel, E. 2001, 'Innovation by User Communities: Learning from Open-Source Soft-

ware', MIT Sloan Management Review, vol. 42, no. 4, pp. 82-86.

von Hippel, E. 2005, Democratizing Innovation (Electronic Version), viewed 07.11.2008,

<http://web.mit.edu/evhippel/www/books/DI/DemocInn.pdf>.

von Krogh, G., Spaeth, S. & Lakhani, K.R. 2003, 'Community, joining, and specialization in

open source software innovation: a case study', Research Policy, vol. 32, pp. 1217-

1241.

Voss, J. 2005, 'Measuring Wikipedia', paper presented to the International Conference of the

International Society for Scientometrics and Informetrics : 10th, Stockholm (Sweden),

24-28 July 2005,<http://eprints.rclis.org/3610/1/MeasuringWikipedia2005.pdf>.

Wanga, Y. & Fesenmaier, D.R. 2004, 'Towards understanding members’ general participation

in and active contribution to an online travel community', Tourism Management, vol.

25, pp. 709–722.

Wilkinson, D.M. & Huberman, B.A. 2007, 'Assessing the value of cooperation in Wikipedia',

First Monday, vol. 12, no. 4.

Williams, R.L. & Cothrel, J. 2000, 'Four Smart Ways to Run Online Communities', Sloan

Management Review, vol. 41, no. 4, pp. 81-91.

Internet Sources:

Alexa.com 2008, Traffic Details - wikipedia.org, viewed 10.11.2008

<http://www.alexa.com/data/details/traffic_details/wikipedia.org>.

Döring, N. 2001, Virtuelle Gemeinschaften als Lerngemeinschaften!?, viewed 07.11.2008

<http://www.die-frankfurt.de/zeitschrift/32001/positionen4.htm>.

Encyclopædia Britannica, I. 2006, Fatally Flawed - Refuting the recent study on encyclopedic

accuracy by the journal Nature, viewed 16.04.2009

<http://corporate.britannica.com/britannica_nature_response.pdf>.

Page 52: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 48 |

Gude, A. 2008, wikipedia-article-exporter, viewed 16.10.2008

<http://code.google.com/p/wikipedia-article-exporter/>.

MediaWiki.org 2009, MWDumper, viewed 16.04.2009

<http://www.mediawiki.org/w/index.php?title=MWDumper&oldid=242629>.

O'Reilly, T. 2005, What Is Web 2.0, viewed 16.03.2009

<http://www.oreillynet.com/pub/a/oreilly/tim/news/2005/09/30/what-is-web-20.html>.

stats.grok.se 2009, Wikipedia article traffic statistics, viewed 28.03.2009

<http://stats.grok.se/>.

stilversprechend.de 2009a, stilversprechend, viewed 17.04.2009

<http://www.stilversprechend.de/stil/index.html>.

stilversprechend.de 2009b, Was ist der Flesch-Wert, viewed 17.04.2009

<http://www.stilversprechend.de/stil/fleschwert.html>.

Swartz, A. 2006, Raw Thought: Who Writes Wikipedia?, viewed 27.03.2009

<http://www.aaronsw.com/weblog/whowriteswikipedia>.

Wales, J. 2005a, The Intelligence of Wikipedia, Oxford Internet Institute, viewed 27.03.2009

<http://webcast.oii.ox.ac.uk/?ID=20050711_76&view=Webcast>.

Wales, J. 2005b, Wikipedia is an encylopedia, viewed 12.03.2009

<http://lists.wikimedia.org/pipermail/wikipedia-l/2005-March/020469.html>.

Wales, J. 2005c, Wikipedia, Emergence, and The Wisdom of Crowds, viewed 27.03.2009

<http://lists.wikimedia.org/pipermail/wikipedia-l/2005-May/021764.html>.

Wikimedia 2008a, dewiki dump progress on 20080607, viewed 03.08.2008

<http://download.wikimedia.org/dewiki/20080607/>.

Wikimedia 2008b, enwiki dump progress on 20080312, viewed 28.03.2009

<http://download.wikimedia.org/enwiki/20080312/>.

Wikimedia 2008c, enwiki dump progress on 20080524, viewed 28.03.2009

<http://download.wikimedia.org/enwiki/20080524/>.

Page 53: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 49 |

Wikimedia 2008d, List of Wikipedias, viewed 10.11.2008

<http://meta.wikimedia.org/w/index.php?title=List_of_Wikipedias&oldid=1267871>.

Wikimedia 2009, Wikimedia Downloads, viewed 27.03.2009

<http://download.wikimedia.org/>.

Wikipedia 2009a, Archiv/WP 1.0, viewed 29.04.2009

<http://de.wikipedia.org/w/index.php?title=Wikipedia:Archiv/WP_1.0&oldid=303126

94>.

Wikipedia 2009b, Benutzerverzeichnis, viewed 16.10.2008

<http://de.wikipedia.org/w/index.php?title=Spezial%3ABenutzer&group=bot>.

Wikipedia 2009c, Bots, viewed 16.04.2009

<http://en.wikipedia.org/w/index.php?title=Wikipedia:Bots&oldid=283931567>.

Wikipedia 2009d, Categorization, viewed 12.04.2009

<http://en.wikipedia.org/w/index.php?title=Wikipedia:Categorization&oldid=2828195

65#Categorizing_pages>.

Wikipedia 2009e, Featured articles, viewed 12.04.2009

<http://en.wikipedia.org/w/index.php?title=Wikipedia:Featured_articles&oldid=28342

9397>.

Wikipedia 2009f, Hilfe:Kategorien, viewed 17.04.2009

<http://de.wikipedia.org/w/index.php?title=Hilfe:Kategorien&oldid=58900978>.

Wikipedia 2009g, Mediawiki API documentation page, viewed 28.03.2009

<http://de.wikipedia.org/w/api.php>.

Wikipedia 2009h, Revision history of Virtual community, viewed 23.04.2009

<http://en.wikipedia.org/w/index.php?title=Virtual_community&action=history>.

Wikipedia 2009i, Size of Wikipedia, viewed 20.05.2009

<http://en.wikipedia.org/w/index.php?title=Wikipedia:Size_of_Wikipedia&oldid=291

163424>.

Page 54: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 50 |

Wikipedia 2009j, Statistics, viewed 28.04.2009

<http://en.wikipedia.org/w/index.php?title=Wikipedia:Statistics&oldid=282205700>.

Wikipedia 2009k, Version 1.0 Editorial Team/Assessment, viewed 29.04.2009

<http://en.wikipedia.org/wiki/Wikipedia:Version_1.0_Editorial_Team/Assessment>.

Wikipedia 2009l, Welcome to Wikipedia, viewed 14.04.2009

<http://en.wikipedia.org/w/index.php?title=Main_Page&oldid=273421236>.

Wikipedia 2009m, Who is responsible for these pages, viewed 14.04.2009

<http://en.wikipedia.org/w/index.php?title=Wikipedia:Editorial_oversight_and_contro

l&oldid=283640875#User_collaborative_knowledge-building>.

Wikipedia 2009n, Wikimedia projects, viewed 27.03.2009

<http://en.wikipedia.org/w/index.php?title=Wikimedia_Foundation&oldid=28004568

3#Wikimedia_projects>.

Wikipedia 2009o, Wikipedia:Barnstars, viewed 28.03.2009

<http://en.wikipedia.org/w/index.php?title=Wikipedia:Barnstars&oldid=279034552>.

Wikipedia 2009p, Wikipedia:Namespace, viewed 16.04.2009

<http://en.wikipedia.org/w/index.php?title=Wikipedia:Namespace&oldid=275699788

>.

Wikipedia 2009q, Wikipedia:Talk page, viewed 27.03.2009

<http://en.wikipedia.org/w/index.php?title=Wikipedia:Talk_page&oldid=277750073>

.

Wikipedia 2009r, Wikipedia:User access levels, viewed 27.03.2009

<http://en.wikipedia.org/w/index.php?title=Wikipedia:User_access_levels&oldid=279

622212>.

Wikipedia 2009s, Wikipedia:User page, viewed 27.03.2009

<http://en.wikipedia.org/w/index.php?title=Wikipedia:User_page&oldid=279760325>

.

Page 55: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 51 |

Wikipedia 2009t, WikiProject, viewed 28.04.2009

<http://en.wikipedia.org/w/index.php?title=Wikipedia:WikiProject&oldid=286400532

>.

Wikipedia 2009u, WikiProject Vandalism studies, viewed 20.05.2009

<http://en.wikipedia.org/w/index.php?title=Wikipedia_talk:WikiProject_Vandalism_st

udies&oldid=291203033>.

Wikipedia 2009v, Wikitext, viewed 16.04.2009

<http://en.wikipedia.org/w/index.php?title=Wikitext&oldid=283256384>.

Wikipedia 2009w, Zusammenfassung und Quellen, viewed 16.04.2009

<http://de.wikipedia.org/w/index.php?title=Hilfe:Zusammenfassung_und_Quellen&ol

did=58724557#Auto-Zusammenfassung>.

Page 56: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 52 |

7 Appendix

7.1 Figures and Examples

XML representation of a Wikipedia article [1]

Online representation of a Wikipedia article [2]

Page 57: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 53 |

HTML representation of a Wikipedia article [2]

7.2 Data Collection

7.2.1 Setting up the Database and Drawing the Sample

Instead of installing the whole MediaWiki software it was decided to setup the required data-

base scheme using the tables.sql file, which can be found in the MediaWiki repository [3].

The XML-dump from June 7th of the German Wikipedia was downloaded [4] and converted

into a SQL file using the MWDumper-tool [5]:

java -jar mwdumper-2008-04-13.jar --output=file:dump.sql --format=sql:1.5 dewiki-20080607-stub-meta-history.xml.gz

The yielded file was then imported into a MySQL database:

mysql -u username –ppassword --database=dbname --force --default-character-set=utf8 < dump.sql

Two tables in the dump seemed especially important for this study: The page-table including

all pages in Wikipedia and the revision-table including every single revision of each page. As

these tables were out of sync (the revision table included more distinct pages than the page

table), it was decided to use the revision table as a basis for the analysis.

As this study aimed to analyze articles created in 2007, all revisions in 2007 were extracted

from the revision table in a first step and indices were added to speed up queries from this

table.

create table revision2007 as select * from revision where extract(year from rev_timestamp)=2007; alter table revision2007 add index (rev_page); alter table revision2007 add index (rev_user_text);

In a next step a table was created to depict which pages were edited by which user in 2007

and how often. Again, several indices were created to improve query performance:

Page 58: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 54 |

create table userpages2007 as select rev_page,rev_user_text,count(*) from revision2007 group by rev_page,rev_user_text Order by Null; alter table userpages2007 add index (rev_user_text); alter table userpages2007 add index (rev_page); alter table userpages2007 add index (rev_page,rev_user_text);

To flag bots a column was added to this table…:

alter table userpages2007 add column bot boolean;

…an up-to-date user-group assignment list received from [6] and updated with the help of the

following short Python script:

import urllib import re class AppURLopener(urllib.FancyURLopener): #set user agent to firefox version= "Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11" #get bot list AppURLopenerinstance=AppURLopener() #adapt to wkiversion botlist=AppURLopenerinstance.open("http://de.wikipedia.org/w/index.php?title=Spezial%3ABenutzer&username=&group=bot&limit=5000")# make sure that all bots are included in this query global_botlistcontent=botlist.read() botlist.close() bots=re.findall(""">([^<]*)</a> \xe2\x80\x8e\(<a href="/wiki/Wikipedia:Bots""",global_botlistcontent) import MySQLdb conn = MySQLdb.connect (host = "127.0.0.1", user = "username", passwd = "password", db = "dbname") cursor = conn.cursor (MySQLdb.cursors.DictCursor) for bot in bots: print bot cursor.execute ("update userpages2007 set bot=1 where rev_user_text=%s",(bot,)) result = cursor.fetchall()

Due to special characters used in its name a bot had to be flagged by hand. What is more, all

other bot values were set to 0 and a ‘bot’ table including all bots was created:

update userpages2007 set bot=1 where rev_user_text="L&K-Bot"; update userpages2007 set bot=0 where bot IS NULL; create table bots as select distinct(rev_user_text) from userpages2007 where bot=1;

After that, columns for the page title and page namespace were added to the table…

alter table userpages2007 add column page_title varchar(255); alter table userpages2007 add column page_namespace int(11);

…and filled with values from the page-table:

Page 59: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 55 |

update userpages2007,page set userpages2007.page_title=page.page_title where user-pages2007.rev_page=page.page_id; update userpages2007,page set userpages2007.page_namespace=page.page_namespace where user-pages2007.rev_page=page.page_id;

Missing values were queried from the Wikipedia API and missing pages flagged with the fol-

lowing Python script:

import urllib import re import xml.etree.cElementTree as cElementTree import time class AppURLopener(urllib.FancyURLopener): #set user agent to firefox version= "Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11" import MySQLdb conn = MySQLdb.connect (host = "127.0.0.1", user = "username", passwd = "password", db = "dbname") cursor = conn.cursor (MySQLdb.cursors.DictCursor) AppURLopenerinstance=AppURLopener() numberofmissingvalues=100 for i in range (0,numberofmissingvalues/50): #it is only possible to query 50 values from the api without bot status at once cursor.execute ("select distinct(rev_page) from userpages where page_title is NULL limit 50")#create batches of 50 resultpage = cursor.fetchall() print resultpage print stringliste=[] #create query string for api for page in resultpage: stringliste.append(str(page['rev_page'])) fertigerstring="|".join(stringliste) print fertigerstring ##get api content liste=AppURLopenerinstance.open("http://de.wikipedia.org/w/api.php?action=query&pageids="+fertigerstring+"&format=xml") #get api results for event, elem in cElementTree.iterparse(liste): if elem.tag=='page': if elem.attrib.has_key('missing'): print "missing" cursor.execute ("update userpages set page_title=%s, page_namespace=%s where rev_page=%s",("!missing","999",elem.attrib['pageid']))#flag missing pages else: print "---" print elem.attrib['pageid'] print elem.attrib['ns'] print elem.attrib['title']

Page 60: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 56 |

print cursor.execute ("update userpages set page_title=%s, page_namespace=%s where rev_page=%s",(elem.attrib['title'].encode("latin-1"),elem.attrib['ns'],elem.attrib['pageid'])) #changed from utf-8 cursor.execute ("select * from userpages where rev_page=%s limit 1",(elem.attrib['pageid'])) resultpage = cursor.fetchall() print resultpage time.sleep(5)# sleep 5 seconds to reduce load on server

Then a table with all articles in the main namespace (namespace 0; see [7] for more details)

edited in 2007, which was not only edited by bots was created:

create table pagelist2007 as select distinct(rev_page),page_title,page_namespace from userpages2007 where page_namespace=0 and bot=0; alter table pagelist2007 add index (rev_page);

It turned out that an easier way of removing all pages not in namespace 0 would probably

have been to use the –filter option of the MwDumper-tool (see [5] for more details).

Next, a column depicting the date of creation of each article was created…

alter table pagelist2007 add column creationdate binary(14);

… and populated with the help of the following Python script:

import MySQLdb conn = MySQLdb.connect (host = "127.0.0.1", user = "username", passwd = "password", db = "dbname") cursor = conn.cursor (MySQLdb.cursors.DictCursor) cursor.execute ("select * from pagelist2007 where creationdate is Null") pages = cursor.fetchall() for page in pages: cursor.execute ("select min(rev_timestamp) from revision where rev_page=%s;",(page["rev_page"],)) # get date of first edit=creation result = cursor.fetchone() cursor.execute ("UPDATE pagelist2007 SET creationdate = %s where rev_page=%s",(result["min(rev_timestamp)"],page["rev_page"],))

Subsequently a column for the number of contributors in 2007 was added and…

alter table pagelist2007 add column user2007 int(11);

…the number of users in 2007 without bots calculated:

import MySQLdb conn = MySQLdb.connect (host = "127.0.0.1", user = "username", passwd = "password", db = "dbname")

Page 61: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 57 |

cursor = conn.cursor (MySQLdb.cursors.DictCursor) cursor.execute ("select rev_page from pagelist2007 where user2007 is Null") pages = cursor.fetchall() for page in pages: cursor.execute ("select count(distinct rev_user_text) as contributors from revision2007 where rev_page=%s and rev_user_text not in(select rev_user_text from bots)",(page["rev_page"],)) #select users 2007 without bots result = cursor.fetchone() cursor.execute ("UPDATE pagelist2007 SET user2007 = %s where rev_page=%s",(result["contributors "],page["rev_page"],))

A random sample of 5000 articles with more than 1 user created in 2007 was drawn from this

table and stored in the samplepagescreated2007 table.

create table samplepagescreated2007 as select * from pagelist2007 where extract(year from creationdate)=2007 and user2007>1 order by rand() limit 5000;

Finally columns for each variable and a column indicating whether the article was already

analyzed were added to the table:

alter table samplepagescreated2007 add column words int(10) unsigned;# for nr. of words alter table samplepagescreated2007 add column wordpool int(10) unsigned;# for vocabulary alter table samplepagescreated2007 add column uniquesumlinks int(10) unsigned;#for unique total links alter table samplepagescreated2007 add column fleschd int(10) unsigned; #for flesch readability score alter table samplepagescreated2007 add column categoriescalc int(10) unsigned;# for nr. of categories alter table samplepagescreated2007 add column userscalc int(10) unsigned; #for community-size and interactiv-ity alter table samplepagescreated2007 add column heterogeneitycalc double unsigned; #for heterogeneity alter table samplepagescreated2007 add column avgactivity double unsigned; #for activity alter table samplepagescreated2007 add column avgfocus double unsigned; #for focus alter table samplepagescreated2007 add column interactionscalc int(10) unsigned; #for calculating interactivity alter table samplepagescreated2007 add column editscalc int(10) unsigned; #for calculating interactivity alter table samplepagescreated2007 add column mediantimebetweenedits double unsigned; #for calculating dynamics alter table samplepagescreated2007 add column analysed boolean; #set to 1 if article was already analyzed

Page 62: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 58 |

7.2.2 Parsing the Data and Calculating all Variables

The following sections explain the process of calculating all variables from the following

three sources:

� XML file of each article

� HTML file of each article

� The MySQL database created

7.2.2.1 Retrieval and Analyses of the XML Representation of each Article

The following script queries the title of each article from the Wikipedia API and downloads

the article’s XML file from Wikipedia with the help of an adapted version of the getwiki

script by Gude [9] (all links to the English Wikipedia were replaced by the respective links to

the German Wikipedia). The file is stored to a folder and parsed. In a first step vandals are

flagged. After that the XML file is parsed again to calculate the number of users (excluding

bots and vandals), categories, interactions and edits. In case of any problems, the article is

flagged with a problem code (1: article is a redirect, 2: rev_page id of downloaded article file

does not match rev_page in database, 3: redirect & ids do not match, 4: article not found via

API) and excluded from further analysis.

# -*- coding: cp1252 -*- #### utf-8 import xml.etree.cElementTree as ElementTree import datetime import time import urllib import re import numpy import degetwiki #importing getwiki by Alexander Gude Version 1.1, adapted to the German Wikipedia import os import pylab import pickle import math import MySQLdb conn = MySQLdb.connect (host = "127.0.0.1", user = "username", passwd = "password", db = "dbname") cursor = conn.cursor (MySQLdb.cursors.DictCursor) class AppURLopener(urllib.FancyURLopener): version = "Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11"

Page 63: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 59 |

AppURLopenerinstance=AppURLopener() def get_botlist(): if os.path.exists("botlist.txt"): #print "botfile found" botfile = file('botlist.txt', 'r') botlist=pickle.load(botfile) #read from file else: botlist=[] #pickled and unpickled conn = MySQLdb.connect (host = "127.0.0.1", user = "username", passwd = "password", db = "dbname") cursor = conn.cursor (MySQLdb.cursors.DictCursor) cursor.execute ("select * from bots;") #get bots from database bots = cursor.fetchall() for bot in bots: botlist.append(bot['rev_user_text']) #save to file botfile = file('botlist.txt', 'w') pickle.dump(botlist,botfile) #store botfile botfile.close #close botfile in both cases return botlist def analysearticle(doc,rev_page): conn = MySQLdb.connect (host = "127.0.0.1", user = "username", passwd = "password", db = "dbname") cursor = conn.cursor (MySQLdb.cursors.DictCursor) redirect=False #is article a redirect? problem=0 #is there a problem? filecelement=open(doc, "r") #open xml file revcounter=0#revision counter newerthan2007=False articleidfound=False # there are several id fields (article,revision,user) vandalism=set() vandalismuser=set() upperlimit=datetime.datetime(2008, 1, 1)# wrongarticle=False #get botlist bots=get_botlist() #print bots for event, elem in ElementTree.iterparse(filecelement): #first run, flag vandalism if elem.tag=="fusername" or elem.tag=="{http://www.mediawiki.org/xml/export-0.3/}ip": currentusernameorip=elem.text if elem.tag=="{http://www.mediawiki.org/xml/export-0.3/}timestamp": currentzeit=elem.text if datetime.datetime.strptime(elem.text, "%Y-%m-%dT%H:%M:%SZ")>=upperlimit: newerthan2007=True if elem.tag=="{http://www.mediawiki.org/xml/export-0.3/}title": articletitle=elem.text

Page 64: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 60 |

if elem.tag=="{http://www.mediawiki.org/xml/export-0.3/}id": if not articleidfound: articleid=elem.text if str(articleid)!=rev_page: #ids in database and id in downloaded article don't match wrongarticle=True articleidfound=True if elem.tag=="{http://www.mediawiki.org/xml/export-0.3/}comment": if not newerthan2007: if elem.text=="[[Hilfe:Zusammenfassung und Quelle#Auto-Zusammenfassung|AZ]]: Der Seiteninhalt wurde durch einen anderen Text ersetzt." or elem.text=="[[Hilfe:Zusammenfassung und Quelle#Auto-Zusammenfassung|AZ]]: Die Seite wurde geleert.": #potential vandalism detected see http://de.wikipedia.org/wiki/Hilfe:Zusammenfassung_und_Quelle vandalism.add(revcounter) vandalismuser.add("'"+currentusernameorip+"'") if elem.tag=="{http://www.mediawiki.org/xml/export-0.3/}text": if not newerthan2007: if revcounter not in vandalism: if elem.text: if "#redirect[[" in elem.text.lower() or "#redirect [[" in elem.text.lower() or "#weiterleitung[[" in elem.text.lower() or "#weiterleitung [[" in elem.text.lower(): #redirect? redirect=True#last revision includes redirect else: redirect=False else: #if everything is deleted there's no text in the text element vandalism.add(revcounter) vandalismuser.add("'"+currentusernameorip+"'") revcounter+=1 if wrongarticle==False and redirect==False: #ids fit,no redirect newerthan2007=False revcounter=0 userset=set() filecelement =open(doc, "r") olduser="" interactions=0 edits=0 for event, elem in ElementTree.iterparse(filecelement): #calculate vectors if elem.tag=="{http://www.mediawiki.org/xml/export-0.3/}username" or elem.tag=="{http://www.mediawiki.org/xml/export-0.3/}ip": currentusernameorip=elem.text #store username or ip if elem.tag=="{http://www.mediawiki.org/xml/export-0.3/}timestamp": currentzeit=elem.text if datetime.datetime.strptime(elem.text, "%Y-%m-%dT%H:%M:%SZ")>=upperlimit newerthan2007=True #flag revisions after 2007 if elem.tag=="{http://www.mediawiki.org/xml/export-0.3/}text": if not newerthan2007: if revcounter not in vandalism and currentusernameorip not in bots: edits+=1 userset.add(currentusernameorip) categories=len(re.findall("\[\[Kategorie:(.*)]]",elem.text)) if not currentusernameorip == olduser: interactions+=1

Page 65: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 61 |

olduser=currentusernameorip revcounter+=1 userscalc=len(userset) cursor.execute ("UPDATE samplepagescreated2007 SET interaction-scalc=%s,userscalc=%s,categoriescalc=%s, editscalc=%s WHERE rev_page = %s",(interactions,userscalc,categories,edits,rev_page,)) else: #ids don't fit! or redirect if redirect==True and wrongarticle==True:#redirect and wrongid problem=3 if wrongarticle==True and redirect==False:#wrongid problem=2 if redirect==True and wrongarticle==False:#redirect problem=1 cursor.execute ("UPDATE samplepagescreated2007 SET analysed = 1,problem=%s WHERE rev_page = %s",(problem,rev_page,)) #do for all cursor.close () conn.close () def main(article,offset,rev_page): wikiheader="""<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.3/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.3/ http://www.mediawiki.org/xml/export-0.3.xsd" version="0.3" xml:lang="en"> <siteinfo> <sitename>Wikipedia</sitename> <base>http://en.wikipedia.org/wiki/Main_Page</base> <generator>MediaWiki 1.14alpha</generator> <case>first-letter</case> <namespaces> <namespace key="-2">Media</namespace> <namespace key="-1">Special</namespace> <namespace key="0" /> <namespace key="1">Talk</namespace> <namespace key="2">User</namespace> <namespace key="3">User talk</namespace> <namespace key="4">Wikipedia</namespace> <namespace key="5">Wikipedia talk</namespace> <namespace key="6">Image</namespace> <namespace key="7">Image talk</namespace> <namespace key="8">MediaWiki</namespace> <namespace key="9">MediaWiki talk</namespace> <namespace key="10">Template</namespace> <namespace key="11">Template talk</namespace> <namespace key="12">Help</namespace> <namespace key="13">Help talk</namespace> <namespace key="14">Category</namespace> <namespace key="15">Category talk</namespace> <namespace key="100">Portal</namespace> <namespace key="101">Portal talk</namespace> </namespaces> </siteinfo> <page> """ #download article if os.path.exists(rev_page+".xml"): print "article xml file found, using this version. Delete old version to trigger download" else: if offset !=1:

Page 66: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 62 |

deget-wiki.downloadArticles(articlename=article,filename=rev_page+"_temp.xml",verbose=True,split=False,offset=offset) #create new file, add header, add content filehandler = open(rev_page+'.xml', 'w') filehandler.write(wikiheader) #add header to file filehandler.write(open(rev_page+"_temp.xml").read()) filehandler.close() else: #download whole article deget-wiki.downloadArticles(articlename=article,filename=rev_page+".xml",verbose=True,split=False,offset=offset) os.remove(rev_page+"_temp.xml")#remove temporary file analysearticle(doc=rev_page+".xml",rev_page=rev_page) #analyze article pagequery=cursor.execute ("select * from samplepagescreated2007 where analyzed is Null order by rev_page") #get all pages not analyzed yet pages = cursor.fetchall() for page in pages: missing=0 article="" if not os.path.exists(str(page['rev_page'])+".xml"): #don't querry if articlexml exists apil-ist=AppURLopenerinstance.open("http://de.wikipedia.org/w/api.php?action=query&pageids="+str(page['rev_page'])+"&format=xml") #get name from api #print apilist for event, elem in ElementTree.iterparse(apilist): if elem.tag=='page': if elem.attrib.has_key('missing'): print "missing" missing=1 else: print "---" article=elem.attrib['title'].encode("utf-8") print article if missing ==0: #if article found main(article,rev_page=str(page['rev_page']),offset="2007-01-01T00:00:00Z") else: #if article was not found, set problem to 4 and analyzed to 1 cursor.execute ("UPDATE samplepagescreated2007 SET analysed = 1,problem=%s WHERE rev_page = %s",(4,randompage['rev_page'],)) #do for all

Due to a number of missing pages and redirects in the sample a second draw was necessary:

create table samplepagescreated2nddraw as select * from resultspagescreated2007 where user2007>1 order by rand() limit 200; #draw additional articles delete from samplepagescreated2nddraw where rev_page in (Select rev_page from samplepagescreated2007); #delete duplicates insert into samplepagescre-ated2007(rev_page,page_title,page_namespace,creationdate,user2007,user,edits2007,edits) select rev_page,page_title,page_namespace,creationdate,user2007,user,edits2007,edits from samplepagescre-ated2nddraw; #insert into sample table drop table samplepagescreated2nddraw; #drop the table

Page 67: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 63 |

7.2.2.2 Retrieval and Analyses of the HTML Representation of each Article

To calculate the number of words and the Flesch readability score another Python script was

developed. The function getlastid examines the XML representation of an article and returns

the revision number of the last revision in 2007. Subsequently, the HTML version of this re-

vision is obtained from Wikipedia using specific parameters explained in [8]. The article con-

tent is extracted, some clean-ups conducted, HTML markup stripped and the plain text is send

to [10] for analysis. The Flesch readability scores are extracted from the results and inserted

into the MySQL database.

# -*- coding: cp1252 -*- import urllib import urllib2 import re import xml.etree.cElementTree as ElementTree import MySQLdb import datetime import time def strip_tags(value): "Return the given HTML with all tags stripped." return re.sub(r'<[^>]*?>', '', value) user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11' headers = { 'User-Agent' : user_agent } def getlastid(doc,rev_page): #find out id of last revision in 2007 filecelement =open(doc, "r") #assumption doc already exists! newerthan2007=False articleidfound=False # there are several id fields (article,revision,user) upperlimit=datetime.datetime(2008, 1, 1)# inrevision=False revisionid=0 for event, elem in ElementTree.iterparse(filecelement): if elem.tag=="{http://www.mediawiki.org/xml/export-0.3/}revision": inrevision=True if elem.tag=="{http://www.mediawiki.org/xml/export-0.3/}timestamp": currentzeit=elem.text if datetime.datetime.strptime(elem.text, "%Y-%m-%dT%H:%M:%SZ")>=upperlimit#if newer than 2007 newerthan2007=True else: lastrevisionid=revisionid #as the timestamp tag comes after the revision tag, assign here if elem.tag=="{http://www.mediawiki.org/xml/export-0.3/}id": if inrevision==True: revisionid=elem.text inrevision=False

Page 68: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 64 |

print lastrevisionid return lastrevisionid conn = MySQLdb.connect (host = "127.0.0.1", user = "username", passwd = "password", db = "dbname") cursor = conn.cursor (MySQLdb.cursors.DictCursor) #connect to db pagequery=cursor.execute ("select rev_page from samplepagescreated2007 where problem=0 and words is Null") # pages = cursor.fetchall() for page in pages: rev_page=str(page['rev_page']) lastrevisionid=getlastid(doc=rev_page+".xml",rev_page=rev_page) ## url ="http://de.wikipedia.org/w/index.php" values = {'title' : "", "curid" : rev_page, "oldid" : lastrevisionid} data = urllib.urlencode(values) print data req = urllib2.Request(url, data, headers) response = urllib2.urlopen(req) the_page = response.read() #get last revision with specific id #do some cleanups: content=the_page[the_page.index("<!-- start content -->"):the_page.index("<!-- end content -->")] #extract article content content=re.sub("""<div class="printfooter">\n.*</div>""","",content) content=content.replace("""<script type="text/javascript">\n//<![CDATA[\n if (window.showTocToggle) { var tocShowText = "Anzeigen"; var tocHideText = "Verbergen"; showTocToggle(); } \n//]]>\n</script>\n""","") stripped_content=strip_tags(content.decode("utf-8").replace("&#160;"," ")) #strip tags stripped_content=re.sub("Eine gesichtete Version dieser Seite.*, basiert auf dieser Versi-on.","",stripped_content) stripped_content=stripped_content.replace("[Bearbeiten]","") stripped_content=stripped_content.replace("[Verbergen]","") stripped_content=stripped_content.replace(u"&#32;"," ") stripped_content=stripped_content.replace(u"&amp;","&") #connect to stilverstprechend.de url2 = 'http://www.stilversprechend.de/stil/bericht.html'#index.html' values2 = {'text' : stripped_content.encode("cp1252","ignore")} data2 = urllib.urlencode(values2) req2 = urllib2.Request(url2, data2, headers) response2 = urllib2.urlopen(req2) the_page2 = response2.read() #extract number of sentences, words, syllables, characters x2=re.search("""Ihr Text besteht aus <b>(?P<Saetze>\d*) </b> Sätzen, <b>(?P<Woerter>\d*)</b> W&ouml;rtern, <b>(?P<Silben>\d*)</b> Silben und <b>(?P<Zeichen>\d*)</b> Zeichen.""",the_page2) print "Sätze: ",x2.group("Saetze")

Page 69: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 65 |

print "Wörter: ",x2.group("Woerter") print "Silben: ",x2.group("Silben") print "Zeichen: ",x2.group("Zeichen") #extract flesch readablility score x3=re.search("""Der <a href="/stil/fleschwert.html;jsessionid=\S*">Flesch-Wert</a> liegt bei <b> (?P<FleschWert>\d*)</b>.""",the_page2) try: print "Flesch-ValueGerman",x3.group("FleschWert") #<---- important! #if there are no sentences, there is no Flesch cursor.execute ("UPDATE samplepagescreated2007 fleschd =%s WHERE rev_page = %s",( x3.group("FleschWert"),rev_page,)) #update all except AttributeError: #no flesch value print "no flesch" print 10*"-" time.sleep(5)#sleep to reduce load on server #save plain text to txt file savetextfilehandler=open(str(rev_page)+"txt.txt","w") savetextfilehandler.write(stripped_content.encode("cp1252","ignore")) savetextfilehandler.close()

The following script was used to calculate the number words and the number of unique words

(vocabulary) in each article. The plain text version of each article stored before is loaded and

split into words. The number of words and unique words is counted and stored in the data-

base:

import os.path import re import MySQLdb FOLDER = "..."#path to plain text files saved before i=0 conn = MySQLdb.connect (host = "127.0.0.1", user = "username", passwd = "password", db = "dbname") cursor = conn.cursor (MySQLdb.cursors.DictCursor) for filename in os.listdir (FOLDER):#iterate over files in folder if "txt.txt" in filename: #plain text representations of articles were saved as *txt.txt files y=open(FOLDER+filename,"r").read() temptextsplit=re.findall('\w+\S+|[^\w\s]+',y.lower()) #split text at whitespaces, convert to lower case tempvektordict={} #create dictionary with word frequencies tempvektordicthandler=tempvektordict.get for item in temptextsplit: tempvektordict[item] = tempvektordicthandler(item, 0) + 1 print len(temptextsplit) #words without spaces print len(tempvektordict) #uniquewords cursor.execute ("UPDATE samplepagescreated2007 SET words=%s, wordpool =%s WHERE rev_page = %s",(len(temptextsplit),len(tempvektordict),filename.replace("txt.txt",""),)) #update in db print 5*"-"

Page 70: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 66 |

The total number of unique links in each article was calculated by the following script. Again,

the last revision of each article is determined from the XML file and the respective HTML file

is obtained from Wikipedia. The content of the article is extracted, some clean-ups conducted

and all unique links are counted (page internal links are omitted):

# -*- coding: cp1252 -*- import urllib import urllib2 import re import xml.etree.cElementTree as ElementTree import MySQLdb import datetime import time user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Fire-fox/2.0.0.11' headers = { 'User-Agent' : user_agent } def getlastid(doc,rev_page): #find out id of last revision in 2007 filecelement =open(doc, "r") #assumption doc already exists! newerthan2007=False articleidfound=False # there are several id fields (article,revision,user) upperlimit=datetime.datetime(2008, 1, 1)# inrevision=False revisionid=0 for event, elem in ElementTree.iterparse(filecelement): if elem.tag=="{http://www.mediawiki.org/xml/export-0.3/}revision": inrevision=True if elem.tag=="{http://www.mediawiki.org/xml/export-0.3/}timestamp": currentzeit=elem.text if datetime.datetime.strptime(elem.text, "%Y-%m-%dT%H:%M:%SZ")>=upperlimit#if newer than 2007 newerthan2007=True else: lastrevisionid=revisionid #as the timestamp tag comes after the revision tag, assign here if elem.tag=="{http://www.mediawiki.org/xml/export-0.3/}id": if inrevision==True: revisionid=elem.text inrevision=False print lastrevisionid return lastrevisionid conn = MySQLdb.connect (host = "127.0.0.1", user = "username", passwd = "password", db = "dbname") cursor = conn.cursor (MySQLdb.cursors.DictCursor) #connect to db pagequery=cursor.execute ("select rev_page from samplepagescreated2007 where problem=0 and uniquesum-links is NULL")

Page 71: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 67 |

pages = cursor.fetchall() i=0 problem=0 for page in pages: if problem ==0: #stop if there is a problem i+=1 print i," artikel" print page['rev_page'] rev_page=str(page['rev_page']) lastrevisionid=getlastid(doc=rev_page+".xml",rev_page=rev_page) url ="http://de.wikipedia.org/w/index.php" values = {'title' : "", "curid" : rev_page, "oldid" : lastrevisionid} data = urllib.urlencode(values) print "http://de.wikipedia.org/wiki/index.php?"+data+'""' req = urllib2.Request(url, data, headers) response = urllib2.urlopen(req) the_page = response.read() #extract content content=the_page[the_page.index("<!-- start content -->"):the_page.index("<!-- end content -->")] #do some clean-ups content=re.sub("""<div class="printfooter">\n.*</div>""","",content) content=content.replace("""<script type="text/javascript">\n//<![CDATA[\n if (window.showTocToggle) { var tocShowText = "Anzeigen"; var tocHideText = "Verbergen"; showTocToggle(); } \n//]]>\n</script>\n""","") #delete toggle link// content=re.sub("Eine <.*?>gesichtete Version</a> dieser Seite, <.*?>freigegeben</a> am <i>.*?</i>, ba-siert auf dieser Version.","",content) content=re.sub("<span class=\"editsection\">\[<a href=.*?Bearbeiten</a>]</span>","",content)#delete edit links content=re.sub(re.compile("<table class=\"metadata\".*?</table>",re.DOTALL),"",content)#delete meta-data links not visible content=re.sub(re.compile("<span class=\"metadata\".*?</span>",re.DOTALL),"",content)#delete metadata links not visible alllinks=re.findall("<a href=\"([^\"]*)\"",content) extandintlinks=re.findall("<a href=\"([^#][^\"]*)\"",content) pageintlinks=re.findall("<a href=\"(#[^\"]*)\"",content) extlinks=re.findall("<a href=\"(http://[^\"]*)\"",content) #add https links extlinks.extend(re.findall("<a href=\"(https://[^\"]*)\"",content)) #add ftp links extlinks.extend(re.findall("<a href=\"(ftp://[^\"]*)\"",content)) #add newsgrouplinks extlinks.extend(re.findall("<a href=\"(news://[^\"]*)\"",content)) #add maillinks extlinks.extend(re.findall("<a href=\"(mailto:[^\"]*)\"",content)) relintlinks=re.findall("<a href=\"(/[^\"]*)\"",content) print "extandintlinks",len(set(extandintlinks)) #alle unique links #

Page 72: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 68 |

cursor.execute ("UPDATE samplepagescreated2007 SET uniquesumlinks=%s WHERE rev_page = %s",(len(set(extandintlinks)),rev_page,)) #update all

7.2.2.3 Analyses based on the Database

For calculating activity and focus in the analyzed communities another table was created con-

taining all user-page combinations in the sample and the number of edits by each user. Due

the fact that vandalism was only detected in less than 1% of all analyzed communities, poten-

tial vandals were not removed from these analyses as doing so would have complicated data-

base queries significantly. Again, bots were excluded:

create table characuser as select up1.rev_page,rev_user_text,`count(*)` from userpages2007 as up1 inner join samplepagescreated2007 as up2 on up1.rev_page=up2.rev_page where up2.problem=0 and up2.userscalc>1 and up1.bot=0; alter table characuser add column edits2007 int unsigned; #edits by each user in 2007 in Wikipedia alter table characuser add column percentofedits double unsigned;# used to calculate ratio (edits in arti-cle/edits2007)

In a first step, the number of edits by each user in the sample in 2007 (activity) were calcu-

lated with the following function:

# -*- coding: cp1252 -*- import xml.etree.cElementTree as ElementTree import datetime import time import urllib import re import numpy import os import pylab import math import MySQLdb def calc_editsuser2007(): #actvity conn = MySQLdb.connect (host = "127.0.0.1", user = "username", passwd = "password",#"root" db = "dbname") cursor = conn.cursor (MySQLdb.cursors.DictCursor) #alle user cursor.execute ("select distinct rev_user_text from characuser where edits2007 is Null") resultuser=cursor.fetchall() for item in resultuser: cursor.execute ("select sum(`count(*)`) from userpages2007 where rev_user_text=%s and page_namespace=0",(item["rev_user_text"],)) result = cursor.fetchone() print item["rev_user_text"],result["sum(`count(*)`)"]

Page 73: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 69 |

cursor.execute ("UPDATE characuser SET edits2007= %s WHERE rev_user_text = %s",(result["sum(`count(*)`)"],item["rev_user_text"],)) #do for all

In a next step, the ratio of edits in the analyzed article and in other Wikipedia articles in 2007 was calculated for each user-page combination:

# -*- coding: cp1252 -*- import xml.etree.cElementTree as ElementTree import datetime import time import urllib import re import numpy import os import pylab import math import MySQLdb def calc_percentofedits2007(): #focus conn = MySQLdb.connect (host = "127.0.0.1", user = "username" passwd = "password" db = "dbname") cursor = conn.cursor (MySQLdb.cursors.DictCursor) #calculate for every user/rev_page combination cursor.execute ("select rev_page,rev_user_text,`count(*)`/edits2007 as ratio from characuser where percen-tofedits2007 is Null;") resultcombinations=cursor.fetchall() for item in resultcombinations: cursor.execute ("UPDATE characuser SET percentofedits2007= %s WHERE rev_user_text = %s and rev_page=%s",(item["ratio"],item["rev_user_text"],item["rev_page"])) #do for all

The average activity and focus of each community was then stored in a temporary table and updated in the samplepagescreated2007 table:

create temporary table avgfocus_activitytemp as select rev_page, avg(edits2007), avg(percentofedits2007) from characuser group by rev_page order by rev_page asc; update samplepagescreated2007 as t1, avgfocus_activitytemp as t2 set t1.avgactivity=t2.`avg(edits2007)`, t1.avgfocus=t2.`avg(percentofedits2007)` where t1.rev_page=t2.rev_page;

To calculate the heterogeneity of an online community’s users, the number of articles each

user edited in 2007 was determined by adding a column to the characuser table in a first step.

alter table characuser add column articles2007 int unsigned; # nr. of articles edited by each user in 2007

These fields were populated with the help of the following Python script …

# -*- coding: cp1252 -*- import xml.etree.cElementTree as ElementTree import datetime import time

Page 74: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 70 |

import urllib import re import numpy import os import pylab import math import MySQLdb def calc_articlesuser2007(): conn = MySQLdb.connect (host = "127.0.0.1", user = "username” passwd = “password” db = "dbname") cursor = conn.cursor (MySQLdb.cursors.DictCursor) #get all users cursor.execute ("select distinct rev_user_text from characuser where articles2007 is NULL") resultuser=cursor.fetchall() for item in resultuser: cursor.execute ("select count(*) from userpages2007 where rev_user_text=%s and page_namespace=0;",(item["rev_user_text"],)) #get number of articles result = cursor.fetchone() print item["rev_user_text"],result["count(*)"] cursor.execute ("UPDATE characuser SET articles2007= %s WHERE rev_user_text = %s",(result["count(*)"],item["rev_user_text"],)) #update table

…and the average number of articles users in the sample contributed to was calculated.

create temporary table nrofarticles2007 as select distinct rev_user_text,articles2007 from characuser; select avg(articles2007) from nrofarticles2007;

The yielded figure (265) was used to compare community users on the most important articles

each community edited and to calculate the average heterogeneity of each community with

the help of the cosine similarity function:

def calc_heterogeneity(rev_page,vandalismuser): pagevektor=[] uservektordict={} if len(vandalismuser)==0: string="''" else: string=",".join(vandalismuser) #join vandalismuser string (vandals identified) conn = MySQLdb.connect (host = "127.0.0.1", user = "username" passwd = "password" db = "dbname") cursor = conn.cursor (MySQLdb.cursors.DictCursor) dimensions=265 #compare community users on the 265 most important articles of each community cursor.execute ("select count(*),up1.rev_page,up1.page_title from userpages2007 as up1 inner join user-pages2007 as up2 on up1.rev_user_text=up2.rev_user_text where up2.rev_page=%s and up2.bot=0 and up1.page_namespace=0 and up2.rev_user_text not in ("+string+") group by up1.rev_page Order by `count(*)` desc limit %s;",(rev_page,dimensions)) resultpages = cursor.fetchall()

Page 75: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 71 |

for page2 in resultpages: pagevektor.append(page2['rev_page']) sortedpagevektor=sorted(pagevektor) stringmostcommonsites=",".join([str(el) for el in sortedpagevektor])#create a vector of the most important articles cursor.execute ("select distinct rev_user_text from userpages2007 where bot=0 and page_namespace=0 and rev_page=%s and rev_user_text not in ("+string+");",(rev_page,)) articleusers = cursor.fetchall() #users of the article without vandals usernumber=0 if len(articleusers)>1: for user in articleusers: uservektor=numpy.zeros(len(sortedpagevektor),dtype=int)#create vector for each user i=0 usersites=cursor.execute("select * from userpages2007 where rev_user_text=%s and page_namespace=0 and rev_page in ("+stringmostcommonsites+");",(user['rev_user_text'],)) resultnumber=cursor.fetchall() dictforuser={} for item in resultnumber: dictforuser[item['rev_page']]=item['count(*)'] #populate vector with numbers of edits per article for page in sortedpagevektor: if dictforuser.has_key(page): uservektor[i]=dictforuser[page] i+=1 uservektordict[usernumber]=uservektor usernumber+=1 mat=numpy.zeros((len(articleusers),len(articleusers))) #create matrix usernumber=0 for user in articleusers: usernumber2=0 for user2 in articleusers: mat[usernumber,usernumber2]=mat[usernumber2,usernumber]=float(numpy.dot(uservektordict[usernumber],uservektordict[usernumber2])) / (numpy.linalg.norm(uservektordict[usernumber]) * numpy.linalg.norm(uservektordict[usernumber2]))#compare each user pair (cosim) usernumber2+=1 usernumber+=1 heterogeneity=1-(mat.sum()-len(articleusers))/(len(articleusers)*(len(articleusers)-1)) #calculate average heterogeneity(1-average similarity) else: heterogeneity=0 #if there is only one user heterogeneity =0 return heterogeneity

To assess the dynamics of collaboration within each article the time difference between each

revision and the following revision was determined and the median calculated.

# -*- coding: cp1252 -*- import xml.etree.cElementTree as ElementTree import datetime import time import urllib import re

Page 76: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 72 |

import numpy import os import pylab import math import MySQLdb def calc_mediantimebetweeneditsarticle2007(): #dynamics conn = MySQLdb.connect (host = "127.0.0.1", user = "username", passwd = "password", db = "dbname") cursor = conn.cursor (MySQLdb.cursors.DictCursor) #all user: cursor.execute ("select distinct rev_page from samplepagescreated2007 where problem=0 and userscalc>1 and mediantimebetweenedits is NULL") resultpages=cursor.fetchall() i=0 for item in resultpages: timediffs=[] cursor.execute("SELECT @seq:=0;") cursor.execute("drop table if exists revisiontemp2007;") cursor.execute("create table revisiontemp2007 as SELECT @seq:=@seq+1 as Rank, t1.rev_id, t1.rev_page,t1.rev_user_text,t1.rev_timestamp from revision2007 as t1 where t1.rev_page=%s and t1.rev_user_text not in(select rev_user_text from bots) order by t1.rev_timestamp",(item["rev_page"],))# con-secutively number revisions in a new table, omit bots cursor.execute("SELECT t1.rank,t2.rank,timestamp(t1.rev_timestamp),timestamp(t2.rev_timestamp),TIMESTAMPDIFF(second,t2.rev_timestamp,t1.rev_timestamp) AS timedif FROM revisiontemp2007 AS t1, revisiontemp2007 AS t2 WHERE t1.rank = t2.rank+1;") results=cursor.fetchall() for timeitem in results: timediffs.append(timeitem["timedif"]) print item["rev_page"],";",numpy.median(timediffs) #calculate median cursor.execute ("UPDATE samplepagescreated2007 SET mediantimebetweenedits= %s where rev_page=%s",(numpy.median(timediffs),item["rev_page"],)) #do for all

Finally, the control variable article age in seconds was calculated with a standard SQL- state-

ment subtracting the creation date of an article from 01.01.2008 00:00:00.

select rev_page, TIMESTAMPDIFF(second,creationdate,20080101000000) from samplepagescreated2007 where problem=0 and userscalc>1;

Page 77: The Performance of Online Communities

Roman Pickl | The Performance of Online Communities | 73 |

7.3 References used in the Appendix:

[1]…Wikipedia 2009, Special:Export, viewed 23.04.2009,

<http://en.wikipedia.org/wiki/Special:Export>.

[2]…Wikipedia 2009, Virtual community, viewed 23.04.2009,

<http://en.wikipedia.org/wiki/Virtual_community>.

[3]…Mediawiki Repository 2009, viewed 16.04.2009,

<http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/maintenance/tables.sql>.

[4]…Wikimedia 2008, dewiki dump progress on 20080607, viewed 16.04.2009,

<http://download.wikimedia.org/dewiki/20080607/>.

[5]…Mediawiki 2009, MWDumper, viewed 16.04.2009,

<http://www.mediawiki.org/w/index.php?title=MWDumper&oldid=242629>.

[6]…Wikipedia 2009, Benutzerverzeichnis, viewed 16.04.2009,

<http://de.wikipedia.org/w/index.php?title=Spezial%3ABenutzer&username=&group=bot&li

mit=5000>.

[7]…Wikipedia 2009, Namespace, viewed 16.04.2009,

<http://en.wikipedia.org/w/index.php?title=Wikipedia:Namespace&oldid=275699788>.

[8]…Mediawiki 2009, Identifying a page or revision, viewed 16.04.2009,

<http://www.mediawiki.org/wiki/Manual:Parameters_to_index.php#Identifying_a_page_or_r

evision>.

[9]…Gude 2008, wikipedia-article-exporter, viewed 16.10.2008,

<http://code.google.com/p/wikipedia-article-exporter/>.

[10] stilversprechend.de 2009, viewed 16.4.2009, <http://www.stilversprechend.de>.