zero, single, or multi? genre of web pages through the users’ perspective

36

Click here to load reader

Upload: marina-santini

Post on 05-Sep-2016

218 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Zero, single, or multi? Genre of web pages through the users’ perspective

Information Processing and Management 44 (2008) 702–737

www.elsevier.com/locate/infoproman

Zero, single, or multi? Genre of web pages throughthe users’ perspective

Marina Santini

University of Brighton, Lewes Road, Brighton, UK

Received 7 February 2007; received in revised form 21 May 2007; accepted 26 May 2007Available online 20 July 2007

Abstract

The goal of the study presented in this article is to investigate to what extent the classification of a web page by a singlegenre matches the users’ perspective. The extent of agreement on a single genre label for a web page can help understandwhether there is a need for a different classification scheme that overrides the single-genre labelling. My hypothesis is that asingle genre label does not account for the users’ perspective. In order to test this hypothesis, I submitted a restricted num-ber of web pages (25 web pages) to a large number of web users (135 subjects) asking them to assign only a single genrelabel to each of the web pages. Users could choose from a list of 21 genre labels, or select one of the two ‘escape’ options,i.e. ‘Add a label’ and ‘I don’t know’. The rationale was to observe the level of agreement on a single genre label per webpage, and draw some conclusions about the appropriateness of limiting the assignment to only a single label when doinggenre classification of web pages. Results show that users largely disagree on the label to be assigned to a web page.� 2007 Elsevier Ltd. All rights reserved.

Keywords: User study; Web users; Genre classification; Web pages; Web genres; User warrant

1. Introduction

The goal of the study1 presented in this article is to investigate to what extent the classification of a webpage by a single genre matches the users’ perspective. The extent of agreement on a single genre label for aweb page can help understand whether there is a need for a different classification scheme that overridesthe single-genre labelling.

Although it has already been pointed out that users often disagree in assigning a genre label to a web page,and that genre classification using a single genre label does not reflect the users’ perspective (e.g. cf. Rosso,2005, p. 116), the single genre classification scheme is still largely used in many fields, such as corpus studies

0306-4573/$ - see front matter � 2007 Elsevier Ltd. All rights reserved.

doi:10.1016/j.ipm.2007.05.011

E-mail address: [email protected] The study presented in this article was carried out within my Ph.D. research on automatic identification of genre in web pages at

University of Brighton (UK), and is based on volunteering web users from several countries. I am very grateful to these volunteers andwarmly thank them.

Page 2: Zero, single, or multi? Genre of web pages through the users’ perspective

M. Santini / Information Processing and Management 44 (2008) 702–737 703

(e.g. the genre annotation of the British National Corpus by David Lee2), automatic genre classification (e.g.Lim, Lee, & Kim, 2005; Meyer zu Eissen & Stein, 2004), or document retrieval evaluation (e.g. TREC HARD2003 and 2004, or TREC-2006 Blog Track3).

Genre is a notoriously complex concept to pin down in its entirety. Even if we limit our investigations to thegenre of written documents, definitions abound. Many interpretations have been proposed since Aristotle’sPoetics, and recently definitions of genre have been adapted to the new digital environments, like corporateintranets or the web (e.g. cf. Erickson, 1999; Yates & Orlikowski, 1992).

The lack of a unified view of what genre is constitutes only one of the hurdles in genre classification. Anadditional problem is represented by the loose boundaries between the term ‘genre’ with other neighbouringterms, such as ‘register’, ‘domain’, ‘topic’, ‘text types’, ‘style’, or ‘folksonomies’. For this reason, some scholarsand researchers working with genre have proposed a multi-facetted classification. The multi-facetted classifi-cation has its roots in the works of S.R. Ranganathan, an Indian scholar, who posited that any complex entitycould be viewed from a number of perspectives or facets. The multi-facetted classification is a multi-dimen-sional view that highlights different aspects in a document, not necessarily different genres. It was adoptedKessler, Numberg, and Shutze (1997) in automatic genre classification of text documents,4 and it was sug-gested as a viable solution to handle classification problems by Tyrvainen and Paivarinta (1999) for documentmanagement in a corporate environment, and by Crowston and Kwasnik (2004) in information studies.

In this article I would like to focus on the genre dimension. More specifically, I will explore the users’ per-spectives on genres of web pages (e.g. FAQS, ESHOPS, HOW-TOS, or BLOGS). My unit of analysis is then the indi-vidual web page. Genres on the web can be also studied using other units of analysis, for example they can beanalysed at website level, as preferred by Shepherd and Watters (1999), Rehm (2005), or Mehler and Gleim(2006).

Broadly speaking, genres are textual categories that streamline communication relying on acknowledgedconventions and raising predictable expectations. For instance, the conventions underlying the FAQs genreare represented by a sequence of questions about recurrent issues accompanied by related answers. Whenbrowsing FAQs web pages, users’ expectations are to find information or instructions to solve common prob-lems. The majority of websites – from banking to travel, to sport, to Do-It-Yourself, to computing – includeFAQs web pages.

Optimistically, Karlgren, one of the leading scholars in automatic genres identification, states that the term‘genre’ is established and generally understood, at least intuitively (Karlgren, 2004). However, difficulties arisewhen the task is to find agreement on the genre label to be assigned to a document. This task is even morearduous when the document to be classified by genre is the individual web page (e.g. see the small-scale userstudy in Santini, 2005). Web pages often are more unpredictable and difficult to sort into a single genre thandocuments in other media, like paper printing, where social rigidity, work practices or stable settings favourmore controlled and standardized text production (Yates & Sumner, 1997).

A web page often appears to be composite, with a visual organization of the space, where different com-municative purposes and several functions are included at the same time. Textual complexity is not an exclu-sive trait of web pages. For instance, the interweaving of visual and verbal is common in magazine covers, as

2 See Lee (2001).3 The goal of TREC HARD was to achieve High Accuracy Retrieval from Documents by leveraging additional information about the

searcher and/or the search context. In particular, in TREC HARD 2003, genre was included among the metadata in the following form:item = GENRE represents the type of material the searcher is interested in.value = OVERVIEW means the searcher is interested in general news related to the topic.value = REACTION indicates the searcher is looking for news commentary on the topic.value = I-REACTION is like REACTION but is specifically about non-U.S. news commentary.value = ADMINISTRATIVE means the search is interested in official US government documents.value = ANY indicates that any genre is acceptable or none was indicated.Instead, in TREC HARD 2004 genre had values of news-report, opinion-editorial, other, or any. In TREC HARD 2005 the genreattribute was not included. The HARD track last ran in TREC 2005. The Blog track was introduced in TREC 2006. The purposeof the single-genre Blog track is to explore information seeking behaviour in the blogosphere (cf. Ounis et al., 2006).

4 Kessler et al. (1997) proposed three facets – brow, narrative, and genre – that relate respectively to the kind of language used in the text,the rhetorical typology and the genre itself. Such an approach returns three independent and unrelated classifications, where theclassification by genre corresponds to the traditional single-genre labelling.

Page 3: Zero, single, or multi? Genre of web pages through the users’ perspective

704 M. Santini / Information Processing and Management 44 (2008) 702–737

highlighted by Held (2005). What is new is the widespread use of composite and complex web pages. Forexample, the space in a web page is often divided into different sections, organized by lists of links – mainlyisolated noun structures or verbal elements, as reported by Haas and Grams (2000, pp. 186–187) – and snip-pets of text scattered around the main body of the document – like navigational buttons, menus, ads, andsearch boxes – that are visually dislocated in different areas of a single page. The detection of regularitiesin the visual form, or ‘‘shape’’, of digital documents, and the formation of emerging cognitive models playan important role in recent research on digital genres (e.g. cf. Dillon, 2000; Dillon & Vaughan, 1997; Toms &Campbell, 1999). Additionally, hyperlinking (studied by Crowston & Williams, 1999; Haas & Grams, 1998;Jucker, 2002), interactivity and multi-functionality (analysed by Shepherd & Watters, 1999) affect the textu-ality of web pages, which also heavily rely on the use of images and other graphical elements. Althoughthe use of fonts of different types, sizes, and colours, as well as the use of formatting devices – like columns,lines separating different sections of a document, pictures, etc. – has a long-standing tradition (see Waller,1987), a NEWSPAPER ARTICLE organized in columns and headlines does not lose its specific linguistic and textualcharacteristics when included in a corpus such as the British National Corpus. The same is not true for manyweb pages, since the visual structure of a web page incorporating a NEWSPAPER ARTICLE in most cases cannot beflattened out or ignored without losing important information, as noted by Watters and Shepherd (1997) intheir interpretation of the digital broadsheet as an evolving genre. As highlighted by Furuta and Marshall(1996), the complexity of web pages can be explained by the flexibility provided by HTML and the simplicityof such a language that allows the creation of more complex texts without much effort or expertise. Hence, aweb page can be considered as a sort of container of multiple texts – so much so that in coding the pages oftheir sample, Haas and Grams (2000) repeatedly encountered pages that could be interpreted as comprisingseveral textual components. Solutions such as the artificial separation of what is considered the main bodyfrom the rest would be an arbitrary operation and would not make sense in many cases, for instance in asearch page similar to webpage_05 or webpage_06 (see Appendix). In brief, in a web page, not all the ele-ments necessarily belong together, but they all contribute to form a ‘‘unified whole’’, even without any linearprogression, as explained by Johnsen (2000) through the concepts of ‘‘rhetorical clusters’’ and ‘‘perceptualcohesion’’.

The textual complexity of web pages also accounts for the ‘malleability’ of genre. As Yates and Orlikowski(1992) and others have stressed, genres are rarely homogeneous. On the contrary, they tend to overlap andmix. In an open communication space like the web, where many communities meet, each with its own genresystem and repertoire (cf. Crowston & Williams, 2000), phenomena such as genre colonization,5 genre com-bination (cf. Østerlund, 2006) and genre contamination, common also in other environments, are likely tooccur.

Although it is possible to do genre classification leaving the concept of genre implicit, and simply rely onintuition, I would like to propose a characterization of genre classes that is suitable for complex documents,like web pages. I see genres as named communication artefacts, linked to a society or community, character-ized by conventions, raising expectations, showing hybridism or individualization, and undergoing evolution.As exemplified earlier with FAQs, genres show sets of standardized or conventional characteristics that makethem recognizable, and this identity raises specific expectations. Genres also evolve over time, following socialand cultural needs. For instance, longitudinal studies show that the NEWSPAPER genre has changed when mov-ing from paper printing to the web (Ihlstrom & Henfridsson, 2005). Together with conventions and expecta-tions, genres have many other traits. With my characterization, I would like to focus on two traits that seem tobe important on the web, namely hybridism and individualisation. As a matter of fact, genres are not mutuallyexclusive and different genres can be merged into a single document, generating hybrid forms. Additionally,genres allow us a certain freedom of variation and consequently can be individualised. It is also important to

5 According to Beghtol (2001), genre colonization occurs when ‘‘the vocabulary and text forms of one field are used to rationalize andlegitimize changes in another. For example, discussions of students as both the ‘‘consumers’’ and the ‘‘products’’ of an educationalinstitution use terms from the field of marketing to create new kinds of expectations in the field of education. This analogical reasoninglikening one field to another extends to the development of analogous text genres such as the creation of marketing plans, missionstatements and outcome analyses for educational institutions. In cases of genre colonization, readers must have expectations both for thegenres of one field and of the standard structures and expectations of another field’’.

Page 4: Zero, single, or multi? Genre of web pages through the users’ perspective

M. Santini / Information Processing and Management 44 (2008) 702–737 705

note that before genre conventions become fully standardized, a genre does not have an official name. A genrename becomes acknowledged when the genre itself has an active role and a communicative function in a com-munity or society (Swales, 1990, pp. 54–57). Before this acknowledgement, a genre shows hybrid or individ-ualised forms, and indistinct functions. Currently, many web pages seem to be characterized by the attributesof hybridism and individualization, showing multiple genres or no genre at all.

Since web pages tend to be complex documents, the study presented in this article aims at investigatingwhether the classification of a web page by a single genre label does account for the users’ perspective.

The difficulty of assigning a single genre to web pages is well expressed by Rosso (2005), when he reports thecomments of the participants to his studies:

6 ‘‘Sesuch tassignm

7 ‘‘Inone pa

8 ‘‘[.know(Crow

9 ‘‘Ukind. Magreem10 ‘‘A

page. TStein,

‘‘In summary, the comments provided much insight into participants’ experiences of single-genre web-page categorization: problem with pages fitting multiple categories, problems with pages fitting no cat-egories, and general recognition of the characteristic formal elements of many of the web pages.’’ (Rosso,2005, p. 116)

This difficulty has been pointed out also by other scholars and researchers who carried out surveys about thegenre of web pages, namely, Dewe, Karlgren, and Bretan (1998),6 and Haas and Grams (2000),7 Crowston andWilliams (2000),8 Roussinov et al. (2001),9 and Meyer zu Eissen and Stein (2004).10 All of them have hinted atthe difficulty of fitting a web page into a single genre for different reasons: either the page is multi-genre, orwithout any genre; either genre conventions are unclear, or genre taxonomy is fuzzy; and so forth. However,the authors of these surveys limit themselves to point out this difficulty without questioning or discussing theadequacy of the single genre scheme in their classification results. Naturally, this stance is justified by the prac-tical purpose of exploring genres on the web (Crowston & Williams, 2000), or finding genre categories usefulfor genre retrieval (Dewe et al., 1998; Meyer zu Eissen & Stein, 2004; Roussinov et al., 2001), or finding suit-able categories for web pages (Haas & Grams, 1998). Therefore, these studies are very informative in manyrespects.

My aim is different. As mentioned earlier, I wish to investigate whether the classification of a web page bya single genre represents the users’ perspective on the genre of web pages. My hypothesis is that a single genre

label does not account for the users’ perspective on the individual web page. As explained above, the visual orga-nization of a web pages favour the tiling of different types of text, not necessarily connected to each other.For this reason, users tend to focus on the type of text, or on the textual function, they are more interested in,when they classify a web page by a single genre. In this way, they create different genre perspectives on thesame web page. In order to test my hypothesis, I submitted a restricted number of web pages (25 web pages)to a large number of web users (135 subjects) asking them to assign only a single genre label to each of theweb pages. Users could choose from a list of 21 genre labels, or select one of the two ‘escape’ options, i.e.‘Add a label’ and ‘I don’t know’. The rationale is to observe the level of agreement on a single genre labelper web page, and draw some conclusions about the appropriateness of limiting the assignment to only a sin-gle label when doing genre classification of web pages. Results show that users largely disagree on the label tobe assigned to a web page.

veral respondents pointed out that the categories were not mutually exclusive. In summary, the most central objections were eitherhat would be remedied in an interactive situation where examples are readily available, or requests for more flexible genre

ent’’ (Dewe et al., 1998).coding the pages in our samples for page type, we repeatedly encountered pages that could be interpreted as comprising more than

ge type’’ (Haas & Grams, 2000).. .] we had difficulty assigning genres to a number of the pages, most often when we agreed there was a genre, but simply did notthe name. In other cases, we could not determine the purpose of communication, making the assignment of a genre problematic’’ston & Williams, 2000).nfortunately, not all pages could be classified.’’ And also ‘‘[. . .] many disagreements where hierarchical rather than disagreements in

ore generally, since some genres are similar to each other (e.g. NEWS BULLETIN and PRESS RELEASE) we would like to develop anent metrics that takes this similarity into consideration’’ (Roussinov et al., 2001).

n inherent problem of Web genre classification is that even humans are not able to consistently specify the genre of a given webake for example a tutorial on machine learning that could be either classified as scholar material or as article’’ (Meyer zu Eissen &

2004).

Page 5: Zero, single, or multi? Genre of web pages through the users’ perspective

706 M. Santini / Information Processing and Management 44 (2008) 702–737

This article is organized as follows: Section 2 describes the principle of ‘user warrant’ applied and adaptedto analyse the users’ perspective on genre of web pages; Section 3 explains the study design; in Section 4, Ireport the experimental results and discuss them; finally in Section 5, I draw some conclusions.

2. Users’ perspective on the genre of web pages: the principle of ‘user warrant’ applied to genre

One may argue with Stam (2000, p. 14) ‘‘Are genres really ‘out there’ in the world, or are they merely theconstructions of analysts?’’. Although the views of intellectuals, academics, or genre analysts are undoubtedlyimportant because they contain the level of abstraction or generalization that non-experts usually do not have,the users’ view on genre cannot be simply ignored. As a matter of fact, the users’ perspective is fundamentalwhen it comes to the application of the concept of genre for practical purposes, like building a corpus annotatedby genre or enabling users to search by genre on the web or in a digital library. This view from the bottom hasalready been applied in similar tasks, for example by Aires, Santos, and Aluısio (2005) for compiling a corpusaccording to what the user wants, or by Rosso (2005) to assemble a genre palette useful for web searches.

In the present study, the users’ perspective is expressed by the assignment of a single (genre) label to a webpage. I put the term ‘genre’ into brackets because the full set of labels employed in the study includes 21 webgenre labels plus two ‘escape’ labels, namely the ‘Add a label’ and ‘I don’t know’ options.

The difficulty of assigning a single label to a web page is analysed using the principle of ‘user warrant’.According to the standard ANSI/NISO11 Z39.19-2005 (NISO, 2005):

11 ‘‘NInstitudigitalretriev12 Av13 Av

‘‘User warrant is generally reflected by the use of terms [my emphasis] in requests for information on theconcept or from searches on the term by users of an information storage and retrieval system’’. (Z39.19-2005, NISO, 2005)

This concept was discussed by Lancaster (1986) to explain the inclusion of certain indexing terms in retrievalsystems. Currently, the ‘‘Information Retrieval Expert Answers’’12 web-based service defines this concept as

‘‘User warrant means that the vocabulary of users or potential users should be accepted as terminologyfor index headings, descriptors, or preferred terms in thesauri, because it is warranted (authorized)through actual usage by users.’’

User warrant is one of the standards used by multi-lingual countries to develop their vocabularies. For exam-ple, it is include in the Guide to the Development and Maintenance of Controlled Vocabularies in the Government

of Canada,13 supported by the Treasury Board Secretariat of Canada (TBS), where it is defined as follows:

‘‘‘User warrant’ refers to a justification for selecting terms based on words or phrases employed by users

[my emphasis] of information resources for information retrieval or information management. Evidenceof such usage may be derived from search engine logs or interviews. User warrant ensures that the lan-guage of the vocabulary matches the language of the user community.’’ (Guide to the Development and

Maintenance of Controlled Vocabularies in the Government of Canada)

More related to my study is the application of the principle of user warrant as proposed in Rosso (2005, p.104)

‘‘if themajority of theparticipants agree that awebpage is of aparticular genre, then it is.Thehigher the levelof agreement, the more we might infer that a particular genre is socially recognized’’ (Rosso, 2005, p. 104).

Building on Rosso’s user warrant interpretation, I extend this principle to adapt it to the purpose of my studyin the following way:

ISO, the National Information Standards Organization, a non-profit association accredited by the American National Standardste (ANSI), identifies, develops, maintains, and publishes technical standards to manage information in our changing and ever-moreenvironment. NISO standards apply both traditional and new technologies to the full range of information-related needs, includingal, re-purposing, storage, metadata, and preservation.’’ see <http://www.niso.org/about/index.html>.ailable at: <http://information-retrieval.expert-answers.net/information-retrieval-glossary/en/>.ailable at: <http://www.tbs-sct.gc.ca/im-gi/mwg-gtm/cvsg-sgvc/docs/2005/vocab/vocab08_e.asp>.

Page 6: Zero, single, or multi? Genre of web pages through the users’ perspective

M. Santini / Information Processing and Management 44 (2008) 702–737 707

Single genre: when the agreement of many users on a single genre label is high, then the web page is not

difficult to classify into a single genre, i.e. a single-genre classification can be profitable. This applies also toa case when the majority of web users use the ‘Add a label’ option to add new, but consistent, genre labelsfor the same web page.

Multiple genres: when the agreement of users on the genre of a web page is fragmented across a limitednumber of genre labels, then the web page is difficult to sort into a single genre, i.e. a multi-genre classificationcan be profitable.

No genre: when a large part of agreement aggregates around the two ‘escape’ options, ‘Add a label’ and ‘Idon’t know’, or is fragmented across many labels then the web page cannot be easily sorted into any genre, i.e.a zero-genre classification can be more appropriate.

3. Study design

As mentioned earlier, this study relies on the users’ perspective of the genre of web pages. The users’ per-spective is expressed by labelling a web page using one label, and analysed using the principle of user warrant.I adapted this principle to my needs building on Rosso’s interpretation. The idea behind my adaptation of thisprinciple is to exploit the different levels of agreement on the genres of web pages to assess the difficulty ofclassifying a web page by a single genre. For instance, if many users agree mostly on a single genre label whenthey see a web page, then the web page is not difficult to classify into a single genre. By contrast, when theagreement of users on the genre of a web page is fragmented across a number of genre labels, then the webpage is difficult to sort into a single genre. The unit of analysis is the individual web page.

The screenshots of the 25 web pages used in this study, and the original URLs (Table A1) are shown inAppendix in sequential order. These web pages are also available online as bitmap files at <http://www.nltg.brighton.ac.uk/home/Marina.Santini/>. The web pages included in the study were downloaded inearly 2005.

Criteria of selection:

I collected from the web 21 web pages that could represent the 21 genre labels. Ideally, each web page rep-resents a single genre or a predominant genre. In other words, each of these web pages is a stereotypical exem-plar of a web genre. For the association one web genre = one web page I used a criterion of ‘objectivesources’ (Santini, 2006a). In brief, I selected these ‘representative’ web pages in either of the following ways:

� The web pages showed the name of the genre in the title. For example, FAQS was in the title of web-page_12, and HOTLIST was in the title of webpage_15 (see Appendix);� The web pages were downloaded from genre-specific portals or archives. For example, webpage_01 rep-

resenting an ESHOP was downloaded through <http://www.shops.co.uk>;� The genre was included in the URL. For example, webpage_10 representing the ABOUT PAGE (see Appen-

dix) was downloaded from <http://infogistics.com/about.html>.� In the set of web pages, I also included three web pages from the SPIRIT collection14 (see webpage_18,webpage_21, and webpage_23 in Appendix), and a common web page type, without any official genrename, downloaded from the live web (see webpage_17 in Appendix). I assigned nicknames to these fourweb pages to facilitate any reference to them during the discussion of the results. These nicknames are thefollowing:– webpage_17=RYANAIR;– webpage_18=ADIRONDACK;– webpage_21=CITIDEX;– webpage_23=LENS HOLDER.

14 The SPIRIT collection is a random crawl carried out in 2001 and bootstrapped by a set of educational websites (Joho & Sanderson,2004). It contains individual web pages and not full websites.

Page 7: Zero, single, or multi? Genre of web pages through the users’ perspective

708 M. Santini / Information Processing and Management 44 (2008) 702–737

In summary, the total number of web pages included in the study is 25, namely 21 web pages selected withthe criterion of ‘objective sources’, and four arbitrary web pages.

Table 1 lists the web genres included in the study, and the web page representing that genre. It is important

to note that my initial association was never shown to the participants and has no influence on the results of the

study. Discussion and conclusions are based on the labels assigned by the 135 web users and summarized inTable 3, and not on my initial choices or associations, which should be merely considered as a random startingpoint. The web pages presented to the subjects are arbitrary web pages that can be found on the web, and thelabels were suggested by the web page creators, either through the title or in some other ways. It is worthreminding that the subjects had the possibility of specifying their own labels by using the two ‘escape’ options.

3.1. Comparison with Rosso’s experiment 3

Mark Rosso conducted an experiment (Rosso, 2005, pp. 103–131) that may appear similar to the study thatI present in article. I will highlight the main similarities and differences between the two experiments to under-stand better the rationale behind them. This comparison does not imply that my experimental choices arebetter than Rosso’s, or viceversa. The breakdown of similarities and differences should help the readers under-stand the different experimental stances. In brief, Rosso’s objective is to measure the users’ recognition of thegenres in his palette. He emphasizes the consensus among users in order to validate a genre palette to be usedfor web searches. My objective is to explore to what extent users agree on a single genre label per web page.I analyse the distribution of the different labels assigned to a web page, and emphasize the dissension amongusers. My motivation is to justify the adoption of a more flexible genre classification scheme that goes beyondthe single genre assignment.

3.1.1. Similarities

� The use of user warrant to assess the users’ perception on genre of web pages, instead of the traditionaluser-respondent agreement.� The selection of different degrees of genre typicality, including web pages that do not seem to fit into any

categories.� The use of human subjects belonging to a university environment.� The use of percentage to measure the agreement (or ‘‘consensus’’, in Rosso’s terminology).

3.1.2. Differences

� My goal is to show that single-genre classification of web pages does not represent the users’ perspective.– Rosso’s goal is to validate a genre palette by showing that users can recognize the 18 genres he has iden-

tified in previous experiments.

Table 1Web pages and (genre) labels

Web page name Genre name Web page name Genre name

webpage_01 ESHOP webpage_14 ORGANIZATIONAL HOME PAGE

webpage_02 PERSONAL HOME PAGE webpage_15 HOTLIST

webpage_03 EMAIL webpage_16 CLOG

webpage_04 ONLINE FRONT PAGE webpage_17 NO OFFICIAL GENRE (RYANAIR)

webpage_05 SEARCH PAGE webpage_18 NO OFFICIAL GENRE (ADIRONDACK)

webpage_06 SITEMAP webpage_19 NEWSLETTER

webpage_07 BLOG webpage_20 HOW-TO

webpage_08 ACADEMIC HOME PAGE webpage_21 NO OFFICIAL GENRE (CITIDEX)

webpage_09 ONLINE FORM webpage_22 ONLINE TUTORIAL

webpage_10 ABOUT PAGE webpage_23 NO OFFICIAL GENRE (LENS HOLDER)

webpage_11 CORPORATE HOME PAGE webpage_24 SPLASH SCREEN

webpage_12 FAQS webpage_25 NET AD

webpage_13 EZINE

Page 8: Zero, single, or multi? Genre of web pages through the users’ perspective

M. Santini / Information Processing and Management 44 (2008) 702–737 709

� I do not provide any definition list of the genres presented to the users. I prefer that users follow their ownidea of genre.– Rosso provides a definition list, where users can go back to any time to check the descriptions of genres.

� I present 25 web pages to the subjects with a list of 21 genre labels plus two ‘escape’ labels, the ‘Add a label’and ‘I don’t know’ options, for a total of 23 labels. The idea is that if the subjects are not happy with theproposed labels they can add their own label or say that they do not know.– Rosso presents 55 web pages to be classified with one of the 18 genres in his palette. He also provides an

option called ‘None of the above’.

� I select 21 out of 25 web pages according to the criterion of ‘objective sources’. Ideally each of these webpages should represent a genre on the web. However, this ‘objective’ classification is never shown to thesubjects. The aim is not to see if they comply with a given classification, but to see to what extent they agreewith each other in assigning a single genre to these web pages. If the subjects are not happy with the list ofgenres, they can use the two ‘escape’ options. This means that they are free to express their own perspectiveon the web pages and are not influenced my view of genre. For example, one of the subjects added a newlabel for all the web pages shown in the study.– Rosso selects his web pages through the use of an algorithm and the specification of queries in a search

engine. He selects at least two web pages to represent each genre.

� My experiment relies on the participation of 135 subjects.– Rosso’s experiment relies on 257 subjects.

3.2. Possible objections to the study design

The study is based on 25 web pages labelled by 135 web users using a single label. Several objections mightcome to one’s mind regarding this experimental design.

The first objection is about the number of web pages. One might argue that 25 web pages are not represen-tative because it could be that a particular genre works better than another, or that a particular page is not asexemplar as I supposed to be. This objection would be valid if the aim of the study was to ‘‘guess’’ the ‘‘right’’genre for a web page, or to verify that users can recognize genres of web pages. In such cases presenting onlyone page per genre would surely involve a bias. Instead, the aim of this study is to observe the extent to whicha large number of users spontaneously agree on the same label. The study is designed to leave space to users’own genre vocabulary, so when they are not happy with the labels listed on the screen either they can add theirown label (‘Add a label’), or they can say ‘I don’t know’. In brief, the aim of the study is not to find the correctgenre for a web page, or to check that a particular genre is recognised, but to observe the degree of agreementin choosing a genre label for a web page.

A second objection is about the study design as a whole. In this study web users had to select a singlelabel by clicking on a radio button. One might argue that the selection of multiple labels through check-boxes and the observation of the agreement on several labels would have supported the need for multiplegenre classification better, because users might have all agreed on the same two, or three or four genrelabels. This objection points to the next step of the research: once the inappropriateness of the single genreclassification is fully acknowledged, then we can start thinking of finding solutions to assign multiplegenre labels to the individual web page. At the current stage, my aim is to show the inadequacy of thesingle-genre labelling.

Undoubtedly, the multi-genre assignment seems to be the unavoidable future direction, as already pointedout by Rosso:

‘‘Soliciting more information than just a single genre per page could conceivably increase the level ofparticipant agreement by allowing them to choose multiple genres.’’ (Rosso, 2005, 125)

However, designing an experiment based on multiple genre classification is not a trivial endeavour, and manyissues need to be addressed. For instance:

Page 9: Zero, single, or multi? Genre of web pages through the users’ perspective

710 M. Santini / Information Processing and Management 44 (2008) 702–737

(1) Should users add their own genre labels (thus giving then priority to users’ own vocabulary) or shouldthey choose labels from a list (thus validating a number of pre-set genres)?

(2) If they add their own labels how shall we face the problems of genre synonyms, like NEWS BULLETIN andPRESS RELEASE (cf. Roussinov et al., 2001), genre granularity, like inclusion of super-genres or subgenres,and genre similarity, like HOW-TOS and TUTORIALS?

(3) Should genre labels be rated, assigning a confidence score for each rating as suggested by Rosso (2005, p.129)? If so, how are we going to assess the inter-rater agreement in these conditions?

These and other questions are indeed the focus of future research on genre classification.

3.3. Experimental choices

I made a number of experimental choices in this study. First, I included only one page per web genre tokeep a reasonably short run time. I tried to include as many genre labels as possible (in order to have a broaderview) that could be processed in no more than 15–30 minutes. Time limitation for a volunteering audience wasthe main reason for ‘offering’ only one web page per genre. Nonetheless, some of the subjects complained thatthe study was too long and they lost interest along the way.15

Second, the term ‘genre’ was never mentioned in the study because I did not wish to trigger any reflection orafterthought about this term. I preferred having a labelling/classification that was as spontaneous as possible.The goal of the study was not declared either. Participants were only told to select a label for each web page.The idea was to ask for a genre classification of web pages implicitly, and observe the users’ reactions. How-ever, some subjects enquired about the goal of the study.16

Third, when the genre was mentioned in the main heading of the original web page, I deleted the genrename. I thought that it would have been too easy to make a choice in the conditions shown in Fig. 1. There-fore, users were presented the web page as shown in Appendix (webpage_15). The only exception to this rulewas the ABOUT PAGE, which was presented untouched (see Appendix).

Fourth, I also arbitrarily decided the kind and number of genres to be included in the study. I used some ofthe web genres included in this study also for my experiments on automatic genre identification (see Santini,2006a; Santini et al., 2006). One of the subjects complained that I did not include American genre names, with-out indicating, however, any example.17

The main limitation of this study is graphical. Web pages were presented as screenshots (exactly as theyappear in Appendix, or in the bitmap files available at <http://www.nltg.brighton.ac.uk/home/Marina.San-tini/>). For this reason, certain genres suffer from the absence of context (e.g. the EMAIL genre, webpage_03,see Appendix). The lack of context was noted by the subjects who straightaway complained.18

3.4. Participant sample

Asking a web user to classify a web page by genre is not a easy task because genre identification requires a levelof abstraction that is often linked to a number of different factors, such as the society in which the subjects live (ina society without computer-mediated communications, web genres would make no sense), cultural/educationallevel (without a certain degree of education, an academic home page cannot be identified or perceived as belong-ing to a (sub)genre), and the membership in a community (being a ‘blogger’ makes it easier to identify a BLOG or aCLOG). In this study, I assumed that the level of abstraction needed to perceive web genres could be found in auniversity environment where medium-high education, free access to computer-mediated communication and

15 Subjects’ comments: ‘‘Too long, I lost interest after about 18’’ (see drops-out section).16 Subjects’ comments: ‘‘Very interesting. What’s it all about?’’; ‘‘Very interesting. I’d love to know the research question behind the

study!’’; somebody asked explicitly ‘‘What is the aim of the study?’’.17 Subjects’ comments: ‘‘I noticed that some of the names were British. I recommend adding North American expressions alongside with

them’’.18 Subjects’ comments: ‘‘You have cut headers and footers off some of the pages. These are the parts of the page which give a clear

indication (often) about what kind of page/site it is’’; ‘‘Hard to determine the type of page without context’’.

Page 10: Zero, single, or multi? Genre of web pages through the users’ perspective

Fig. 1. Erasing genre name from the title (cf. webpage_15 in Appendix).

M. Santini / Information Processing and Management 44 (2008) 702–737 711

familiarity with the web are the norm. More specifically, I assumed that the subjects of the study were presum-ably students, researchers, teachers, or administrative staff, and had three elements in common:

(1) They had at least a secondary school diploma, or equivalent, which is usually the basic requirement toaccess university courses or to work at university in white-collar positions. However, many actually havehigher education, such as degrees, PhDs or post-doctorates.

(2) They were used to computer-meditated communication, as much interchange at university occurs viaemails and mailing lists.

(3) They were familiar with the web, for instance to carry out bibliographic researches, or for providinginformation about themselves, about courses or exams, and so on.

Giving priority to the level of education and to Internet literacy, I left details about age, gender, professionand more personal or demographic information as optional additions, and I will not include them in the dis-cussion of the results.

The study is based on participants who volunteered within the University of Brighton (UK), the Informat-ics Department of University of Sussex (UK), a Department of Dalhousie University (Canada), a Departmentof Syracuse University (USA), plus a small number of academics in other universities and research institutes inEurope. Potential participants were sent an email, and this email was re-distributed within their university ordepartment.

One hundred and ninety eight users in total began the experiment. A number of them (27 users) did notwrite any name or nickname and a small portion (6 users) wrote only their first name, but this was not enoughto let them continue because first name and surname were set as required fields; a larger part (30 users)dropped out at different level of the study; in the end, the total number of participants who went throughthe whole study and provided valid responses for the experiment amounted to 135 users.

Fig. 2 shows the welcome page of the study. Only Surname and First Name were set as required fields, butparticipants were told that they could use pseudonyms/nicknames. These two fields were used to build uniqueidentifiers for each of the subject (a sequential number was added in order to avoid homonyms). Later on, allthe names were eliminated and replaced with serial numbers to ensure confidentiality.

Page 11: Zero, single, or multi? Genre of web pages through the users’ perspective

Fig. 2. The welcome page.

712 M. Santini / Information Processing and Management 44 (2008) 702–737

3.5. Participant task

The study was web-based. It was uploaded on to the ITRI server (University of Brighton, UK) at the end ofFebruary 2005, and kept online for one month up to the end of March 2005.

The participants’ task was straightforward. Users were asked to go though 25 web pages and label eachpage by selecting a radio button. There were 23 labels available for selection, i.e. 21 genre labels and thetwo ‘escape’ labels, ‘I don’t know’ and ‘Add a label’. Subjects appreciated the possibility of adding theirown label19 and the possibility of stating they did not know how to label a web page.

Fig. 3 shows how a web page was presented to the user. The general structure includes a space for the snap-shot of the web page in the upper box, 21 radio buttons with genre labels in the bottom box on the left-handside (LHS), and two ‘escape’ buttons (‘I don’t know’ and ‘Add a label’) plus a text box for optional commentsin the bottom box on the right-hand side (RHS). Most participants did not comment on the arrangement ofthe study. However, some suggested a different layout of the screen,20 and few complained about it.21

As it was not possible to move to the next page without selecting a radio button, participants who completedthe test assigned a label to each of the 25 web pages. No restriction was given on the selection of labels, i.e. userscould use the same label for more than one web page, or never employ a label if deemed inappropriate.

Participants could add additional personal details in the welcome page (Fig. 2) if they wished to do so(some of them asked to be sent the results of the study by email), and specify optional comments on eachweb page. In the last page of the study, where they were thanked, they could add additional comments.

The web pages were presented to the subjects in the following random sequence: webpage_04, web-page_25, webpage_03, webpage_02, webpage_05, webpage_19, webpage_23, webpage_15, web-page_11, webpage_13, webpage_21, webpage_12, webpage_07, webpage_09, webpage_24,webpage_08, webpage_20, webpage_14, webpage_16, webpage_10, webpage_06, webpage_18,webpage_22, webpage_01, and webpage_17.

Among drops-out, 12 left at the welcome page, but wrote a full name; 7 left at the first or second page of thestudy; 7 left between the second page and the tenth page; 2 left at page 17; 1 left at page 14, and finally one leftat page 16. Among possible hypotheses for dropping out, some are easy to identify: lack of interest in thestudy, tiredness (too many web pages to be assessed); or lack of familiarity with the suggested labels. Very

19 Subjects’ comment: ‘‘Nice that you can add your own categories’’.20 Subjects’ comments: ‘‘Perhaps one suggestion would be to show the options of the radio button at the very beginning’’.21 Subjects’ comments: ‘‘The way the types of webpages are listed is not very user friendly’’.

Page 12: Zero, single, or multi? Genre of web pages through the users’ perspective

Fig. 3. Example of the screen.

M. Santini / Information Processing and Management 44 (2008) 702–737 713

few users started the study, left and resumed later on. Most of the participants went through the study withoutany break.

4. Results

The counts discussed in the following subsections are mostly based on percentage.22 I grouped resultsaccording to three ranges of agreement on the most voted label: an agreement above 80% is considered to

22 Recently, Rosso has suggested two interesting measures to assess users’ recognition of genre in web pages (Rosso, 2005, 109ff). The firstis the users’ average agreement over the pages in which a consensus of 50% was achieved in a particular genre. The second measureprovides an indication of ‘‘false hit’’ per genre. According to Rosso, these two measures together provide a more complete picture of theusers’ recognition. However, their full interpretation and effectiveness are still under investigation. For this reason, I have not applied themyet in this study.

Page 13: Zero, single, or multi? Genre of web pages through the users’ perspective

714 M. Santini / Information Processing and Management 44 (2008) 702–737

be high; an agreement between 50% and 80% is considered to be medium; finally an agreement below 50% isconsidered to be low. I will show later whether these thresholds make any sense.

4.1. Single-label agreement

A view of the data of this study is offered in Table 2. This table shows the number of subjects assigning aparticular label to a particular web page and the percentage of the most voted label. For example, the labelESHOP (8th row) was assigned to WP123 (first column) by 119 subjects (highlighted in bold), which correspondsto 88.15% (bottom row). Four subjects thought that WP1 was a CORPORATE HOME PAGE, seven selected NET AD,one subject chose FRONT PAGE, another subject selected HOTLIST, one DID NOT KNOW, two added a new label forit. In order to understand whether these results were due to chance, I submitted them to statistical tests ofsignificance (Chi-square, Fisher’s Exact test, and Likelihood Ratio, as implemented in SPSS). These testsshowed that the results were statistically significant (Sig. .000), i.e. the association of labels to web pageswas not due to chance.

Using the three thresholds mentioned above (i.e. above 80%, between 80% and 50% and below 50%) I couldidentify three ranges of agreement in the users’ perspective. Table 3 shows the percentage of agreement byrange: Top (Above 80%), Middle (Between 50% and 80%), and Bottom (Below 50%). The records in the thirdcolumn, Agreement on the labels of web pages, show the name of the page, the genre label that received the high-est agreement, the percentage of agreement, and the value of the pairwise agreement. For example, web-page_02, was classified as a PERSONAL HOME PAGE by 88.89% of participants, with a pairwise agreement of 0.79.

According to the percentages shown in Table 3, participants show the highest agreement, i.e. above 80%, onthe genre of five web pages: webpage_02 was classified as a PERSONAL HOME PAGE by 88.89% of the partici-pants, webpage_01 as an ESHOP by 88.15%, webpage_11 as CORPORATE HOME PAGE by 88.15%, webpage_12as FAQS by 83.7%, and finally webpage_05 as a SEARCH PAGE by 82.96%.

In the middle range (agreement between 50% and 80%) two cases are particularly interesting, webpage_10and webpage_21. Webpage_10 was classified as a CORPORATE HOME PAGE by 94 web users (69.63%), only 32participants (23.7%) labelled it as ABOUT PAGE. One of the four pages without any official name, i.e. web-page_21 (CITIDEX), was assessed as a SEARCH PAGE by 57.8% of participants.

The bottom range contains four interesting cases, namely webpage_17, webpage_18, webpage_23, andwebpage_15.Webpage_17 (RYANAIR) andwebpage_18 (ADIRONDACK) were two of the web pages without anyofficial genre. Webpage_17was classified as ONLINE FORM by 42.22% of participants, and webpage_18 ORGANI-

ZATIONAL HOME PAGE by 38.52%. Although these web pages do not look like as stereotypical exemplars of a webgenre, a majority of users (a thin majority, indeed) could perceive some genre conventions in them. More prob-lematic is another of the web pages without an official genre, i.e. webpage_23, for which the majority of users (amajority of only 26.7%) agreed in adding a new label for it. Interestingly, most of the added labels24 are consistentand all roughly indicate some kind of ‘product information page’. However, more than 10% preferred to select ‘Idon’t know’. Similarly, webpage_15 shows a majority of 23.7% of users who preferred to add a label.25

A first conclusion can be drawn from these results. According to the labelling expressed by the 135 subjectson 25 web pages, a high agreement on a single label exists only on a restricted number of web pages, namely20% (i.e. 5 web pages out of 25). For the rest, i.e. 80% of the web pages, the subjects show different level ofdisagreement, in some case particularly acute, as in the case of webpage_23 and webpage_15.

Fig. 4 shows the charted percentages. The average of agreement on the top range is 86.37%, on the middlerange 61.04%, on the bottom range 38.36%.

23 WP1, WP2, WP3, etc. are short form of webpage_01, webpage_02, webpage_03, etc.24 Added labels for webpage_23: content page, information page, online product information, product catalogue, product

documentation, product information, product manual, product specification page, sub page of an online store, tech specifications,technical description, technical documentation, technical information page, technical instructions, technical product description, and(normal) webpage (sic).25 Added labels for webpage_15: academic document, catalogue, classification page, contents page, database listing, encyclopedia,

expert information, index, index of links, index page, information page, itemization page, knowledge directory entry, link list/page, menupage, navigation page, online encyclopedia, online reference, online textbook, primary navigation tool, reference, select from list, (online)table of contents, and topic indices.

Page 14: Zero, single, or multi? Genre of web pages through the users’ perspective

Table 2

Raw counts and percentages of the most voted labels

WP1 WP2 WP3 WP4 WP5 WP6 WP7 WP8 WP9 WP10 WP11 WP12 WP13 WP14 WP15 WP16 WP17 WP18 WP19 WP20 WP21 WP22 WP23 WP24 WP25

ABOUT_PAGE 0 3 20 0 3 1 25 11 0 32 0 2 2 12 6 0 1 22 4 3 3 2 28 6 3

ACADEMIC_HP 0 0 0 0 0 0 0 79 0 1 0 1 0 0 8 0 0 0 2 0 0 1 0 0 0

BLOG 0 10 6 0 0 0 90 0 0 1 0 0 0 0 0 18 0 1 1 0 0 2 1 2 0

CLOG 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 70 0 0 0 0 0 0 0 0 0

CORPORATE_HP 4 0 0 8 4 1 0 0 0 94 119 2 5 13 0 0 10 0 0 0 4 0 5 3 19

EMAIL 0 0 66 1 0 0 1 0 0 0 0 0 0 0 0 3 0 0 5 0 0 1 0 0 0

ESHOP 119 0 0 0 0 0 0 0 26 1 5 1 0 3 0 0 28 0 0 0 6 0 20 1 39

EZINE 0 0 0 11 2 0 0 0 0 0 0 0 81 0 2 2 0 1 37 0 0 0 0 0 0

FAQS 0 0 0 0 0 0 0 0 0 0 0 113 0 0 4 1 0 1 0 26 1 0 3 0 0

FRONTPAGE 1 0 0 55 3 1 0 1 1 1 1 0 15 8 2 0 2 4 3 0 1 0 0 2 2

HOTLIST 1 0 0 0 0 10 0 0 0 0 0 0 1 0 29 2 0 8 0 0 0 0 0 0 0

HOWTO 0 0 0 0 0 0 0 0 0 0 0 6 0 2 1 1 0 3 2 73 5 29 14 0 1

NETAD 7 0 0 1 1 0 0 0 0 0 4 0 0 2 0 0 1 0 0 0 3 0 3 6 39

NEWSLETTER 0 1 3 15 0 0 1 0 0 0 0 1 16 0 0 7 0 14 60 0 0 0 0 0 0

ONLINE_FORM 0 0 1 0 0 0 1 0 102 0 0 0 0 1 0 1 57 0 0 0 13 0 1 0 0

ORGANIZATIONAL_HP 0 0 0 9 4 3 0 1 0 5 2 0 7 69 0 0 0 52 0 0 4 0 0 7 5

PERSONAL_HP 0 120 1 0 0 0 0 32 0 0 0 0 1 0 3 0 0 2 0 0 0 0 0 1 0

SEARCH_PAGE 0 0 1 1 112 64 1 1 0 0 0 1 0 7 7 0 26 2 1 0 78 0 1 0 0

SITEMAP 0 0 0 0 0 48 0 2 0 0 1 0 0 4 23 0 1 3 1 0 1 0 0 1 1

SPLASHSCREEN 0 0 0 1 0 1 0 0 0 0 0 1 1 2 0 1 0 0 0 0 1 0 0 61 5

TUTORIAL 0 0 0 0 0 0 0 0 0 0 0 0 0 0 11 0 0 0 1 30 0 88 9 0 0

ADD_LABEL 2 1 34 31 6 5 10 6 6 0 2 3 4 8 32 11 8 8 9 2 9 8 36 21 17

DONT_KNOW 1 0 3 2 0 1 6 2 0 0 1 4 2 4 7 18 1 14 9 1 6 4 14 24 4

Total 135 135 135 135 135 135 135 135 135 135 135 135 135 135 135 135 135 135 135 135 135 135 135 135 135

Percentage 88.15 88.89 48.89 40.74 82.96 47.41 66.67 58.52 75.56 69.63 88.15 83.7 60 51.11 23.7 51.85 42.22 38.52 44.44 54.07 57.8 65.2 26.7 45.2 28.89

M.

Sa

ntin

i/

Info

rma

tion

Pro

cessing

an

dM

an

ag

emen

t4

4(

20

08

)7

02

–7

37

715

Page 15: Zero, single, or multi? Genre of web pages through the users’ perspective

Table 3Ranges of agreement

Ranges # of web pages Agreement on the labels of web pages

Web page name Genre label % Pairwise Agreem.

(Above 80%) 5 webpage_02 PERSONAL_HP 88.89% 0.79webpage_01 ESHOP 88.15% 0.78webpage_11 CORPORATE_HP 88.15% 0.78webpage_12 FAQS 83.7% 0.70webpage_05 SEARCH_PAGE 82.96% 0.69

Middle (50% – 80%) 10 webpage_09 ONLINE_FORM 75.56% 0.61webpage_10 CORPORATE_HP 69.63% 0.54webpage_07 BLOG 66.67% 0.48webpage_22 TUTORIAL 65.2% 0.47webpage_13 EZINE 60% 0.39webpage_08 ACADEMIC_HP 58.52% 0.40webpage_21 SEARCH PAGE 57.8% 0.35webpage_20 HOWTO 54.07% 0.38webpage_16 CLOG 51.85% 0.31webpage_14 ORGANIZ_HP 51.11% 0.29

Bottom (Below 50%) 10 webpage_03 EMAIL 48.89% 0.32webpage_06 SITEMAP 47.41% 0.35webpage_24 SPLASHSCREEN 45.2% 0.26webpage_19 NEWSLETTER 44.44% 0.28webpage_17 ONLINEFORM 42.22% 0.26webpage_04 FRONTPAGE 40.74% 0.24webpage_18 ORGANIZ_HP 38.52% 0.20webpage_25 NETAD/ESHOP 28.89% 0.20webpage_23 ADD_LABEL 26.7% 0.16webpage_15 ADD_LABEL 23.7% 0.14

716 M. Santini / Information Processing and Management 44 (2008) 702–737

One problem with percentage analysis is that it fails to take account of agreement for categories other thanthe most popular, so a 60–10–10–10–10 split among five categories will look exactly the same as a 50–50 splitamong two categories. For this reason it is customary to look at pairwise agreement, i.e. the percentage ofagreeing judgment pairs. For example, the 60–10–10–10–10 split has 60 * 59/2 + 4 * 10 * 9/2 = 1950 agreeingpairs (39.4%), whereas the 50–50 split has 2 * 50 * 49/2 = 2450 agreeing pairs (49.5%), so the latter has higheragreement overall.

The pairwise agreement on the individual web pages is shown in the column ‘Pairwise Agreem.’ in Table 3,and it matches the percentages except for a slight difference in very few cases.

In conclusion, the agreement expressed by the 135 users appears to be stable on a single label only in the toprange.

4.2. Beyond the single label

In the previous section, I showed the agreement on the most voted labels as expressed by 135 web users on25 web pages. According to the subjects’ labelling, a single-label agreement is reasonably stable on five webpages. In most cases (20 web pages out of 25), the agreement on a single label is moderate (middle range)or low (bottom range). In this section, I show another view of the data based on the distribution of agreementacross several labels. It is worth saying that participants in this user study explicitly complained about the lim-itation of the labelling task to a single-label.26 However, the very aim of the study is to investigate the effects ofthis restriction.

26 Some comments from participants: ‘‘A page can be a search page and almost anything else, right?’’; ‘‘Topologies based on singledescriptors are grossly inadequate for interactive website’’; ‘‘Some pages could belong to multiple types’’; ‘‘Many of the pages havecharacteristics of more than one of the types listed’’; ‘‘I don’t think, the types you listed are mutually exclusive. For example an online shoppage can be an online form. That made it rather difficult to answer anything other than Don’t Know’’.

Page 16: Zero, single, or multi? Genre of web pages through the users’ perspective

Fig. 4. Percentage of agreement per page.

M. Santini / Information Processing and Management 44 (2008) 702–737 717

Table 4 shows a breakdown of the raw counts and percentages of the five web pages where users reachedan agreement of above 80%. The main aggregation of agreement is around the first label (1st). However,three web pages (webpage_02, webpage_01, and webpage_05) also show that some users preferredother labels (2nd). In the representation of the agreement on different labels I applied a threshold in ordernot to clutter the table with too much data. More specifically, I included in the table only labels thatreceived at least six votes, that is 4.4% of agreement. A threshold of 4.4% approximately corresponds tothe chance of randomly assigning one of the 23 labels to a web page [(1:23) * 100]. For this reason, I didnot include a second label for webpage_11 and webpage_12, since they did not reach six votes (cf. Table2).

The situation appears much more fragmented in Tables 5 and 6. Each web page in these tables has beenassigned to at least three labels. The main difference between Tables 5 and 6 is that in Table 5 the gap betweenthe first label and the next label is wider than in Table 6. The average gap between the 1st label and the 2ndlabel in Table 5 is about 43%, while it decreases to about 15% in Table 6. In other words, in Table 6 the agree-ment is more evenly spread over several labels.

Table 4Top range agreement

Agreement above 80%

webpage_02 1st (PERSONAL_HP) 2nd (BLOG)120 10

88.89% 7.41%

webpage_01 1st (ESHOP) 2nd (NETAD)119 7

88.15% 5.19%

webpage_11 1st (CORPORATE_HP)119

88.15%

webpage_12 1st (FAQS)113

83.7%

webpage_05 1st (SEARCH_PAGE) 2nd (ADD_LABEL)113 6

83.70% 4.44%

Page 17: Zero, single, or multi? Genre of web pages through the users’ perspective

Table 5Middle range agreement

Agreement between 50% and 80%

webpage_09 1st (ONLINE_FORM) 2nd (ESHOP) 3rd (ADD_LABEL)102 26 6

75.56% 19.26% 4.44%

webpage_10 1st (CORPORATE_HP) 2nd (ABOUT_P) 3rd (ORGANIZ_HP)94 32 5

69.63% 23.70% 3.70%

webpage_07 1st (BLOG) 2nd (ABOUT_P) 3rd (ADD_LABEL) 4th (DON’T KNOW)90 25 10 6

66.67% 18.52% 7.41% 4.44%

webpage_22 1st (TUTORIAL) 2nd (HOWTO) 3rd (ADD_LABEL)88 29 8

65.19% 21.48% 5.93%

webpage_13 1st (EZINE) 2nd (NEWSLET) 3rd (FRONT PAGE) 4th (ORGANIZ_HP)81 16 15 7

60% 11.85% 11.11% 5.19%

webpage_08 1st (ACADEMIC_HP) 2nd (PERS_HP) 3rd (ABOUT PAGE) 4th (ADD_LABEL)79 32 11 6

58.52% 23.70% 8.15% 4.44%

webpage_21 1st (SEARCH_PAGE) 2nd (ONLINE_F) 3rd (ADD_LABEL) 4th (DON’T KNOW) 5th (ESHOP)78 13 9 6 6

57.78% 9.63% 6.67% 4.44% 4.44%

webpage_20 1st (HOWTO) 2nd (TUTORIAL) 3rd (FAQS)73 30 26

54.07% 22.22% 19.26%

webpage_16 1st (CLOG) 2nd (BLOG) 3rd (DON’T KNOW) 4th (ADD LABEL) 5th (NEWSLET)70 18 18 11 7

51.85% 13.33% 13.33% 8.15% 5.19%

webpage_14 1st (ORGANIZ_HP) 2nd (CORPOR_HP) 3rd (ABOUT_PAGE) 4th (FRONT PAGE) 5th (ADD_LAB) 6th (SEARCH_P)69 13 12 8 8 7

51.11% 9.63% 8.89% 5.93% 5.93% 5.19%

718 M. Santini / Information Processing and Management 44 (2008) 702–737

In Table 6, two cases are particularly fragmented, i.e. webpage_23 and webpage_15. These two webpages were very difficult to classify according to the participants’ behaviour. Most users preferred to addtheir own label, a fair number of users declared that they did not know the most suitable label for thesetwo web pages, and the rest of votes are spread over a large number of labels. By adding up the percent-ages of the two ‘escape’ options, I get a percentage of 37.04% for webpage_23 and of 28.89% forwebpage_15.

In conclusion, according to the labelling of 135 subjects, all the 25 web pages except two (namely, web-page_11 and webpage_12) have received more than one label. The aggregation of agreement around a sin-gle label is stable only for the five pages of the top range. The rest of web pages show aggregation of agreementaround at least three labels.

4.3. Discussion

From the data shown in the previous subsections, it appears that only a small number of web pages wouldunproblematically fit into a single genre, namely the five web pages showed in Table 4. Most of the web pages

Page 18: Zero, single, or multi? Genre of web pages through the users’ perspective

Table 6Bottom range agreement

Agreement below 50%

webpage_03 1st (EMAIL) 2nd (ADD_LABEL) 3rd (ABOUT_PAGE) 4th (BLOG)66 34 20 6

48.89% 25.19% 14.81% 4.44%

webpage_06 1st (SEARCH PAGE) 2nd (SITEMAP) 3rd (HOTLIST)64 46 10

47.41% 34.07% 7.41%

webpage_24 1st (SPLASH SCREEN) 2nd (DON’T_K) 3rd (ADD_LABEL) 4th (NET AD)61 24 21 6

45.19% 17.78% 15.56% 4.44%

webpage_19 1st (NEWSLETTER) 2nd (EZINE) 3rd (ADD_LABEL) 4th (DON’T KNOW)60 37 9 9

44.44% 27.41% 6.67% 6.67%

webpage_17 1st (ONLINE FORM) 2nd (ESHOP) 3rd (SEARCH_P) 4th (CORPOR_HP) 5th (ADD LABEL)57 28 26 10 8

42.22% 20.74% 19.26% 7.41% 5.93%

webpage_04 1st (FRONT PAGE) 2nd (ADD LAB) 3rd (NEWSLET) 4th (EZINE) 5th (ORGANIZ_HP) 6th (CORPOR_HP)55 31 15 11 9 8

40.74% 22.96% 11.11% 8.15% 6.67% 5.93%

webpage_18 1st (ORGANIZ_HP) 2nd (ABOUT_PAGE) 3rd (NEWSLETTER) 4th (DON’T KNOW) 5th (HOTLIST) 6th (ADD_LABEL)52 22 14 14 8 8

38.52% 16.30% 10.37% 10.37% 5.93% 5.93%

webpage_25 1st (NET AD) 2nd (ESHOP) 3rd (CORPOR_HP) 4th (ADD_LABEL)39 39 19 17

28.89% 28.89% 14.07% 12.59%

webpage_23 1st (ADD_LABEL) 2nd (ABOUT_P) 3rd (ESHOP) 4th (HOW_TO) 5th (DON’T KNOW) 6th (TUTORIAL)36 28 20 14 14 9

26.67% 20.74% 14.81% 10.37% 10.37% 6.67%

webpage_15 1st (ADD_LABEL) 2nd (HOTLIST) 3rd (SITEMAP) 4th (TUTORIAL) 5th (ACADEMIC_HP) 6th (SEARCH_P) 7th (DON’T_K) 8th (ABOUT_P)32 29 23 11 8 7 7 6

23.70% 21.48% 17.04% 8.15% 5.93% 5.19% 5.19% 4.44%

M.

Sa

ntin

i/

Info

rma

tion

Pro

cessing

an

dM

an

ag

emen

t4

4(

20

08

)7

02

–7

37

719

Page 19: Zero, single, or multi? Genre of web pages through the users’ perspective

720 M. Santini / Information Processing and Management 44 (2008) 702–737

included in the study would benefit from a multi-label classification, because users perceive them from differentangles.

I included in the study both ‘objective’ exemplars (21 web pages) and web pages without any official genre(four web pages). From the results, it appears that most of these web pages were assigned to at leastthree labels. This means that a multi-genre classification would be not only beneficial, but also more real-istic. When the fragmentation across many labels is very large (say more than 4 or 5 labels), probably itwould be more appropriate to assign a zero-genre label, thus signalling the uncertainty of the genreassignment.

The thresholds and the percentage ranges (above 80%, between 50% and 80%, and below 50%) are infor-mative. Web pages with an agreement of above 80% aggregate most votes at most on two labels (Table 4). Webpages with an agreement between 50% and 80% aggregate votes on at least three labels, but the gap betweenthe first label and the other labels is quite wide (Table 5). Web pages with an agreement below 50% aggregatevotes on at least three labels, but the distribution of votes is spread much more evenly across the labels(Table 6).

In conclusion, I interpret the large disagreement on a single genre assignment as a motivation for a moreflexible classification scheme, i.e. a scheme that goes beyond the single genre classification.

It is worth noting that this interpretation of the results is only one of the possible views of the data collectedin this study. I suggested a preliminary interpretation in terms of genre evolution in Santini (2006b). In future,this data can be exploited to produce further findings. For example, they would be useful in investigating thefollowing issues:

� web genre granularity: to what extent can participants distinguish among different types of HOME PAGES, suchas personal, academic, corporate, or organizational?� web genre exposure: to what extent are participants familiar with recent labels, such as CLOG or EZINE?

Are they able to assign them to the appropriate web pages? Or are these labels still very opaque tothem?� web genre clairvoyance: are participants able to suggest labels for complex or hybrid web pages?� unpopularity of web genre label: why are SPLASH SCREEN and HOT LIST such unpopular choices? Have they been

replaced by other genre labels?� relations among genres: what is the relation among genres? Tables 5 and 6 help identify genres that users

see as similar or related. For example, labels such as HOW-TO and TUTORIAL are often used interchange-ably.

5. Conclusions

The goal of the study presented in this article was to investigate the extent to which the classification of aweb page by a single genre matches the users’ perspective.

Since web pages are a complex type of document – encompassing several texts not necessarily related to oneanother – my hypothesis was that users tend to focus on the type of text, or on the textual function, they aremore interested in, when they classify a web page by a single genre. This results in the creation of differentgenre perspectives on the same web page. In order to test this hypothesis, I submitted a restricted numberof web pages (25 web pages) to a large number of web users (135 subjects) asking them to assign only a singlegenre label (choosing among 21 genre labels and two ‘escape’ labels) to each of the web pages. According tothe labelling expressed by the 135 who took part in this study:

� five out of 25 web pages received an agreement on a single genre label above 80%:� 20 out of 25 web pages were labelled using at least three genre labels;� when genre conventions are particularly weak or unclear, web users tended to disagree more, and this

resulted in the use of a larger number of genre labels.

Page 20: Zero, single, or multi? Genre of web pages through the users’ perspective

M. Santini / Information Processing and Management 44 (2008) 702–737 721

These findings suggest that:

� multi-genre labelling is likely to better represent the users’ perspective, which appears diversified because aweb page can be seen from different angles;� zero-genre label may be useful to signal the uncertainty (represented by the selection of the two ‘escape’

options) or extreme fragmentation of genre assignment.

These findings shows that there is the need of a more flexible genre classification scheme that goes beyondthe single-genre classification. I propose using the zero-to-multi-genre classification scheme that could helpaccount for the current situation of genres on the web, where it is often difficult to fit a web page into a singlegenre. Following this scheme, a zero-genre label is assigned when a web page appears very individualized, withunclear genre conventions, and the agreement of web users appear sparse or spread over a large number ofgenre labels; a single-genre label is assigned when a web page complies with the convention of a single genre;finally, multiple genre labels are assigned when a web page is hybrid and composite.

This interpretation of the findings complies with the characterization of the genre of web pages that I sug-gested in the Introduction. This characterization includes the two attributes of genre hybridism and individ-ualization, where the first attribute accounts for multi-genre classification, and the latter for zero-genreclassification.

A viable future experiment would be the investigation of the extent to which web users agree on the sameset of genre labels for a web page. This could be done by allowing multiple classification. However, as men-tioned earlier, doing multiple classification implies addressing a number of thorny issues. Discussion on theseissues would provide new valuable insights useful for genre classification practices.

Once these and similar issues have been settled, we could imagine a future scenario where the creationof genre-annotated corpora is carried out by the web users themselves, maybe within a net of Internetsocial networks.27 This would also allow for genre investigations through the lens of social network anal-ysis, an approach pioneered by Paolillo, Warren, and Kunz (2007). It is worth stressing that the creationof reliable and shareable genre-annotated corpora is fundamental for a number of tasks, such as lan-guage studies, genre theory, genre analyses, automatic genre classification, and evaluation of webapplications.

In conclusion, genre is an important concept that shapes communication and social interaction. Hopefullyfuture research will head towards a more fine-grained genre classification scheme, overriding the oversimpli-fying assumption that a document can be assigned to only one genre.

Acknowledgements

I gratefully thank the reviewers of this article for their useful comments. Some of their suggestions couldnot be implemented at this stage of research. They will be the object of future investigations. The responsibilityfor the remaining flaws is mine alone.

Appendix

This appendix contains the source URLs (see Table A1) and screenshots of the 25 web pages utilized in theuser study described in the article. The screenshots are also available as .jpg files at <http://www.nltg.brigh-ton.ac.uk/home/Marina.Santini/>.

27 Similar to what happens for folksonomic tagging on Flickr and del.icio.us (see http://en.wikipedia.org/wiki/Folksonomy).

Page 21: Zero, single, or multi? Genre of web pages through the users’ perspective

Table A1Source URLs of the 25 web pages

Web page name URL

webpage_01 http://shop.panasonic.co.uk/webpage_02 http://www.satansbarber.co.uk/webpage_03 http://torvald.aksis.uib.no/corpora/2004-3/0239.htmlwebpage_04 http://www.nytimes.com/webpage_05 http://www.dogpile.com/webpage_06 http://www.thebritishmuseum.ac.uk/sitemap/sitemap.htmlwebpage_07 http://journals.aol.com/brucer5150/AGimpsLife/webpage_08 http://www.cs.brown.edu/people/ec/webpage_09 https://www2.bookryanair.com/thepaymentpagewebpage_10 http://www.infogistics.com/about.htmlwebpage_11 http://www.intel.com/index.htm?iid=Homepage+Header_UShome&webpage_12 http://www.pharmaceuticalsaleshelp.com/faq.phpwebpage_13 http://www.splendidezine.com/webpage_14 http://kycares.ky.govwebpage_15 http://www.fi.edu/tfi/hotlists/insects.htmlwebpage_16 http://weblog.flashline.com/weblogs/webpage_17 http://www2.bookryanair.com/skylights/cgi-bin/skylights.cgi?language=EN (dynamic web page)webpage_18 SPIRIT Collection. This page is also available at: http://web.archive.org/web/20010426092202/http://faculty.plattsburgh.edu/nancy.allen/aok.htmwebpage_19 http://www.freepint.com/issues/100205.txtwebpage_20 http://wt.xpilot.org/publications/linux/howtos/cd-writing/html/webpage_21 SPIRIT Collection. This page is also available at: http://web.archive.org/web/20020725160529/http://www.citidex.net/896.htmwebpage_22 http://www.intap.net/~drw/cpp/webpage_23 SPIRIT Collection. This page is also available at: http://web.archive.org/web/20020812134857/ http://www.oceanoptics.com/products/ach.aspwebpage_24 http://www.lotekk.net/index.php?page=moz&sub=splashwebpage_25 http://www.bcchyundai.co.uk/

722M

.S

an

tini

/In

form

atio

nP

rocessin

ga

nd

Ma

na

gem

ent

44

(2

00

8)

70

2–

73

7

Page 22: Zero, single, or multi? Genre of web pages through the users’ perspective

M. Santini / Information Processing and Management 44 (2008) 702–737 723

Page 23: Zero, single, or multi? Genre of web pages through the users’ perspective

724 M. Santini / Information Processing and Management 44 (2008) 702–737

Page 24: Zero, single, or multi? Genre of web pages through the users’ perspective

M. Santini / Information Processing and Management 44 (2008) 702–737 725

Page 25: Zero, single, or multi? Genre of web pages through the users’ perspective

726 M. Santini / Information Processing and Management 44 (2008) 702–737

Page 26: Zero, single, or multi? Genre of web pages through the users’ perspective

M. Santini / Information Processing and Management 44 (2008) 702–737 727

Page 27: Zero, single, or multi? Genre of web pages through the users’ perspective

728 M. Santini / Information Processing and Management 44 (2008) 702–737

Page 28: Zero, single, or multi? Genre of web pages through the users’ perspective

M. Santini / Information Processing and Management 44 (2008) 702–737 729

Page 29: Zero, single, or multi? Genre of web pages through the users’ perspective

730 M. Santini / Information Processing and Management 44 (2008) 702–737

Page 30: Zero, single, or multi? Genre of web pages through the users’ perspective

M. Santini / Information Processing and Management 44 (2008) 702–737 731

Page 31: Zero, single, or multi? Genre of web pages through the users’ perspective

732 M. Santini / Information Processing and Management 44 (2008) 702–737

Page 32: Zero, single, or multi? Genre of web pages through the users’ perspective

M. Santini / Information Processing and Management 44 (2008) 702–737 733

Page 33: Zero, single, or multi? Genre of web pages through the users’ perspective

734 M. Santini / Information Processing and Management 44 (2008) 702–737

Page 34: Zero, single, or multi? Genre of web pages through the users’ perspective

M. Santini / Information Processing and Management 44 (2008) 702–737 735

References

All URLs cited in this list of references were active in May 2007.

Aires, R., Santos, D., & Aluısio, S. (2005).‘‘Yes, user!: compiling a corpus according to what the user wants. In Proceedings of corpus

linguistics 2005, Birmingham, 14–17 July 2005. http://www.corpus.bham.ac.uk/PCLC/finalAiresetal_cl2005.doc.Beghtol, C. (2001). The concept of genre and its characteristics. Bulletin of The American Society for Information Science and Technology,

27(2). http://www.asis.org/Bulletin/Dec-01/beghtol.html.Crowston, K., & Kwasnik, B. (2004). A framework for creating a facetted classification for genres: addressing issues of

multidimensionality. In Proceedings of the 37th Hawaii international conference on system science (HICSS-37). http://csdl2.com-puter.org/comp/proceedings/hicss/2004/2056/04/205640100a.pdf.

Crowston, K., & Williams, M. (1999). The effects of linking on genres of web documents. In Proceedings of the 32nd Hawaii international

conference on system sciences (HICSS-32). http://csdl2.computer.org/comp/proceedings/hicss/1999/0001/02/00012006.PDF.Crowston, K., & Williams, M. (2000). Reproduced and emergent genres of communication on the World-Wide Web. The Information

Society, 16(3), 201–216.Dewe, J., Karlgren, J., & Bretan, I. (1998). Assembling a balanced corpus from the Internet. In Proceedings of the 11th Nordic conference of

computational linguistics, Copenhagen. http://www.sics.se/diglib/DropJaw/korpus.html.Dillon, A. (2000). Spatial semantics and individual differences in the perception of shape in information space. Journal of the American

Society for Information Science, 51(6), 521–528.Dillon, A., & Vaughan, M. (1997). It’s the journey and the destination: Shape and the emergent property of genre in digital documents.

New Review of Multimedia and Hypermedia, 3, 91–106.Erickson, T. (1999). Rhyme and punishment: the creation and enforcement of conventions in an on-line participatory limerick genre. In

Proceedings of the 32nd Hawaii international conference on system sciences (HICSS-32). http://www.pliant.org/personal/Tom_Erickson/limerick.html.

Furuta, R., & Marshall, C. (1996). Genre as reflection of technology in the World-Wide Web. In Proceedings of the international workshop

on hypermedia design (IWHD 95) (pp. 182–195). Heidelberg–New York–London: Springer-Verlag.

Page 35: Zero, single, or multi? Genre of web pages through the users’ perspective

736 M. Santini / Information Processing and Management 44 (2008) 702–737

Haas, S., & Grams, E. (1998). Page and link classifications: connecting diverse resources. In Proceedings of digital libraries ’98 – Third

ACM conference on digital libraries (pp. 99–107).Haas, S., & Grams, E. (2000). Readers, authors, and page structure: a discussion of four questions arising from a content analysis of web

pages. Journal of the American Society for Information Science, 51(2), 181–192.Held, G. (2005). Magazine covers – a multimodal pretext-genre authors. Folia Linguistica, 39(1–2). http://www.atypon-link.com/WDG/

doi/abs/10.1515/flin.2005.39.1-2.173 .Ihlstrom, C., & Henfridsson, O. (2005). Online newspapers in Scandinavia: a longitudinal study of genre change and interdependency.

Information Technology & People, 18(2), 172–192.Johnsen, L. (2000). Rhetorical clustering and perceptual cohesion in technical (Online) documentation. In A. Trosborg (Ed.), Analysing

professional genres (pp. 193–206). Amsterdam–Philadelphia: J. Benjamins.Joho, H., & Sanderson, M. (2004). The SPIRIT collection: an overview of a large web collection. SIGIR Forum, 38(2).Jucker, A. (2002). Textuality and typology of hypertext. In A. Fisher, G. Tottie, & H. M. Lehmann (Eds.), Text types and corpora

(pp. 29–51). Tubingen: Gunter Narr Verlag.Karlgren, J. (2004). The whereas and whyfores for studying textual genre computationally. In Papers from the AAAI fall symposium (style

and meaning in language, art, music, and design). Arlington, VA.Kessler, B., Numberg, G., & Shutze, H., (1997). Automatic detection of text genre. In Proceedings of the 35th annual meeting of the

association for computational linguistics and 8th conference of the European chapter of the association for computational linguistics.http://www.cs.mu.oz.au/acl/P/P97/P97-1005.pdf.

Lancaster, J. (1986). Vocabulary control for information retrieval (2nd ed.). Arlington, VA: Information Resources Press.Lee, D. (2001). Genres, registers, text types, domains, and styles: clarifying the concepts and navigating a path through the BNC Jungle.

Language Learning & Technology, 5(3), 37–72.Lim, C. S., Lee, K. J., & Kim, G. C. (2005). Automatic genre detection of web documents. In K. Su, J. Tsujii, J. Lee, & O. Y. Kwong

(Eds.), Natural language processing – IJCNLP 2004 (pp. 310–319). Berlin: Springer.Mehler, A., & Gleim, R. (2006). The net for the graphs: towards webgenre representation for corpus. In M. Baroni & S. Bernardini (Eds.),

WaCky! Working papers on the web as corpus, GEDIT (pp. 191–334). Bologna.Meyer zu Eissen, S., & Stein, B. (2004). Genre classification of web pages: user study and feasibility analysis. In S. Biundo, T. Fruhwirth, &

G. Palm (Eds.), KI 2004: Advances in artificial intelligence (pp. 256–269). Berlin–Heidelberg–New York: Springer.NISO, 2005. Guidelines for the construction, format, and management of monolingual controlled vocabularies. http://www.niso.org/

standards/resources/Z39-19-2005.pdf.Østerlund, C. (2006). Combining genres: how practice matters. In Proceedings of the 39th annual Hawaii international conference on system

sciences (HICSS-39). http://ieeexplore.ieee.org/iel5/10548/33363/01579388.pdf?arnumber=1579388.Ounis, I., Rijke, (de) M., Macdonald, C., Mishne, G., & Soboroff, I. (2006). Overview of the TREC-2006 blog track. TREC 2006 Working

Notes, NIST (pp. 15–27).Paolillo, J., Warren, J., & Kunz, B., (2007). Social network and genre emergence in amateur flash multimedia. In Proceedings of the

40th Hawaii international conference on system sciences. http://csdl2.computer.org/comp/proceedings/hicss/2007/2755/00/27550070b.pdf.

Rehm, G. (2005). Language-independent text parsing of arbitrary HTML-documents. Towards a foundation for web genre identification,LDV forum. Corpus Linguistics, 20(2), 53–74 (special issue).

Rosso, M. (2005). Using genre to improve web search. Thesis submitted for the degree of Doctor of Philosophy, University of NorthCarolina, Chapel Hill, USA. http://ils.unc.edu/~rossm/Rosso_dissertation.pdf.

Roussinov, D., Crowston, K., Nilan, M., Kwasnik, B., Cai, J., & Liu, X. (2001). Genre based navigation on the web. In Proceedings of the

34th Hawaii international conference on system sciences (HICSS-34). http://csdl2.computer.org/comp/proceedings/hicss/2001/0981/04/09814013.pdf.

Santini, M. (2005). Genres in formation? An exploratory study of web pages using cluster analysis. In Proceedings of the 8th annual

colloquium for the UK special interest group for computational linguistics (CLUK 2005). http://www.nltg.brighton.ac.uk/home/Marina.Santini/ITRI-05-01.pdf.

Santini, M. (2006a). Common criteria for genre classification: annotation and granularity. In Proceedings of the workshop on text-based

information retrieval (TIR-06) (held in conjunction with ECAI 2006). http://www.uni-weimar.de/medien/webis/research/tir/tir-06/proceedings/santini06-genre-annotation-granularity.pdf.

Santini, M. (2006b). Interpreting genre evolution on the Web. In Proceedings of the Workshop on NEW TEXT. Wikis and blogs and other

dynamic text sources (held in conjunction with EACL 2006). http://www.sics.se/jussi/newtext/working_notes/06_santini.pdf.Santini, M., Power, R., & Evans, R., (2006). Implementing a characterization of genre for automatic genre identification of web pages. In

Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational

linguistics (ACL/COLING 2006). http://acl.ldc.upenn.edu/P/P06/P06-2090.pdf.Shepherd, M., & Watters, C., (1999). The functionality attribute of cybergenres. In Proceedings of the 32nd Hawaii international conference

on system sciences (HICSS-32). http://csdl2.computer.org/comp/proceedings/hicss/1999/0001/02/00012007.PDF.Stam, R. (2000). Film theory. Oxford: Blackwell.Swales, J. (1990). Genre analysis English in academic and research settings. Cambridge: Cambridge University Press.Toms, E., & Campbell, D. (1999). Genre as interface metaphor: exploiting form and function in digital environments. In Proceedings of the

32nd annual Hawaii international conference on systems sciences (HICSS-32). http://csdl2.computer.org/comp/proceedings/hicss/1999/0001/02/00012008.PDF.

Page 36: Zero, single, or multi? Genre of web pages through the users’ perspective

M. Santini / Information Processing and Management 44 (2008) 702–737 737

Tyrvainen, P., & Paivarinta, T. (1999). On rethinking organizational document genres for electronic document management. InProceedings of the 32nd Hawaii international conference on system sciences (HICSS-32). http://csdl2.computer.org/comp/proceedings/hicss/1999/0001/02/00012011.PDF.

Waller, R. (1987). The typographic contribution to language. Thesis submitted for the degree of Doctor of Philosophy, University ofReading (UK). http://www.robwaller.org/RobWaller_thesis87.pdf.

Watters, C., & Shepherd, M. (1997). The digital broadsheet: an evolving genre. In Proceedings of the 30th annual Hawaii international

conference on system sciences (HICSS-30). http://csdl2.computer.org/comp/proceedings/hicss/1997/7734/06/7734060022.pdf.Yates, S., & Sumner, T. (1997). Digital genres and the new burden of fixity. In Proceedings of the 30th Hawaii international conference on

system sciences (HICSS-30). http://www.cs.colorado.edu/~sumner/articles/trs97-HICSS.pdf.Yates, J., & Orlikowski, W. (1992). Genres of organizational communication: a structural approach to studying communications and

media. Academy of Management Review, 17(2), 229–326.