an overview of studies on automatic genre identification

19
1 An Overview of Studies on Automatic Genre Identification Marina Santini University of Brighton UK From Biberian Text Types to Genres of Web Pages Université de Toulouse-Le Mirail, Maison de la recherche Toulouse, 5 et 6 octobre 2006 <http://w3.univ-tlse2.fr/erss/textes/seminaires/sc2006/sc2006.html > GENRE TEXTUEL/DOMAINE/ACTIVITÉ Journées d'étude organisées par l'opération «Sémantique et Corpus» Overview of the Talk Elusiveness of the concept of genre Genre & neighbouring terms Corpus-based approach & automatically- extractable features Automatic text type identification Automatic genre classification Electronic corpora Collections of web pages

Upload: others

Post on 03-Feb-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

1

An Overview of Studies on Automatic Genre Identification

Marina Santini

University of Brighton

UK

From Biberian Text Types to Genres of Web Pages

Université de Toulouse-Le Mirail, Maison de la rechercheToulouse, 5 et 6 octobre 2006

<http://w3.univ-tlse2.fr/erss/textes/seminaires/sc2006/sc2006.html>

GENRE TEXTUEL/DOMAINE/ACTIVITÉ

Journées d'étude organisées par l'opération «Sémantique et Corpus»

Overview of the Talk

Elusiveness of the concept of genreGenre & neighbouring termsCorpus-based approach & automatically-extractable featuresAutomatic text type identificationAutomatic genre classification

Electronic corporaCollections of web pages

2

Elusiveness of the concept of genre

A codification of discursive properties (Todorov, 1978);

A social action (Miller, 1984);

A persuasive classifying statement (Rosmarin, 1985)

A pattern or a recurring type of text (Erickson, 1999)

An interface metaphor (Toms and Campbell, 1999);

A typified communicative action (Yates & Orlikoswki, 1992).

and so on…

Slippery but Intuitive

Academic papers

Fables

Editorials

Sonnets

Novels

Interviews

Letters

Recipies

Information Patient leaflets

Reviews

and so forth…

3

Genre & neighbouring terms

Genre

Register

Text types

Domain

Topic

Folksonomies

[…]

Text Types, Style, Genre: Overlap

Klavans and Kan (1998)

Johannesson & Wallström (1999)

Karlgren (2000)

Stamatatos et al. (2001)

Dewdney et al. (2001)

Rehm (2006)

etc.

4

Text Types, Genre, Domain & Topic:

Some DistinctionsText Types:

� Biberian text types: intimate interpersonal interaction, informational interaction, scientific exposition, etc.

� Rhetorical text types: narration, instruction, argumentation, etc.Genre:

� Text categories, e.g. news story, academic paper, interview, etc.Domain:

� Subject fields, e.g. religion, hobbies, etc.Topic:

� the content, i.e. what is the text about, e.g. Chirac, nuclear weapons, greenhouse effect, etc.

Genre, Register, Text Types, Domain, & Style, Lee (2001)

Corpus-based approach &

automatically-extractable features

Corpora: unsupervised approach vs. supervised approach

Features:limitationof automatically-extractable genre-revealing features

5

Automatic Text Type Identification:

the Multi-Dimensional Analysis

Task: Language Analysis

Corpus-based techique

Unsupervised (bottom-up, inductive)A representative corpus (genres, registers, other categories)

Countable features

Factor analysis

Interpretation of the factors

Validation of the factors with statistical confirmitory techniques

Criticism: Lee (1999)

Biberian

Text TypesBiber (1988)Biber (1989)Biber (1993)Biber (1995)Biber (2004a)Biber (2004b)Biber et al. (2005)etc.

Genres/Registers/Other Categories

vs.

Text Types

External Features

vs.

Internal Features

“I have used the term ‘genre’ (or ‘register’) for text varieties that are readily recognized and ‘named’ within a culture (e.g. letters, press editorials, sermon,

conversation), while I have used the term ‘text type’ for varieties that are defined

linguistically (rather than perceptually)” (Biber, 1993).

6

Multi-Dimensional Analysis

Factor Analysis, Factors Scores (Biber, 1988)Cluster Analysis (Biber, 1989)Additional Statistical Tests (Biber, 2004a; 2004b, etc.)

1. intimate interpersonal interaction

2. informational interaction

3. scientific exposition

4. learned exposition

5. imaginative narrative

6. general narrative exposition

7. situated reportage

8. involved persuasion

Cluster Analysis - Biber (1989)Factor 2 - Biber (1988)

Biberian features

“The notion of function is closely associated with the notion of situation. A primary motivation for analysis of the components of situtation is the desire to link the functions of particular linguistic features to variation in the communicative situation” (Biber, 1988: 33).

7

Linguistic Features���� Text Types

Text types refer to groupings of texts that are similar with respect to their linguistic form, irrespective of genre/register/other categories.

Example: Conversation Text Types, Biber (2004a, JADT)

Microscopic Analysis + Macroscopic Analysis

Microscopic analysis is necessary to pinpoint the exact communicative functions of individual linguistic features.

Macroscopic analysis is needed to identify the underlying textual dimensions in a set of texts, enabling an overall account of linguistic variations among those texts.

8

Text Type-Oriented Identification

Nakamura (1993)Takahashi (1997) Sigley (1997)

Yin and Power (2006)

TypTEXT (Illouz et al., 2000; Folch et al., 2000)TyPWEB (Beaudouin et al., 2001ab; Illouz & Habert 2002)

Automatic Genre Classification:

Electronic Corpora

Task: ClassificationCorpus-based

Supervised: discriminant analysis, logistic regression, classifiers (SVM, C4.5, Naive Bayes, Neural Network, etc.)

Corpus of documents labelled by genre

Countable features

Off-the-shelf statistical algorithms

Evaluation: cross-validation or test set

9

From Biber’s text types to genres of electronic

corpora: Karlgren and Cutting (1994)

Karlgren and Cutting (1994): Recognizing Text Genres with Simple

Metrics Using Discriminant Analysis

20 featuresDiscriminant analysisBrown corpus

10

Stamatatos et al. (2000):

Text Genre Detection Using Common Word Frequencies

50 common wordsDiscriminant analysisWall Street Journal Corpus

The multiThe multi--faceted approachfaceted approach: Kessler et al. (1997):Automatic Detection of Text Genre

Three categorial facets:Brow (popular, middle, upper-middle and high)

Narrative (narrative, non-narrative)

Genre (reportage, editorial, scitech, legal, non-fiction, or fiction)

A set of 55 lexical, character-level and derivative cues: very easy to extract!

11

Summary

Two main tendencies:Descriptive Framework: Biber & multi-dimensional analysis

Classificatory Framework: Karlgren , Stamatatos, Kessler

In both frameworks:lack of an external reference corpuslack of an external reference corpus

Genre &

Layout/Shape

Dillon & Vaughan, 1997

Toms & Campbell, 1999a, 1999b

Bagdanov and Worring (2001a, 2001b)

Bagdanov (2004)

12

Automatic Genre Classification:

Collections of Web Pages

Task: Classification of individual web pagesCorpus-based

Supervised: discriminant analysis, classifiers (SVM, Neural Networks, Timbl, etc.)

Corpus of web pages labelled by genre

Countable features

Off-the-shelf statistical algorithms

Evaluation: cross-validation or test set

Features for

Web PagesThe functionality attribute:

Html, URLs, other cues.

13

16 Web Genres

Lim et al. (2005)

Authors using a supervised approach

on genre of individual web pages

Shepherd et al.(2004) Kennedy and Shepherd (2005) Lee and Myaeng (2002, 2004) Meyer zu Eissen and Stein (2004)Boese (2005)Lim et al. (2005)

14

Analysing Hypertext Structure:

Genres of Websites

Academic’s personal home page: Rehm (2002, 2005, 2006)

Analysing Hypertext Structure:

Web Genre Representation for Corpus

Linguistic Studies

Websites as instances of web genres: Mehler & Gleim (2006)

15

Summary

Main tendency:Use of supervised methods

“one web page = one genre”

Rehm, Mehler : HTML parser that understand the structure of the website

The website may include web pages with differing genres

My position:Automatic Identification of Genre in Automatic Identification of Genre in

Web PagesWeb Pages

a web page may have zero-, one-, or multiple genres, because of:

the complexity of web pages

the fluidity and the fast-paced evolution of the web

16

Zero-, One-, Multiple Genres

Genre-Based Search Engines

Shepherd et al. (2004)Meyer zu Eissen & Stein (2004)Rehm (2005)etc.

(Rosso, 2005)

17

Genre Retrieval and Visualization

Roussinov et al. (2001)Karlgren et al. (1998a, b) Dimitrova et al. (2002)

Genre & Domain/Topic

Wolters and Kirsten (1999): German, genre & domain

Poudat & Cleuziou (2003) : French, genre & domain

Pery-Woodley & Rebeyrolle (1998): French, feature patterns for retrieving definitional text across genres and domains

Lee and Mayeng (2002, 2004): English and Korean: can we use genre to help topical categorization?

18

Clustering Genres and Topic

Rauber and Müller-Kögler (2001):

Digital Libraries

Genre as Polarities: Domain Transfer

Finn & Kusmerick (2006)

19

Thank you for your attention…