an overview of studies on automatic genre identification
TRANSCRIPT
1
An Overview of Studies on Automatic Genre Identification
Marina Santini
University of Brighton
UK
From Biberian Text Types to Genres of Web Pages
Université de Toulouse-Le Mirail, Maison de la rechercheToulouse, 5 et 6 octobre 2006
<http://w3.univ-tlse2.fr/erss/textes/seminaires/sc2006/sc2006.html>
GENRE TEXTUEL/DOMAINE/ACTIVITÉ
Journées d'étude organisées par l'opération «Sémantique et Corpus»
Overview of the Talk
Elusiveness of the concept of genreGenre & neighbouring termsCorpus-based approach & automatically-extractable featuresAutomatic text type identificationAutomatic genre classification
Electronic corporaCollections of web pages
2
Elusiveness of the concept of genre
A codification of discursive properties (Todorov, 1978);
A social action (Miller, 1984);
A persuasive classifying statement (Rosmarin, 1985)
A pattern or a recurring type of text (Erickson, 1999)
An interface metaphor (Toms and Campbell, 1999);
A typified communicative action (Yates & Orlikoswki, 1992).
and so on…
Slippery but Intuitive
Academic papers
Fables
Editorials
Sonnets
Novels
Interviews
Letters
Recipies
Information Patient leaflets
Reviews
and so forth…
3
Genre & neighbouring terms
Genre
Register
Text types
Domain
Topic
Folksonomies
[…]
Text Types, Style, Genre: Overlap
Klavans and Kan (1998)
Johannesson & Wallström (1999)
Karlgren (2000)
Stamatatos et al. (2001)
Dewdney et al. (2001)
Rehm (2006)
etc.
4
Text Types, Genre, Domain & Topic:
Some DistinctionsText Types:
� Biberian text types: intimate interpersonal interaction, informational interaction, scientific exposition, etc.
� Rhetorical text types: narration, instruction, argumentation, etc.Genre:
� Text categories, e.g. news story, academic paper, interview, etc.Domain:
� Subject fields, e.g. religion, hobbies, etc.Topic:
� the content, i.e. what is the text about, e.g. Chirac, nuclear weapons, greenhouse effect, etc.
Genre, Register, Text Types, Domain, & Style, Lee (2001)
Corpus-based approach &
automatically-extractable features
Corpora: unsupervised approach vs. supervised approach
Features:limitationof automatically-extractable genre-revealing features
5
Automatic Text Type Identification:
the Multi-Dimensional Analysis
Task: Language Analysis
Corpus-based techique
Unsupervised (bottom-up, inductive)A representative corpus (genres, registers, other categories)
Countable features
Factor analysis
Interpretation of the factors
Validation of the factors with statistical confirmitory techniques
Criticism: Lee (1999)
Biberian
Text TypesBiber (1988)Biber (1989)Biber (1993)Biber (1995)Biber (2004a)Biber (2004b)Biber et al. (2005)etc.
Genres/Registers/Other Categories
vs.
Text Types
External Features
vs.
Internal Features
“I have used the term ‘genre’ (or ‘register’) for text varieties that are readily recognized and ‘named’ within a culture (e.g. letters, press editorials, sermon,
conversation), while I have used the term ‘text type’ for varieties that are defined
linguistically (rather than perceptually)” (Biber, 1993).
6
Multi-Dimensional Analysis
Factor Analysis, Factors Scores (Biber, 1988)Cluster Analysis (Biber, 1989)Additional Statistical Tests (Biber, 2004a; 2004b, etc.)
1. intimate interpersonal interaction
2. informational interaction
3. scientific exposition
4. learned exposition
5. imaginative narrative
6. general narrative exposition
7. situated reportage
8. involved persuasion
Cluster Analysis - Biber (1989)Factor 2 - Biber (1988)
Biberian features
“The notion of function is closely associated with the notion of situation. A primary motivation for analysis of the components of situtation is the desire to link the functions of particular linguistic features to variation in the communicative situation” (Biber, 1988: 33).
7
Linguistic Features���� Text Types
Text types refer to groupings of texts that are similar with respect to their linguistic form, irrespective of genre/register/other categories.
Example: Conversation Text Types, Biber (2004a, JADT)
Microscopic Analysis + Macroscopic Analysis
Microscopic analysis is necessary to pinpoint the exact communicative functions of individual linguistic features.
Macroscopic analysis is needed to identify the underlying textual dimensions in a set of texts, enabling an overall account of linguistic variations among those texts.
8
Text Type-Oriented Identification
Nakamura (1993)Takahashi (1997) Sigley (1997)
Yin and Power (2006)
TypTEXT (Illouz et al., 2000; Folch et al., 2000)TyPWEB (Beaudouin et al., 2001ab; Illouz & Habert 2002)
Automatic Genre Classification:
Electronic Corpora
Task: ClassificationCorpus-based
Supervised: discriminant analysis, logistic regression, classifiers (SVM, C4.5, Naive Bayes, Neural Network, etc.)
Corpus of documents labelled by genre
Countable features
Off-the-shelf statistical algorithms
Evaluation: cross-validation or test set
9
From Biber’s text types to genres of electronic
corpora: Karlgren and Cutting (1994)
Karlgren and Cutting (1994): Recognizing Text Genres with Simple
Metrics Using Discriminant Analysis
20 featuresDiscriminant analysisBrown corpus
10
Stamatatos et al. (2000):
Text Genre Detection Using Common Word Frequencies
50 common wordsDiscriminant analysisWall Street Journal Corpus
The multiThe multi--faceted approachfaceted approach: Kessler et al. (1997):Automatic Detection of Text Genre
Three categorial facets:Brow (popular, middle, upper-middle and high)
Narrative (narrative, non-narrative)
Genre (reportage, editorial, scitech, legal, non-fiction, or fiction)
A set of 55 lexical, character-level and derivative cues: very easy to extract!
11
Summary
Two main tendencies:Descriptive Framework: Biber & multi-dimensional analysis
Classificatory Framework: Karlgren , Stamatatos, Kessler
In both frameworks:lack of an external reference corpuslack of an external reference corpus
Genre &
Layout/Shape
Dillon & Vaughan, 1997
Toms & Campbell, 1999a, 1999b
Bagdanov and Worring (2001a, 2001b)
Bagdanov (2004)
12
Automatic Genre Classification:
Collections of Web Pages
Task: Classification of individual web pagesCorpus-based
Supervised: discriminant analysis, classifiers (SVM, Neural Networks, Timbl, etc.)
Corpus of web pages labelled by genre
Countable features
Off-the-shelf statistical algorithms
Evaluation: cross-validation or test set
Features for
Web PagesThe functionality attribute:
Html, URLs, other cues.
13
16 Web Genres
Lim et al. (2005)
Authors using a supervised approach
on genre of individual web pages
Shepherd et al.(2004) Kennedy and Shepherd (2005) Lee and Myaeng (2002, 2004) Meyer zu Eissen and Stein (2004)Boese (2005)Lim et al. (2005)
14
Analysing Hypertext Structure:
Genres of Websites
Academic’s personal home page: Rehm (2002, 2005, 2006)
Analysing Hypertext Structure:
Web Genre Representation for Corpus
Linguistic Studies
Websites as instances of web genres: Mehler & Gleim (2006)
15
Summary
Main tendency:Use of supervised methods
“one web page = one genre”
Rehm, Mehler : HTML parser that understand the structure of the website
The website may include web pages with differing genres
My position:Automatic Identification of Genre in Automatic Identification of Genre in
Web PagesWeb Pages
a web page may have zero-, one-, or multiple genres, because of:
the complexity of web pages
the fluidity and the fast-paced evolution of the web
16
Zero-, One-, Multiple Genres
Genre-Based Search Engines
Shepherd et al. (2004)Meyer zu Eissen & Stein (2004)Rehm (2005)etc.
(Rosso, 2005)
17
Genre Retrieval and Visualization
Roussinov et al. (2001)Karlgren et al. (1998a, b) Dimitrova et al. (2002)
Genre & Domain/Topic
Wolters and Kirsten (1999): German, genre & domain
Poudat & Cleuziou (2003) : French, genre & domain
Pery-Woodley & Rebeyrolle (1998): French, feature patterns for retrieving definitional text across genres and domains
Lee and Mayeng (2002, 2004): English and Korean: can we use genre to help topical categorization?
18
Clustering Genres and Topic
Rauber and Müller-Kögler (2001):
Digital Libraries
Genre as Polarities: Domain Transfer
Finn & Kusmerick (2006)