uppsala uni 4march2011

59
Computational Models for Automatic WebGenre Identification Marina Santini Artificial Solutions, KYH Agile Web Development Stockholm Uppsala University Department of Linguistics and Philology, Seminar Series Fri 4 March 2011

Upload: marina-santini

Post on 12-May-2015

430 views

Category:

Documents


0 download

TRANSCRIPT

  • 1.Marina Santini Artificial Solutions, KYH Agile Web Development Stockholm Uppsala University Department of Linguistics and Philology, Seminar Series Fri 4 March 2011

2. Genres on the Web GoWeb 3. Outline What is genre? What is web genre? What is the difference betw genre and web genre? Why is (web) genre important? Automatic web genre identification The very beginning: Biber and Karlgren&Cutting Sharoff Kim & Ross Santini Stein et al. Web genre identification by Humans Karlgren Rosso & Haas Crowston et al. Future directions 4. What is genre? The beginning Aristotle (4th cent. b.C.): drama, lyrics, epics Drama: tragedy, comedy, satyr Literary theory and literary genres Library classification Library classification used also in online bookshops (e.g Amazon) Music genres (jazz, rock, etc.), film genres (thriller, drama, western etc.) 5. More recently Genre in academic contexts, in workplace and professional contexts, public contexts, in pedagogy (teaching writing), etc (resarch articles, essays, emails, memos, etc.) 6. Recent Genre Definitions: 2008-2010 7. Genre & Corpus Linguistics Surprisingly, no explicit definition of what genre is Brown corpus (1961): 15 genres Sockholm-Ume Corpus (SUC) (1990s) British National Corpus (1990s) etc. 8. David Lee and the BNC Jungle 9. Why is genre important? It is a context carrier: being based on recurrent conventions and predictable expectations, genre provides the communicative context and the communicative purpose for which a text has been produced. Think of what happens in your mind when you come across a specific genre. Eg, FAQs, reviews, interviews, academic papers, reportages 10. Benefits (I) Being a context carrier Complexity reduction: a text receives identity throught belonging to a certain genre; Predictivity: genre reduces information overload. Findability: genre helps find web documents relevant to our information needs; 11. Benefits (II) Genre competence increases information understanding: genre competence increases self protection against digital crimes (fishing, hoaxes, cyberbullying) because it can help us spot genre anomalies and consequently malicious intentions; Genre competence helps implement democracy: some educational programs (e.g. in Australia) focus on teaching genre since the primary school because those who do not have genre competence because they drop off school after the primary school become socially disadvantaged in the structure of power. 12. What is webgenre ? All types of genres that are on the web Paper genres that have been uploaded in any format + genres that do not have any countepart in the paper world: ex: home page, About Us, FAQs, webzine, personal blog, corporate weblogs 13. How is webgenre different from paper genre? On the web, there are new communicative settings, and new communicative contexts, so new genres are spawned On the web, the new communicative settings have been spurred by a proliferation of new technologies that ease, foster and model our communication: ex: chats, blogs, social networks, like Facebook, Twitter, LinkedIn 14. Then, a written text is not only topic There are many dimensions of variation: domain, topic, register, sentiment, level of complexity or difficulty or specialisation, trustworthiness and credibility, etc. genre is a dimension of variation. Genre gives us a topic packaged in a certain way. From the package, we are able to identify the communicative purpose of the text and the commiunicative context that has spawn such a text. 15. A step back Biber (1988) Genre Text types 66 linguistically-motivated features Multi-Dimensional Analysis Ad-hoc corpus Karlgren & Cutting (1994) Genre 20 shallow features Brown Corpus 16. Biberian Text Types Biber (1988) Biber (1989) Biber (1993) Biber (1995) Biber (2004a) Biber (2004b) Biber et al. (2005) etc. Genres/Registers vs. Text Types External Features vs. Internal Features I have used the term genre (or register) for text varieties that are readily recognized and named within a culture (e.g. letters, press editorials, sermon, conversation), while I have used the term text type for varieties that are defined linguistically (rather than perceptually) (Biber, 1993). 17. Multi-Dimensional Analysis Factor Analysis, Factors Scores (Biber, 1988) Cluster Analysis (Biber, 1989) Additional Statistical Tests (Biber, 2004a; 2004b, etc.) 1. intimate interpersonal interaction 2. informational interaction 3. scientific exposition 4. learned exposition 5. imaginative narrative 6. general narrative exposition 7. situated reportage 8. involved persuasion Cluster Analysis - Biber (1989)Factor 2 - Biber (1988) Criticism: Lee (1999) 18. From Bibers text types to genres of electronic corpora: Karlgren and Cutting (1994) 19. Karlgren and Cutting (1994): Recognizing Text Genres with Simple Metrics Using Discriminant Analysis 20 features Discriminant analysis Brown corpus 20. POSs & SUC 21. More than 15 years later Grieve, Biber et al. We define a genre in a very similar manner to how we define register i.e. as a variety of language defined by the external situation in which it is produced. However, while a register is characterized by pervasive linguistic features, a genre is characterized by conventionalized linguistic features Karlgren: Genre is a vague but well-established notion, and genres are explicitly identified and discussed by language users even while they may be difficult to encode and put into practical use GoWeb 22. The concept of genre is beneficial but difficult to pin down and to agree upon GoWeb In the book, we do not propose a single and unified definition of genre. Authors give their different views on genre. 23. Do we really need a definition? After all. once we are convinced that genre is useful, we could just say that: genre is a classificatory principle based on a number of attributes. The web is immense, we cannot think of classifying web documents by genre manually, can we? Lets just focus on AUTOMATIC web GENRE CLASSIFCATION! 24. What do we need for Automatic webGenre Identification (AGI)? We need: a genre taxonomy (palette) and a corpus measurable attributes (features) that can be extracted automatically an automatic classifier, i.e. a computational model that does the classification for us 25. Vector representation & supervised machine learning algorithms (esp. SVM) 26. Models for AGI: Scenarios Serge Sharoff Kim & Ross Santini Stein et al. Others GoWeb 27. Morphology & the Linguist Aim: Find a genre palette allowing comparison among corpora (Web As Corpus initiative ) and across languages A functional genre palette inspired by J. Sinclair Many corpora: English and Russian Classifier: SVM Features: POS trigrams (577 for Russian; 593 for English) Ex of POS trigrams: ADV ADJ NOUN Sharoff GoWeb 28. The expert (the linguist) decides: 29. Results 30. KRYS I and Harmonic Descriptor Representation (HDR) Information studies , Digital Libraries: semantic concept Features: HDR = FP, LP or AP (betw 1 and T/ (N x MP)) Number of features: 7431 Classifier: SVM KRYS I + 7 webgenre collection (total: 24 + 7 genre classes , 3452 documents) Kim & Ross GoWeb 2477 words 31. KRYS I & 7-webgenre collection 32. Accuracies 33. What about morphology & syntax? What about noise? Collection: 7-webgenre collection + others Features: 100 facets Genre palette: 7 webgenres + other Classifier: inferential model subjective Bayesian method Santini GoWeb 34. 7-webgenre collection Balanced (200 web pages per genre class) Genre palette Not annotated manually Built following 2 principles: Objective sources Consistent genre granularity 35. 100 Facets 36. Inferential model It is a simple probabilistic model based on rules. It allows some reasonging through the use of weights (closer to artificial intelligence than machine learning) 37. Comparisons (I) 38. Different types of noise! 39. Results 40. Three experimental settings, three different genre needs. 1. Genre comparison across corpora 2. Digital libraries, where documents can be more easily monitored 3. The wild web, where everything is uncertain and noisy WEGA prototype: a retrieval model for genre-enabled web search 41. Genre retrieval model Genre collection and palette: KI-04 corpus: 8 webgenres Firefox add-on Model: lightweight GenreRich model (linear discriminant analysis) Features: HTML, link features, character features, vocabulary concentration features (< 100 features) Stein, Meyer zu Eissen, Lipka GoWeb 42. WEGA (WEb Genre Analysis) 43. KI-04 genre collection: 8 webgenres 44. Genre Classes & Human Recognition How can we decide on the most representative genre classes? Lets ask users yes indeed, but how? 1) questionnaires (Karlgren) 2) card sorting (Rosso & Haas) 3) task-oriented studies (Crowston et al.) 4) others 45. Questionnaires: what genres are available on the internet? 46. User Warrant Collecting genre terminology in the users own words (3 participants) Make the users classify web pages and create piles (rationale?) Users choose the best of the collected genre terminology (102 participants) User validation of the genre palette (257 participants) Genres usefulness of web search (32 participants) GoWeb: Rosso & Haas 47. Final Genre palette: 18 genres 48. Genres & Tasks 3 groups of respondents : teachers, journalists, engineers, Respondents were asked to carry out a web search for a real task of their own choice What is your search goal? What type of web page would you call this? What is it about the page that makes you call that? Was this page useful to you? GoWeb: Crowston et al. 49. What type of web page would you call this? 522 unique terms about 300 50. Syracuse corpus & AGI ACL 2010 (Uppsala): FINE-GRAINED GENRE CLASSIFICATION USING STRUCTURAL LEARNING ALGORITHMS Zhili Wu, Katja Markert and Serge Sharoff The whole corpus: 3027 annotated webpages divided into 292 genres. Focussing on genres containing 15 or more examples, the corpus is of about 2293 examples and 52 genres. 51. Conclusions (I) : Do we really need a definition of genre? 1. Take a number of web pages belonging to different web genres (e.g. blogs, home pages, news stories, FAQs, etc.) 2. Identify and extract genre-revealing features 3. Feed an automatic classifier Where is problem? 52. Conclusions (II) The problem with this approach is that without a theoretical definition and characterization of the concept of genre, it is not clear: how to create a genre taxonomy that both humans and automatic classifiers can easily discriminate against how to select representative corpus for the genre classes in the taxonomy, since there is a lot of variation in users assessment how to identifiy the optimal genrerevealing features 53. Future Work Genre is a high-level concept: we NEED a theoretical definition of genre for computational and empirical purposes. Without a theoretical definition: genres become lifeless texts, merely characterized by formal attributes and the communicative context , i.e. the thing that make genre important, is completely stripped out Although in some restricted experimental settings, this formalistic approach is quite rewarding (more than 95% success rate), we can hardly generalize on it. 54. Future directions: AGI is a fertile land for research and development Now that basic explorations have been carried out, we should concentrate more on the correlation and interrelation of the following variables: Human agreement Representation of genre classes Number of genre classes Nature of genre classes Size of the whole corpus Sturctured and unstructered noise Genre-revealing features that account for the context that genres carry with them New computational models and algorithms 55. Certainties. Genre is a useful concept in many disciplines Automatic genre classification is feasible, and there is ample space for improvement I am interested in your views on (web) genre: send me your impressions, ideas, gut feelings and your genre classes: Facebook page: www.facebook.com/genresontheweb Genre blog: www.forum.santini.se Webriders Short proposal to EU: www.webrider.se 56. Thank you for your attention! 57. References (I) Bateman, John (2008) Multimodality and Genre, Palgrave Macmillan Bawarshi, Anis S. and Reiff, Mary Jo (eds) (2010) Genre: An Introduction to History, Theory, Research, and Pedagogy (free book); http://wac.colostate.edu/books/bawarshi_reiff/genre.pdf Bruce, Ian (2008) Academic Writing and Genre, Continuum Dorgeloh, Heidrun and Wanner, Anja (2010) Syntactic Variation and Genre, De Gruyter Mouton 58. References (II) Giltrow,Janet and Stein, Dieter (eds) (2009) Genres in the Internet, John Benjamins Publishing Company Heyd, Theresa (2008) Email Hoaxes: Form, function, genre ecology, John Benjamins Publishing Company Lee, David (2001), Genres, Registers, Text Types, Domains, And Styles: Clarifying The Concepts And Navigating A Path Through The Bnc Jungle, Language Learning & Technology September 2001, Vol. 5, Num. 3. pp. 37-72, http://llt.msu.edu/vol5num3/pdf/lee.pdf 59. References (III) Luzn, Mara Jos, Ruiz-Madrid, Mara Noelia and Villanueva, Mara Luisa (eds) (2010) Digital Genres, New Literacies and Autonomy in Language Learning, Cambridge Scholars Publishing Martin, James and Rose, David (2008) Genre Relations: Mapping Culture, Equinox Puschmann, Cornelius (2010) The corporate blog as an emerging genre of computer-mediated communication: features, constraints, discourse situation, Universittsverlag Gttingen WEGA prototype download, documentation and references: http://www.uni- weimar.de/cms/medien/webis/research/projects/wega .html