gramsci ’ s authorship attribution of anonymus newspapers articles

29
Gramsci’s authorship attribution of anonymus newspapers articles Maurizio Lana Histoire et informatique Textométrie des sources historiques 6.6.2014

Upload: effie

Post on 04-Feb-2016

40 views

Category:

Documents


0 download

DESCRIPTION

Gramsci ’ s authorship attribution of anonymus newspapers articles. Maurizio Lana Histoire et informatique Textométrie des sources historiques 6.6.2014. who we are. maurizio lana mirko degli esposti emanuele caglioti dario benedetto 1 scholar and 3 physical mathematicians. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Gramsci ’ s authorship attribution  of anonymus newspapers articles

Gramsci’s authorship attribution of anonymus newspapers articles

Maurizio Lana Histoire et informatique

Textométrie des sources historiques6.6.2014

Page 2: Gramsci ’ s authorship attribution  of anonymus newspapers articles

who we are

• maurizio lana• mirko degli esposti• emanuele caglioti• dario benedetto• 1 scholar and 3 physical mathematicians

Page 3: Gramsci ’ s authorship attribution  of anonymus newspapers articles

it’s always data

• the analysis of numerization of physical world phenomena can equally work on

• TAC imaging, • songs, • ECG, • texts, • …

Page 4: Gramsci ’ s authorship attribution  of anonymus newspapers articles

reason for the study• national edition of Gramsci’s works, by Ministero

dei Beni Culturali• new work on the newspaper articles• many anonymous newspaper articles in the

journals and newspapers Gramsci wrote for:Il Grido del Popolo, Avanti!, La Citta Futura

• request from the Fondazione Gramsci to start anew the study of anonymous articles, to find new evidences of Gramsci writings

• we were in 2005

Page 5: Gramsci ’ s authorship attribution  of anonymus newspapers articles

a little background• the start is in 1847, V.J. Bunjakovskij On the possibility to apply

determining measures of confidence to the results of some observing sciences, particularly statistics

• 1897-98, W. Lutosławski, “On Stylometry”; “Principes de stylometrie”

• 1959, D. R. Cox and L. Brandwood, On a discriminatory problem connected with the works of Plato

• 1962, A. Ellegard, Who was Junius? • 1964, F. Mosteller and D. Wallace Inference and Disputed

Authorship: The Federalist • 1978, A. Kenny, The Aristotelian ethics: a study of the relationship

between the Eudemian and Nicomachean ethics of Aristotle • 1980, J.P. Benzécri Pratique de l’analyse des donnees • 1987, J. F Burrows, Word Patterns and Story Shapes: The Statistical

Analysis of Narrative Style, ”LLC”, 2, 1987, pagg. 61-70

Page 6: Gramsci ’ s authorship attribution  of anonymus newspapers articles

in common…

• … they have the work at words levels

Page 7: Gramsci ’ s authorship attribution  of anonymus newspapers articles

the turning point• G. Ledger, Re-counting Plato: A Computer Analysis of

Plato’s Style, Oxford, Clarendon Press, 1989 • the scope are

words containing a specified letter; words ending in a specified letter; words with a specified letter as penultimate

• that is semantically and linguistically meaningless parts of the words

• “I have departed from the traditional approach of stylometry by ignoring entirely meanings and grammatical functions, measuring instead the frequencies of words according to their orthographic content”

Page 8: Gramsci ’ s authorship attribution  of anonymus newspapers articles

today, for me (for us)

• the key is: a latent mathematical structure of the text

• from: L. Dolezel, A note on quantification in text theory, in: “Text Processing”, S. Allén ed., Stockholm, 1982, pagg. 539-552

• an expression of the idea: D. Khmelev, F. Tweedie, Using Markov chains for identification of writers, “LLC”, 16, 4, 2001, pagg. 299-307

Page 9: Gramsci ’ s authorship attribution  of anonymus newspapers articles

today, for me (for us)

• another expression: D. Benedetto, E. Caglioti, V. Loreto et al., Language Trees and Zipping, “Phys. Rev. Lett.” 88, n. 4, 048702-1, 048702-4 (2002)

• take 1 texts, compress it with Zip;• then take another text and compress it with

the compression dictionary of the first one;• measure the difference in size: this is the

measure of the relative entropy

Page 10: Gramsci ’ s authorship attribution  of anonymus newspapers articles

then came the AAAC

• in 2004 the american mathematician Patrick Juola proposed the ad-hoc authorship attribution competition to experimentally find the best method to correctly attribute anonymous works:http://www.mathcs. duq.edu/~juola/authorship_contest.html

• second best scorer Vlado Keselj, with a method based on measurements of n-grams frequencies

Page 11: Gramsci ’ s authorship attribution  of anonymus newspapers articles

the state of the QAA world in 2005

• in 2002 Jack Grieve, for his thesis “Quantitative Authorship Attribution: A History And An Evaluation Of Techniques”, counts at least 39 known and used methods with 93 variants for Quantitative AA

• the aim of AAAC: prune the useless methods• nevertheless: this continue to be not science,

but craftmanship

Page 12: Gramsci ’ s authorship attribution  of anonymus newspapers articles

in 2005 we started

• we had to prove to the Fondazione Gramsci that the Quantitative AA produced good results

• we choose to use two QAA methods: – relative entropy (already described)– n-gram distances (which gave Keselj the 2° palce

in the AAAC)

Page 13: Gramsci ’ s authorship attribution  of anonymus newspapers articles

the protocol

• phase 1: 50 surely Gramscian texts; 50 surely non-Gramscian texts;– do whatever you like to be able to recognize the

Gramscian as Gramscian and the non-Gramscian as non-Gramscian

• phase 2 (blind test): 40 unidentified texts, some Gramscian and some not: classify them correctly

Page 14: Gramsci ’ s authorship attribution  of anonymus newspapers articles

text preparation

• deletion of:– citations of any lenght– proper nouns– numbers

• no lemmatization: e.g. the choice for a given tense and person of a verb contains some quantity of information we cannot evaluate properly in order to discard it

Page 15: Gramsci ’ s authorship attribution  of anonymus newspapers articles

n-grams• sequencies of n entities you must choose (we chose

characters)• sliding n-grams: in “final” a 3-gram reads fin, ina, nal• to find the right n you do tests• n-grams capture fragments of meaning, syntax,

collocations/cooccurrences, etc.• you have a dictionary of gramscian n-grams• you check the n-grams of your anonymous texts; you

count the matches and the non-matches and do an algebric sum: if positive the text is gramscian, if negative not

Page 16: Gramsci ’ s authorship attribution  of anonymus newspapers articles

strategy

• maximize the correct attributions• at the same avoiding false attributions• = some missed attributions are ok if you don’t

produce false attributions• you must have your commissioner trust you

Page 17: Gramsci ’ s authorship attribution  of anonymus newspapers articles

strategy 2

• we don’t know if, how, and how much the “parole” of an author changes across matters, audience, genre, time, …

• so we decide that we had to work on well defined periods: their boundaries being left to decide to the Gramsci experts

• 1° period 1914-1921

Page 18: Gramsci ’ s authorship attribution  of anonymus newspapers articles

a little of maths

• having two methods at work, we could build a cartesian plane, where the results of he measures were plotted after normalization bringing them in the range -1 / + 1

Page 19: Gramsci ’ s authorship attribution  of anonymus newspapers articles

phase 1 - setup

Page 20: Gramsci ’ s authorship attribution  of anonymus newspapers articles

phase 2 – blind test

Page 21: Gramsci ’ s authorship attribution  of anonymus newspapers articles

the day after

• we started to do the attributions - being paid by Fondazione Gramsci for it - without knowing anything of the texts, and giving periodical reports to the historians who were editors of the various volumes of the national edition od Gramsci works

• we got the texts, normalized them, measured them, and produced a Report we sent to Fondazione Gramsci

• historians evaluation of the QAA: no proposed attribution was unacceptable, even if not every proposed attribution was accepted

• [example of report]

Page 22: Gramsci ’ s authorship attribution  of anonymus newspapers articles

now we have stopped

• due to the cuts to research funds, the national edition is at now stopped

Page 23: Gramsci ’ s authorship attribution  of anonymus newspapers articles

some practical principles on AA

• no tool can ‘read’ a text and say you: this text was written by Francesco Stella

• you can only classify the texts you chose to work on, crunched by the tool you use

• all of the texts will be connected: you must interpret the results

• you must mix anonymous or disputed works with “control works”: same period, same genre, same language, same author, similar authors, …

Page 24: Gramsci ’ s authorship attribution  of anonymus newspapers articles

be careful

• when you have proper nouns in your works, it’s easy to classify them:

• R. Clement and D. Sharp, Ngram and Bayesian Classification of Documents for Topic and Authorship, “LLC”, 2003, 18(4):423-447

• but you don’t really classifiy the texts, you classify the collections of proper nouns they contain

Page 25: Gramsci ’ s authorship attribution  of anonymus newspapers articles

why the gramsci cas was/is difficult and strange

• articles are very short: between 300 and 1000/1200 words

• all of these articles share: matters, ideology, context

• there is no countercheck, and you work for a scientific and productive initiative (it’s not ‘simply’ an experiment)

• the tables showing the matches are sparse tables, nevertheless these data work well

Page 26: Gramsci ’ s authorship attribution  of anonymus newspapers articles

now what

• Patrick Juola, the mathematician who proposed the AAAC, has released JGAAP, a package offering various tools for QAA:

• http://evllabs.com/jgaap/w/index.php/ • the R package with stylo is impressive and I

wish we had it when we started our work with Gramsci texts

Page 27: Gramsci ’ s authorship attribution  of anonymus newspapers articles

some references to start from• C. Basile, D. Benedetto, E. Caglioti, M. Degli Esposti, An

example of mathematical authorship attribution, “Journal Of Mathematical Physics”, 2008, 49, pp. 1 – 20

• C. Basile, D. Benedetto, E. Caglioti, M. Degli Esposti, L'attribuzione dei testi gramsciani: metodi e modelli matematici, “La Matematica nella Societa e nella Cultura”, 2010, 3, pp. 235 – 269

• M. Lana, Come scriveva Gramsci? Metodi matematici per riconoscere scritti gramsciani anonimi, “Informatica Umanistica”, 2010, 3, 31-56

Page 28: Gramsci ’ s authorship attribution  of anonymus newspapers articles

some references (2)• M. Lana, Individuare scritti gramsciani anonimi in un"

corpus" giornalistico. Il ruolo dei metodi quantitativi, “Studi storici: rivista trimestrale dell'Istituto Gramsci”, 52 (4), 859-880

• P. Juola, Authorship Attribution, “Foundations and Trends in Information Retrieval”, Vol. 1, No. 3 (2006) 233–334http://www.conll.org/~walter/educational/material/fnt-aa.pdf

• J. Grieve, Quantitative Authorship Attribution: An Evaluation of Techniques, LLC 22: 251-270http://dl.dropboxusercontent.com/u/99161057/Grieve_authorshipattribution.pdf

Page 29: Gramsci ’ s authorship attribution  of anonymus newspapers articles

thanks!