authorship attribution pydata london

Authorship Attribution & Forensic Linguistics with Python/Scikit-Learn/Pandas Kostas Perifanos, Search & Analytics Engineer @perifanoskostas Learner Analytics & Data Science Team

Upload: kperi

Post on 06-May-2015




5 download


Page 1: Authorship attribution   pydata london

Authorship Attribution & Forensic Linguistics with Python/Scikit-Learn/Pandas

Kostas Perifanos, Search & Analytics Engineer@perifanoskostas Learner Analytics & Data Science Team

Page 2: Authorship attribution   pydata london


“Automated authorship attribution is the problemof identifying the author of an anonymous text, ortext whose authorship is in doubt” [Love, 2002]

Page 3: Authorship attribution   pydata london

Domains of application● Author attribution● Author verification● Plagiarism detection● Author profiling [age, education, gender]● Stylistic inconsistencies [multiple collaborators/authors]● Can be also applied in computer code, music scores, ...

Page 4: Authorship attribution   pydata london

“Automated authorship attribution is the problemof identifying the author of an anonymous text, ortext whose authorship is in doubt”

“Automation”, “identification”, “text”: Machine Learning

Page 5: Authorship attribution   pydata london

A classification problem

● Define classes● Extract features ● Train ML classifier● Evaluate

Page 6: Authorship attribution   pydata london

Class definition[s]

● AuthorA, AuthorB, AuthorC, …● Author vs rest-of-the-world [1-class classification

problem]● Or even, in extended contexts, a clustering problem

Page 7: Authorship attribution   pydata london

Feature extraction

● Lexical features● Character features● Syntactic features● Application specific

Page 8: Authorship attribution   pydata london

Feature extraction

● Lexical features● Word length, sentence length etc ● Vocabulary richness [lexical density: functional word vs content words ratio]● Word frequencies● Word n-grams● Spelling errors

Page 9: Authorship attribution   pydata london

Feature extraction

● Character features● Character types (letters, digits, punctuation)● Character n-grams (fixed and variable length)● Compression methods [Entropy, which is really nice but for another talk :) ]

Page 10: Authorship attribution   pydata london

Feature extraction

● Syntactic features● Part-of-speech tags [eg Verbs (VB), Nouns (NN), Prepositions (PP) etc]● Sentence and phrase structure● Errors

Page 11: Authorship attribution   pydata london

Feature extraction● Semantic features

● Synonyms● Semantic dependencies

● Application specific features● Structural● Content specific● Language specific

Page 12: Authorship attribution   pydata london

Demo application

Let’s apply a classification algorithm on texts, using word and character n-grams and POS n-grams

Data set (1): 12867 tweets from 10 users, in Greek Language, collected in 2012 [4]Data set (2): 1157 judgments from 2 judges, in English [5]

Page 13: Authorship attribution   pydata london

But what’s an “n-gram”?

[…]an n-gram is a contiguous sequence of n items from a given sequence of text. []So, for the sentence above:word 2-grams (or bigrams): [ (an, n-gram), (n-gram, is), (is, a), (a, contiguous), …] char 2-grams: [ ‘an’, ‘n ‘, ‘ n’, ‘n-’, ‘-g’, …]We will use the TF-IDF weighted frequencies of both word and character n-grams as features.

Page 14: Authorship attribution   pydata london

Enter Python

Flashback [or, transforming experiments to accepted papers in t<=2h] A few months earlier, Dec 13, just one day before my holidays I get this call...

Page 15: Authorship attribution   pydata london

Load the dataset# assume we have the data in 10 tsv files, one file per author.

# each file consists of two columns, id and actual text

import pandas as pd

def load_corpus(input_dir):

trainfiles= [ f for f in listdir( input_dir ) if isfile(join(input_dir ,f)) ]

trainset = []

for filename in trainfiles:

df = pd.read_csv( input_dir + "/" + filename , sep="\t",

dtype={ 'id':object, 'text':object } )

for row in df['text']:

trainset.append( { "label":filename, "text": row } )

return trainset

Page 16: Authorship attribution   pydata london

Extract features [1]from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.pipeline import FeatureUnion

word_vector = TfidfVectorizer( analyzer="word" , ngram_range=(2,2),

max_features = 2000, binary = False )

char_vector = TfidfVectorizer(ngram_range=(2, 3), analyzer="char",

max_features = 2000,binary=False, min_df=0 )

for item in trainset:

corpus.append( item[“text”] )

classes.append( item["label"] )

#our vectors are the feature union of word/char ngrams

vectorizer = FeatureUnion([ ("chars", char_vector),("words", word_vector) ] )

# load corpus, use fit_transform to get vectors

X = vectorizer.fit_transform(corpus)

Page 17: Authorship attribution   pydata london

Extract features [2]import nltk

#generate POS tags using nltk, return the sequence as whitespace separated string

def pos_tags(txt):

tokens = nltk.word_tokenize(txt)

return " ".join( [ tag for (word, tag) in nltk.pos_tag( tokens ) ] )

#combine word and char ngrams with POS-ngrams

tag_vector = TfidfVectorizer( analyzer="word" , ngram_range=(2,2),

binary = False, max_features= 2000, decode_error = 'ignore' )

X1 = vectorizer.fit_transform( corpus )

X2 = tag_vector.fit_transform( tags )

#concatenate the two matrices

X = sp.hstack((X1, X2), format='csr')

Page 18: Authorship attribution   pydata london

Extract features [2.1]

#this last part is a little bit tricky

X = sp.hstack((X1, X2), format='csr')

There was no (obvious) way to use FeatureUnion

X1, X2 are sparse matrices - so, we are using hstack to stack two matrices horizontally

(column wise)

Page 19: Authorship attribution   pydata london

Put everything together

word n-grams

character n-grams

POS tags n-grams (optional)

feature vector components

Author: A function of

Page 20: Authorship attribution   pydata london

Fit the model and evaluate (10-fold-CV)

model = LinearSVC( loss='l1', dual=True)

scores = cross_validation.cross_val_score( estimator = model,

X = matrix.toarray(),

y= np.asarray(classes), cv=10 )

print "10-fold cross validation results:", "mean score = ", scores.mean(), \

"std=", scores.std(), ", num folds =", len(scores)

Results: 96% accuracy for two authors, using 10-fold-CV

Page 21: Authorship attribution   pydata london

Evaluate (train set vs test set) from sklearn.cross_validation import train_test_split

model = LinearSVC( loss='l1', dual=True)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

y_pred =, y_train).predict(X_test)

cm = confusion_matrix(y_test, y_pred)



pl.title('Confusion matrix')


pl.ylabel('True label')

pl.xlabel('Predicted label')

Page 22: Authorship attribution   pydata london

[[ 57 1 2 0 4 8 3 27 13 2] [ 0 71 1 1 0 13 0 8 6 0] [ 3 0 51 1 3 5 4 8 25 0] [ 0 1 0 207 2 8 8 8 82 2] [ 5 4 3 7 106 30 10 25 23 3] [ 9 11 3 15 11 350 14 46 42 12] [ 3 1 3 8 13 16 244 21 38 5] [ 8 12 10 3 11 46 13 414 39 8] [ 8 4 7 59 11 21 31 49 579 10] [ 2 6 1 4 3 24 13 29 15 61]]

Confusion Matrix

Page 23: Authorship attribution   pydata london

Interesting questions● Many authors?● Short texts / “micro messages"? ● Is writing style affected by time/age?● Can we detect “mood”? ● Psychological profiles?● What about obfuscation?● Even more subtle problems [PAN Workshop 2013]● Other applications (code, music scores etc)

Page 24: Authorship attribution   pydata london

References & Libraries1. Authorship Attribution: An Introduction, Harold Love, 20022. A Survey of Modern Authorship Attribution Methods,Efstathios

Stamatatos, 20073. Authorship Attribution, Patrick Juola, 20084. Authorship Attribution in Greek Tweets Using Author's Multilevel

N-Gram Profiles, G. Mikros, Kostas Perifanos. 20125. Authorship Attribution with Latent Dirichlet Allocation,

Seroussi,Zukerman, Bohnert, 2011

Python libraries:

● Pandas: ● Scikit-learn:● nltk,


Demo Python code:

Page 25: Authorship attribution   pydata london


Page 26: Authorship attribution   pydata london

Thank you!