author profiling from personal content blogscse.iitkgp.ac.in/~pawang/courses/snlp15/projects... ·...

Author Profiling from Personal Content Blogs

Aseem Patni (12CS10008)Ayush Singhal (12CS10002)

Soham Dan (12CS10059)Pranay Yadav (12CS30025)

Shubham Saxena (12CS30032)Bhushan Kulkarni (12CS30016)

Mentors:Sruthi WarrierAnurag Verma

Abstract● This project aims to predict personally identifiable information (PII), such as

age and gender of the author by extracting features from his/her personal content blog texts.

● We consider Content-based and Semantic features for the classification task.

● We also use Word2Vec vectors along with 1-gram and 2-gram counts as features.

Datasets Used● Blog Authorship Corpus

○ 19,320 bloggers gathered from blogger.com in August 2004. ○ 681,288 posts and over 140 million words ○ 35 posts and 7250 words per person○ One of three age groups — ”10s” [13-17], ”20s” [23-27], ”30s” [33-47]. ○ Equal Sampling of male and female bloggers

● PAN’ 14 & ‘15 Corpus○ Made available for use during Author Profiling Shared Task in PAN ’14○ Under consideration - English blog posts○ One of these age groups: [18-24], [25-34], [35-49], [50-64], [65-xx].○ 2278 posts, 148 authors ○ On an average, 15 blogs per author.

Pre-Processing

Blog Authorship Corpus● Initially, 19320 XML Files, each pertaining to a unique author● Each XML file contains date of posting followed by the post.● All the HTML links in the post replaced by a unique tag ’urlLink’ ● We cleaned the data by discarding the following blog posts -

○ empty blog posts.○ posts which contain only HTML links and no text.○ posts with sentence length < 3 words.

● Exported to JSON for further analysis.● For few tasks, we also removed the STOP WORDS from further

consideration.

PAN ’14 and PAN ‘15 Corpus● Initially, 148 XML Files, each pertaining to a unique author● Each XML file contains Author ID and Blog posts.● Wrote regex to translated HTML entities like ’&’, ’“’ to regular

texts.● Exported to JSON for further analysis.● For few tasks, we also removed the STOP WORDS from further

consideration.

Features

Semantic Features● Topic Modeling using LDA

○ Variance in the probability distribution of the topics he/she blogs about.○ Using gensim package API

● Opinion score of the blogs○ Average opinion score of the blogs by the author○ Using SentiWordNet package

● Word2Vec ○ Average Word2Vec representation of the blogs by the author○ Using gensim package API

● Discourse Analysis ○ Average % of Explicit and Implicit Discourse relations in the blogs by the author ○ Using Java-based Discourse Parser (GNU Public License)

Feature Selected Feature Selected

# of Hyperlinks Readability Metrics

# of Quotations Automated Readability Index

Average Sentence Length Flesch Reading Ease

Distribution of POS tags Fresh Kincaid Grade Level

Distribution of Punctuators Gunning Fog index

# Named Entities SMOG Index

# Non- Word Errors Coleman Liau Index

# References to Past/ Future LIX

# Non-English Words RIX

Syntactic Features

Feature Selection (Contd.)

Analysis of Variance

Topic Modelling as a featureUsing gensim LDA API with param = 10 (# of topics to be retrieved)

Feature Used due to significant positive correlation in both gender and age.

Opinion Scores as a feature Using SentiWordNet 3.0.0, to assign average sentiment score to blogs by a particular user.

Feature rejected due to low correlation across age and gender.

Discourse Analysis as a feature

No. of explicit and implicit discourse relations

No. of Sentences

Both intra- and inter-sentential discourse relations considered.

Feature rejected due to low correlation.

F =

Word2Vec vectors as a featureUsing GraphLab and gensim API for word2vec word embedding of the blog.

Average word2vec vector is obtained for each author by averaging the individual vectors for each of the blogs by him/her.

Word2Vec neural net was trained on the entire Blog Authorship Corpus.

As we shall see, resulted in good accuracy in the classification task.

Learning Techniquesand Evaluation

Technique Precision Recall ROC Accuracy(%)

Logistic Regression 0.697 0.692 0.733 69.23

CART 0.661 0.651 0.729 65.41

Random Forest 0.696 0.683 0.731 70.87

SVM (Normalised Kernel) 0.689 0.679 0.732 68.29

AdaBoost 0.701 0.698 0.723 71.03

Gender : Learning Techniques Comparison

Features used: Semantic + Content - { Neural Word Embeddings, 1-gram and 2-gram features}

Technique Accuracy (%)

Logistic Regression 58.34

CART 53.11

Random Forest 59.89

SVM (Normalised Kernel) 57.17

AdaBoost 60.70

Age :Learning Techniques ComparisonFeatures used: Semantic + Content - { Neural Word Embeddings, 1-gram and 2-gram features}

Word2Vec

81.9% accuracy in gender prediction

67.05% accuracy in age group prediction

Gender : Evaluation (contd.)Complete Blog Authorship Corpus was used for Evaluation.

Training : Test Split was set at 80 : 20

Features Used : Unigram and Bigram counts, Neural Word Embeddings.

Precision : 82%

Recall : 80.4%

Accuracy : 81.9% (using Boosted Trees)

Age : Evaluation (contd.)

Complete Blog Authorship Corpus was used.

Training : Test Split was set at 80 : 20

Features Used : Unigram and Bigram counts, Neural Word Embeddings.

Precision : 74.5%

Recall : 66.67%

Accuracy : 67.85% (using Boosted Trees)

PAN Author Profiling Task at CLEF 2013

[1] Wikipedia Semantic + Word n-grams + POS n-grams

SVM Gender Prediction Accuracy: 62.12%

Age Prediction Accuracy: 66.51%

ConclusionAn extensive list of syntactic features were used with a variety of learning methods including ensemble learning, which gave a decent accuracy in both gender and age prediction.

Incorporation of semantic features gave a far better accuracy than usage of only content-based features.

This shows the power of Neural Networks in Age and Gender Prediction from text.

Future WorkInvestigating further to reveal a suitable combination of syntactic and semantic features that can further boost the accuracy of predictions.

Using textual features for predicting other personally identifiable information of authors, like stylometric information.

References1. Varma Et Al. 2013. Exploiting Wikipedia Categorization for Predicting Age and Gender of

Blog Authors. Notebook for PAN at CLEF 2013. 2. Smith Et. Al. 2011. Author Age Prediction from Text using Linear Regression. In Proceedings

of the ACL Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LATECH 2011), Portland, OR, June 2011.

3. Francisco Rangel, Paolo Rosso, Moshe Koppel, Efstathios Stamatatos, and Giacomo Inches. Overview of the Author Profiling Task at PAN 2013. Proceedings of PAN at CLEF 2013.

4. Argamon Et. Al. 2009. Automatically profiling the author of an anonymous text. Communications of the ACM 52 (2): 119-123.

5. Schler Et. Al. 2006. Effects of Age and Gender on Blogging. In Proceedings of AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs, March 2006.

6. Meina Et. Al. 2013. Ensemble-based Classification for Author Profiling Using Various Features. The Notebook for 2013 PAN at CLEF 2013

author profiling from personal content blogscse.iitkgp.ac.in/~pawang/courses/snlp15/projects... ·...

Documents